Get last 10 lines of very large text file > 10GB

问题

What is the most efficient way to display the last 10 lines of a very large text file (this particular file is over 10GB). I was thinking of just writing a simple C# app but I'm not sure how to do this effectively.

回答1:

Read to the end of the file, then seek backwards until you find ten newlines, and then read forward to the end taking into consideration various encodings. Be sure to handle cases where the number of lines in the file is less than ten. Below is an implementation (in C# as you tagged this), generalized to find the last numberOfTokens in the file located at path encoded in encoding where the token separator is represented by tokenSeparator; the result is returned as a string (this could be improved by returning an IEnumerable<string> that enumerates the tokens).

public static string ReadEndTokens(string path, Int64 numberOfTokens, Encoding encoding, string tokenSeparator) {

    int sizeOfChar = encoding.GetByteCount("\n");
    byte[] buffer = encoding.GetBytes(tokenSeparator);


    using (FileStream fs = new FileStream(path, FileMode.Open)) {
        Int64 tokenCount = 0;
        Int64 endPosition = fs.Length / sizeOfChar;

        for (Int64 position = sizeOfChar; position < endPosition; position += sizeOfChar) {
            fs.Seek(-position, SeekOrigin.End);
            fs.Read(buffer, 0, buffer.Length);

            if (encoding.GetString(buffer) == tokenSeparator) {
                tokenCount++;
                if (tokenCount == numberOfTokens) {
                    byte[] returnBuffer = new byte[fs.Length - fs.Position];
                    fs.Read(returnBuffer, 0, returnBuffer.Length);
                    return encoding.GetString(returnBuffer);
                }
            }
        }

        // handle case where number of tokens in file is less than numberOfTokens
        fs.Seek(0, SeekOrigin.Begin);
        buffer = new byte[fs.Length];
        fs.Read(buffer, 0, buffer.Length);
        return encoding.GetString(buffer);
    }
}

回答2:

I'd likely just open it as a binary stream, seek to the end, then back up looking for line breaks. Back up 10 (or 11 depending on that last line) to find your 10 lines, then just read to the end and use Encoding.GetString on what you read to get it into a string format. Split as desired.

回答3:

Tail? Tail is a unix command that will display the last few lines of a file. There is a Windows version in the Windows 2003 Server resource kit.

回答4:

As the others have suggested, you can go to the end of the file and read backwards, effectively. However, it's slightly tricky - particularly because if you have a variable-length encoding (such as UTF-8) you need to be cunning about making sure you get "whole" characters.

回答5:

You should be able to use FileStream.Seek() to move to the end of the file, then work your way backwards, looking for \n until you have enough lines.

回答6:

I'm not sure how efficient it will be, but in Windows PowerShell getting the last ten lines of a file is as easy as

Get-Content file.txt | Select-Object -last 10

回答7:

That is what unix tail command does. See http://en.wikipedia.org/wiki/Tail_(Unix)

There is lots of open source implementations on internet and here is one for win32: Tail for WIn32

回答8:

I think the following code will solve the prblem with subtle changes regrading encoding

StreamReader reader = new StreamReader(@"c:\test.txt"); //pick appropriate Encoding
reader.BaseStream.Seek(0, SeekOrigin.End);
int count = 0;
while ((count < 10) && (reader.BaseStream.Position > 0))
{
    reader.BaseStream.Position--;
    int c = reader.BaseStream.ReadByte();
    if (reader.BaseStream.Position > 0)
        reader.BaseStream.Position--;
    if (c == Convert.ToInt32('\n'))
    {
        ++count;
    }
}
string str = reader.ReadToEnd();
string[] arr = str.Replace("\r", "").Split('\n');
reader.Close();

回答9:

You could use the windows version of the tail command and just pype it's output to a text file with the > symbol or view it on the screen depending on what your needs are.

回答10:

here is version of mine. HTH

using (StreamReader sr = new StreamReader(path))
{
  sr.BaseStream.Seek(0, SeekOrigin.End);

  int c;
  int count = 0;
  long pos = -1;

  while(count < 10)
  {
    sr.BaseStream.Seek(pos, SeekOrigin.End);
    c = sr.Read();
    sr.DiscardBufferedData();

    if(c == Convert.ToInt32('\n'))
      ++count;
    --pos;
  }

  sr.BaseStream.Seek(pos, SeekOrigin.End);
  string str = sr.ReadToEnd();
  string[] arr = str.Split('\n');
}

回答11:

If you open the file with FileMode.Append it will seek to the end of the file for you. Then you could seek back the number of bytes you want and read them. It might not be fast though regardless of what you do since that's a pretty massive file.

回答12:

One useful method is FileInfo.Length. It gives the size of a file in bytes.

What structure is your file? Are you sure the last 10 lines will be near the end of the file? If you have a file with 12 lines of text and 10GB of 0s, then looking at the end won't really be that fast. Then again, you might have to look through the whole file.

If you are sure that the file contains numerous short strings each on a new line, seek to the end, then check back until you've counted 11 end of lines. Then you can read forward for the next 10 lines.

回答13:

I think the other posters have all shown that there is no real shortcut.

You can either use a tool such as tail (or powershell) or you can write some dumb code that seeks end of file and then looks back for n newlines.

There are plenty of implementations of tail out there on the web - take a look at the source code to see how they do it. Tail is pretty efficient (even on very very large files) and so they must have got it right when they wrote it!

回答14:

Open the file and start reading lines. After you've read 10 lines open another pointer, starting at the front of the file, so the second pointer lags the first by 10 lines. Keep reading, moving the two pointers in unison, until the first reaches the end of the file. Then use the second pointer to read the result. It works with any size file including empty and shorter than the tail length. And it's easy to adjust for any length of tail. The drawback, of course, is that you end up reading the entire file and that may be exactly what you're trying to avoid.

回答15:

If you have a file that has a even format per line (such as a daq system), you just use streamreader to get the length of the file, then take one of the lines, (readline()).

Divide the total length by the length of the string. Now you have a general long number to represent the number of lines in the file.

The key is that you use the readline() prior to getting your data for your array or whatever. This is will ensure that you will start at the beginning of a new line, and not get any leftover data from the previous one.

StreamReader leader = new StreamReader(GetReadFile);
leader.BaseStream.Position = 0;
StreamReader follower = new StreamReader(GetReadFile);

int count = 0;
string tmper = null;
while (count <= 12)
{
    tmper = leader.ReadLine();
    count++;
}

long total = follower.BaseStream.Length; // get total length of file
long step = tmper.Length; // get length of 1 line
long size = total / step; // divide to get number of lines
long go = step * (size - 12); // get the bit location

long cut = follower.BaseStream.Seek(go, SeekOrigin.Begin); // Go to that location
follower.BaseStream.Position = go;

string led = null;
string[] lead = null ;
List<string[]> samples = new List<string[]>();

follower.ReadLine();

while (!follower.EndOfStream)
{
    led = follower.ReadLine();
    lead = Tokenize(led);
    samples.Add(lead);
}

回答16:

Using Sisutil's answer as a starting point, you could read the file line by line and load them into a Queue<String>. It does read the file from the start, but it has the virtue of not trying to read the file backwards. This can be really difficult if you have a file with a variable character width encoding like UTF-8 as Jon Skeet pointed out. It also doesn't make any assumptions about line length.

I tested this against a 1.7GB file (didn't have a 10GB one handy) and it took about 14 seconds. Of course, the usual caveats apply when comparing load and read times between computers.

int numberOfLines = 10;
string fullFilePath = @"C:\Your\Large\File\BigFile.txt";
var queue = new Queue<string>(numberOfLines);

using (FileStream fs = File.Open(fullFilePath, FileMode.Open, FileAccess.Read, FileShare.Read)) 
using (BufferedStream bs = new BufferedStream(fs))  // May not make much difference.
using (StreamReader sr = new StreamReader(bs)) {
    while (!sr.EndOfStream) {
        if (queue.Count == numberOfLines) {
            queue.Dequeue();
        }

        queue.Enqueue(sr.ReadLine());
    }
}

// The queue now has our set of lines. So print to console, save to another file, etc.
do {
    Console.WriteLine(queue.Dequeue());
} while (queue.Count > 0);

回答17:

I just had the same Problem, a huge log file that should be accessed via a REST interface. Of course loading it into whatever memory and sending it complete via http was no solution.

As Jon pointed out, this Solution has a very specific usecase. In my case, I know for sure (and check), that the encoding is utf-8 (with BOM!) and thus can profit from all the blessings of UTF. It is surely not a general purpose solution.

Here is what worked for me extremely well and fast (I forgot to close the stream - fixed now):

    private string tail(StreamReader streamReader, long numberOfBytesFromEnd)
    {
        Stream stream = streamReader.BaseStream;
        long length = streamReader.BaseStream.Length;
        if (length < numberOfBytesFromEnd)
            numberOfBytesFromEnd = length;
        stream.Seek(numberOfBytesFromEnd * -1, SeekOrigin.End);

        int LF = '\n';
        int CR = '\r';
        bool found = false;

        while (!found) {
            int c = stream.ReadByte();
            if (c == LF)
                found = true;
        }

        string readToEnd = streamReader.ReadToEnd();
        streamReader.Close();
        return readToEnd;
    }

We first seek to somewhere near the end with the BaseStream, and when we have the right stream positon, read to the end with the usual StreamReader.

This doesn't really allow to specify the amount of lines form the end, which is not a good idea anyways, as the lines could be arbitrarily long and thus, killing the performance again. So I specify the amount of bytes, read until we get to the first Newline and the comfortably read to the end. Theoretically, you could also look for the CarriageReturn also, but in my case, that was not necessary.

If we use this code, it will not disturb a writer thread:

        FileStream fileStream = new FileStream(
            filename,
            FileMode.Open,
            FileAccess.Read,
            FileShare.ReadWrite);

        StreamReader streamReader = new StreamReader(fileStream);

回答18:

In case you need to read any number of lines in reverse from a text file, here's a LINQ-compatible class you can use. It focuses on performance and support for large files. You could read several lines and call Reverse() to get the last several lines in forward order:

Usage:

var reader = new ReverseTextReader(@"C:\Temp\ReverseTest.txt");
while (!reader.EndOfStream)
    Console.WriteLine(reader.ReadLine());

ReverseTextReader Class:

/// <summary>
/// Reads a text file backwards, line-by-line.
/// </summary>
/// <remarks>This class uses file seeking to read a text file of any size in reverse order.  This
/// is useful for needs such as reading a log file newest-entries first.</remarks>
public sealed class ReverseTextReader : IEnumerable<string>
{
    private const int BufferSize = 16384;   // The number of bytes read from the uderlying stream.
    private readonly Stream _stream;        // Stores the stream feeding data into this reader
    private readonly Encoding _encoding;    // Stores the encoding used to process the file
    private byte[] _leftoverBuffer;         // Stores the leftover partial line after processing a buffer
    private readonly Queue<string> _lines;  // Stores the lines parsed from the buffer

    #region Constructors

    /// <summary>
    /// Creates a reader for the specified file.
    /// </summary>
    /// <param name="filePath"></param>
    public ReverseTextReader(string filePath)
        : this(new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read), Encoding.Default)
    { }

    /// <summary>
    /// Creates a reader using the specified stream.
    /// </summary>
    /// <param name="stream"></param>
    public ReverseTextReader(Stream stream)
        : this(stream, Encoding.Default)
    { }

    /// <summary>
    /// Creates a reader using the specified path and encoding.
    /// </summary>
    /// <param name="filePath"></param>
    /// <param name="encoding"></param>
    public ReverseTextReader(string filePath, Encoding encoding)
        : this(new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read), encoding)
    { }

    /// <summary>
    /// Creates a reader using the specified stream and encoding.
    /// </summary>
    /// <param name="stream"></param>
    /// <param name="encoding"></param>
    public ReverseTextReader(Stream stream, Encoding encoding)
    {          
        _stream = stream;
        _encoding = encoding;
        _lines = new Queue<string>(128);            
        // The stream needs to support seeking for this to work
        if(!_stream.CanSeek)
            throw new InvalidOperationException("The specified stream needs to support seeking to be read backwards.");
        if (!_stream.CanRead)
            throw new InvalidOperationException("The specified stream needs to support reading to be read backwards.");
        // Set the current position to the end of the file
        _stream.Position = _stream.Length;
        _leftoverBuffer = new byte[0];
    }

    #endregion

    #region Overrides

    /// <summary>
    /// Reads the next previous line from the underlying stream.
    /// </summary>
    /// <returns></returns>
    public string ReadLine()
    {
        // Are there lines left to read? If so, return the next one
        if (_lines.Count != 0) return _lines.Dequeue();
        // Are we at the beginning of the stream? If so, we're done
        if (_stream.Position == 0) return null;

        #region Read and Process the Next Chunk

        // Remember the current position
        var currentPosition = _stream.Position;
        var newPosition = currentPosition - BufferSize;
        // Are we before the beginning of the stream?
        if (newPosition < 0) newPosition = 0;
        // Calculate the buffer size to read
        var count = (int)(currentPosition - newPosition);
        // Set the new position
        _stream.Position = newPosition;
        // Make a new buffer but append the previous leftovers
        var buffer = new byte[count + _leftoverBuffer.Length];
        // Read the next buffer
        _stream.Read(buffer, 0, count);
        // Move the position of the stream back
        _stream.Position = newPosition;
        // And copy in the leftovers from the last buffer
        if (_leftoverBuffer.Length != 0)
            Array.Copy(_leftoverBuffer, 0, buffer, count, _leftoverBuffer.Length);
        // Look for CrLf delimiters
        var end = buffer.Length - 1;
        var start = buffer.Length - 2;
        // Search backwards for a line feed
        while (start >= 0)
        {
            // Is it a line feed?
            if (buffer[start] == 10)
            {
                // Yes.  Extract a line and queue it (but exclude the \r\n)
                _lines.Enqueue(_encoding.GetString(buffer, start + 1, end - start - 2));
                // And reset the end
                end = start;
            }
            // Move to the previous character
            start--;
        }
        // What's left over is a portion of a line. Save it for later.
        _leftoverBuffer = new byte[end + 1];
        Array.Copy(buffer, 0, _leftoverBuffer, 0, end + 1);
        // Are we at the beginning of the stream?
        if (_stream.Position == 0)
            // Yes.  Add the last line.
            _lines.Enqueue(_encoding.GetString(_leftoverBuffer, 0, end - 1));

        #endregion

        // If we have something in the queue, return it
        return _lines.Count == 0 ? null : _lines.Dequeue();
    }

    #endregion

    #region IEnumerator<string> Interface

    public IEnumerator<string> GetEnumerator()
    {
        string line;
        // So long as the next line isn't null...
        while ((line = ReadLine()) != null)
            // Read and return it.
            yield return line;
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        throw new NotImplementedException();
    }

    #endregion
}

回答19:

Using PowerShell, Get-Content big_file_name.txt -Tail 10 where 10 is the number of bottom lines to retrieve.

This has no performance problems. I ran it on a text file that is over 100 GB and got an instant result.

回答20:

I used this code for a small utility sometime ago, i hope it can help you!

private string ReadRows(int offset)     /*offset: how many lines it reads from the end (10 in your case)*/
{
    /*no lines to read*/
    if (offset == 0)
        return result;

    using (FileStream fs = new FileStream(FullName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 2048, true))
    {
        List<char> charBuilder = new List<char>(); /*StringBuilder doesn't work with Encoding: example char 𐍈 */
        StringBuilder sb = new StringBuilder();

        int count = 0;

        /*tested with utf8 file encoded by notepad-pp; other encoding may not work*/

        var decoder = ReaderEncoding.GetDecoder();
        byte[] buffer;
        int bufferLength;

        fs.Seek(0, SeekOrigin.End);

        while (true)
        {
            bufferLength = 1;
            buffer = new byte[1];

            /*for encoding with variable byte size, every time I read a byte that is part of the character and not an entire character the decoder returns '�' (invalid character) */

            char[] chars = { '�' }; //� 65533
            int iteration = 0;

            while (chars.Contains('�'))
            {
                /*at every iteration that does not produce character, buffer get bigger, up to 4 byte*/
                if (iteration > 0)
                {
                    bufferLength = buffer.Length + 1;

                    byte[] newBuffer = new byte[bufferLength];

                    Array.Copy(buffer, newBuffer, bufferLength - 1);

                    buffer = newBuffer;
                }

                /*there are no characters with more than 4 bytes in utf-8*/
                if (iteration > 4)
                    throw new Exception();


                /*if all is ok, the last seek return IOError with chars = empty*/
                try
                {
                    fs.Seek(-(bufferLength), SeekOrigin.Current);
                }
                catch
                {
                    chars = new char[] { '\0' };
                    break;
                }

                fs.Read(buffer, 0, bufferLength);

                var charCount = decoder.GetCharCount(buffer, 0, bufferLength);
                chars = new char[charCount];

                decoder.GetChars(buffer, 0, bufferLength, chars, 0);

                ++iteration;
            }

            /*when i get a char*/
            charBuilder.InsertRange(0, chars);

            if (chars.Length > 0 && chars[0] == '\n')
                ++count;

            /*exit when i get the correctly number of line (*last row is in interval)*/
            if (count == offset + 1)
                break;

            /*the first search goes back, the reading goes on then we come back again, except the last */
            try
            {
                fs.Seek(-(bufferLength), SeekOrigin.Current);
            }
            catch (Exception)
            {
                break;
            }

        }
    }

    /*everithing must be reversed, but not \0*/
    charBuilder.RemoveAt(0);

    /*yuppi!*/
    return new string(charBuilder.ToArray());
}

I attach a screen for the speed

回答21:

Why not use file.readalllines which returns a string[]?

Then you can get the last 10 lines (or members of the array) which would be a trivial task.

This approach isn't taking into account any encoding issues and I'm not sure on the exact efficiency of this approach (time taken to complete method, etc).

来源：https://stackoverflow.com/questions/398378/get-last-10-lines-of-very-large-text-file-10gb

标签

text

large-files