I have a huge text file with 25k lines.Inside that text file each line starts with \"1 \\t (linenumber)\"
Example:
1 1 ITEM_ETC_GOLD_01 골드(소)
If you are going to be looking up a lot of different lines from the file (but not all), then you may get some benefit from building an index as you go. Use any of the suggestions that are already here, but as you go along build up an array of byte-offsets for any lines that you have already located so that you can save yourself from re-scanning the file from the beginning each time.
ADDENDUM:
There is one more way you can do it fast if you only need the occasional 'random' line, but at the cost of a more complicated search (If Jon's answer is fast enough, I'd definitely stick with that for simplicity's sake).
You could do a 'binary search', by just starting looking halfway down the file for the sequence '1', the first occurrence you find will give you an idea what line number you have found; then based on where the line you are looking for is relative to the found number you keep splitting recursively.
For extra performance you could also make the assumption that the lines are roughly the same length and have the algorithm 'guess' the approximate position of the line you are looking for relative to the total number of lines in the file and then perform this search from there onwards. If you do not want to make assumptions about the length of the file you can even make it self-prime by just splitting in half first, and using the line number it finds first as an approximation of how many lines there are in the file as a whole.
Definitely not trivial to implement, but if you have a lot of random access in files with a large number of lines, it may pay off in performance gains.
If you are dealing with a fixed-width data format (ie. you know all the lines to be the same length), you can multiply the length with your desired line number and use Stream.Seek to find the start point of the nth line.
If the lines are not fixed length, you need to find the right number of line breaks until you are at the beginning of the line you want. That would be easiest done with StreamReader.ReadLine. (You can make an extension method to make the file en IEnumerable<string> as Jon Skeet suggests - this would get you nicer syntax, but under the hood you will be using ReadLine).
If performance is an issue, it might be (a little bit) more efficient to scan for <CR><LF> byte sequences in the file manually using the Stream.Read method. I haven't tested that; but the StreamReader obviously need to do some work to construct a string out of the byte sequence - if you don't care about the first lines, this work can be saved, so theoretically you should be able to make a scanning method that performs better. This would be a lot more work for you, however.
You can't jump directly to a line in a text file unless every line is fixed width and you are using a fixed-width encoding (i.e. not UTF-8 - which is one of the most common now).
The only way to do it is to read lines and discard the ones you don't want.
Alternatively, you might put an index at the top of the file (or in an external file) that tells it (for example) that line 1000 starts at byte offset [x], line 2000 starts at byte offset [y] etc. Then use .Position
or .Seek()
on the FileStream
to move to the nearest indexed point, and walk forwards.
Assuming the simplest approach (no index), the code in Jon's example should work fine. If you don't want LINQ, you can knock up something similar in .NET 2.0 + C# 2.0:
// to read multiple lines in a block
public static IEnumerable<string> ReadLines(
string path, int lineIndex, int count) {
if (string.IsNullOrEmpty(path)) throw new ArgumentNullException("path");
if (lineIndex < 0) throw new ArgumentOutOfRangeException("lineIndex");
if (count < 0) throw new ArgumentOutOfRangeException("count");
using (StreamReader reader = File.OpenText(path)) {
string line;
while (count > 0 && (line = reader.ReadLine()) != null) {
if (lineIndex > 0) {
lineIndex--; // skip
continue;
}
count--;
yield return line;
}
}
}
// to read a single line
public static string ReadLine(string path, int lineIndex) {
foreach (string line in ReadLines(path, lineIndex, 1)) {
return line;
}
throw new IndexOutOfRangeException();
}
If you need to test values of the line (rather than just line index), then that is easy enough to do too; just tweak the iterator block.
You can use my LineReader
class (either the one in MiscUtil or a simple version here) to implement IEnumerable<string>
and then use LINQ:
string line5 = new LineReader(file).Skip(4).First();
This assumes .NET 3.5, admittedly. Otherwise, open a TextReader
(e.g. with File.OpenText
) and just call ReadLine()
four times to skip the lines you don't want, and then once more to read the fifth line.
There's no way of "shortcutting" this unless you know exactly how many bytes are in each line.
If you need to be able to jump to line 24,000 using a function that does ReadLine() in the background will be a bit slow.
If the line number is high you may want to make some sort of educated guess as to where in the file the line may be and start reading from there. That way to get to line 24,567 you don't have to read 24,566 lines first. You can skip to somewhere in the middle find out what line you are on based on the number after the /t and then count from there.
A while back I worked with a dev who had to build a DB before RDBMSs where common. His solution to your problem was similar to what I just wrote about but in his case he kept a map in a separate file. The map can map every hundredth line to its location in the document. A map like this can be loaded very quickly and this may increase read times. At the time his system was very fast and efficient for readonly data but not very good for read/write data. (every time you change the lines you have to change the whole map, this is not very efficient)