Problem: Very very, large file I need to parse line by line to get 3 values from each line. Everything works but it takes a long time to parse through the whole file. Is it poss
Right now, you recreate your Regex
each time you call GetRateLine
, which occurs every time you read a line.
If you create a Regex instance once in advance, and then use the non-static Match method, you will save on regex compilation time, which could potentially give you a speed gain.
That being said, it will likely not take you from minutes to seconds...
At a brief glance there are a few things I would try...
First, Increase your file stream buffer to at least 64kb:
using (var sr = new StreamReader(inputFile, Encoding.UTF8, true, 65536))
Second, Construct the Regex once instead of using a string inside the loop:
static readonly Regex rateExpression = new Regex(@"^\d{1,}.+\[(.*)\s[\-]\d{1,}].+GET.*HTTP.*\d{3}[\s](\d{1,})[\s](\d{1,})$", RegexOptions.IgnoreCase);
//In GetRateLine() change to:
Match match = rateExpression.Match(justALine);
Third, Use a single list instance by having Responder.GetRate() return a list or array.
// replace: 'rp.GetRate(rate)', with:
rate = rp.GetRate();
I would preallocate the list to a 'reasonable' limit:
List<int> rate = new List<int>(10000);
You might also consider changing your encoding from UTF-8 to ASCII if available and applicable to your specific needs.
Comments
Generally, if this is really going to be a requirement to get the parse time down, you are going to want to build a tokenizer and skip Regex entirely. Since your input format looks to be all ascii and fairly simple this should be easy enough to do, but probably a little more brittle than regex. In the end you will need to weigh and balance the need for speed vs the reliability and maintainability of the code.
If you need some example by-hand parsing look at the answer to this question
Instead of recreating a regex for each call to GetRateLine
, create it in advance, passing the RegexOptions.Compiled
option to the Regex(String,RegexOptions) constructor.
You may also want to try reading in the entire file to memory, but I doubt that's your bottleneck. It shouldn't take a minute to read in ~100MB from disk.
Memory Mapped Files and Task Parallel Library for help.
IEnumerable<string>
, basically to abstract a set of not parsed linesParse(IEnumerable<string>)
as a Task actionSee Pipelines pattern on MSDN
Must say this solution is for .NET Framework >=4