Is there a fast way to parse through a large file with regex?

前端 未结 4 1668
面向向阳花
面向向阳花 2021-02-20 17:49

Problem: Very very, large file I need to parse line by line to get 3 values from each line. Everything works but it takes a long time to parse through the whole file. Is it poss

相关标签:
4条回答
  • 2021-02-20 18:28

    Right now, you recreate your Regex each time you call GetRateLine, which occurs every time you read a line.

    If you create a Regex instance once in advance, and then use the non-static Match method, you will save on regex compilation time, which could potentially give you a speed gain.

    That being said, it will likely not take you from minutes to seconds...

    0 讨论(0)
  • 2021-02-20 18:38

    At a brief glance there are a few things I would try...

    First, Increase your file stream buffer to at least 64kb:

    using (var sr = new StreamReader(inputFile, Encoding.UTF8, true, 65536))
    

    Second, Construct the Regex once instead of using a string inside the loop:

    static readonly Regex rateExpression = new Regex(@"^\d{1,}.+\[(.*)\s[\-]\d{1,}].+GET.*HTTP.*\d{3}[\s](\d{1,})[\s](\d{1,})$", RegexOptions.IgnoreCase);
    //In GetRateLine() change to:
    Match match = rateExpression.Match(justALine);
    

    Third, Use a single list instance by having Responder.GetRate() return a list or array.

    // replace: 'rp.GetRate(rate)', with:
    rate = rp.GetRate();
    

    I would preallocate the list to a 'reasonable' limit:

    List<int> rate = new List<int>(10000);
    

    You might also consider changing your encoding from UTF-8 to ASCII if available and applicable to your specific needs.

    Comments

    Generally, if this is really going to be a requirement to get the parse time down, you are going to want to build a tokenizer and skip Regex entirely. Since your input format looks to be all ascii and fairly simple this should be easy enough to do, but probably a little more brittle than regex. In the end you will need to weigh and balance the need for speed vs the reliability and maintainability of the code.

    If you need some example by-hand parsing look at the answer to this question

    0 讨论(0)
  • 2021-02-20 18:39

    Instead of recreating a regex for each call to GetRateLine, create it in advance, passing the RegexOptions.Compiled option to the Regex(String,RegexOptions) constructor.

    You may also want to try reading in the entire file to memory, but I doubt that's your bottleneck. It shouldn't take a minute to read in ~100MB from disk.

    0 讨论(0)
  • 2021-02-20 18:41

    Memory Mapped Files and Task Parallel Library for help.

    1. Create persisted MMF with multiple random access views. Each view corresponds to a particular part of a file
    2. Define parsing method with parameter like IEnumerable<string>, basically to abstract a set of not parsed lines
    3. Create and start one TPL task per one MMF view with Parse(IEnumerable<string>) as a Task action
    4. Each of worker tasks adds a parsed data into the shared queue of BlockingCollection type
    5. An other Task listen to BC (GetConsumingEnumerable()) and processes all data which already parsed by worker Tasks

    See Pipelines pattern on MSDN

    Must say this solution is for .NET Framework >=4

    0 讨论(0)
提交回复
热议问题