Is there a fast way to parse through a large file with regex?

前端 未结 4 1671
面向向阳花
面向向阳花 2021-02-20 17:49

Problem: Very very, large file I need to parse line by line to get 3 values from each line. Everything works but it takes a long time to parse through the whole file. Is it poss

4条回答
  •  礼貌的吻别
    2021-02-20 18:38

    At a brief glance there are a few things I would try...

    First, Increase your file stream buffer to at least 64kb:

    using (var sr = new StreamReader(inputFile, Encoding.UTF8, true, 65536))
    

    Second, Construct the Regex once instead of using a string inside the loop:

    static readonly Regex rateExpression = new Regex(@"^\d{1,}.+\[(.*)\s[\-]\d{1,}].+GET.*HTTP.*\d{3}[\s](\d{1,})[\s](\d{1,})$", RegexOptions.IgnoreCase);
    //In GetRateLine() change to:
    Match match = rateExpression.Match(justALine);
    

    Third, Use a single list instance by having Responder.GetRate() return a list or array.

    // replace: 'rp.GetRate(rate)', with:
    rate = rp.GetRate();
    

    I would preallocate the list to a 'reasonable' limit:

    List rate = new List(10000);
    

    You might also consider changing your encoding from UTF-8 to ASCII if available and applicable to your specific needs.

    Comments

    Generally, if this is really going to be a requirement to get the parse time down, you are going to want to build a tokenizer and skip Regex entirely. Since your input format looks to be all ascii and fairly simple this should be easy enough to do, but probably a little more brittle than regex. In the end you will need to weigh and balance the need for speed vs the reliability and maintainability of the code.

    If you need some example by-hand parsing look at the answer to this question

提交回复
热议问题