Why is my regex so much slower compiled than interpreted?

前端 未结 5 699
花落未央
花落未央 2021-02-18 18:17

I have a large and complex C# regex that runs OK when interpreted, but is a bit slow. I\'m trying to speed this up by setting RegexOptions.Compiled, and this seems

5条回答
  •  旧巷少年郎
    2021-02-18 19:05

    After extensive testing of my own, I can confirm the suspicions of mikel are essentially correct. Even when using Regex.CompileToAssembly() and statically linking the resultant DLL into the application, there is a substantial initial delay on the first practical matching call (at least for patterns involving many ORed alternatives). Moreover, the initial delay on the first matching call depends on what text you match against. For example, matching against an empty string or some other arbitrary text will cause less of an initial delay, but you will still get additional delays later on when actual positive matches are first encountered in new text. The only way to fully guarantee future matches will all be lightning fast is to initially force a positive match at runtime with text that does indeed match. Of course this gives the maximum initial delay possible (in exchange for all future matches being lightning fast).

    I dug deeper in order to understand this better. For each regex compiled into the assembly, a triplet of classes are written with the following naming template: {RegexName, RegexNameFactoryN, RegexNameRunnerN}. A reference to the RegexNameFactoryN class is instantiated at time of RegexName ctor, but the RegexNameRunnerN class is not. See the private factory and runnerref fields in the base Regex class. runnerref is a cached weak reference to a RegexNameRunnerN object. After various experiments with reflection, I can confirm that the ctors of all 3 of these compiled classes are fast and the RegexNameFactoryN.CreateInstance() function (which returns the initial RegexNameRunnerN reference) is also fast. The initial delay occurs somewhere within RegexRunner.Scan(), or it's call tree, and is thus likely outside the reach of the compiled MSIL generated by Regex.CompileToAssembly() since this call tree involves numerous non-abstract functions. This is very unfortunate and means the C# Regex compilation process performance benefits only extend so far: At runtime there will always be some substantial delay at the first time a positive match is encountered (at least for this class of many-ORed patterns).

    I theorize that this has to do with how the Nondeterministic Finite Automaton (NFA) engine performs some of it's own internal caching/instantiations at runtime as the pattern is processed.

    jessehouwing's suggestion of ngen is interesting and could possibly improve performance. I have not tested it.

提交回复
热议问题