Multi-line regex search in whole file

前端 未结 6 970
悲哀的现实
悲哀的现实 2021-02-15 16:47

I\'ve found loads of examples on to to replace text in files using regex. However it all boils down to two versions:
1. Iterate over all lines in the file and apply regex to

相关标签:
6条回答
  • 2021-02-15 16:49

    If you don't mind getting your hands a little dirty (and your regex is simple enough, or perhaps you have a strong desire for speed and don't mind suffering a bit), you can use Ragel. It can target C#, though the site doesn't mention it. You'll need to wrap a FileStream to provide a buffered indexer or use a memory mapped file (with unsafe pointers) in a 64 bit process to use this with large files though.

    0 讨论(0)
  • 2021-02-15 16:53

    I would say you should pre-parse/normalize the data before doing your replacements so that each line describes one possible set of data that needs to have replacements applied. Otherwise you get into complications with data integrity that cannot really be solved without a host of other difficulties.

    If there is a way to chunk the data into logical blocks then you could build a program that uses a mapreduce pattern to parse the data.

    0 讨论(0)
  • 2021-02-15 16:55

    Perhaps you could load in 2 lines at a time (or more, depending on how many lines you think your matches are going to span), and overlap them, e.g: load lines 1-2, then the next loop load lines 2-3, the next load 3-4; and do your multiline regexes over both lines combined, in each loop.

    0 讨论(0)
  • 2021-02-15 16:59

    I'm with Bart; you really should be using some kind of parser for this.

    Or, if you don't mind spawning a child process, you could just use sed (there's a native port on windows, or you can use Cygwin)

    0 讨论(0)
  • 2021-02-15 17:11

    Here's the Answer:
    There is no easy way

    I found a StreamRegex-Class which could be able to do what I am looking for.
    From what I could grasp of the algorithm:

    • Start at the beginning of the file with an empty buffer
    • do (
      • add a chunk of the file to the buffer
      • if there is a match in the buffer
        • mark the match
        • drop all data which appeared before the end of the match from the buffer
    • ) while there is still something of the file left

    That way it is not nessesary to load the full file -- or at least the chances of loading the full file in memory are reduced...
    However: Worst case is that there is no match in the whole file - in this case the full file will be loaded into memory.

    0 讨论(0)
  • 2021-02-15 17:12

    Regex is not the way to go, especially not with these large amounts of text. Create a little parser of your own:

    • read the file line by line;
    • for each line:
      • loop through the line char by char keeping track of any opening/closing string literals
      • when you encounter '/*' (and you're not 'inside' a string), store that offset number and loop until you encounter the first '*/' and store that number as well

    That will give you all the starting- and closing-offset numbers of the comment blocks. You should now be able to replace them by creating a temp-file and writing the text from the original file to the temp file (and writing something else if you're inside a comment block of course).

    Edit: source files of 2GiB??

    0 讨论(0)
提交回复
热议问题