Bug in Mathematica: regular expression applied to very long string

前端 未结 3 1864
深忆病人
深忆病人 2021-01-02 07:14

In the following code, if the string s is appended to be something like 10 or 20 thousand characters, the Mathematica kernel seg faults.

s = \"This is the fi         


        
相关标签:
3条回答
  • 2021-01-02 07:39

    Mathematica is a great executive toy but I'd advise against trying to do anything serious with it like regexs over long strings or any kind of computation over significant amounts of data (or where correctness is important). Use something tried and tested. Visual F# 2010 takes 5 milliseconds and one line of code to get the correct answer without crashing:

    > let str =
        "This is the first line.\nMAGIC_STRING\nEverything after this line should get removed." +
          String.replicate 2000 "0123456789";;
    val str : string =
      "This is the first line.
    MAGIC_STRING
    Everything after this li"+[20022 chars]
    
    > open System.Text.RegularExpressions;;
    > #time;;
    --> Timing now on
    
    > (Regex "(^|\\n)[^\\n]*MAGIC_STRING(.|\\n)*").Replace(str, "");;
    Real: 00:00:00.005, CPU: 00:00:00.015, GC gen0: 0, gen1: 0, gen2: 0
    val it : string = "This is the first line."
    
    0 讨论(0)
  • 2021-01-02 07:48

    Mathematica uses PCRE syntax, so it does have the /s aka DOTALL aka Singleline modifier, you just prepend the (?s) modifier before the part of the expression in which you want it to apply.

    See the RegularExpression documentation here: (expand the section labeled "More Information")
    http://reference.wolfram.com/mathematica/ref/RegularExpression.html

    The following set options for all regular expression elements that follow them:
    (?i) treat uppercase and lowercase as equivalent (ignore case)
    (?m) make ^ and $ match start and end of lines (multiline mode)
    (?s) allow . to match newline
    (?-c) unset options

    This modified input doesn't crash Mathematica 7.0.1 for me (the original did), using a string that is 15,000 characters long, producing the same output as your expression:

    s = StringReplace[s,RegularExpression@".*MAGIC_STRING(?s).*"->""]

    It should also be a bit faster for the reasons @AlanMoore explained

    0 讨论(0)
  • 2021-01-02 07:54

    The best way to optimize the regex depends on the internals of Mathematica's regex engine, but I would definitely get rid of the (.|\\n)*, as @Simon mentioned. It's not just the alternation--although it's almost always a mistake to have an alternation in which every alternative matches exactly one character; that's what character classes are for. But you're also capturing each character when you match it (because of the parentheses), only to throw it away when you match the next character.

    A quick scan of the Mathematica regex docs doesn't turn up anything like the /s (Singleline or DOTALL) modifier, so I recommend the old JavaScript standby, [\\s\\S]* -- match anything that is whitespace or anything that isn't whitespace. Also, it might help to add the $ anchor to the end of the regex:

    "(^|\\n)[^\\n]*MAGIC_STRING[\\s\\S]*$"
    

    But your best option would probably be not to use regexes at all. I don't see anything here that requires them, and it would probably be much easier as well as more efficient to use Mathematica's normal string-manipulation functions.

    0 讨论(0)
提交回复
热议问题