Regular expression to find unescaped double quotes in CSV file

后端 未结 5 672
盖世英雄少女心
盖世英雄少女心 2021-01-03 06:00

What would a regular expression be to find sets of 2 unescaped double quotes that are contained in columns set off by double quotes in a CSV file?

Not a matc

相关标签:
5条回答
  • 2021-01-03 06:25

    Try this:

    (?m)""(?![ \t]*(,|$))
    

    Explanation:

    (?m)       // enable multi-line matching (^ will act as the start of the line and $ will act as the end of the line (i))
    ""         // match two successive double quotes
    (?!        // start negative look ahead
      [ \t]*   //   zero or more spaces or tabs
      (        //   open group 1
        ,      //     match a comma 
        |      //     OR
        $      //     the end of the line or string
      )        //   close group 1
    )          // stop negative look ahead
    

    So, in plain English: "match two successive double quotes, only if they DON'T have a comma or end-of-the-line ahead of them with optionally spaces and tabs in between".

    (i) besides being the normal start-of-the-string and end-of-the-string meta characters.

    0 讨论(0)
  • 2021-01-03 06:25

    Try this regular expression:

    "(?:[^",\\]*|\\.)*(?:""(?:[^",\\]*|\\.)*)+"
    

    That will match any quoted string with at least one pair of unescaped double quotes.

    0 讨论(0)
  • 2021-01-03 06:32

    For single-line matches:

    ^("[^"]*"\s*,\s*)*"[^"]*""[^"]*"
    

    or for multi-line:

    (^|\r\n)("[^\r\n"]*"\s*,\s*)*"[^\r\n"]*""[^\r\n"]*"
    

    Edit/Note: Depending on the regex engine used, you could use lookbehinds and other stuff to make the regex leaner. But this should work in most regex engines just fine.

    0 讨论(0)
  • 2021-01-03 06:37
    ".*"(\n|(".*",)*)
    

    should work, I guess...

    0 讨论(0)
  • 2021-01-03 06:46

    Due to the complexity of your problem, the solution depends on the engine you are using. This because to solve it you must use look behind and look ahead and each engine is not the same one this.

    My answer is using Ruby engine. The checking is just one RegEx but I out the whole code here for better explain it.

    NOTE that, due to Ruby RegEx engine (or my knowledge), optional look ahead/behind is not possible. So I need a small problem of spaces before and after comma.

    Here is my code:

    orgTexts = [
        '"asdf","asdf"',
        '"", "asdf"',
        '"asdf", ""',
        '"adsf", "", "asdf"',
        '"asdf""asdf", "asdf"',
        '"asdf", """asdf"""',
        '"asdf", """"'
    ]
    
    orgTexts.each{|orgText|
        # Preprocessing - Eliminate spaces before and after comma
        # Here is needed if you may have spaces before and after a valid comma
        orgText = orgText.gsub(Regexp.new('\" *, *\"'), '","')
    
        # Detect valid character (non-quote and valid quote)
        resText = orgText.gsub(Regexp.new('([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")'), '-')
        # resText = orgText.gsub(Regexp.new('([^\"]|(^|(?<=,)|(?<=\\\\))\"|\"($|(?=,)))'), '-')
        # [^\"]       ===> A non qoute
        # |           ===> or
        # ^\"         ===> beginning quot
        # |           ===> or
        # \"$         ===> endding quot
        # |           ===> or
        # (?<=,)\"    ===> quot just after comma
        # \"(?=,)     ===> quot just before comma
        # (?<=\\\\)\" ===> escaped quot
    
        #  This part is to show the invalid non-escaped quots
        print orgText
        print resText.gsub(Regexp.new('"'), '^')
    
        # This part is to determine if there is non-escaped quotes
        # Here is the actual matching, use this one if you don't want to know which quote is un-escaped
        isMatch = ((orgText =~ /^([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")*$/) != 0).to_s
        # Basicall, it match it from start to end (^...$) there is only a valid character
    
        print orgText + ": " + isMatch
        print 
        print ""
        print ""
    } 
    

    When executed the code prints:

    "asdf","asdf"
    -------------
    "asdf","asdf": false
    
    
    "","asdf"
    ---------
    "","asdf": false
    
    
    "asdf",""
    ---------
    "asdf","": false
    
    
    "adsf","","asdf"
    ----------------
    "adsf","","asdf": false
    
    
    "asdf""asdf","asdf"
    -----^^------------
    "asdf""asdf","asdf": true
    
    
    "asdf","""asdf"""
    --------^^----^^-
    "asdf","""asdf""": true
    
    
    "asdf",""""
    --------^^-
    "asdf","""": true
    

    I hope I give you some idea here that you can use with other engine and language.

    0 讨论(0)
提交回复
热议问题