What's a semantically-correct way to parse CSV from SQL Server 2008?

前端 未结 2 771
无人共我
无人共我 2021-01-21 05:04

I got a CSV dump from SQL Server 2008 that has lines like this:

Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1\" COPPER,1996-08-09 00:00:00
Construction,1971         


        
相关标签:
2条回答
  • 2021-01-21 05:19

    If your CSV doesn't ever use a double quote as a legitimate quoting character, tweak the options to CSV to pass :quote_char => "\0" and then you can do this (wrapped strings for clarity)

    1.9.3p327 > puts 'Construction,197133031B,"MORGAN SHOES" ALT,
                      1997-05-13 00:00:00'.parse_csv(:quote_char => "\0")
    Construction
    197133031B
    "MORGAN SHOES" ALT
    1997-05-13 00:00:00
    
    1.9.3p327 > puts 'Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,
                      1996-08-09 00:00:00'.parse_csv(:quote_char => "\0")
    Plumbing
    196222006P
    REPLACE LEAD WATER SERVICE W/1" COPPER
    1996-08-09 00:00:00
    
    0 讨论(0)
  • 2021-01-21 05:20

    The following uses regexp and String#scan. I observe that in the broken CSV format you're dealing with, that " only has quoting properties when it comes at the beginning and end of a field.

    Scan moves through the string successively matching the regexp, so the regexp can assume its start match point is the beginning of a field. We construct the regexp so it can match a balanced quoted field with no internal quotes (QUOTED) or a string of non-commas (UNQUOTED). When either alternative field representation is matched, it must be followed by a separator which can be either comma or end of string (SEP)

    Because UNQUOTED can match a zero length field before a separator, the scan always matches an empty field at the end which we discard with [0...-1]. Scan produces an array of tuples; each tuple is an array of the capture groups, so we map over each element picking the captured alternate with matches[0] || matches[1].

    None of your example lines show a field which contains both a comma and a quote -- I have no idea how it would be legally represented and this code probably wont recognize such a field correctly.

    SEP = /(?:,|\Z)/
    QUOTED = /"([^"]*)"/
    UNQUOTED = /([^,]*)/
    
    FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/
    
    def ugly_parse line
      line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] }
    end
    
    lines.each do |l|
      puts l
      puts ugly_parse(l).inspect
      puts
    end
    
    # Electrical,197135021E,"SERVICE, OUTLETS",1997-05-15 00:00:00
    # ["Electrical", "197135021E", "SERVICE, OUTLETS", "1997-05-15 00:00:00"]
    # 
    # Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
    # ["Plumbing", "196222006P", "REPLACE LEAD WATER SERVICE W/1\" COPPER", "1996-08-09 00:00:00"]
    # 
    # Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
    # ["Construction", "197133031B", "MORGAN SHOES\" ALT", "1997-05-13 00:00:00"]
    
    0 讨论(0)
提交回复
热议问题