What's a semantically-correct way to parse CSV from SQL Server 2008?

前端未结

关注

 2  772

I got a CSV dump from SQL Server 2008 that has lines like this:

Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1\" COPPER,1996-08-09 00:00:00
Construction,1971


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  灰色年华        
                
              
                            
                2021-01-21 05:19
              
            
            
                                                                       
If your CSV doesn't ever use a double quote as a legitimate quoting character, tweak the options to CSV to pass :quote_char => "\0" and then you can do this (wrapped strings for clarity)

1.9.3p327 > puts 'Construction,197133031B,"MORGAN SHOES" ALT,
                  1997-05-13 00:00:00'.parse_csv(:quote_char => "\0")
Construction
197133031B
"MORGAN SHOES" ALT
1997-05-13 00:00:00

1.9.3p327 > puts 'Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,
                  1996-08-09 00:00:00'.parse_csv(:quote_char => "\0")
Plumbing
196222006P
REPLACE LEAD WATER SERVICE W/1" COPPER
1996-08-09 00:00:00

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  青春惊慌失措        
                
              
                            
                2021-01-21 05:20
              
            
            
                                                                       
The following uses regexp and String#scan. I observe that in the broken CSV format you're dealing with, that " only has quoting properties when it comes at the beginning and end of a field.

Scan moves through the string successively matching the regexp, so the regexp can assume its start match point is the beginning of a field.  We construct the regexp so it can match a balanced quoted field with no internal quotes (QUOTED) or a string of non-commas (UNQUOTED). When either alternative field representation is matched, it must be followed by a separator which can be either comma or end of string (SEP)

Because UNQUOTED can match a zero length field before a separator, the scan always matches an empty field at the end which we discard with [0...-1]. Scan produces an array of tuples; each tuple is an array of the capture groups, so we map over each element picking the captured alternate with matches[0] || matches[1].

None of your example lines show a field which contains both a comma and a quote -- I have no idea how it would be legally represented and this code probably wont recognize such a field correctly.

SEP = /(?:,|\Z)/
QUOTED = /"([^"]*)"/
UNQUOTED = /([^,]*)/

FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/

def ugly_parse line
  line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] }
end

lines.each do |l|
  puts l
  puts ugly_parse(l).inspect
  puts
end

# Electrical,197135021E,"SERVICE, OUTLETS",1997-05-15 00:00:00
# ["Electrical", "197135021E", "SERVICE, OUTLETS", "1997-05-15 00:00:00"]
# 
# Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
# ["Plumbing", "196222006P", "REPLACE LEAD WATER SERVICE W/1\" COPPER", "1996-08-09 00:00:00"]
# 
# Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
# ["Construction", "197133031B", "MORGAN SHOES\" ALT", "1997-05-13 00:00:00"]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复