Validate a csv file

前端未结

关注

 6  503

This is my sample file

#%cty_id1,#%ccy_id2,#%cty_src,#%cty_cd3,#%cty_nm4,#%cty_reg5,#%cty_natnl6,#%cty_bus7,#%cty_data8
690,ALL2,,AL,ALBALODMNIA,,,,
90,ALL2,,


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  北荒        
                
              
                            
                2021-01-20 02:35
              
            
            
                                                                       
You could try to delete the empty field in column 4, if column no. 4 is not a two-character field, as follows: 

awk 'BEGIN {FS=OFS=","}
{
    for (i=1; i<=NF; i++) {
        if (!(i==4 && length($4)!=4))
            printf "%s%s",$i,(i<NF)?OFS:ORS
    }
}' file.csv


Output:

"id","cty_ccy_id","cty_src","cty_nm","cty_region","cty_natnl","cty_bus_load","cty_data_load"
6,"ALL",,"AL","ALBANIA",,,,
9,"ALL",,"AQ","ANTARCTICA",,,
16,"IDR",,"AZ","AZERBAIJAN",,,,
25,"LTL",,"BJ","BENIN",,,,
26,"CVE",,"BL","SAINT BARTH�LEMY",,,,
36,,,"BW","BOTSWANA",,,,
41,"BNS",,"CF","CENTRAL AFRICAN REPUBLIC",,,,
47,"CVE",,"CL","CHILE",,,,
50,"IDR",,"CO","COLOMBIA",,,,
61,"BNS",,"DK","DENMARK",,,,


Note:


We use length($4)!=4 since we assume two characters in column 4, but we also have to add two extra characters for the double quotes..

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轻奢々        
                
              
                            
                2021-01-20 02:43
              
            
            
                                                                       
If that's the only problem (and if you never have a comma in the field bt_cty_ccy_id), then you could remove such an extra comma by loading your file into an editor that supports regexes and have it replace

^([^,]*,[^,]*,[^,]*,),(?="[A-Z]{2}")


with \1.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  花落未央        
                
              
                            
                2021-01-20 02:44
              
            
            
                                                                       
The solution is to use a look-ahead regex, as suggested before. To reproduce your issue I used this:

"\\,\\,\\,(?=\\\"[A-Z]{2}\\\")"


which matches three commas followed by two quoted uppercase letters, but not including these in the match. Ofc you could need to adjust it a bit for your needs (ie. an arbitrary numbers of commas rather than exactly three).

But you cannot use it in Talend directly without tons of errors. Here's how to design your job:


In other words, you need to read the file line by line, no fields yet. Then, inside the tMap, do the match&replace, like:

row1.line.replaceAll("\\,\\,\\,(?=\\\"[A-Z]{2}\\\")", ",,")




and finally tokenize the line using "," as separator to get your final schema. You probably need to manually trim out the quotes here and there, since tExtractDelimitedFields won't.

Here's an output example (needs some cleaning, ofc):



You don't need to entry the schema for tExtractDelimitedFields by hand. Use the wizard to record a DelimitedFile Schema into the metadata repository, as you probably already did. You can use this schema as a Generic Schema, too, fitting it to the outgoing connection of tExtractDelimitedField. Not something the purists hang around, but it works and saves time.

About your UI problems, they are often related to file encodings and locale settings. Don't worry too much, they (usually) won't affect the job execution.

EDIT: here's a sample TOS job which shows the solution, just import in your project: TOS job archive

EDIT2: added some screenshots
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉话见心        
                
              
                            
                2021-01-20 02:46
              
            
            
                                                                       
Coming to the party late with a VBA based approach. An alternative way to regex is to to parse the file and remove a comma when the 4th field is empty. Using microsoft scripting runtime this can be acheived the code opens a the file then reads each line, copying it to a new temporary file. If the 4 element is empty, if it is it writes a line with the extra comma removed. The cleaned data is then copied to the origonal file and the temporary file is deleted. It seems a bit of a long way round, but it when I tested it on a file of 14000 rows based on your sample it took under 2 seconds to complete.

Sub Remove4thFieldIfEmpty()

    Const iNUMBER_OF_FIELDS As Integer = 9

    Dim str As String
    Dim fileHandleInput As Scripting.TextStream
    Dim fileHandleCleaned As Scripting.TextStream
    Dim fsoObject As Scripting.FileSystemObject
    Dim sPath As String
    Dim sFilenameCleaned As String
    Dim sFilenameInput As String
    Dim vFields As Variant
    Dim iCounter As Integer
    Dim sNewString As String

    sFilenameInput = "Regex.CSV"
    sFilenameCleaned = "Cleaned.CSV"
    Set fsoObject = New FileSystemObject

    sPath = ThisWorkbook.Path & "\"


    Set fileHandleInput = fsoObject.OpenTextFile(sPath & sFilenameInput)

    If fsoObject.FileExists(sPath & sFilenameCleaned) Then
        Set fileHandleCleaned = fsoObject.OpenTextFile(sPath & sFilenameCleaned, ForWriting)
    Else
        Set fileHandleCleaned = fsoObject.CreateTextFile((sPath & sFilenameCleaned), True)
    End If


    Do While Not fileHandleInput.AtEndOfStream
        str = fileHandleInput.ReadLine
            vFields = Split(str, ",")
            If vFields(3) = "" Then
                sNewString = vFields(0)
                For iCounter = 1 To UBound(vFields) 
                    If iCounter <> 3 Then sNewString = sNewString & "," & vFields(iCounter)
                Next iCounter
                str = sNewString
            End If
        fileHandleCleaned.WriteLine (str)
    Loop


    fileHandleInput.Close
    fileHandleCleaned.Close

    Set fileHandleInput = fsoObject.OpenTextFile(sPath & sFilenameInput, ForWriting)
    Set fileHandleCleaned = fsoObject.OpenTextFile(sPath & sFilenameCleaned)

    Do While Not fileHandleCleaned.AtEndOfStream
        fileHandleInput.WriteLine (fileHandleCleaned.ReadLine)
    Loop

    fileHandleInput.Close
    fileHandleCleaned.Close



    Set fileHandleCleaned = Nothing
    Set fileHandleInput = Nothing

    KillFile (sPath & sFilenameCleaned)

    Set fsoObject = Nothing


End Sub

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  陌清茗        
                
              
                            
                2021-01-20 03:00
              
            
            
                                                                       
Your best bet here may be to use the tSchemaComplianceCheck component in Talend.



If you read the file in with a tFileInputDelimited component and then check it with the tSchemaComplianceCheck where you set cty_cd to not nullable then it will reject your Antarctica row simply for the null where you expect no nulls.



From here you can use a tMap and simply map the fields to the one above.



You should be able to easily tweak this as necessary, potentially with further tSchemaComplianceChecks down the reject lines and mapping to suit. This method is a lot more self explanatory and you don't have to deal with complicated regex's that need complicated management when you want to accommodate different variations of your file structure with the benefit that you will always capture all of the well formatted rows.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  心在旅途        
                
              
                            
                2021-01-20 03:02
              
            
            
                                                                       
i would question the source system which is sending you this file as to why this extra comma in between for some rows? I guess you would be using comma as a delimeter for importing this .csv file into talend. 

(or another suggestion would be to ask for semi colon as column separator in the input file)

9,"ALL",,,"AQ","ANTARCTICA",,,,

will be

9;"ALL";,;"AQ";"ANTARCTICA";;;;
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复