Validate a csv file

前端未结

关注

 6  504

暖寄归人 2021-01-20 02:34

This is my sample file

#%cty_id1,#%ccy_id2,#%cty_src,#%cty_cd3,#%cty_nm4,#%cty_reg5,#%cty_natnl6,#%cty_bus7,#%cty_data8
690,ALL2,,AL,ALBALODMNIA,,,,
90,ALL2,,


      
      
        
          6条回答        

        
                    
            
            
                         
                
              
              
                
                   花落未央
                                             
                
                
                (楼主)
            
              
              
                2021-01-20 02:44
              

            
            
                        
The solution is to use a look-ahead regex, as suggested before. To reproduce your issue I used this:

"\\,\\,\\,(?=\\\"[A-Z]{2}\\\")"


which matches three commas followed by two quoted uppercase letters, but not including these in the match. Ofc you could need to adjust it a bit for your needs (ie. an arbitrary numbers of commas rather than exactly three).

But you cannot use it in Talend directly without tons of errors. Here's how to design your job:


In other words, you need to read the file line by line, no fields yet. Then, inside the tMap, do the match&replace, like:

row1.line.replaceAll("\\,\\,\\,(?=\\\"[A-Z]{2}\\\")", ",,")




and finally tokenize the line using "," as separator to get your final schema. You probably need to manually trim out the quotes here and there, since tExtractDelimitedFields won't.

Here's an output example (needs some cleaning, ofc):



You don't need to entry the schema for tExtractDelimitedFields by hand. Use the wizard to record a DelimitedFile Schema into the metadata repository, as you probably already did. You can use this schema as a Generic Schema, too, fitting it to the outgoing connection of tExtractDelimitedField. Not something the purists hang around, but it works and saves time.

About your UI problems, they are often related to file encodings and locale settings. Don't worry too much, they (usually) won't affect the job execution.

EDIT: here's a sample TOS job which shows the solution, just import in your project: TOS job archive

EDIT2: added some screenshots
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它6个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复