Validate a csv file

前端 未结 6 504
暖寄归人
暖寄归人 2021-01-20 02:34

This is my sample file

#%cty_id1,#%ccy_id2,#%cty_src,#%cty_cd3,#%cty_nm4,#%cty_reg5,#%cty_natnl6,#%cty_bus7,#%cty_data8
690,ALL2,,AL,ALBALODMNIA,,,,
90,ALL2,,         


        
6条回答
  •  花落未央
    2021-01-20 02:44

    The solution is to use a look-ahead regex, as suggested before. To reproduce your issue I used this:

    "\\,\\,\\,(?=\\\"[A-Z]{2}\\\")"
    

    which matches three commas followed by two quoted uppercase letters, but not including these in the match. Ofc you could need to adjust it a bit for your needs (ie. an arbitrary numbers of commas rather than exactly three).

    But you cannot use it in Talend directly without tons of errors. Here's how to design your job: job design

    In other words, you need to read the file line by line, no fields yet. Then, inside the tMap, do the match&replace, like:

    row1.line.replaceAll("\\,\\,\\,(?=\\\"[A-Z]{2}\\\")", ",,")
    

    tMap definition

    and finally tokenize the line using "," as separator to get your final schema. You probably need to manually trim out the quotes here and there, since tExtractDelimitedFields won't.

    Here's an output example (needs some cleaning, ofc):

    output snippet

    You don't need to entry the schema for tExtractDelimitedFields by hand. Use the wizard to record a DelimitedFile Schema into the metadata repository, as you probably already did. You can use this schema as a Generic Schema, too, fitting it to the outgoing connection of tExtractDelimitedField. Not something the purists hang around, but it works and saves time.

    About your UI problems, they are often related to file encodings and locale settings. Don't worry too much, they (usually) won't affect the job execution.

    EDIT: here's a sample TOS job which shows the solution, just import in your project: TOS job archive

    EDIT2: added some screenshots

提交回复
热议问题