Oracle SQL-Loader handling efficiently internal Double Quotes in values

后端 未结 1 1571
[愿得一人]
[愿得一人] 2021-01-25 11:03

I have some Oracle SQL Loader challenges and looking for an efficient and simple solution. my source files to be loaded are pipe | delimited, where values are enclo

相关标签:
1条回答
  • 2021-01-25 11:49

    If you never had pipes in the enclosed fields you could do it from the control file. If you can have both pipes and double-quotes within a field then I think you have no choice but to preprocess the files, unfortunately.

    Your solution [1], to replace double-quotes with an SQL operator, is happening too late to be useful; the delimiters and enclosures have already been interpreted by SQL*Loader before it does the SQL step. Your solution [2], to ignore the enclosure, would work in combination with [1] - until one of the fields did contain a pipe character. And solution [3] has the same problems as using [1] and/or [2] globally.

    The documentation for specifying delimiters mentions that:

    Sometimes the punctuation mark that is a delimiter must also be included in the data. To make that possible, two adjacent delimiter characters are interpreted as a single occurrence of the character, and this character is included in the data.

    In other words, if you repeated the double-quotes inside the fields then they would be escaped and would appear in the table data. As you can't control the data generation, you could preprocess the files you get to replace all the double-quotes with escaped double quotes. Except you don't want to replace all of them - the ones that are actually real enclosures should not be escaped.

    You could use a regular expression to target the relevant characters will skipping others. Not my strong area, but I think you can do this with lookahead and lookbehind assertions.

    If you had a file called orig.txt containing:

    "1"|A|"B"|"C|D"
    "2"|A|"B"|"C"D"
    3|A|""B""|"C|D"
    4|A|"B"|"C"D|E"F"G|H""
    

    you could do:

    perl -pe 's/(?<!^)(?<!\|)"(?!\|)(?!$)/""/g' orig.txt > new.txt
    

    That looks for a double-quote which is not preceded by the line-start anchor or a pipe character; and is not followed by a pipe character or line end anchor; and replaces only those with escaped (doubled) double-quotes. Which would make new.txt contain:

    "1"|A|"B"|"C|D"
    "2"|A|"B"|"C""D"
    3|A|"""B"""|"C|D"
    4|A|"B"|"C""D|E""F""G|H"""
    

    The double-quotes at the start and end of fields are not modified, but those in the middle are now escaped. If you then loaded that with a control file with double-quote enclosures:

    load data
    truncate
    into table t42
    fields terminated by '|' optionally enclosed by '"'
    (
      col1,
      col2,
      col3,
      col4
    )
    

    Then you would end up with:

    select * from t42 order by col1;
    
          COL1 COL2       COL3       COL4                
    ---------- ---------- ---------- --------------------
             1 A          B          C|D                 
             2 A          B          C"D                 
             3 A          "B"        C|D                 
             3 A          B          C"D|E"F"G|H"        
    

    which hopefully matches your original data. There may be edge cases that don't work (like a double-quote followed by a pipe within a field) but there's a limit to what you can do to attempt to interpret someone else's data... There may also be (much) better regular expression patterns, of course.


    You could also consider using an external table instead of SQL*Loader, if the data file is (or can be) in an Oracle directory and you have the right permissions. You still have to modify the file, but you could do it automatically with the preprocessor directive, rather than needing to do that explicitly before calling SQL*Loader.

    0 讨论(0)
提交回复
热议问题