Finding and removing non ascii characters from an Oracle Varchar2

前端 未结 17 2129
猫巷女王i
猫巷女王i 2020-12-02 23:03

We are currently migrating one of our oracle databases to UTF8 and we have found a few records that are near the 4000 byte varchar limit. When we try and migrate these reco

相关标签:
17条回答
  • 2020-12-02 23:49

    Do this, it will work.

    trim(replace(ntwk_slctor_key_txt, chr(0), ''))
    
    0 讨论(0)
  • 2020-12-02 23:52

    The following also works:

    select dump(a,1016), a from (
    SELECT REGEXP_REPLACE (
              CONVERT (
                 '3735844533120%$03  ',
                 'US7ASCII',
                 'WE8ISO8859P1'),
              '[^!@/\.,;:<>#$%&()_=[:alnum:][:blank:]]') a
      FROM DUAL);
    
    0 讨论(0)
  • 2020-12-02 23:54

    There's probably a more direct way using regular expressions. With luck, somebody else will provide it. But here's what I'd do without needing to go to the manuals.

    Create a PLSQL function to receive your input string and return a varchar2.

    In the PLSQL function, do an asciistr() of your input. The PLSQL is because that may return a string longer than 4000 and you have 32K available for varchar2 in PLSQL.

    That function converts the non-ASCII characters to \xxxx notation. So you can use regular expressions to find and remove those. Then return the result.

    0 讨论(0)
  • 2020-12-02 23:54

    Please note that whenever you use

    regexp_like(column, '[A-Z]')
    

    Oracle's regexp engine will match certain characters from the Latin-1 range as well: this applies to all characters that look similar to ASCII characters like Ä->A, Ö->O, Ü->U, etc., so that [A-Z] is not what you know from other environments like, say, Perl.

    Instead of fiddling with regular expressions try changing for the NVARCHAR2 datatype prior to character set upgrade.

    Another approach: instead of cutting away part of the fields' contents you might try the SOUNDEX function, provided your database contains European characters (i.e. Latin-1) characters only. Or you just write a function that translates characters from the Latin-1 range into similar looking ASCII characters, like

    • å => a
    • ä => a
    • ö => o

    of course only for text blocks exceeding 4000 bytes when transformed to UTF-8.

    0 讨论(0)
  • 2020-12-02 23:55

    If you use the ASCIISTR function to convert the Unicode to literals of the form \nnnn, you can then use REGEXP_REPLACE to strip those literals out, like so...

    UPDATE table SET field = REGEXP_REPLACE(ASCIISTR(field), '\\[[:xdigit:]]{4}', '')
    

    ...where field and table are your field and table names respectively.

    0 讨论(0)
提交回复
热议问题