Finding and removing non ascii characters from an Oracle Varchar2

前端 未结 17 2128
猫巷女王i
猫巷女王i 2020-12-02 23:03

We are currently migrating one of our oracle databases to UTF8 and we have found a few records that are near the 4000 byte varchar limit. When we try and migrate these reco

相关标签:
17条回答
  • 2020-12-02 23:45

    Thanks, this worked for my purposes. BTW there is a missing single-quote in the example, above.

    REGEXP_REPLACE (COLUMN,'[^' || CHR (32) || '-' || CHR (127) || ']', ' '))
    

    I used it in a word-wrap function. Occasionally there was an embedded NewLine/ NL / CHR(10) / 0A in the incoming text that was messing things up.

    0 讨论(0)
  • 2020-12-02 23:46

    In a single-byte ASCII-compatible encoding (e.g. Latin-1), ASCII characters are simply bytes in the range 0 to 127. So you can use something like [\x80-\xFF] to detect non-ASCII characters.

    0 讨论(0)
  • 2020-12-02 23:46

    You can try something like following to search for the column containing non-ascii character :

    select * from your_table where your_col <> asciistr(your_col);
    
    0 讨论(0)
  • 2020-12-02 23:47

    I wouldn't recommend it for production code, but it makes sense and seems to work:

    SELECT REGEXP_REPLACE(COLUMN,'[^' || CHR(1) || '-' || CHR(127) || '],'')
    
    0 讨论(0)
  • 2020-12-02 23:47

    I had a similar issue and blogged about it here. I started with the regular expression for alpha numerics, then added in the few basic punctuation characters I liked:

    select dump(a,1016), a, b
    from
     (select regexp_replace(COLUMN,'[[:alnum:]/''%()> -.:=;[]','') a,
             COLUMN b
      from TABLE)
    where a is not null
    order by a;
    

    I used dump with the 1016 variant to give out the hex characters I wanted to replace which I could then user in a utl_raw.cast_to_varchar2.

    0 讨论(0)
  • 2020-12-02 23:47

    I'm a bit late in answering this question, but had the same problem recently (people cut and paste all sorts of stuff into a string and we don't always know what it is). The following is a simple character whitelist approach:

    SELECT est.clients_ref
      ,TRANSLATE (
                  est.clients_ref
                 ,   'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
                  || REPLACE (
                              TRANSLATE (
                                         est.clients_ref
                                        ,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
                                        ,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
                                        )
                             ,'~'
                             )
                 ,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
                 )
          clean_ref
    

    FROM edms_staging_table est

    0 讨论(0)
提交回复
热议问题