We are currently migrating one of our oracle databases to UTF8 and we have found a few records that are near the 4000 byte varchar limit. When we try and migrate these reco
Do this, it will work.
trim(replace(ntwk_slctor_key_txt, chr(0), ''))
The following also works:
select dump(a,1016), a from (
SELECT REGEXP_REPLACE (
CONVERT (
'3735844533120%$03 ',
'US7ASCII',
'WE8ISO8859P1'),
'[^!@/\.,;:<>#$%&()_=[:alnum:][:blank:]]') a
FROM DUAL);
There's probably a more direct way using regular expressions. With luck, somebody else will provide it. But here's what I'd do without needing to go to the manuals.
Create a PLSQL function to receive your input string and return a varchar2.
In the PLSQL function, do an asciistr() of your input. The PLSQL is because that may return a string longer than 4000 and you have 32K available for varchar2 in PLSQL.
That function converts the non-ASCII characters to \xxxx notation. So you can use regular expressions to find and remove those. Then return the result.
Please note that whenever you use
regexp_like(column, '[A-Z]')
Oracle's regexp engine will match certain characters from the Latin-1 range as well: this applies to all characters that look similar to ASCII characters like Ä->A, Ö->O, Ü->U, etc., so that [A-Z] is not what you know from other environments like, say, Perl.
Instead of fiddling with regular expressions try changing for the NVARCHAR2 datatype prior to character set upgrade.
Another approach: instead of cutting away part of the fields' contents you might try the SOUNDEX function, provided your database contains European characters (i.e. Latin-1) characters only. Or you just write a function that translates characters from the Latin-1 range into similar looking ASCII characters, like
of course only for text blocks exceeding 4000 bytes when transformed to UTF-8.
If you use the ASCIISTR
function to convert the Unicode to literals of the form \nnnn
, you can then use REGEXP_REPLACE
to strip those literals out, like so...
UPDATE table SET field = REGEXP_REPLACE(ASCIISTR(field), '\\[[:xdigit:]]{4}', '')
...where field and table are your field and table names respectively.