Using nzload to load special characters

问题

I have extended ascii chars in Oracle table data which I am able extract to a file using sqlplus with the \ escape character prefixed. I want to use nzload to load the exact same data into a netezza table.

nzload adds a couple of extra bytes when it encounters this char seq (c2bf) in the extracted file data:

echo "PROFESSIONAL¿" | od -x
0000000  5052 4f46 4553 5349 4f4e 414c **c2bf** 0a00

after nzload:

echo "PROFESSIONALÂ¿" | od -x
0000000  5052 4f46 4553 5349 4f4e 414c **c382 c2bf**

on the nzload command line, I have these options: -escapechar \ -ctrlchars

Can anyone provide any help with this?

回答1:

I'm not very savvy with Unicode conversion issues, but I've done this to myself before, and I'll demonstrate what I think is happening.

I believe what you are seeing here is not an issue with loading special characters with nzload, rather it's an issue with how your display/terminal software is display the data and/or Netezza how is storing the character data. I suspect a double conversion to/from UTF-8 (the Unicode encoding that Netezza supports). Let's see if we can suss out which it is.

Here I am using PuTTY with the default (for me) Remote Character Set as Latin-1.

$ od -xa input.txt
0000000    5250    464f    5345    4953    4e4f    4c41    bfc2    000a
          P   R   O   F   E   S   S   I   O   N   A   L   B   ?  nl
0000017

$ cat input.txt
PROFESSIONALÂ¿

Here we can see from od that the file has only the data we expect, however when we cat the file we see the extra character. If it's not in the file, then the character is likely coming from the display translation.

If I change the PuTTY settings to have UTF-8 be the remote character set, we see it this way:

$ od -xa input.txt
0000000    5250    464f    5345    4953    4e4f    4c41    bfc2    000a
          P   R   O   F   E   S   S   I   O   N   A   L   B   ?  nl
0000017
$ cat input.txt
PROFESSIONAL¿

So, the same source data, but two different on-screen representations, which are, not coincidentally, the same as your two different outputs. The same data can be displayed at least two ways.

Now let's see how it loads into Netezza, once into a VARCHAR column, and again into an NVARCHAR column.

create table test_enc_vchar (col1 varchar(50));
create table test_enc_nvchar (col1 nvarchar(50));

$ nzload -db testdb -df input.txt -t test_enc_vchar -escapechar '\' -ctrlchars
Load session of table 'TEST_ENC_VCHAR' completed successfully
$ nzload -db testdb -df input.txt -t test_enc_nvchar -escapechar '\' -ctrlchars
Load session of table 'TEST_ENC_NVCHAR' completed successfully

The data loaded with no errors. Note while I specify the escapechar option for nzload, none of the characters in this specific sample of input data require escaping, nor are they escaped.

I will now use the rawtohex function from the SQL Extension Toolkit as an in-database tool like we've used od from the command line.

select rawtohex(col1) from test_enc_vchar;
           RAWTOHEX
------------------------------
 50524F46455353494F4E414CC2BF
(1 row)

select rawtohex(col1) from test_enc_nvchar;
           RAWTOHEX
------------------------------
 50524F46455353494F4E414CC2BF
(1 row)

At this point both columns seem to have exactly the same data as the input file. So far, so good.

What if we select the column? For the record, I am doing this in a PuTTY session with remote character set of UTF-8.

select col1 from test_enc_vchar;
      COL1
----------------
 PROFESSIONALÂ¿
(1 row)

select col1 from test_enc_nvchar;
     COL1
---------------
 PROFESSIONAL¿
(1 row)

Same binary data, but different display. If I then copy the output of each of those selects into echo piped to od,

$ echo PROFESSIONALÂ¿ | od -xa
0000000    5250    464f    5345    4953    4e4f    4c41    82c3    bfc2
          P   R   O   F   E   S   S   I   O   N   A   L   C stx   B   ?
0000020    000a
         nl
0000021

$ echo  PROFESSIONAL¿ | od -xa
0000000    5250    464f    5345    4953    4e4f    4c41    bfc2    000a
          P   R   O   F   E   S   S   I   O   N   A   L   B   ?  nl
0000017

Based on this output, I'd wager that you are loading your sample data, which I'd also wager is UTF-8, into a VARCHAR column rather than an NVARCHAR column. This is not, in of itself, a problem, but can have display/conversion issues down the line.

Generally speaking, you'd want to load UTF-8 data into NVARCHAR columns.

来源：https://stackoverflow.com/questions/34537853/using-nzload-to-load-special-characters

标签

Oracle

unicode

utf-8

netezza

extended-ascii