Converting a Postgresql database from SQL_ASCII, containing mixed encoging types, to UTF-8

后端未结

关注

 4  658

没有蜡笔的小新 2021-01-13 14:11

I have a postgresql database I would like to convert to UTF-8.

The problem is that it is currently SQL_ASCII, so hasn\'t been doing any kind of encoding conversion o

4条回答

被撕碎了的回忆 (楼主)

2021-01-13 14:35
I've seen exactly this problem myself, actually. The short answer: there's no straightforward algorithm. But there is some hope.

First, in my experience, the data tends to be:
- 99% ASCII
- .9% UTF-8
- .1% other, 75% of which is Windows-1252.
So let's use that. You'll want to analyze your own dataset, to see if it follows this pattern. (I am in America, so this is typical. I imagine a DB containing data based in Europe might not be so lucky, and something further east even less so.)

First, most every encoding out there today contains ASCII as a subset. UTF-8 does, ISO-8859-1 does, etc. Thus, if a field contains only octets within the range [0, 0x7F] (ie, ASCII characters), then it's probably encoded in ASCII/UTF-8/ISO-8859-1/etc. If you're dealing with American English, this will probably take care of 99% of your data.

On to what's left.

UTF-8 has some nice properties, in that it will either be 1 byte ASCII characters, OR everything after the first byte will be 10xxxxxx in binary. So: attempt to run your remaining fields through a UTF-8 decoder (one that will choke if you give it garbage.) On the fields it doesn't choke on, my experience has been that they're probably valid UTF-8. (It is possible to get a false positive here: we could have a tricky ISO-8859-1 field that is also valid UTF-8.)

Last, if it's not ASCII, and it doesn't decode as UTF-8, Windows-1252 seems to be the next good choice to try. Almost everything is valid Windows-1252 though, so it's hard to get failures here.

You might do this:
- Attempt to decode as ASCII. If successful, assume ASCII.
- Attempt to decode as UTF-8.
- Attempt to decode as Windows-1252
For the UTF-8 and Windows-1252, output the table's PK and the "guess" decoded text to a text file (convert the Windows-1252 to UTF-8 before outputting). Have a human look over it, see if they see anything out of place. If there's not too much non-ASCII data (and like I said, ASCII tends to dominate, if you're in America...), then a human could look over the whole thing.

Also, if you have some idea about what your data looks like, you could restrict decodings to certain characters. For example, if a field decodes as valid UTF-8 text, but contains a "©", and the field is a person's name, then it was probably a false positive, and should be looked at more closely.

Lastly, be aware that when you change to a UTF-8 database, whatever has been inserting this garbage data in the past is probably still there: you'll need to track down this system and teach it character encoding.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...