Converting a Postgresql database from SQL_ASCII, containing mixed encoging types, to UTF-8

后端 未结 4 658
没有蜡笔的小新
没有蜡笔的小新 2021-01-13 14:11

I have a postgresql database I would like to convert to UTF-8.

The problem is that it is currently SQL_ASCII, so hasn\'t been doing any kind of encoding conversion o

4条回答
  •  被撕碎了的回忆
    2021-01-13 14:35

    I've seen exactly this problem myself, actually. The short answer: there's no straightforward algorithm. But there is some hope.

    First, in my experience, the data tends to be:

    • 99% ASCII
    • .9% UTF-8
    • .1% other, 75% of which is Windows-1252.

    So let's use that. You'll want to analyze your own dataset, to see if it follows this pattern. (I am in America, so this is typical. I imagine a DB containing data based in Europe might not be so lucky, and something further east even less so.)

    First, most every encoding out there today contains ASCII as a subset. UTF-8 does, ISO-8859-1 does, etc. Thus, if a field contains only octets within the range [0, 0x7F] (ie, ASCII characters), then it's probably encoded in ASCII/UTF-8/ISO-8859-1/etc. If you're dealing with American English, this will probably take care of 99% of your data.

    On to what's left.

    UTF-8 has some nice properties, in that it will either be 1 byte ASCII characters, OR everything after the first byte will be 10xxxxxx in binary. So: attempt to run your remaining fields through a UTF-8 decoder (one that will choke if you give it garbage.) On the fields it doesn't choke on, my experience has been that they're probably valid UTF-8. (It is possible to get a false positive here: we could have a tricky ISO-8859-1 field that is also valid UTF-8.)

    Last, if it's not ASCII, and it doesn't decode as UTF-8, Windows-1252 seems to be the next good choice to try. Almost everything is valid Windows-1252 though, so it's hard to get failures here.

    You might do this:

    • Attempt to decode as ASCII. If successful, assume ASCII.
    • Attempt to decode as UTF-8.
    • Attempt to decode as Windows-1252

    For the UTF-8 and Windows-1252, output the table's PK and the "guess" decoded text to a text file (convert the Windows-1252 to UTF-8 before outputting). Have a human look over it, see if they see anything out of place. If there's not too much non-ASCII data (and like I said, ASCII tends to dominate, if you're in America...), then a human could look over the whole thing.

    Also, if you have some idea about what your data looks like, you could restrict decodings to certain characters. For example, if a field decodes as valid UTF-8 text, but contains a "©", and the field is a person's name, then it was probably a false positive, and should be looked at more closely.

    Lastly, be aware that when you change to a UTF-8 database, whatever has been inserting this garbage data in the past is probably still there: you'll need to track down this system and teach it character encoding.

提交回复
热议问题