Converting a Postgresql database from SQL_ASCII, containing mixed encoging types, to UTF-8

后端 未结 4 663
没有蜡笔的小新
没有蜡笔的小新 2021-01-13 14:11

I have a postgresql database I would like to convert to UTF-8.

The problem is that it is currently SQL_ASCII, so hasn\'t been doing any kind of encoding conversion o

相关标签:
4条回答
  • 2021-01-13 14:35

    I've seen exactly this problem myself, actually. The short answer: there's no straightforward algorithm. But there is some hope.

    First, in my experience, the data tends to be:

    • 99% ASCII
    • .9% UTF-8
    • .1% other, 75% of which is Windows-1252.

    So let's use that. You'll want to analyze your own dataset, to see if it follows this pattern. (I am in America, so this is typical. I imagine a DB containing data based in Europe might not be so lucky, and something further east even less so.)

    First, most every encoding out there today contains ASCII as a subset. UTF-8 does, ISO-8859-1 does, etc. Thus, if a field contains only octets within the range [0, 0x7F] (ie, ASCII characters), then it's probably encoded in ASCII/UTF-8/ISO-8859-1/etc. If you're dealing with American English, this will probably take care of 99% of your data.

    On to what's left.

    UTF-8 has some nice properties, in that it will either be 1 byte ASCII characters, OR everything after the first byte will be 10xxxxxx in binary. So: attempt to run your remaining fields through a UTF-8 decoder (one that will choke if you give it garbage.) On the fields it doesn't choke on, my experience has been that they're probably valid UTF-8. (It is possible to get a false positive here: we could have a tricky ISO-8859-1 field that is also valid UTF-8.)

    Last, if it's not ASCII, and it doesn't decode as UTF-8, Windows-1252 seems to be the next good choice to try. Almost everything is valid Windows-1252 though, so it's hard to get failures here.

    You might do this:

    • Attempt to decode as ASCII. If successful, assume ASCII.
    • Attempt to decode as UTF-8.
    • Attempt to decode as Windows-1252

    For the UTF-8 and Windows-1252, output the table's PK and the "guess" decoded text to a text file (convert the Windows-1252 to UTF-8 before outputting). Have a human look over it, see if they see anything out of place. If there's not too much non-ASCII data (and like I said, ASCII tends to dominate, if you're in America...), then a human could look over the whole thing.

    Also, if you have some idea about what your data looks like, you could restrict decodings to certain characters. For example, if a field decodes as valid UTF-8 text, but contains a "©", and the field is a person's name, then it was probably a false positive, and should be looked at more closely.

    Lastly, be aware that when you change to a UTF-8 database, whatever has been inserting this garbage data in the past is probably still there: you'll need to track down this system and teach it character encoding.

    0 讨论(0)
  • This is exactly the problem that Encoding::FixLatin was written to solve*.

    If you install the Perl module then you'll also get the fix_latin command-line utility which you can use like this:

    pg_restore -O dump_file | fix_latin | psql -d database
    

    Read of the 'Limitations' section of the documentation to understand how it works.

    [*] Note I'm assuming that when you say ISO-8859-x you mean ISO-8859-1 and when you say CP125x you mean CP1252 - because the mix of ASCII, UTF-8, Latin-1 and WinLatin-1 is a common case. But if you really do have a mixture of eastern and western encodings then sorry but you're screwed :-(

    0 讨论(0)
  • 2021-01-13 14:43

    It is impossible without some knowledge of the data first. Do you know if it is a text message or people's names or places? In some particular language?

    You can try to encode a line of a dump and apply some heuristic — for example try an automatic spell checker and choose an encoding that generates the lowest number of errors or the most known words etc.

    You can use for example aspell list -l en (en for English, pl for Polish, fr for French etc.) to get a list of misspelled words. Then you can choose encoding which generates the least of them. You'd need to install corresponding dictionary package, for example "aspell-en" in my Fedora 13 Linux system.

    0 讨论(0)
  • 2021-01-13 14:51

    I resolved using this commands;

    1-) Export

    pg_dump --username=postgres --encoding=ISO88591 database -f database.sql
    

    and after

    2-) Import

    psql -U postgres -d database < database.sql
    

    these commands helped me solve the problem of conversion SQL_ASCII - UTF-8

    0 讨论(0)
提交回复
热议问题