I have a postgresql database I would like to convert to UTF-8.
The problem is that it is currently SQL_ASCII, so hasn\'t been doing any kind of encoding conversion o
I've seen exactly this problem myself, actually. The short answer: there's no straightforward algorithm. But there is some hope.
First, in my experience, the data tends to be:
So let's use that. You'll want to analyze your own dataset, to see if it follows this pattern. (I am in America, so this is typical. I imagine a DB containing data based in Europe might not be so lucky, and something further east even less so.)
First, most every encoding out there today contains ASCII as a subset. UTF-8 does, ISO-8859-1 does, etc. Thus, if a field contains only octets within the range [0, 0x7F] (ie, ASCII characters), then it's probably encoded in ASCII/UTF-8/ISO-8859-1/etc. If you're dealing with American English, this will probably take care of 99% of your data.
On to what's left.
UTF-8 has some nice properties, in that it will either be 1 byte ASCII characters, OR everything after the first byte will be 10xxxxxx
in binary. So: attempt to run your remaining fields through a UTF-8 decoder (one that will choke if you give it garbage.) On the fields it doesn't choke on, my experience has been that they're probably valid UTF-8. (It is possible to get a false positive here: we could have a tricky ISO-8859-1 field that is also valid UTF-8.)
Last, if it's not ASCII, and it doesn't decode as UTF-8, Windows-1252 seems to be the next good choice to try. Almost everything is valid Windows-1252 though, so it's hard to get failures here.
You might do this:
For the UTF-8 and Windows-1252, output the table's PK and the "guess" decoded text to a text file (convert the Windows-1252 to UTF-8 before outputting). Have a human look over it, see if they see anything out of place. If there's not too much non-ASCII data (and like I said, ASCII tends to dominate, if you're in America...), then a human could look over the whole thing.
Also, if you have some idea about what your data looks like, you could restrict decodings to certain characters. For example, if a field decodes as valid UTF-8 text, but contains a "©", and the field is a person's name, then it was probably a false positive, and should be looked at more closely.
Lastly, be aware that when you change to a UTF-8 database, whatever has been inserting this garbage data in the past is probably still there: you'll need to track down this system and teach it character encoding.