问题
I am trying to turn this text:
×וויר. העתיד של רשתות חברתיות והתקשורת ×©×œ× ×•
Into this text:
אוויר. העתיד של רשתות חברתיות והתקשורת שלנו
Somehow, this website:
http://www.pixiesoft.com/flip/
Can do it, and I would like to know how I might be able to do it myself (with whatever programming language or software)
Just saving the file as UTF8 won't do it.
My motivation for this question is that I have a friend's exported XML file with the garbled text which I want to turn into corrected Hebrew text file.
The XML export was originally garbled by MySQL import and exports, but I don't have the information needed to fix it or traceback the problem.
Thanks.
回答1:
Since the issue was a MySQL fault with double-encoded UTF8 strings, MySQL is the right way to solve it.
Running the following commands will solve it -
mysqldump $DB_NAME -u $DB_USER -p -h $DB_HOST.EXAMPLE.NET --add-drop-table --default-character-set=latin1 > export.sql
- latin1 is used here to force MySQL not to split the characters, and should not be used otherwise.cp export{,.utf8}.sql
- making a backup copy.sed -i -e 's/latin1/utf8/g' export.utf8.sql
- Replacing the latin1 with utf8 in the file, in order to import it as UTF-8 instead of 8859-1.mysql $DB_NAME -u $DB_USER -p -h $DB_HOST.EXAMPLE.NET < export.utf8.sql
- import everything back to the database.
This will solve the issue in about ten minutes.
回答2:
You might want to look here - the accepted answer to this question shows a way how to guess the encoding of a byte[]
. All you have to ensure then, is getting the proper bytes from the gibberish.
Guessing might always fail, of course...
回答3:
If you look closely at the gibberish, you can tell that each Hebrew character is encoded as 2 characters - it appears that של
is encoded as של
.
This suggests that you are looking at UTF8 or UTF16 as ASCII. Converting to UTF8 will not help because it is already ASCII and will keep that encoding.
You can read each pair of bytes and reconstruct the original UTF8 from them.
Here is some C# I came up with - this is very simplistic (doesn't fully work - too many assumptions), but I could see some of the characters converted properly:
private string ToProperHebrew(string gibberish)
{
byte[] orig = Encoding.Unicode.GetBytes(gibberish);
byte[] heb = new byte[orig.Length / 2];
for (int i = 0; i < orig.Length / 2; i++)
{
heb[i] = orig[i * 2];
}
return Encoding.UTF8.GetString(heb);
}
If appears that each byte was re-encoded as two bytes - not sure what encoding was used for this, but discarding one byte seemed to be the right thing for most doubled up characters.
回答4:
You can use the meta tag to set the proper encoding for your page. Here is an example how you can do that:
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1255" />
I suppose that this encoding would do the work.
回答5:
Based on Oded's and Teddy's answers, I came up with this method, which worked for me:
public String getProperHebrew(String gibberish){
byte[] orig = gibberish.getBytes(Charset.forName("windows-1252"));
try {
return new String(orig, "UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
return "";
}
}
回答6:
gibberish.encode('windows-1252').decode('utf-8', 'replace')
来源:https://stackoverflow.com/questions/2840028/how-is-this-website-fixing-the-encoding