问题
I am parsing some web content in a response from a HttpWebRequest
.
This web content is using charset ISO-8859-1
and when parsing it and finally getting the word needed from the response, I am receiving a string
with a question mark like this �
and I want to know which is the right way to transform it back into a readable string
.
So, what I've tried is to convert the current word encoding
into UTF-8
like this:
(I am wondering if UTF-8
could solve my problem)
string word = "ESPA�OL";
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf = Encoding.GetEncoding("UTF-8");
byte[] isoBytes = iso.GetBytes(word);
byte[] utfBytes = Encoding.Convert(iso, utf, isoBytes);
string utfWord = utf.GetString(utfBytes);
Console.WriteLine(utfWord);
However, utfWord
variable outputs ESPA?OL
which is still wrong. The correct output is supposed to be ESPAÑOL
.
Can someone please give me the right directions to solve this, if possible?
回答1:
The word in question is "ESPAÑOL". This can be encoded correctly in ISO-8859-1 since all characters in the word are represented in ISO-8859-1.
You can see this for yourself using the following simple program:
using System;
using System.Diagnostics;
using System.Text;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Encoding enc = Encoding.GetEncoding("ISO-8859-1");
string original = "ESPAÑOL";
byte[] iso_8859_1 = enc.GetBytes(original);
string roundTripped = enc.GetString(iso_8859_1);
Debug.Assert(original == roundTripped);
Console.WriteLine(roundTripped);
}
}
}
What this tells you is that you need to properly diagnose where the erroneous character comes from. By the time that you have a � character, it is too late. The information has been lost. The presence of the � character indicates that, at some point, a conversion was performed into a character set that did not contain the character Ñ.
A conversion from ISO-8859-1 to a Unicode encoding will correctly handle "ESPAÑOL" because that word can be encoded in ISO-8859-1.
The most likely explanation is that somewhere along the way, the text "ESPAÑOL" is being converted to a character set that does not contain the letter Ñ.
来源:https://stackoverflow.com/questions/22451154/encoding-issue-when-handling-a-string-that-contains-question-mark