Encoding issue when handling a string that contains “question mark” (�)

五迷三道 提交于 2019-12-11 05:49:40

问题


I am parsing some web content in a response from a HttpWebRequest.

This web content is using charset ISO-8859-1 and when parsing it and finally getting the word needed from the response, I am receiving a string with a question mark like this and I want to know which is the right way to transform it back into a readable string.

So, what I've tried is to convert the current word encoding into UTF-8 like this:

(I am wondering if UTF-8 could solve my problem)

string word = "ESPA�OL";

Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf = Encoding.GetEncoding("UTF-8");

byte[] isoBytes = iso.GetBytes(word);
byte[] utfBytes = Encoding.Convert(iso, utf, isoBytes);

string utfWord = utf.GetString(utfBytes);

Console.WriteLine(utfWord);

However, utfWord variable outputs ESPA?OL which is still wrong. The correct output is supposed to be ESPAÑOL.

Can someone please give me the right directions to solve this, if possible?


回答1:


The word in question is "ESPAÑOL". This can be encoded correctly in ISO-8859-1 since all characters in the word are represented in ISO-8859-1.

You can see this for yourself using the following simple program:

using System;
using System.Diagnostics;
using System.Text;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            Encoding enc = Encoding.GetEncoding("ISO-8859-1");
            string original = "ESPAÑOL";
            byte[] iso_8859_1 = enc.GetBytes(original);
            string roundTripped = enc.GetString(iso_8859_1);
            Debug.Assert(original == roundTripped);
            Console.WriteLine(roundTripped);
        }
    }
}

What this tells you is that you need to properly diagnose where the erroneous character comes from. By the time that you have a � character, it is too late. The information has been lost. The presence of the � character indicates that, at some point, a conversion was performed into a character set that did not contain the character Ñ.

A conversion from ISO-8859-1 to a Unicode encoding will correctly handle "ESPAÑOL" because that word can be encoded in ISO-8859-1.

The most likely explanation is that somewhere along the way, the text "ESPAÑOL" is being converted to a character set that does not contain the letter Ñ.



来源:https://stackoverflow.com/questions/22451154/encoding-issue-when-handling-a-string-that-contains-question-mark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!