Simplest way to get rid of zero-width-space in c# string

问题

I am parsing emails using a regex in a c# VSTO project. Once in a while, the regex does not seem to work (although if I paste the text and regex in regexbuddy, the regex correctly matches the text). If I look at the email in gmail, I see

=E2=80=8B

at the beginning and end of some lines (which I understand is the UTF8 zero width space); this appears to be what is messing up the regex. This seems to be only sequence showing up.

What is the easiest way to get rid of this exact sequence? I cannot do the obvious

MailItem.Body.Replace("=E2=80=8B", "")

because those characters don't show up in the c# string.

I also tried

byte[] bytes = Encoding.Default.GetBytes(MailItem.TextBody);
string myString = Encoding.UTF8.GetString(bytes);

But the zero-width spaces just show up as ?. I suppose I could go through the bytes array and remove the bytes comprising the zero width space, but I don't know what the bytes would look like (it does not seem as simple as converting E2 80 8B to decimal and searching for that).

回答1:

As strings in C# are stored in Unicode (not UTF-8) the following might do the trick:

MailItem.Body.Replace("\u200B", "");

回答2:

As all the Regex.Replace() methods operate on strings, that's not going to be useful here.

The string indexer returns a char, so for want of a better solution (and if you can't predict where these characters are going to be), as long-winded as it seems, you may be best off with:

        StringBuilder newText = new StringBuilder();

        for (int i = 0; i < MailItem.Body.Length; i++)
        {
            if (a[i] != '\u200b')
            {
                newText.Append(a[i]);
            }
        }

回答3:

Use System.Web.HttpUtility.HtmlDecode(string); Quite simple.

来源：https://stackoverflow.com/questions/24942167/simplest-way-to-get-rid-of-zero-width-space-in-c-sharp-string

标签

regex

utf-8

character-encoding