Simplest way to get rid of zero-width-space in c# string

一个人想着一个人 提交于 2021-02-04 22:37:07

问题


I am parsing emails using a regex in a c# VSTO project. Once in a while, the regex does not seem to work (although if I paste the text and regex in regexbuddy, the regex correctly matches the text). If I look at the email in gmail, I see

=E2=80=8B

at the beginning and end of some lines (which I understand is the UTF8 zero width space); this appears to be what is messing up the regex. This seems to be only sequence showing up.

What is the easiest way to get rid of this exact sequence? I cannot do the obvious

MailItem.Body.Replace("=E2=80=8B", "")

because those characters don't show up in the c# string.

I also tried

byte[] bytes = Encoding.Default.GetBytes(MailItem.TextBody);
string myString = Encoding.UTF8.GetString(bytes);

But the zero-width spaces just show up as ?. I suppose I could go through the bytes array and remove the bytes comprising the zero width space, but I don't know what the bytes would look like (it does not seem as simple as converting E2 80 8B to decimal and searching for that).


回答1:


As strings in C# are stored in Unicode (not UTF-8) the following might do the trick:

MailItem.Body.Replace("\u200B", "");



回答2:


As all the Regex.Replace() methods operate on strings, that's not going to be useful here.

The string indexer returns a char, so for want of a better solution (and if you can't predict where these characters are going to be), as long-winded as it seems, you may be best off with:

        StringBuilder newText = new StringBuilder();

        for (int i = 0; i < MailItem.Body.Length; i++)
        {
            if (a[i] != '\u200b')
            {
                newText.Append(a[i]);
            }
        } 



回答3:


Use System.Web.HttpUtility.HtmlDecode(string); Quite simple.



来源:https://stackoverflow.com/questions/24942167/simplest-way-to-get-rid-of-zero-width-space-in-c-sharp-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!