Compare strings with non-English characters?

后端 未结 7 2110
再見小時候
再見小時候 2021-01-19 07:20

I need to compare strings for a search mechanism on a web site. I use C#. I tried two ways:

consultants.Where(x => 
    x.Description.ToLower().Contains(v         


        
相关标签:
7条回答
  • 2021-01-19 08:10

    Here is an introduction to the character set problem by Joel Spolsky. A very interesting read.

    In short, the web page needs to tell you what character set it is using at the very beginning of the page. C# is using unicode (In UTF-16 encoding as standard) for strings, a explanation what that means can you find here in csharp in depth

    Hope this will help you.

    0 讨论(0)
  • 2021-01-19 08:10

    Thanks to all who offered suggestions, but unfortunately they seem to be irrelevant. As it turns out Contains() has no problem with non-English characters at all. The problem was that the database field in question had html encoded text, so I needed to use HtmlDecode to compare the strings in the controller:

            if (vm.Description != "")
            {
                //HttpUtility.HtmlDecode needed because text in Description field is HtmlEncoded!
                consultants = consultants.Where(x => HttpUtility.HtmlDecode(x.Description).ContainsCaseInsensitive(vm.Description)).ToList();
            }
    

    I discovered this because the Contains() code worked fine when searching another field with non-English characters.

    0 讨论(0)
  • 2021-01-19 08:11

    Indexing is a big part of searching. I think you would be best served by using something ready and solid, like Lucene or Solr.

    If you still insist on searching using regexes on non-ascii characters, you should probably learn more on unicode categories and then use them to strip any accent marks (for example, strip with \p{P} or \p{M}) before searching for that word in the text.

    Note: You will also probably need to normalize your strings using the FormC flag in order to decompose and strip/search more effectively

    0 讨论(0)
  • 2021-01-19 08:14

    For comparing non-English characters properly you should use appropriate culture rules for this. E.g. you could create your own case-insensitive StringComparer for Swedish and use it in Contains method:

    var swedishComparer = StringComparer.Create(new CultureInfo("sv-Se"), true);
    
    consultants = consultants
        .Where(x => 
            x.Description.Contains(vm.Description, swedishComparer)
        ).ToList();
    
    0 讨论(0)
  • 2021-01-19 08:15

    What do you search on ? On an xml file, on a db4o file, on sql ? The character coding of your database is important. You can handle with it at xml setting its utf-coding; and db4o it is already safe works on object, on sql side you have to set the charachter encoding.

    if you database is holding values as char(50) or varchar(50) it may miss different characters, to hold different characters such you should use nchar, nvarchar at your sql-database. Do not forget to check your database character coding, even it is not much neccessary

    0 讨论(0)
  • 2021-01-19 08:17

    What kind of list are you working on? A plain list or an ORM? use string.Compare() if it's a plain list.

    0 讨论(0)
提交回复
热议问题