How do I split a phrase into words using Regex in C#

前端 未结 8 1565
孤街浪徒
孤街浪徒 2020-12-18 01:25

I am trying to split a sentence/phrase in to words using Regex.

var phrase = \"This isn\'t a test.\";
var words = Regex.Split(phrase, @\"\\W+\").ToList();
         


        
8条回答
  •  醉梦人生
    2020-12-18 02:13

    Due to the fact that a number of languages use very complex rules to string words together into phrases and sentences, you can't rely on a simple Regular Expression to get all the words from a piece of text. Even for a language as 'simple' as English you'll run in a number of corner cases such as:

    • How to handle words like you're, isn't where there's two words combined and a number of characters replaces with '.
    • How to handle abbreviations such as Mr. Mrs. i.e.
    • combined words using '-'
    • hyphenated words at the end of a sentence.

    Chinese and Japanese (among others) are notoriously hard to parse this way, as these languages do not use spaces between words, only between sentences.

    You might want to read up on Text Segmentation and if the segmentation is important to you invest in a Spell Checker that can parse a whole text or a Text Segmentation engine which can split your sentences up into words according to the rules of the language.

    I couldn't find a .NET based multi-lingual segmentation engine with a quick google search though. Sorry.

提交回复
热议问题