How do I split a phrase into words using Regex in C#

前端未结

关注

 8  1566

I am trying to split a sentence/phrase in to words using Regex.

var phrase = \"This isn\'t a test.\";
var words = Regex.Split(phrase, @\"\\W+\").ToList();


                      
              相关标签:


      
      
        
          8条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  灰色年华        
                
              
                            
                2020-12-18 02:03
              
            
            
                                                                       
I'm not a java person but you could try to exclude punctuation while splitting on

spaces at the same time. Something like this maybe.  

These are raw and expanded regexes, the words are in capture group 1.

Do a global search.  

Unicode (doesen't account for grapheme's) 

[\s\pP]* ([\pL\pN_-] (?: [\pL\pN_-] | \pP(?=[\pL\pN\pP_-]) )* )


Ascii  

[\s[:punct:]]* (\w (?: \w | [[:punct:]](?=[\w[:punct:]]) )* )

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人及你        
                
              
                            
                2020-12-18 02:04
              
            
            
                                                                       
If you want to split into words for spell checking purposes, this is a good solution:

new Regex(@"[^\p{L}]*\p{Z}[^\p{L}]*")


Basically you can use Regex.Split using the previous regex.
It uses unicode syntax so it would work in several languages (not for most asian though).
And it won't break words with apostrophes ot hyphens.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  灰色年华        
                
              
                            
                2020-12-18 02:08
              
            
            
                                                                       
This worked for me: [^(\d|\s|\W)]*
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不思量自难忘°        
                
              
                            
                2020-12-18 02:12
              
            
            
                                                                       
It doesn't really seem like you need a regex. You could just do:

phrase.Split(" ");

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉梦人生        
                
              
                            
                2020-12-18 02:13
              
            
            
                                                                       
Due to the fact that a number of languages use very complex rules to string words together into phrases and sentences, you can't rely on a simple Regular Expression to get all the words from a piece of text. Even for a language as 'simple' as English you'll run in a number of corner cases such as:


How to handle words like you're, isn't where there's two words combined and a number of characters replaces with '.
How to handle abbreviations such as Mr. Mrs. i.e.
combined words using '-'
hyphenated words at the end of a sentence.


Chinese and Japanese (among others) are notoriously hard to parse this way, as these languages do not use spaces between words, only between sentences.

You might want to read up on Text Segmentation and if the segmentation is important to you invest in a Spell Checker that can parse a whole text or a Text Segmentation engine which can split your sentences up into words according to the rules of the language. 

I couldn't find a .NET based multi-lingual segmentation engine with a quick google search though. Sorry.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  借酒劲吻你        
                
              
                            
                2020-12-18 02:13
              
            
            
                                                                       
What do you want to split on? Spaces? Punctuation? You have to decide what the stop characters are. A simple regex that uses space and a few punctuation characters would be "[^.?!\s]+". That would split on period, question mark, exclamation, and any whitespace characters.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复