Special characters problems using Python unicode

前端未结

关注

 4  1235

#!/usr/bin/env python
# -*- coding: utf_8 -*-

def splitParagraphIntoSentences(paragraph):

\'\'\' break a paragraph into sentences
    and return a list \'\'\'


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  抹茶落季        
                
              
                            
                2021-01-15 20:10
              
            
            
                                                                       
p = "While other species..."


should be changed to

p = u"While other species..."


Notice the u in front of the quote.

What you need is a so-called Unicode literals. In Python 2, string literals is not Unicode by default.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  盖世英雄少女心        
                
              
                            
                2021-01-15 20:14
              
            
            
                                                                       
Have found the solution to this.

The following piece of code, works just fine.

p = p.encode('utf-8') if isinstance(p,unicode)  else p

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一整个雨季        
                
              
                            
                2021-01-15 20:16
              
            
            
                                                                       
That looks like cp437.  Try this:

import codecs, sys
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
print u"valued at £9.2 billion."


This works for me in Python 2.6.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  滥情空心        
                
              
                            
                2021-01-15 20:17
              
            
            
                                                                       
Using Unicode string literals as Nam suggested is correct, but if your terminal is using the cp437 code page as your output suggests, it will not be able to display some of the Unicode characters you want to use.  The Windows console doesn't support UTF-8, which is what you are sending if you declare coding: utf-8¹ in your source file and do not use Unicode literals.  coding: utf-8 declares the encoding of your source file, so make sure you are actually saving your source in UTF-8 encoding.

When you use a Unicode literal, Python interprets the source string in the declared encoding, and converts it to a Unicode string.  When printing a Unicode string, Python will encode the string in the terminal encoding, or lacking a terminal encoding, use a default encoding of ascii for Python 2.

An example:

# coding: utf8

print '£9.2 billion'  # Sends UTF-8 to cp437 terminal (gibberish)
print u'£9.2 billion' # Correctly prints on cp437 terminal.
print 'Sheffield’s'   # Sends UTF-8 to cp437 terminal (gibberish)

# Replaces Unicode characters that are unsupported in cp437.
print u'Sheffield’s £9.2 billion'.encode('cp437','xmlcharrefreplace')

print u'Sheffield’s'  # UnicodeEncodeError.


Output

┬ú9.2 billion
£9.2 billion
SheffieldΓÇÖs
Sheffield&#8217;s £9.2 billion
Traceback (most recent call last):
  File "C:\Documents and Settings\metolone\Desktop\x.py", line 10, in <module>
    print u'SheffieldΓÇÖs'  # UnicodeEncodeError.
  File "C:\dev\python27\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 9: character maps to <undefined>


So, don't expect things to print all Unicode correctly on a Windows console.  Use a Python IDE that supports UTF-8, such as PythonWin (available in the pywin32 extension).

You need two things to display Unicode characters properly in the Windows console:  An encoding that maps the Unicode characters you want to display, and a font that supports the correct glyph for those characters.  For your example, if you change the console code page to Windows-1252 (chcp 1252) and change the console font to Consolas or Lucida Console instead of Raster Fonts, your original program will work if you use Unicode literals (p = u"...").
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复