UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-6: invalid data

后端未结

关注

 8  1199

how does the unicode thing works on python2? i just dont get it.

here i download data from a server and parse it for JSON.

Traceback (most recent cal


                      
              相关标签:


      
      
        
          8条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  悲&欢浪女        
                
              
                            
                2020-11-30 01:50
              
            
            
                                                                       
The solution to change the encoding to Latin1 / ISO-8859-1 solves an issue I observed with html2text.py as invoked on an output of tex4ht. I use that for an automated word count on LaTeX documents: tex4ht converts them to HTML, and then html2text.py strips them down to pure text for further counting through wc -w. Now, if, for example, a German "Umlaut" comes in through a literature database entry, that process would fail as html2text.py would complain e.g.

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 32243-32245: invalid data

Now these errors would then subsequently be particularly hard to track down, and essentially you want to have the Umlaut in your references section. A simple change inside html2text.py from

data = data.decode(encoding)

to

data = data.decode("ISO-8859-1")

solves that issue; if you're calling the script using the HTML file as first parameter, you can also pass the encoding as second parameter and spare the modification.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  春和景丽        
                
              
                            
                2020-11-30 01:51
              
            
            
                                                                       
Temporary workaround: unicode(urllib2.urlopen(url).read(), 'utf8') - this should work if what is returned  is UTF-8.

urlopen().read() return bytes and you have to decode them to unicode strings. Also it would be helpful to check the patch from http://bugs.python.org/issue4733
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复