HttpWebRequest: Receiving response with the right encoding

后端未结

关注

 3  881

I\'m currently downloading an HTML page, using the following code:

Try
    Dim req As System.Net.HttpWebRequest = DirectCast(WebRequest.Create(URL), HttpWebR


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  北海茫月        
                
              
                            
                2021-01-07 01:27
              
            
            
                                                                       
Gap's site is wrong.  The specific problem is that their page claims an encoding of Latin1 (ISO-8859-1), while the page uses character #146 which is not valid in ISO-8859-1.  

That character is, however, valid in the Windows CP-1252 encoding (which is a superset of ISO 8859-1). In CP-1252, character code #146 and is used for the right-quote character. You'll see this as an apostrophe in "Youll find Petites and small sizes" in today's text on the Gap.com home page.  

You can read http://en.wikipedia.org/wiki/Windows-1252 for more details.  Turns out this kind of thing is a common problem on web pages where the content was originally saved in the CP-1252 encoding (e.g. copy/pasted from Word).

Moral of the story here: always store internationalized text as Unicode in your database, and always emit HTML as UTF8 on your web server!
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤街浪徒        
                
              
                            
                2021-01-07 01:32
              
            
            
                                                                       
Daniel,
Some pages not even return a value in the CharacterSet, so this approach is not so reliable.
Sometimes not even the browsers are able to "guess" which Encoding to use, so I think you can't do a 100% enconding recogniton.

In my particular case, as I deal with spanish or portuguese pages, I use the UTF7 encoding and it is working fine for me (áéíóúñÑêã... etc). 

May be you can first load a table of CharacterSet codes and their corresponding Encoding. And in case the CharacterSet is empty, you can provide a Default encoding.

The detectEncodingFromByteOrderMarks parameter in the StreamReader constructor, may help a little as it automatically detect or infers some encodings from the very first bytes.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  被撕碎了的回忆        
                
              
                            
                2021-01-07 01:46
              
            
            
                                                                       
I believe that the HttpWebResponse has a ContentEncoding property. Use it in the constructor of your StreamReader.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复