’ is getting converted as “\u0092” by nokogiri in ruby on rails

前端未结

关注

 2  512

I have html page which has following line with some html entities like \"’\".

#Here I am not pasting whole html page content. just putting issue lin


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  醉酒成梦        
                
              
                            
                2021-01-02 16:11
              
            
            
                                                                       
they&#146;re


is wrong and should be avoided. If you want to use a close-single-quote there, to reproduce the typographical practice of rendering apostrophes as a slanted quote, then the correct character is U+2019 RIGHT SINGLE QUOTATION MARK, which can be written as &#x2019; or &#8217;. Or, if you're using UTF-8, just included verbatim as ’.

&#146; should refer to character U+0092, a little-used and pointless control character that typically renders as blank or a missing-glyph box. And indeed in XML, it does.

But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range &#128; to &#159; are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.

The problem is that Nokogiri doesn't know about this quirk, and takes character reference 146 at its word, ending up with the character 146 (\u0092) that you don't really want. I think Nokogiri is using libxml2 to parse HTML, so ultimately the proper fix would be to libxml2's htmlParseCharRef function, to substitute characters 128–159.

In the meantime you could perhaps try ‘fixing up’ character references manually with crude string substitution like &#146;->&#x2019; before parsing. It's a bit wrong, but at least in HTML the only other place you can have the markup sequence &#146; without it being a character reference would be in a comment, so hopefully it wouldn't matter if you changed the content there accidentally too.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暖寄归人        
                
              
                            
                2021-01-02 16:25
              
            
            
                                                                       
Have you tried changing 

&amp;#146;


into

&#146;


i think the parser parses the ampersand first then concats it with "#146" and then parse them both. it's just an opinion though..I want this to be just a comment IDK how..lol 

Well I got the idea from focos in his answer post here, and the unicode from here.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复