I can't remove whitespaces from a string parsed by Nokogiri

前端未结

关注

 2  1634

I can\'t remove whitespaces from a string.

My HTML is:


Cena pro Vás: 139 Kč


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  闹比i        
                
              
                            
                2021-02-09 07:14
              
            
            
                                                                       
If I wanted to remove non-breaking spaces "\u00A0" AKA &nbsp; I'd do something like:

require 'nokogiri'

doc = Nokogiri::HTML("&nbsp;")

s = doc.text # => " "

# s is the NBSP
s.ord.to_s(16)                   # => "a0"

# and here's the translate changing the NBSP to a SPACE
s.tr("\u00A0", ' ').ord.to_s(16) # => "20"


So tr("\u00A0", ' ') gets you where you want to be and at this point, the NBSP is now a space:

tr is extremely fast and easy to use.

An alternate is to pre-process the actual encoded character "&nbsp;" before it's been extracted from the HTML. This is simplified but it'd work for an entire HTML file just as well as a single entity in the string:

s = "&nbsp;"

s.gsub('&nbsp;', ' ') # => " "


Using a fixed string for the target is faster than using a regular expression:

s = "&nbsp;" * 10000

require 'fruity'

compare do
  fixed { s.gsub('&nbsp;', ' ') }
  regex { s.gsub(/&nbsp;/, ' ') }
 end

# >> Running each test 4 times. Test will take about 1 second.
# >> fixed is faster than regex by 2x ± 0.1


Regular expressions are useful if you need their capability, but they can drastically slow code.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2021-02-09 07:15
              
            
            
                                                                       
strip only removes ASCII whitespace and the character you've got here is a Unicode non-breaking space. 

Removing the character is easy. You can use gsub by providing a regex with the character code:

gsub(/\u00a0/, '')


You could also call 

gsub(/[[:space:]]/, '')


to remove all Unicode whitespace. For details, check the Regexp documentation.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复