I can't remove whitespaces from a string parsed by Nokogiri

前端 未结 2 1628
有刺的猬
有刺的猬 2021-02-09 06:34

I can\'t remove whitespaces from a string.

My HTML is:

Cena pro Vás: 139 

2条回答
  •  闹比i
    闹比i (楼主)
    2021-02-09 07:14

    If I wanted to remove non-breaking spaces "\u00A0" AKA   I'd do something like:

    require 'nokogiri'
    
    doc = Nokogiri::HTML(" ")
    
    s = doc.text # => " "
    
    # s is the NBSP
    s.ord.to_s(16)                   # => "a0"
    
    # and here's the translate changing the NBSP to a SPACE
    s.tr("\u00A0", ' ').ord.to_s(16) # => "20"
    

    So tr("\u00A0", ' ') gets you where you want to be and at this point, the NBSP is now a space:

    tr is extremely fast and easy to use.

    An alternate is to pre-process the actual encoded character " " before it's been extracted from the HTML. This is simplified but it'd work for an entire HTML file just as well as a single entity in the string:

    s = " "
    
    s.gsub(' ', ' ') # => " "
    

    Using a fixed string for the target is faster than using a regular expression:

    s = " " * 10000
    
    require 'fruity'
    
    compare do
      fixed { s.gsub(' ', ' ') }
      regex { s.gsub(/ /, ' ') }
     end
    
    # >> Running each test 4 times. Test will take about 1 second.
    # >> fixed is faster than regex by 2x ± 0.1
    

    Regular expressions are useful if you need their capability, but they can drastically slow code.

提交回复
热议问题