I can't remove whitespaces from a string parsed by Nokogiri

前端 未结 2 1626
有刺的猬
有刺的猬 2021-02-09 06:34

I can\'t remove whitespaces from a string.

My HTML is:

Cena pro Vás: 139 

相关标签:
2条回答
  • 2021-02-09 07:14

    If I wanted to remove non-breaking spaces "\u00A0" AKA   I'd do something like:

    require 'nokogiri'
    
    doc = Nokogiri::HTML(" ")
    
    s = doc.text # => " "
    
    # s is the NBSP
    s.ord.to_s(16)                   # => "a0"
    
    # and here's the translate changing the NBSP to a SPACE
    s.tr("\u00A0", ' ').ord.to_s(16) # => "20"
    

    So tr("\u00A0", ' ') gets you where you want to be and at this point, the NBSP is now a space:

    tr is extremely fast and easy to use.

    An alternate is to pre-process the actual encoded character " " before it's been extracted from the HTML. This is simplified but it'd work for an entire HTML file just as well as a single entity in the string:

    s = " "
    
    s.gsub(' ', ' ') # => " "
    

    Using a fixed string for the target is faster than using a regular expression:

    s = " " * 10000
    
    require 'fruity'
    
    compare do
      fixed { s.gsub(' ', ' ') }
      regex { s.gsub(/ /, ' ') }
     end
    
    # >> Running each test 4 times. Test will take about 1 second.
    # >> fixed is faster than regex by 2x ± 0.1
    

    Regular expressions are useful if you need their capability, but they can drastically slow code.

    0 讨论(0)
  • 2021-02-09 07:15

    strip only removes ASCII whitespace and the character you've got here is a Unicode non-breaking space.

    Removing the character is easy. You can use gsub by providing a regex with the character code:

    gsub(/\u00a0/, '')
    

    You could also call

    gsub(/[[:space:]]/, '')
    

    to remove all Unicode whitespace. For details, check the Regexp documentation.

    0 讨论(0)
提交回复
热议问题