I can\'t remove whitespaces from a string.
My HTML is:
Cena pro Vás: 139 Kč
If I wanted to remove non-breaking spaces "\u00A0"
AKA
I'd do something like:
require 'nokogiri'
doc = Nokogiri::HTML(" ")
s = doc.text # => " "
# s is the NBSP
s.ord.to_s(16) # => "a0"
# and here's the translate changing the NBSP to a SPACE
s.tr("\u00A0", ' ').ord.to_s(16) # => "20"
So tr("\u00A0", ' ')
gets you where you want to be and at this point, the NBSP is now a space:
tr is extremely fast and easy to use.
An alternate is to pre-process the actual encoded character "
" before it's been extracted from the HTML. This is simplified but it'd work for an entire HTML file just as well as a single entity in the string:
s = " "
s.gsub(' ', ' ') # => " "
Using a fixed string for the target is faster than using a regular expression:
s = " " * 10000
require 'fruity'
compare do
fixed { s.gsub(' ', ' ') }
regex { s.gsub(/ /, ' ') }
end
# >> Running each test 4 times. Test will take about 1 second.
# >> fixed is faster than regex by 2x ± 0.1
Regular expressions are useful if you need their capability, but they can drastically slow code.
strip
only removes ASCII whitespace and the character you've got here is a Unicode non-breaking space.
Removing the character is easy. You can use gsub
by providing a regex with the character code:
gsub(/\u00a0/, '')
You could also call
gsub(/[[:space:]]/, '')
to remove all Unicode whitespace. For details, check the Regexp documentation.