问题
I am using rvest
to parse a website. I'm hitting a wall with these little non-breaking spaces. How does one remove the whitespace that is created by the
element in a parsed html document?
library("rvest")
library("stringr")
minimal <- html("<!doctype html><title>blah</title> <p> foo")
bodytext <- minimal %>%
html_node("body") %>%
html_text
Now I have extracted the body text:
bodytext
[1] " foo"
However, I can't remove that pesky bit of whitespace!
str_trim(bodytext)
gsub(pattern = " ", "", bodytext)
回答1:
jdharrison answered:
gsub("\\W", "", bodytext)
and, that will work but you can use:
gsub("[[:space:]]", "", bodytext)
which will remove all Space characters: tab, newline, vertical tab, form feed, carriage return, space and possibly other locale-dependent characters
. It's a very readable alternative to other, cryptic regex classes.
回答2:
I have run into the same problem, and have settled on the simple substitution of
gsub(intToUtf8(160),'',bodytext)
(Edited to correct case.)
回答3:
The  
stands for "non-breaking space" which, in the unicode space, has it's own distinct character from a "regular" space (ie " "
). Compare
charToRaw(" foo")
# [1] 20 66 6f 6f
charToRaw(bodytext)
# [1] c2 a0 66 6f 6f
So you'd want to use one of the special character classes for white space. You can remove all white spaces with
gsub("\\s", "", bodytext)
On Windows, I needed to make sure the encoding of the string was set properly
Encoding(bodytext) <- "UTF-8"
gsub("\\s", "", bodytext)
回答4:
Posting this since I think it's the most robust approach.
I scraped a Wikipedia page and got this in my output (not sure if it'll copy-paste properly):
x <- " California"
And gsub("\\s", "", x)
didn't change anything, which raised the flag that something fishy is going on.
To investigate, I did:
dput(charToRaw(strsplit(x, "")[[1]][1]))
# as.raw(c(0xc2, 0xa0))
To figure out how exactly that character is stored/recognized in memory.
With this in hand, we can use gsub
a bit more robustly than in the other solutions:
gsub(rawToChar(as.raw(c(0xc2, 0xa0))), "", x)
# [1] "California"
(@MrFlick's suggestion to set the encoding didn't work for me, and it's not clear where @shabbychef got the input 160
for intToUtf8
; this approach can be generalized to other similar situations)
回答5:
Using rex may make this type of task a little simpler. Also I am not able to reproduce your encoding problems, the following correctly substitutes the space regardless of encoding on my machine. (It is the same solution as [[:space:]]
though, so likely has the same issue for you)
re_substitutes(bodytext, rex(spaces), "", global = TRUE)
#> [1] "foo"
回答6:
I was able to remove
spaces at the beginning and end of strings with mystring %>% stringr::str_trim()
.
来源:https://stackoverflow.com/questions/27237233/parsing-html-containing-nbsp-non-breaking-space