parsing html containing (non-breaking space)

问题

I am using rvest to parse a website. I'm hitting a wall with these little non-breaking spaces. How does one remove the whitespace that is created by the   element in a parsed html document?

library("rvest")
library("stringr")  

minimal <- html("<!doctype html><title>blah</title> <p>&nbsp;foo")

bodytext <- minimal %>%
  html_node("body") %>% 
  html_text

Now I have extracted the body text:

bodytext
[1] " foo"

However, I can't remove that pesky bit of whitespace!

str_trim(bodytext)

gsub(pattern = " ", "", bodytext)

回答1:

jdharrison answered:

gsub("\\W", "", bodytext)

and, that will work but you can use:

gsub("[[:space:]]", "", bodytext)

which will remove all Space characters: tab, newline, vertical tab, form feed, carriage return, space and possibly other locale-dependent characters. It's a very readable alternative to other, cryptic regex classes.

回答2:

I have run into the same problem, and have settled on the simple substitution of

gsub(intToUtf8(160),'',bodytext)

(Edited to correct case.)

回答3:

The &nbsp stands for "non-breaking space" which, in the unicode space, has it's own distinct character from a "regular" space (ie " "). Compare

charToRaw(" foo")
# [1] 20 66 6f 6f
charToRaw(bodytext)
# [1] c2 a0 66 6f 6f

So you'd want to use one of the special character classes for white space. You can remove all white spaces with

gsub("\\s", "", bodytext)

On Windows, I needed to make sure the encoding of the string was set properly

Encoding(bodytext) <- "UTF-8"
gsub("\\s", "", bodytext)

回答4:

Posting this since I think it's the most robust approach.

I scraped a Wikipedia page and got this in my output (not sure if it'll copy-paste properly):

x <- " California"

And gsub("\\s", "", x) didn't change anything, which raised the flag that something fishy is going on.

To investigate, I did:

dput(charToRaw(strsplit(x, "")[[1]][1]))
# as.raw(c(0xc2, 0xa0))

To figure out how exactly that character is stored/recognized in memory.

With this in hand, we can use gsub a bit more robustly than in the other solutions:

gsub(rawToChar(as.raw(c(0xc2, 0xa0))), "", x)
# [1] "California"

(@MrFlick's suggestion to set the encoding didn't work for me, and it's not clear where @shabbychef got the input 160 for intToUtf8; this approach can be generalized to other similar situations)

回答5:

Using rex may make this type of task a little simpler. Also I am not able to reproduce your encoding problems, the following correctly substitutes the space regardless of encoding on my machine. (It is the same solution as [[:space:]] though, so likely has the same issue for you)

re_substitutes(bodytext, rex(spaces), "", global = TRUE)

#> [1] "foo"

回答6:

I was able to remove   spaces at the beginning and end of strings with mystring %>% stringr::str_trim().

来源：https://stackoverflow.com/questions/27237233/parsing-html-containing-nbsp-non-breaking-space

标签

stringr