parsing html containing   (non-breaking space)

前端 未结 6 1635
轻奢々
轻奢々 2020-12-16 23:19

I am using rvest to parse a website. I\'m hitting a wall with these little non-breaking spaces. How does one remove the whitespace that is created by the

相关标签:
6条回答
  • 2020-12-16 23:56

    Using rex may make this type of task a little simpler. Also I am not able to reproduce your encoding problems, the following correctly substitutes the space regardless of encoding on my machine. (It is the same solution as [[:space:]] though, so likely has the same issue for you)

    re_substitutes(bodytext, rex(spaces), "", global = TRUE)
    
    #> [1] "foo"
    
    0 讨论(0)
  • 2020-12-17 00:10

    Posting this since I think it's the most robust approach.

    I scraped a Wikipedia page and got this in my output (not sure if it'll copy-paste properly):

    x <- " California"
    

    And gsub("\\s", "", x) didn't change anything, which raised the flag that something fishy is going on.

    To investigate, I did:

    dput(charToRaw(strsplit(x, "")[[1]][1]))
    # as.raw(c(0xc2, 0xa0))
    

    To figure out how exactly that character is stored/recognized in memory.

    With this in hand, we can use gsub a bit more robustly than in the other solutions:

    gsub(rawToChar(as.raw(c(0xc2, 0xa0))), "", x)
    # [1] "California"
    

    (@MrFlick's suggestion to set the encoding didn't work for me, and it's not clear where @shabbychef got the input 160 for intToUtf8; this approach can be generalized to other similar situations)

    0 讨论(0)
  • 2020-12-17 00:11

    jdharrison answered:

    gsub("\\W", "", bodytext)
    

    and, that will work but you can use:

    gsub("[[:space:]]", "", bodytext)
    

    which will remove all Space characters: tab, newline, vertical tab, form feed, carriage return, space and possibly other locale-dependent characters. It's a very readable alternative to other, cryptic regex classes.

    0 讨论(0)
  • 2020-12-17 00:11

    The &nbsp stands for "non-breaking space" which, in the unicode space, has it's own distinct character from a "regular" space (ie " "). Compare

    charToRaw(" foo")
    # [1] 20 66 6f 6f
    charToRaw(bodytext)
    # [1] c2 a0 66 6f 6f
    

    So you'd want to use one of the special character classes for white space. You can remove all white spaces with

    gsub("\\s", "", bodytext)
    

    On Windows, I needed to make sure the encoding of the string was set properly

    Encoding(bodytext) <- "UTF-8"
    gsub("\\s", "", bodytext)
    
    0 讨论(0)
  • 2020-12-17 00:12

    I have run into the same problem, and have settled on the simple substitution of

    gsub(intToUtf8(160),'',bodytext)
    

    (Edited to correct case.)

    0 讨论(0)
  • 2020-12-17 00:13

    I was able to remove &nbsp; spaces at the beginning and end of strings with mystring %>% stringr::str_trim().

    0 讨论(0)
提交回复
热议问题