removing data with tags from a vector

前端 未结 2 1474
遇见更好的自我
遇见更好的自我 2021-01-17 03:13

I have a string vector which contains html tags e.g

  abc<-\"\"welcome abc Ha         


        
相关标签:
2条回答
  • 2021-01-17 03:32

    Try

    > gsub("(<[^>]*>)","",abc)
    

    what this says is 'substitute every instance of < followed by anything that isnt a > up to a > with nothing"

    You cant just do gsub("<.*>","",abc) because regexps are greedy, and the .* would match up to the last > in your text (and you'd lose the 'abc' in your example).

    This solution might fail if you've got > in your tags - but is <foo class=">" > legal? Doubtless someone will come up with another answer that involves parsing the HTML with a heavyweight XML package.

    0 讨论(0)
  • 2021-01-17 03:46

    You can convert your piece of HTML to an XML document with htmlParse or htmlTreeParse. You can then convert it to text, i.e., strip all the tags, with xmlValue.

    abc <- "welcome <span class=\"r\"><a href=\"abc\">abc</a></span> Have fun!"
    library(XML)
    #doc <- htmlParse(abc, asText=TRUE)
    doc <- htmlTreeParse(abc, asText=TRUE)
    xmlValue( xmlRoot(doc) )
    

    If you also want to remove the contents of the links, you can use xmlDOMApply to transform the XML tree.

    f <- function(x) if(xmlName(x) == "span") xmlTextNode(" ") else x
    d <- xmlDOMApply( xmlRoot(doc), f )
    xmlValue(d)
    
    0 讨论(0)
提交回复
热议问题