问题
I have a list of URLs and have extracted the content as follows:
library(httr)
link="http://www.workerspower.net/disposable-workers-the-real-price-of-sweat-shop-labor"
get.link=GET(link)
get.content=content(x2,as="text")
extract.content=str_extract_all(y2,"<p>(.*?)</p>")
This gives a "list of 1" with text. The length of each list is dependent on/varies with the URL. I would like to bind the URL [link] with the content [extract.content] and transform it into a dataframe and then import that into a Corpus. My attempts fail, eg. this does not work because of the different row lengths:
all=data.frame(url.vec=c(link1,link2),text.vec=c(extract.content1,extract.content2))
Does anyone knows how to combine a character[vector] with a character[list]?
回答1:
I would do this using XML
package. Then you should avoid using regular expression with html/xml documents. Use xpath
instead. Here I create a small function that giving a link it create the corpus.
library(XML)
create.corpus <- function(link){
doc <- htmlParse(link)
parag <- xpathSApply(doc,'//p',xmlValue)
library(tm)
cc <- Corpus(VectorSource(parag))
meta(cc,type='corpus','link') <- link
cc
}
## call it
cc <- create.corpus(link)
Inspecting the result:
meta(cc,type='corpus')
# $create_date
# [1] "2014-01-03 17:40:50 GMT"
#
# $creator
# [1] ""
#
# $link
# [1] "http://www.workerspower.net/disposable-workers-the-real-price-of-sweat-shop-labor"
> cc
# A corpus with 36 text documents
来源:https://stackoverflow.com/questions/20909357/bind-character-vector-to-list-into-dataframe