bind character vector to list into dataframe

只愿长相守 提交于 2019-12-14 03:18:30

问题


I have a list of URLs and have extracted the content as follows:

library(httr)
link="http://www.workerspower.net/disposable-workers-the-real-price-of-sweat-shop-labor"
get.link=GET(link)
get.content=content(x2,as="text")
extract.content=str_extract_all(y2,"<p>(.*?)</p>")

This gives a "list of 1" with text. The length of each list is dependent on/varies with the URL. I would like to bind the URL [link] with the content [extract.content] and transform it into a dataframe and then import that into a Corpus. My attempts fail, eg. this does not work because of the different row lengths:

all=data.frame(url.vec=c(link1,link2),text.vec=c(extract.content1,extract.content2))

Does anyone knows how to combine a character[vector] with a character[list]?


回答1:


I would do this using XML package. Then you should avoid using regular expression with html/xml documents. Use xpath instead. Here I create a small function that giving a link it create the corpus.

library(XML)
create.corpus <- function(link){
  doc <- htmlParse(link)
  parag <- xpathSApply(doc,'//p',xmlValue)
  library(tm)
  cc <- Corpus(VectorSource(parag))
  meta(cc,type='corpus','link') <- link
  cc
}
## call it 
cc <- create.corpus(link)

Inspecting the result:

 meta(cc,type='corpus')
# $create_date
# [1] "2014-01-03 17:40:50 GMT"
# 
# $creator
# [1] ""
# 
# $link
# [1] "http://www.workerspower.net/disposable-workers-the-real-price-of-sweat-shop-labor"

> cc
# A corpus with 36 text documents


来源:https://stackoverflow.com/questions/20909357/bind-character-vector-to-list-into-dataframe

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!