creating corpus from multiple html text files

旧街凉风 提交于 2019-12-02 20:38:23

问题


I have a list of html files, I have taken some texts from the web and make them read with the read_html.

My files names are like:

a1 <- read_html(link of the text) 
a2 <- read_html(link of the text) 
.
.
. ## until:
a100 <- read_html(link of the text)

I am trying to create a corpus with these.

Any ideas how can I do it?

Thanks.


回答1:


You could allocate the vector beforehand:

text <- rep(NA, 100)
text[1] <- read_html(link1)
...
text[100] <- read_html(link100)

Even better, if you organize your links as vector. Then you can use, as suggested in the comments, lapply:

text <- lapply(links, read_html)

(here links is a vector of the links).

It would be rather bad coding style to use assign:

# not a good idea
for (i in 1:100) assign(paste0("text", i), get(paste0("link", i)))

since this is rather slow and hard to process further.




回答2:


I would suggest using purrr for this solution:

library(tidyverse)
library(purrr)
library(rvest)

files <- list.files("path/to/html_links", full.names = T)

all_html <- tibble(file_path = files) %>% 
  mutate(filenames = basename(files)) %>% 
  mutate(text = map(file_path, read_html))

Is a nice way to keep track of which piece of text belongs to which file. It also makes things like sentiment or any other type analysis easy at a document level.



来源:https://stackoverflow.com/questions/53390843/creating-corpus-from-multiple-html-text-files

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!