creating corpus from multiple html text files

…衆ロ難τιáo~ 提交于 2019-12-02 10:21:54

You could allocate the vector beforehand:

text <- rep(NA, 100)
text[1] <- read_html(link1)
...
text[100] <- read_html(link100)

Even better, if you organize your links as vector. Then you can use, as suggested in the comments, lapply:

text <- lapply(links, read_html)

(here links is a vector of the links).

It would be rather bad coding style to use assign:

# not a good idea
for (i in 1:100) assign(paste0("text", i), get(paste0("link", i)))

since this is rather slow and hard to process further.

I would suggest using purrr for this solution:

library(tidyverse)
library(purrr)
library(rvest)

files <- list.files("path/to/html_links", full.names = T)

all_html <- tibble(file_path = files) %>% 
  mutate(filenames = basename(files)) %>% 
  mutate(text = map(file_path, read_html))

Is a nice way to keep track of which piece of text belongs to which file. It also makes things like sentiment or any other type analysis easy at a document level.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!