问题
I have a list of html files, I have taken some texts from the web and make them read with the read_html
.
My files names are like:
a1 <- read_html(link of the text)
a2 <- read_html(link of the text)
.
.
. ## until:
a100 <- read_html(link of the text)
I am trying to create a corpus with these.
Any ideas how can I do it?
Thanks.
回答1:
You could allocate the vector beforehand:
text <- rep(NA, 100)
text[1] <- read_html(link1)
...
text[100] <- read_html(link100)
Even better, if you organize your links as vector. Then you can use, as suggested in the comments, lapply
:
text <- lapply(links, read_html)
(here links is a vector of the links).
It would be rather bad coding style to use assign
:
# not a good idea
for (i in 1:100) assign(paste0("text", i), get(paste0("link", i)))
since this is rather slow and hard to process further.
回答2:
I would suggest using purrr
for this solution:
library(tidyverse)
library(purrr)
library(rvest)
files <- list.files("path/to/html_links", full.names = T)
all_html <- tibble(file_path = files) %>%
mutate(filenames = basename(files)) %>%
mutate(text = map(file_path, read_html))
Is a nice way to keep track of which piece of text belongs to which file. It also makes things like sentiment or any other type analysis easy at a document level.
来源:https://stackoverflow.com/questions/53390843/creating-corpus-from-multiple-html-text-files