问题
I have multiple txt files, I want to have a tidy data. To do that first I create corpus ( I am not sure is it true way to do it). I wrote the following code to have the corpus data.
folder<-"C:\\Users\\user\\Desktop\\text analysis\\doc"
list.files(path=folder)
filelist<- list.files(path=folder, pattern="*.txt")
paste(folder, "\\", filelist)
filelist<-paste(folder, "\\", filelist, sep="")
typeof(filelist)
a<- lapply(filelist,FUN=readLines)
corpus <- lapply(a ,FUN=paste, collapse=" ")
When I check the class(corpus)
it returns list
. From that point how can I create tidy data?
回答1:
If you have text files and you want tidy data, I would go straight from one to the other and not bother with the tm package in between.
To find all the text files within a working directory, you can use list.files
with an argument:
all_txts <- list.files(pattern = ".txt$")
The all_txts
object will then be a character vector that contains all your filenames.
Then, you can set up a pipe to read in all the text files and unnest them using tidytext with a map
function from purrr. You can use a mutate()
within the map()
to annotate each line with the filename, if you'd like.
library(tidyverse)
library(tidytext)
map_df(all_txts, ~ data_frame(txt = read_file(.x)) %>%
mutate(filename = basename(.x)) %>%
unnest_tokens(word, txt))
回答2:
Looking at your other question as well, you need to read up on text-mining and how to read in files. Your result now is a list object. In itself not a bad object, but for your purposes not correct. Instead of lapply
, use sapply
in your last line, like this:
corpus <- sapply(a , FUN = paste, collapse = " ")
This will return a character vector. Next you need to turn this into a data.frame. I added the filelist to the data.frame to keep track of which text belongs to which document.
my_data <- data.frame(files = filelist, text = corpus, stringsAsFactors = FALSE)
and then use tidytext to continue:
library(tidytext)
tidy_text <- unnest_tokens(my_data, words, text)
using tm and tidytext package
If you would use the tm package, you could read everything in like this:
library(tm)
folder <- getwd() # <-- here goes your folder
corpus <- VCorpus(DirSource(directory = folder,
pattern = "*.txt"))
which you could turn into tidytext like this:
library(tidytext)
tidy_corpus <- tidy(corpus)
tidy_text <- unnest_tokens(tidy_corpus, words, text)
来源:https://stackoverflow.com/questions/54850258/creating-corpus-from-multiple-txt-files