creating corpus from multiple txt files

问题

I have multiple txt files, I want to have a tidy data. To do that first I create corpus ( I am not sure is it true way to do it). I wrote the following code to have the corpus data.

folder<-"C:\\Users\\user\\Desktop\\text analysis\\doc"
list.files(path=folder) 
filelist<- list.files(path=folder, pattern="*.txt")
paste(folder, "\\", filelist)
filelist<-paste(folder, "\\", filelist, sep="")
typeof(filelist)
a<- lapply(filelist,FUN=readLines)
corpus <- lapply(a ,FUN=paste, collapse=" ")

When I check the class(corpus) it returns list. From that point how can I create tidy data?

回答1:

If you have text files and you want tidy data, I would go straight from one to the other and not bother with the tm package in between.

To find all the text files within a working directory, you can use list.files with an argument:

all_txts <- list.files(pattern = ".txt$")

The all_txts object will then be a character vector that contains all your filenames.

Then, you can set up a pipe to read in all the text files and unnest them using tidytext with a map function from purrr. You can use a mutate() within the map() to annotate each line with the filename, if you'd like.

library(tidyverse)
library(tidytext)

map_df(all_txts, ~ data_frame(txt = read_file(.x)) %>%
        mutate(filename = basename(.x)) %>%
        unnest_tokens(word, txt))

回答2:

Looking at your other question as well, you need to read up on text-mining and how to read in files. Your result now is a list object. In itself not a bad object, but for your purposes not correct. Instead of lapply, use sapply in your last line, like this:

corpus <- sapply(a , FUN = paste, collapse = " ")

This will return a character vector. Next you need to turn this into a data.frame. I added the filelist to the data.frame to keep track of which text belongs to which document.

my_data <- data.frame(files = filelist, text = corpus, stringsAsFactors = FALSE)

and then use tidytext to continue:

library(tidytext)
tidy_text <- unnest_tokens(my_data, words, text)

using tm and tidytext package

If you would use the tm package, you could read everything in like this:

library(tm)
folder <- getwd() # <-- here goes your folder

corpus <- VCorpus(DirSource(directory = folder,
                            pattern = "*.txt"))

which you could turn into tidytext like this:

library(tidytext)
tidy_corpus <- tidy(corpus)
tidy_text <- unnest_tokens(tidy_corpus, words, text)

来源：https://stackoverflow.com/questions/54850258/creating-corpus-from-multiple-txt-files

标签

tidytext