问题
I have a folder with more than 2,000 rtf documents. I want to import them into r (preferable into a data frame that can be used in combination with the tidytext package). In addition, I need an additional column, adding the filename so that I can link the content of each rtf document to the filename (later, I will also have to extract information from the filename and save it into seperate columns of my data set).
I came across a solution by Jens Leerssen that I tried to adapt to my requirements:
require(textreadr)
read_plus <- function(flnm) {
read_rtf(flnm) %>%
mutate(filename = flnm)
}
tbl_with_sources <-
list.files(path= "./data", pattern = "*.rtf",
full.names = TRUE) %>%
map_df(~read_plus(.))
However, I get the following error message:
Error in UseMethod("mutate_") : no applicable method for 'mutate_' applied to an object of class "character"
Can anyone tell me why this error occurs or propose another solution to my problem?
回答1:
I finally solved the problem, with some workaround.
1) I converted the *.rft files to *.txt files by using the textutil
command in the MacOSX terminal:
find . -name \*.rtf -print0 | xargs -0 textutil -convert txt
By doing so, I get also rid of formatting.
2) I then used the read_plus
function of Jens Lerrssen. However I now use read.delim
instead of read_rtf
and included two options (stringsAsFactors
and quote
) to get rid of warnings and/or errors:
read_plus <- function(flnm) {
read.delim(flnm, header = FALSE, stringsAsFactors = FALSE, quote = "") %>%
mutate(filename = flnm)
}
3) Finally, I read in all the *.txt files and renamed the columnn V1
at the end.
df <- list.files(path = "./data", pattern = "*.txt",
full.names = TRUE) %>%
map_df(~read_plus(.)) %>%
rename(paragraph = V1)
来源:https://stackoverflow.com/questions/50002129/read-multiple-rtf-files-in-r