I have nearly one thousand pdf journal articles in a folder. I need to text mine on all article\'s abstracts from the whole folder. Now I am doing the following:
<
We can use library pdftools
library(pdftools)
# you can use an url or a path
pdf_url <- "https://cran.r-project.org/web/packages/pdftools/pdftools.pdf"
# `pdf_text` converts it to a list
list_output <- pdftools::pdf_text('https://cran.r-project.org/web/packages/pdftools/pdftools.pdf')
# you get an element by page
length(list_output) # 5 elements for a 5 page pdf
# let's print the 5th
cat(list_output[[5]])
# Index
# pdf_attachments (pdf_info), 2
# pdf_convert (pdf_render_page), 3
# pdf_fonts (pdf_info), 2
# pdf_info, 2, 3
# pdf_render_page, 2, 3
# pdf_text, 2
# pdf_text (pdf_info), 2
# pdf_toc (pdf_info), 2
# pdftools (pdf_info), 2
# poppler_config (pdf_render_page), 3
# render (pdf_render_page), 3
# suppressMessages, 2
# 5
To extract abstracts from articles, OP chooses to extract content between Abstract
and Introduction
.
We'll take a list of CRAN
pdfs and extract the author(s) as the text between Author
and Maintainer
(I handpicked a few that had a compatible format).
For this we loop on our url list then extract the content, collapse all texts into one for each pdf, and then extract the relevant info with regex
.
urls <- c(pdftools = "https://cran.r-project.org/web/packages/pdftools/pdftools.pdf",
Rcpp = "https://cran.r-project.org/web/packages/Rcpp/Rcpp.pdf",
jpeg = "https://cran.r-project.org/web/packages/jpeg/jpeg.pdf")
lapply(urls,function(url){
list_output <- pdftools::pdf_text(url)
text_output <- gsub('(\\s|\r|\n)+',' ',paste(unlist(list_output),collapse=" "))
trimws(regmatches(text_output, gregexpr("(?<=Author).*?(?=Maintainer)", text_output, perl=TRUE))[[1]][1])
})
# $pdftools
# [1] "Jeroen Ooms"
#
# $Rcpp
# [1] "Dirk Eddelbuettel, Romain Francois, JJ Allaire, Kevin Ushey, Qiang Kou, Nathan Russell, Douglas Bates and John Chambers"
#
# $jpeg
# [1] "Simon Urbanek <Simon.Urbanek@r-project.org>"
Yes, not really an R
question as IShouldBuyABoat notes, but something that R
can do with only minor contortions...
Use R
to convert PDF files to txt files...
# folder with 1000s of PDFs
dest <- "C:\\Users\\Desktop"
# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
# convert each PDF file that is named in the vector into a text file
# text file is created in the same directory as the PDFs
# note that my pdftotext.exe is in a different location to yours
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"',
paste0('"', i, '"')), wait = FALSE) )
Extract only abstracts from txt files...
# if you just want the abstracts, we can use regex to extract that part of
# each txt file, Assumes that the abstract is always between the words 'Abstract'
# and 'Introduction'
mytxtfiles <- list.files(path = dest, pattern = "txt", full.names = TRUE)
abstracts <- lapply(mytxtfiles, function(i) {
j <- paste0(scan(i, what = character()), collapse = " ")
regmatches(j, gregexpr("(?<=Abstract).*?(?=Introduction)", j, perl=TRUE))
})
Write abstracts into separate txt files...
# write abstracts as txt files
# (or use them in the list for whatever you want to do next)
lapply(1:length(abstracts), function(i) write.table(abstracts[i], file=paste(mytxtfiles[i], "abstract", "txt", sep="."), quote = FALSE, row.names = FALSE, col.names = FALSE, eol = " " ))
And now you're ready to do some text mining on the abstracts.