Use R to convert PDF files to text files for text mining

允我心安 提交于 2019-11-29 22:24:21

Yes, not really an R question as IShouldBuyABoat notes, but something that R can do with only minor contortions...

Use R to convert PDF files to txt files...

# folder with 1000s of PDFs
dest <- "C:\\Users\\Desktop"

# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# convert each PDF file that is named in the vector into a text file 
# text file is created in the same directory as the PDFs
# note that my pdftotext.exe is in a different location to yours
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', 
             paste0('"', i, '"')), wait = FALSE) )

Extract only abstracts from txt files...

# if you just want the abstracts, we can use regex to extract that part of
# each txt file, Assumes that the abstract is always between the words 'Abstract'
# and 'Introduction'
mytxtfiles <- list.files(path = dest, pattern = "txt",  full.names = TRUE)
abstracts <- lapply(mytxtfiles, function(i) {
  j <- paste0(scan(i, what = character()), collapse = " ")
  regmatches(j, gregexpr("(?<=Abstract).*?(?=Introduction)", j, perl=TRUE))
})

Write abstracts into separate txt files...

# write abstracts as txt files 
# (or use them in the list for whatever you want to do next)
lapply(1:length(abstracts),  function(i) write.table(abstracts[i], file=paste(mytxtfiles[i], "abstract", "txt", sep="."), quote = FALSE, row.names = FALSE, col.names = FALSE, eol = " " ))

And now you're ready to do some text mining on the abstracts.

We can use library pdftools

library(pdftools)
# you can use an url or a path
pdf_url <- "https://cran.r-project.org/web/packages/pdftools/pdftools.pdf"

# `pdf_text` converts it to a list
list_output <- pdftools::pdf_text('https://cran.r-project.org/web/packages/pdftools/pdftools.pdf')

# you get an element by page
length(list_output) # 5 elements for a 5 page pdf

# let's print the 5th
cat(list_output[[5]])
# Index
# pdf_attachments (pdf_info), 2
# pdf_convert (pdf_render_page), 3
# pdf_fonts (pdf_info), 2
# pdf_info, 2, 3
# pdf_render_page, 2, 3
# pdf_text, 2
# pdf_text (pdf_info), 2
# pdf_toc (pdf_info), 2
# pdftools (pdf_info), 2
# poppler_config (pdf_render_page), 3
# render (pdf_render_page), 3
# suppressMessages, 2
# 5

To extract abstracts from articles, OP chooses to extract content between Abstract and Introduction.

We'll take a list of CRAN pdfs and extract the author(s) as the text between Author and Maintainer (I handpicked a few that had a compatible format).

For this we loop on our url list then extract the content, collapse all texts into one for each pdf, and then extract the relevant info with regex.

urls <- c(pdftools = "https://cran.r-project.org/web/packages/pdftools/pdftools.pdf",
          Rcpp     = "https://cran.r-project.org/web/packages/Rcpp/Rcpp.pdf",
          jpeg     = "https://cran.r-project.org/web/packages/jpeg/jpeg.pdf")

lapply(urls,function(url){
  list_output <- pdftools::pdf_text(url)
  text_output <- gsub('(\\s|\r|\n)+',' ',paste(unlist(list_output),collapse=" "))
  trimws(regmatches(text_output, gregexpr("(?<=Author).*?(?=Maintainer)", text_output, perl=TRUE))[[1]][1])
})

# $pdftools
# [1] "Jeroen Ooms"
# 
# $Rcpp
# [1] "Dirk Eddelbuettel, Romain Francois, JJ Allaire, Kevin Ushey, Qiang Kou, Nathan Russell, Douglas Bates and John Chambers"
# 
# $jpeg
# [1] "Simon Urbanek <Simon.Urbanek@r-project.org>"
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!