Extracting text data from PDF files

后端 未结 7 1860
[愿得一人]
[愿得一人] 2020-12-02 11:24

Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?

相关标签:
7条回答
  • 2020-12-02 11:42

    I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information

    Set path to pdftotxt.exe and convert pdf to text

    exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"
    
    for(i in 1:length(pdfFracList)){
        fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
        pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
        txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
        print(paste0("File number ", i, ", Processing file ", pdfSource))
        system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
    }
    
    0 讨论(0)
  • 2020-12-02 11:43

    A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.

    0 讨论(0)
  • 2020-12-02 11:47

    A purely R solution could be:

    library('tm')
    file <- 'namefile.pdf'
    Rpdf <- readPDF(control = list(text = "-layout"))
    corpus <- VCorpus(URISource(file), 
          readerControl = list(reader = Rpdf))
    corpus.array <- content(content(corpus)[[1]])
    

    then you'll have pdf lines in an array.

    0 讨论(0)
  • 2020-12-02 11:51

    The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor.

    The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out.

    Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page.

    Data can be extracted from multiple pages, and a different area can be specified for each page, if required.

    For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.

    0 讨论(0)
  • 2020-12-02 11:55

    This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.

    0 讨论(0)
  • 2020-12-02 11:55
    install.packages("pdftools")
    library(pdftools)
    
    
    download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf", 
                  "56901.DEN.Gamebook", mode = "wb")
    
    txt <- pdf_text("56901.DEN.Gamebook")
    cat(txt[1])
    
    0 讨论(0)
提交回复
热议问题