问题
I’ve recently gotten into scraping (and programming in general) for my internship, and I came across PDF scraping. Every time I try to read a scanned pdf with R, I can never get it to work. I’ve tried using the file.choose()
function to no avail. Do I need to change my directory, or how can I get the pdf from my files into R?
The code looks something like this:
> library(pdftools)
> text=pdf_text("C:/Users/myname/Documents/renewalscan.pdf")
> text
[1] ""
Also, using pdftables leads me here:
> library(pdftables)
> convert_pdf("C:/Users/myname/Documents/renewalscan.pdf","my.csv")
Error in get_content(input_file, format, api_key) :
Bad Request (HTTP 400).
回答1:
You should use the packages pdftools
and pdftables
.
If you are trying to read text inside the pdf, then use pdf_text()
function. What goes inside is the path (in your computer or web) to the pdf. For example
tt = pdf_text("C:/Users/Smith/Documents/my_file.pdf")
It would be nice if you were more specif and also give us reproducible example.
回答2:
To use the PDFTables R package, you need to the run the following command:
convert_pdf('test/index.pdf', output_file = NULL, format = "xlsx-single", message = TRUE, api_key = "insert_API_key")
回答3:
If you are looking to get tabular data, you might try tabulizer
. Here is a full code tutorial: https://www.business-science.io/code-tools/2019/09/23/tabulizer-pdf-scraping.html
Basically, you can use this code from the tutorial:
library(tabulizer)
extract_tables(
file = "2019-09-23-tabulizer/endangered_species.pdf",
method = "decide",
output = "data.frame")
来源:https://stackoverflow.com/questions/50749759/how-to-scrape-a-downloaded-pdf-file-with-r