I am working on a Scrapy spider, trying to extract the text from multiple PDF files in a directory, using slate. I have no interest in saving the actual PDF to disk, and so I\'v
When you do in_memory_pdf.read(response.body)
you are supposed to pass the number of bytes to read. You want to initialize the buffer, not read into it.
In python 2, just initialize BytesIO
as:
in_memory_pdf = BytesIO(response.body)
In Python 3, you cannot use BytesIO
with a string because it expects bytes. The error message shows that response.body
is of type str
: we have to encode it.
in_memory_pdf = BytesIO(bytes(response.body,'ascii'))
But as a pdf can be binary data, I suppose that response.body
would be bytes
, not str
. In that case, the simple in_memory_pdf = BytesIO(response.body)
works.