问题
I am using pypdf to extract text from pdf files . The problem is that the tables in the pdf files are not extracted. I have also tried using the pdfminer but i am having the same issue .
回答1:
The problem is that tables in PDFs are generally made up of absolutely positioned lines and characters, and it is non-trivial to convert this into a sensible table representation.
In Python, PDFMiner is probably your best bet. It gives you a tree structure of layout objects, but you will have to do the table interpreting yourself by looking at the positions of lines (LTLine) and text boxes (LTTextBox). There's a little bit of documentation here.
Alternatively, PDFX attempts this (and often succeeds), but you have to use it as a web service (not ideal, but fine for the occasional job). To do this from Python, you could do something like the following:
import urllib2
import xml.etree.ElementTree as ET
# Make request to PDFX
pdfdata = open('example.pdf', 'rb').read()
request = urllib2.Request('http://pdfx.cs.man.ac.uk', pdfdata, headers={'Content-Type' : 'application/pdf'})
response = urllib2.urlopen(request).read()
# Parse the response
tree = ET.fromstring(response)
for tbox in tree.findall('.//region[@class="DoCO:TableBox"]'):
src = ET.tostring(tbox.find('content/table'))
info = ET.tostring(tbox.find('region[@class="TableInfo"]'))
caption = ET.tostring(tbox.find('caption'))
来源:https://stackoverflow.com/questions/17523193/pypdf-not-extracting-tables-from-pdf