问题
I use Pdfplumber to extract the table on page 2, section 3 (normally). But it only works on some pdf, others do not work. For failed pdf files, it seems like Pdfplumber read the button table instead of the table I want.
How can I get the table? link of the pdf which doesn't work: pdfA
link of the pdf which works: pdfB
Here is my code:
import pdfplumber
pdf = pdfplumber.open("/Users/chueckingmok/Desktop/selenium/Shell Omala 68.pdf")
page = pdf.pages[1]
table=page.extract_table()
import pandas as pd
df = pd.DataFrame(table[1:], columns=table[0])
df
and the result is
But the table I want in page 2 is
However, this code works for pdfB (which I mentioned above).
Btw, the table I want in each pdf is in section 3.
Anyone can help?
Many thanks Joan
回答1:
Hey Here is the proper solution for that problem but first please read some of my points below
- Well, you used pdfplumber for table extraction but i think you should have read about settings of tables, there are so many settings of table when you read them according to your need you surely find your answers from there. PdfPlumber API - for Table Extraction is Here
- As of now i give perfect solution for your problem in below, but first check documentation of pdfplumber API properly you can surely find all your answers from there, and i am sure that in future you don't need to ask question regarding table extraction using pdfplumber because you will surely find all your solution from there regarding table extraction and also other things like text extraction, word extraction, etc.
- For better understanding of the tables settings you can also use Visual Debugging, this is very best feature of pdfplumber for knowing what exactly table settings does with table and how it extract the tables using table settings.Visual Debugging of Tables
Below Is the solution of your problem,
import pandas as pd
import pdfplumber
pdf = pdfplumber.open("GSAP_msds_01259319.pdf")
p1 = pdf.pages[1]
table = p1.extract_table(table_settings={"vertical_strategy": "lines",
"horizontal_strategy": "text",
"snap_tolerance": 4,})
df = pd.DataFrame(table[1:], columns=table[0])
df
See the output of the Above Code
来源:https://stackoverflow.com/questions/63000396/pdfplumber-cannot-recognise-table-python