问题
I am using camelot for table data extraction, however header are not getting extracted as part of the PDF.
Attaching the target PDF link below and target table are at page number 3 and 4, which need to extracted.
https://drive.google.com/file/d/1xniTIwpnNIdA_k4xvEARlVH97Lk-K2Yr/view?usp=sharing
One of the tables looks like below
I have seen the the camelot documentation and I think the problem is related to the "Detect short lines"
https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines
However not able to resolve the problem by tweaking the line_size_scaling parameter.
Please assist.
回答1:
I plotted the detected table boundary on page 3 using $ camelot -p 3 lattice -plot contour 007.pdf
. Looks like Camelot isn't including the header row in the detected table boundary [bug 1] (see image below). Then I tried using the table_areas
keyword argument with flavor='lattice'
but then it didn't include the lines in the the specified table boundary [bug 2]. I've added these on the issue tracker as #200 and #201.
You can still use the table_areas
keyword argument with flavor='stream'
to get the table out.
Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf
Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])
You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging
Hope that helps!
来源:https://stackoverflow.com/questions/53203779/headers-are-not-getting-extracted-from-pdf-while-extracting-the-table-data-from