Headers are not getting extracted from PDF while extracting the table data from PDF using camelot

问题

I am using camelot for table data extraction, however header are not getting extracted as part of the PDF.

Attaching the target PDF link below and target table are at page number 3 and 4, which need to extracted.

https://drive.google.com/file/d/1xniTIwpnNIdA_k4xvEARlVH97Lk-K2Yr/view?usp=sharing

One of the tables looks like below

I have seen the the camelot documentation and I think the problem is related to the "Detect short lines"

https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines

However not able to resolve the problem by tweaking the line_size_scaling parameter.

Please assist.

回答1:

I plotted the detected table boundary on page 3 using $ camelot -p 3 lattice -plot contour 007.pdf. Looks like Camelot isn't including the header row in the detected table boundary [bug 1] (see image below). Then I tried using the table_areas keyword argument with flavor='lattice' but then it didn't include the lines in the the specified table boundary [bug 2]. I've added these on the issue tracker as #200 and #201.

You can still use the table_areas keyword argument with flavor='stream' to get the table out.

Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf

Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])

You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging

Hope that helps!

来源：https://stackoverflow.com/questions/53203779/headers-are-not-getting-extracted-from-pdf-while-extracting-the-table-data-from

标签

pdf-scraping

python-camelot