Identify and extract table from pdf using java

问题

I have different types of pdf which contain multiple things like text, table etc. The table may exist any place of pdf(top, middle, bottom). I want to extract only table data(No. of the column, no. of rows & data in a table) from that pdf using java without passing location.

What I have done till yet:-

1. I have used iText java API to read and extract. Following code used:-

PdfTextExtractor.getTextFromPage

but It is only returning data in form of text. Didn't get any clue to identify where table exists in pdf and how to extract data from that table.

2. I have also used PDFBox java API but it didn't solve my problem too.

3. I have also followed this stack overflow link:- PDF table extraction But it is not giving me expected output. This algorithm needs except line position and all.

I am not able to identify where to locate the table in pdf.

Can anybody tell me how to solve this problem using iText & PDF box API or is there any open source API which can help me to solve this problem?

Or can we convert pdf into html so that by table tags we can identify table and read ;)?

回答1:

You can try using Tabula which is an open-source tool to detect and extract tables from pdf documents. You can extend tabula-java and extract the table details. More can be found here.

If you are also looking to extract text from the document then you can use PDFBox or Apache Tika for extracting tables.

回答2:

It basically depends on your input document, and how much effort you're willing to put into this project.

A pdf does not work like an html-document. In html documents you have logical tags like "table" or "paragraph". A pdf document (in the most basic case) contains only the instructions needed to render the document. So instead of getting "table" you might get "draw a line here, and another one a bit further away, and then another one that crosses both, and so on"

Also, according to the pdf specification, these instructions don't even have to appear in logical (reading) order.

If you are lucky, your input pdf might be a tagged PDF. Tagged pdfs contain an internal representation of the underlying structure in the document. A tagged pdf might be able to tell you exactly which objects in the document make up the table.

Now, to get back to an actual answer. If you want a solution that always works, you can implement the iText7 IEventListener class. This class has a method eventOccurred() that gets called every time the parser has finished dealing with an object (like a piece of text, a line, etc)

If you then look out for lines, and build some heuristic to determine when a collection of lines constitutes a table, you should be able to detect tables.

IText also plans on releasing a pdf2Data addon, which will basically do the heavy lifting for you.

来源：https://stackoverflow.com/questions/43138481/identify-and-extract-table-from-pdf-using-java

标签

java

pdf

itext

pdfbox

java-api