Extract Images and Words with coordinates and sizes from PDF

后端未结

关注

 3  622

I\'ve read much about PDF extractions and libraries (as iText) but i just haven\'t found a solution to extract images and text (with coordinates) from a PDF.

The tas

相关标签:

3条回答

闹比i

2021-01-02 16:59

Several Java libraries can do this. Have you looked at JPedal or PdfBox?

0 讨论(0)
发布评论:

提交评论
- 加载中...
北恋

2021-01-02 17:00
If a commercial library is an option for you, you could try Amyuni PDF Creator .Net or Amyuni PDF Creator ActiveX. You could use the method IacDocument.GetObjectsInRectangle to retrieve all the "graphic objects" of your interest, then use the ObjectType attribute to separate images from text. The library already provides an algorithm for putting close text together. From the documentation:
```
IacDocument.GetObjectsInRectangle Method

The GetObjectsInRectangle method gets all the objects that are in the specified rectangle.
```
Usual disclaimer applies.
0 讨论(0)
发布评论:

提交评论
- 加载中...
天涯浪人

2021-01-02 17:13

Use XPDF (http://www.foolabs.com/xpdf/)

It can extract all the characters in the PDF with co-ordinates (pdftotext -bbox [sourcefile] [outputfile]) and also all the images and SVGs in the PDF.

It's open source (GPLv2) and supports a lot of additional extraction functionalities as well.

0 讨论(0)
发布评论:

提交评论
- 加载中...