发表新帖

发表新帖

How to find the Word Co-ordinate using CGPDFScanner in the pdf page in iphone ?

前端未结

关注

 1  809

被撕碎了的回忆

I am doing parsing of the pdf page using CGPDFScanner. But I am not able to find the co-oridnate of the serach result.

In the void Tm1(CGPDFScannerRef scanner, void *inf

相关标签:

1条回答

灰色年华

2021-01-26 03:51

You're drastically under-estimating the complexity to convert PDF to text. I made that mistake as well, and it took months to write a text extraction engine that works with most PDFs. My code is commercial, but just to give you an idea:

Td, TD, Tm, T*, d0, d1 all can contain text. (d0, d1 are for Type3 fonts, which are less common, but Microsoft Word really likes them) So can do any objects in XObjects (also recursively). But you also need to parse the Fonts, since many PDFs have CMaps attached to fonts that translate "random numbers" to the character (or characters - PDF can have ligatures as well). Beware, XObjects might also contain fonts, and it's critical to parse them in the right order, since fonts can have parent fonts.

Adobe's ToUnicode PDF gives you some idea how to start, but just a warning, the spec is very incomplete. There's a bit more in the official PDF reference, but you still will find documents that should not work (when looking at the spec) but still DO work (when you try them in Adobe Acrobat).

0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题