How to find the Word Co-ordinate using CGPDFScanner in the pdf page in iphone ?

前端 未结 1 809
被撕碎了的回忆
被撕碎了的回忆 2021-01-26 02:50

I am doing parsing of the pdf page using CGPDFScanner. But I am not able to find the co-oridnate of the serach result.

In the void Tm1(CGPDFScannerRef scanner, void *inf

相关标签:
1条回答
  • 2021-01-26 03:51

    You're drastically under-estimating the complexity to convert PDF to text. I made that mistake as well, and it took months to write a text extraction engine that works with most PDFs. My code is commercial, but just to give you an idea:

    Td, TD, Tm, T*, d0, d1 all can contain text. (d0, d1 are for Type3 fonts, which are less common, but Microsoft Word really likes them) So can do any objects in XObjects (also recursively). But you also need to parse the Fonts, since many PDFs have CMaps attached to fonts that translate "random numbers" to the character (or characters - PDF can have ligatures as well). Beware, XObjects might also contain fonts, and it's critical to parse them in the right order, since fonts can have parent fonts.

    Adobe's ToUnicode PDF gives you some idea how to start, but just a warning, the spec is very incomplete. There's a bit more in the official PDF reference, but you still will find documents that should not work (when looking at the spec) but still DO work (when you try them in Adobe Acrobat).

    0 讨论(0)
提交回复
热议问题