podofo

Extract text from array TJ in PDF operator using PoDoFo lib

假如想象 提交于 2019-12-03 22:21:37
I am trying to extract text from a PDF file usind the PoDoFo library, it is working for the Tj operator and fails to do so for the (array) TJ operator. I ve found this piece of code(with my small modification) here : const char* pszToken = NULL; PdfVariant var; EPdfContentsType eType; PdfContentsTokenizer tokenizer( pPage ); double dCurPosX = 0.0; double dCurPosY = 0.0; double dCurFontSize = 0.0; bool bTextBlock = false; PdfFont* pCurFont = NULL; std::stack<PdfVariant> stack; while( tokenizer.ReadNext( eType, pszToken, var ) ) { if( eType == ePdfContentsType_Keyword ) { // support 'l' and 'm'

PDF parsing in C++ (PoDoFo)

偶尔善良 提交于 2019-11-30 10:51:53
问题 Hi so I'm trying to parse some text from some pdfs and I would like to use PoDoFo, now I have tried searching for examples of how to use PoDoFo to parse a pdf however all I can come up with is examples of how to create and write a pdf file which is not what I really need. If anyone has any tutorial or example of parsing a PDF file with PoDoFo or have suggestions for a different library that I can use please let me know. Also I know there is pdftotext on linux, however, not only can I not use

PDF parsing in C++ (PoDoFo)

雨燕双飞 提交于 2019-11-29 22:39:37
Hi so I'm trying to parse some text from some pdfs and I would like to use PoDoFo, now I have tried searching for examples of how to use PoDoFo to parse a pdf however all I can come up with is examples of how to create and write a pdf file which is not what I really need. If anyone has any tutorial or example of parsing a PDF file with PoDoFo or have suggestions for a different library that I can use please let me know. Also I know there is pdftotext on linux, however, not only can I not use that, but I would much rather be able to do everything I need to internally and not rely on outside

Extract text from PDF document based on position c++

谁说胖子不能爱 提交于 2019-11-27 02:00:16
问题 I am trying to extract a text from a PDF document based on it's coordinates, so I have came across two notions in the Adobe PDF Reference (chap. 5.3): Text positioning operators Text showing operators For now I am interested in Td & Tm positioning operators, while using Td you have tx and ty , relative to start of the current line which is clearly specified in a PDF document: tx ty Td , I have used this method to extract text by the tx and ty coordinates. The problem is that I don't know how