PoDoFo Extract text + coords from a pdf

柔情痞子 提交于 2019-12-09 13:56:19

问题


I have been trying for a while to use the PoDoFo C++ library to extract text and lines (with their respective coordinates). But I have no way to do this.

This is what I have so far:

#include <iostream>
#include <stdio.h>
#include <vector>
#include <podofo/podofo.h>
using namespace PoDoFo;
using namespace std;

int main( int argc, char* argv[] )
{
    const char* filename = "hello.pdf";
    PdfVecObjects *x = new PdfVecObjects();
    PdfParser parser(x, filename);
    parser.ParseFile("hello.pdf");

    for (TIVecObjects obj = x->begin(); obj != x->end(); obj++){
        PdfObject * a = x->RemoveObject(obj);
        // THIS IS MY PROBLEM VVVVVVVVVV
        cout << a->Reference().ToString() << endl;
    }

    return 0;
}

However, this only gives me incredibly basic information (seems to be object number)

DEBUG: Size=12
DEBUG: Reading numbers: 0 12
DEBUG: Reading XRef Section: 0 with 12 Objects.
DEBUG: Size=12
DEBUG: Reading numbers: 0 12
DEBUG: Reading XRef Section: 0 with 12 Objects.
1 0 R
2 0 R
3 0 R
4 0 R
5 0 R
6 0 R
7 0 R
8 0 R
9 0 R
10 0 R
11 0 R

I want to print out the coordinates of an object, and if it's a line or text. If it's text, I would also like to be able to print out the text. Does anyone that knows this library better than I do know what I could do to fix this?


回答1:


This answer will show you how to extract the text.

To get text positioning information, you will also have to process the following commands:

Tc, Tw, Tz, TL, T*, Tr and Tm.

You definitely need to download the PDF spec from Adobe to get all the details. There is a chapter devoted entirely to text processing. It is well worth your time to print out that chapter as you will be referring to it a lot. Everything you need to know is in there, but it's not always obvious.

You will also need to use a bit of Linear Algebra. Nothing too complicated, though.

Since there are many ways to achieve the same results, it is important to implement all the commands thoroughly, even if the documents you are going to process might not seem to need certain features. For example: I ran across a document which set all text sizes to one point, which threw off all my calculations until I realized it was using the text scaling factor to set the actual font sizes.




回答2:


Use the PoDoFo tools "podofotxtextract" it gives you x,y coordinate (tool folder of PoDoFo package). Extract text from Pdf.



来源:https://stackoverflow.com/questions/11455081/podofo-extract-text-coords-from-a-pdf

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!