问题
another question regarding pdf parsing... Just read PDF Reference version 1.7 "5.3.1 Text-Positioning Operators" and I am a little bit confused.
I wrote some code to get transformation matrix and initial text position.
CGPDFOperatorTableSetCallback (table, "MP", &op_MP);//Define marked-content point
CGPDFOperatorTableSetCallback (table, "DP", &op_DP);//Define marked-content point with property list
CGPDFOperatorTableSetCallback (table, "BMC", &op_BMC);//Begin marked-content sequence
CGPDFOperatorTableSetCallback (table, "BDC", &op_BDC);//Begin marked-content sequence with property list
CGPDFOperatorTableSetCallback (table, "EMC", &op_EMC);//End marked-content sequence
//Text State operators
CGPDFOperatorTableSetCallback(table, "Tc", &op_Tc);
CGPDFOperatorTableSetCallback(table, "Tw", &op_Tw);
CGPDFOperatorTableSetCallback(table, "Tz", &op_Tz);
CGPDFOperatorTableSetCallback(table, "TL", &op_TL);
CGPDFOperatorTableSetCallback(table, "Tf", &op_Tf);
CGPDFOperatorTableSetCallback(table, "Tr", &op_Tr);
CGPDFOperatorTableSetCallback(table, "Ts", &op_Ts);
//text showing operators
CGPDFOperatorTableSetCallback(table, "TJ", &op_TJ);
CGPDFOperatorTableSetCallback(table, "Tj", &op_Tj);
CGPDFOperatorTableSetCallback(table, "'", &op_apostrof);
CGPDFOperatorTableSetCallback(table, "\"", &op_double_apostrof);
//text positioning operators
CGPDFOperatorTableSetCallback(table, "Td", &op_Td);
CGPDFOperatorTableSetCallback(table, "TD", &op_TD);
CGPDFOperatorTableSetCallback(table, "Tm", &op_Tm);
CGPDFOperatorTableSetCallback(table, "T*", &op_T);
//text object operators
CGPDFOperatorTableSetCallback(table, "BT", &op_BT);//Begin text object
CGPDFOperatorTableSetCallback(table, "ET", &op_ET);//End text object
So this is the output after application lunch:
2010-09-02 15:09:23.041 testSearch[8251:207] op_BT begin
Integer value: 0
2010-09-02 15:09:23.043 testSearch[8251:207] op_BT end
2010-09-02 15:09:23.043 testSearch[8251:207] op_Tf begin
Integer value: 1
2010-09-02 15:09:23.044 testSearch[8251:207] op_Tf end
2010-09-02 15:09:23.044 testSearch[8251:207] op_Tm begin
Float value: 557.364197
2010-09-02 15:09:23.045 testSearch[8251:207] op_Tm end
2010-09-02 15:09:23.045 testSearch[8251:207] op_TJ begin
2010-09-02 15:09:23.046 testSearch[8251:207] Array string value [0]: F
2010-09-02 15:09:23.046 testSearch[8251:207] Array integer value [1]: 94985208
2010-09-02 15:09:23.047 testSearch[8251:207] Array string value [2]: r
2010-09-02 15:09:23.047 testSearch[8251:207] Array integer value [3]: 94985208
2010-09-02 15:09:23.048 testSearch[8251:207] Array string value [4]: o
2010-09-02 15:09:23.048 testSearch[8251:207] Array integer value [5]: 94985208
2010-09-02 15:09:23.049 testSearch[8251:207] Array string value [6]: m s
2010-09-02 15:09:23.049 testSearch[8251:207] Array integer value [7]: 94985208
2010-09-02 15:09:23.049 testSearch[8251:207] Array string value [8]: a
2010-09-02 15:09:23.050 testSearch[8251:207] Array integer value [9]: 94985208
2010-09-02 15:09:23.050 testSearch[8251:207] Array string value [10]: m
2010-09-02 15:09:23.051 testSearch[8251:207] Array integer value [11]: 94985208
2010-09-02 15:09:23.051 testSearch[8251:207] Array string value [12]: p
2010-09-02 15:09:23.052 testSearch[8251:207] Array integer value [13]: 94985208
2010-09-02 15:09:23.053 testSearch[8251:207] Array string value [14]: l
2010-09-02 15:09:23.054 testSearch[8251:207] Array integer value [15]: 94985208
2010-09-02 15:09:23.055 testSearch[8251:207] Array string value [16]: e t
2010-09-02 15:09:23.055 testSearch[8251:207] Array integer value [17]: 94985208
2010-09-02 15:09:23.057 testSearch[8251:207] Array string value [18]: o r
2010-09-02 15:09:23.057 testSearch[8251:207] Array integer value [19]: 94985208
2010-09-02 15:09:23.058 testSearch[8251:207] Array string value [20]: e
2010-09-02 15:09:23.058 testSearch[8251:207] Array integer value [21]: 94985208
2010-09-02 15:09:23.059 testSearch[8251:207] Array string value [22]: s
2010-09-02 15:09:23.059 testSearch[8251:207] Array integer value [23]: 94985208
2010-09-02 15:09:23.060 testSearch[8251:207] Array string value [24]: u
2010-09-02 15:09:23.061 testSearch[8251:207] Array integer value [25]: 94985208
2010-09-02 15:09:23.061 testSearch[8251:207] Array string value [26]: l
2010-09-02 15:09:23.062 testSearch[8251:207] Array integer value [27]: 94985208
2010-09-02 15:09:23.062 testSearch[8251:207] Array string value [28]: t
2010-09-02 15:09:23.063 testSearch[8251:207] op_TJ end
If someone is familiar with text matrix and text positioning operators it would be nice to explain how all those thing work.
How to calculate text position (or glyph?) using Tm (transformation matrix and other data)?
回答1:
@Koteg : Hi ! Have you finally managed to get it work ? For Tm, i'm able to get all the six values, but for now i can't see how to get the position of a word into a line ... I have an idea : if we are in Tj, just get the space between letters (hopping this the same everytime) and with Tm, get the position of a word. In the case of TJ, this is quite more complicated : get the value of horizontal translation to substract to Tm matrix for each part of the array, but searching a word in that array will be more complicated than for Tj.
BTW, for others people :
for(size_t n = 0; n < CGPDFArrayGetCount(array); n += 2)
{
if(n >= CGPDFArrayGetCount(array))
continue;
CGPDFStringRef string;
success = CGPDFArrayGetString(array, n, &string);
if(success)
{
NSString *data = (NSString *)CGPDFStringCopyTextString(string);
NSLog(@"array data : %@", data);
[searcher.currentData appendFormat:@"%@", data];
[data release];
}
CGPDFReal real;
success = CGPDFArrayGetNumber(array, n+1, &real);
if(success)
{
NSLog(@"array real : %f", real);
}
}
Thanks
来源:https://stackoverflow.com/questions/3627745/getting-text-position-while-parsing-pdf-with-quartz-2d