Extract text from array TJ in PDF operator using PoDoFo lib

假如想象 提交于 2019-12-03 22:21:37

1. The answer to the original question for which the central code part was this:

else if( strcmp( pszToken, "TJ" ) == 0 ) 
{
    PdfArray array = stack.top().GetArray();
    stack.pop();

    for( int i=0; i<static_cast<int>(array.GetSize()); i++ ) 
    {
        if( array[i].IsString() )
            AddTextElement( dCurPosX, dCurPosY, pCurFont, array[i].GetString() );
        }
    }
}

and the question was:

I've noticed that the the array[i].IsString() never gets to be true. Is this the right way to get the text from a TJ operator?

The short answer:

Hexadecimal strings in PoDoFo PdfVariants are recognized by IsHexString() instead of IsString(). Thus, you have to test for both string flavors:

if( array[i].IsString() || array[i].IsHexString() )

The long answer:

There are two basic flavors of strings in PDF:

String objects shall be written in one of the following two ways:

  • As a sequence of literal characters enclosed in parentheses ( ) (using LEFT PARENTHESIS (28h) and RIGHT PARENThESIS (29h)); see 7.3.4.2, "Literal Strings."

  • As hexadecimal data enclosed in angle brackets < > (using LESS-THAN SIGN (3Ch) and GREATER-THAN SIGN (3Eh)); see 7.3.4.3, "Hexadecimal Strings."

(section 7.3.4 in ISO 32000-1)

PoDoFo models both using the PdfString class which in the context of parsing often is wrapped inside a PdfVariant or even more specifically in a PdfObject.

When determining the type of the object contained in it, though, the PdfVariant differentiates between literal strings and hexadecimal strings:

/** \returns true if this variant is a string (i.e. GetDataType() == ePdfDataType_String)
 */
inline bool IsString() const { return GetDataType() == ePdfDataType_String; }

/** \returns true if this variant is a hex-string (i.e. GetDataType() == ePdfDataType_HexString)
 */
inline bool IsHexString() const { return GetDataType() == ePdfDataType_HexString; }

(PdfVariant.h)

The type of the PdfString inside a PdfVariant is determined when wrapped:

PdfVariant::PdfVariant( const PdfString & rsString )
{
    Init();
    Clear();

    m_eDataType  = rsString.IsHex() ? ePdfDataType_HexString : ePdfDataType_String;
    m_Data.pData = new PdfString( rsString );
}

(PdfVariant.cpp)

In case of your TJ argument array components, the strings in question are read as hexadecimal strings.

In your code, therefore, you have to consider both IsHexString() and IsString():

if( array[i].IsString() || array[i].IsHexString() )

2. Thereafter, and after the code was revised to check using IsHexString(), the question centered on

PdfString s = array[i].GetString();
_RPT1(_CRT_WARN, " : valid :%s   ", s.IsValid()?"yes":"not");
_RPT1(_CRT_WARN, " ;hex :%s   ", s.IsHex()?"yes":"not");
_RPT1(_CRT_WARN, " ;unicode: %s   ", s.IsUnicode()?"yes":"not");

PdfString unicode = pCurFont->GetEncoding()->ConvertToUnicode(s,pCurFont);
const char* szText = unicode.GetStringUtf8().c_str();
_RPT1(_CRT_WARN, " : %s\n", strlen(szText)> 0? szText: "nothing");

and the problem (as stated in comments) that

the s.GetLength() returns 2 and unicode.GetLength() returns 0, the conversion didn't work?

An analysis of the example documents Document2.pdf shows that the document in question does contain the required informations for text extraction. The only font present in that document which is used with hexadecimal encoding is /F1, and its font dictionary does contain an appropriate /ToUnicode map for reliable text extraction.

Unfortunately, though, PoDoFo does not yet seem to have implemented properly using that map for parsing purposes. I do not see it anywhere retrieving the /ToUnicode map to make the contained informations available for text parsing. It looks like PoDoFo cannot be used to properly parse the text of documents using Type0 aka composite font.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!