Merging Tagged PDF without ruining the tags

故事扮演 提交于 2019-12-02 06:54:11

This looks like a bug in the current iText versions.

@Bruno maybe someone should look into this

PdfCopy has a method fixTaggedStructure which tries to fix the tagged structure which has been somewhat garbled by copying tagged pages. Up to the current iText 5.4.6-SNAPSHOT inclusively you find the following code

PdfDictionary dict = (PdfDictionary)iobj.object;
PdfIndirectReference pg = (PdfIndirectReference)dict.get(PdfName.PG);
//if pg is real page - do nothing, else set correct pg and remove first MCID if exists
if (!pageReferences.contains(pg) && !pg.equals(currPage)){
    dict.put(PdfName.PG, currPage);
    PdfArray kids = dict.getAsArray(PdfName.K);
    if (kids != null) {
        PdfObject firstKid = kids.getDirectObject(0);
        if (firstKid.isNumber()) kids.remove(0);
    }
}

for a StructElem tagged element dict from some array. This code implicitly assumes that there is an entry for the key PdfName.PG in that dictionary dict by doing pg.equals(currPage). Unfortunately that entry is optional, e.g. the sample document provided by the OP contains such StructElem dictionaries referenced from some array without a Pg entry. This causes the NPE in question.

In this case it suffices to change the order in the equals call, i.e. instead of

if (!pageReferences.contains(pg) && !pg.equals(currPage)){

one should use

if (!pageReferences.contains(pg) && !currPage.equals(pg)){

or

if (pg != null && !pageReferences.contains(pg) && !pg.equals(currPage)){

depending on the actual program logic here.

@Bruno Please check which variant is semantically correct; I'm not really into this tagged structure stuff after all...

The Code was written in C#

  public static byte[] mergeTest(byte[] pdf) {
        PdfReader reader = null;
        Document doc = null;
        PdfCopy copy = null;
        MemoryStream stream = new MemoryStream();
        byte[] output = null;

        try {
            reader = new PdfReader(pdf);
            doc = new Document();

            copy = new PdfCopy(doc, stream);
            bool tagged = reader.IsTagged();

            if (tagged)
                copy.SetTagged();


            doc.Open();

            for (int x = 1; x <= reader.NumberOfPages; x++) {
                copy.AddPage(copy.GetImportedPage(reader, x, tagged));
            }

            copy.FreeReader(reader);
            doc.Close();
            copy.Close();

            output = stream.ToArray();

            stream.Flush();
            stream.Dispose();

        } catch (Exception ex) {

        } finally {
            try {
                if (reader != null)
                    reader.Close();
            } catch (Exception) { }
        }
        return output;
    }
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!