Reading PDF File Attachment Annotations with iTextSharp

问题

I have the following issue. I have a PDF with a XML file attached as annotation inside it. Not as embedded file but as annotation. Now I try to read it with the code from the following link:

iTextSharp - how to open/read/extract a file attachment?

It works for embedded files but not for file attachemts as annotations.

I Google for extracting annotations from PDF and find out the following link: Reading PDF Annotations with iText

So the annotation type is "File Attachment Annotations"

Could someone show a working example?

Thanks in advance for any help

回答1:

As so often in questions concerning iText and iTextSharp, one should first look at the keyword list on itextpdf.com. Here you find File attachment, extract attachments referencing two Java samples from iText in Action — 2nd Edition:

The old keyword list is no more; the itextpdf.com site now offers other ways for searching examples but I won't describe them lest the site changes again and I have dead links once more...

The relevant iText examples based on iText in Action — Second Edition are:

part4.chapter16.KubrickDvds
- Java, iText 5.x
- Java, iText 7.x
- .Net, iText 5.x
part4.chapter16.KubrickDocumentary
- Java, iText 5.x
- Java, iText 7.x
- .Net, iText 5.x

(I haven't found ports of the samples to .Net and iText 7 but based on the other sources this port should not be too difficult...)

KubrickDvds contains the following method extractAttachments/ExtractAttachments to extract File Attachment Annotations:

Java, iText 5.x:

/**
 * Extracts attachments from an existing PDF.
 * @param src   the path to the existing PDF
 */
public void extractAttachments(String src) throws IOException {
    PdfReader reader = new PdfReader(src);
    PdfArray array;
    PdfDictionary annot;
    PdfDictionary fs;
    PdfDictionary refs;
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        array = reader.getPageN(i).getAsArray(PdfName.ANNOTS);
        if (array == null) continue;
        for (int j = 0; j < array.size(); j++) {
            annot = array.getAsDict(j);
            if (PdfName.FILEATTACHMENT.equals(annot.getAsName(PdfName.SUBTYPE))) {
                fs = annot.getAsDict(PdfName.FS);
                refs = fs.getAsDict(PdfName.EF);
                for (PdfName name : refs.getKeys()) {
                    FileOutputStream fos
                        = new FileOutputStream(String.format(PATH, fs.getAsString(name).toString()));
                    fos.write(PdfReader.getStreamBytes((PRStream)refs.getAsStream(name)));
                    fos.flush();
                    fos.close();
                }
            }
        }
    }
    reader.close();
}

Java, iText 7.x:

public void extractAttachments(String src) throws IOException {
    PdfDocument pdfDoc = new PdfDocument(new PdfReader(src));
    PdfReader reader = new PdfReader(src);
    PdfArray array;
    PdfDictionary annot;
    PdfDictionary fs;
    PdfDictionary refs;
    for (int i = 1; i <= pdfDoc.getNumberOfPages(); i++) {
        array = pdfDoc.getPage(i).getPdfObject().getAsArray(PdfName.Annots);
        if (array == null) continue;
        for (int j = 0; j < array.size(); j++) {
            annot = array.getAsDictionary(j);
            if (PdfName.FileAttachment.equals(annot.getAsName(PdfName.Subtype))) {
                fs = annot.getAsDictionary(PdfName.FS);
                refs = fs.getAsDictionary(PdfName.EF);
                for (PdfName name : refs.keySet()) {
                    FileOutputStream fos
                            = new FileOutputStream(String.format(PATH, fs.getAsString(name).toString()));
                    fos.write(refs.getAsStream(name).getBytes());
                    fos.flush();
                    fos.close();
                }
            }
        }
    }
    reader.close();
}

C#, iText 5.x:

/**
 * Extracts attachments from an existing PDF.
 * @param src the path to the existing PDF
 * @param zip the ZipFile object to add the extracted images
 */
public void ExtractAttachments(byte[] src, ZipFile zip) {
  PdfReader reader = new PdfReader(src);
  for (int i = 1; i <= reader.NumberOfPages; i++) {
    PdfArray array = reader.GetPageN(i).GetAsArray(PdfName.ANNOTS);
    if (array == null) continue;
    for (int j = 0; j < array.Size; j++) {
      PdfDictionary annot = array.GetAsDict(j);
      if (PdfName.FILEATTACHMENT.Equals(
          annot.GetAsName(PdfName.SUBTYPE)))
      {
        PdfDictionary fs = annot.GetAsDict(PdfName.FS);
        PdfDictionary refs = fs.GetAsDict(PdfName.EF);
        foreach (PdfName name in refs.Keys) {
          zip.AddEntry(
            fs.GetAsString(name).ToString(), 
            PdfReader.GetStreamBytes((PRStream)refs.GetAsStream(name))
          );
        }
      }
    }
  }
}

KubrickDocumentary contains the following method extractDocLevelAttachments/ExtractDocLevelAttachments to extract document level attachments:

Java, iText 5.x:

/**
 * Extracts document level attachments
 * @param filename     a file from which document level attachments will be extracted
 * @throws IOException
 */
public void extractDocLevelAttachments(String filename) throws IOException {
    PdfReader reader = new PdfReader(filename);
    PdfDictionary root = reader.getCatalog();
    PdfDictionary documentnames = root.getAsDict(PdfName.NAMES);
    PdfDictionary embeddedfiles = documentnames.getAsDict(PdfName.EMBEDDEDFILES);
    PdfArray filespecs = embeddedfiles.getAsArray(PdfName.NAMES);
    PdfDictionary filespec;
    PdfDictionary refs;
    FileOutputStream fos;
    PRStream stream;
    for (int i = 0; i < filespecs.size(); ) {
      filespecs.getAsString(i++);
      filespec = filespecs.getAsDict(i++);
      refs = filespec.getAsDict(PdfName.EF);
      for (PdfName key : refs.getKeys()) {
        fos = new FileOutputStream(String.format(PATH, filespec.getAsString(key).toString()));
        stream = (PRStream) PdfReader.getPdfObject(refs.getAsIndirectObject(key));
        fos.write(PdfReader.getStreamBytes(stream));
        fos.flush();
        fos.close();
      }
    }
    reader.close();
}

Java, iText 7.x

public void extractDocLevelAttachments(String src) throws IOException {
    PdfDocument pdfDoc = new PdfDocument(new PdfReader(src));
    PdfDictionary root = pdfDoc.getCatalog().getPdfObject();
    PdfDictionary documentnames = root.getAsDictionary(PdfName.Names);
    PdfDictionary embeddedfiles = documentnames.getAsDictionary(PdfName.EmbeddedFiles);
    PdfArray filespecs = embeddedfiles.getAsArray(PdfName.Names);
    PdfDictionary filespec;
    PdfDictionary refs;
    FileOutputStream fos;
    PdfStream stream;
    for (int i = 0; i < filespecs.size(); ) {
        filespecs.getAsString(i++);
        filespec = filespecs.getAsDictionary(i++);
        refs = filespec.getAsDictionary(PdfName.EF);
        for (PdfName key : refs.keySet()) {
            fos = new FileOutputStream(String.format(PATH, filespec.getAsString(key).toString()));
            stream = refs.getAsStream(key);
            fos.write(stream.getBytes());
            fos.flush();
            fos.close();
        }
    }
    pdfDoc.close();
}

C#, iText 5.x:

/**
 * Extracts document level attachments
 * @param PDF from which document level attachments will be extracted
 * @param zip the ZipFile object to add the extracted images
 */
public void ExtractDocLevelAttachments(byte[] pdf, ZipFile zip) {
  PdfReader reader = new PdfReader(pdf);
  PdfDictionary root = reader.Catalog;
  PdfDictionary documentnames = root.GetAsDict(PdfName.NAMES);
  PdfDictionary embeddedfiles = 
      documentnames.GetAsDict(PdfName.EMBEDDEDFILES);
  PdfArray filespecs = embeddedfiles.GetAsArray(PdfName.NAMES);
  for (int i = 0; i < filespecs.Size; ) {
    filespecs.GetAsString(i++);
    PdfDictionary filespec = filespecs.GetAsDict(i++);
    PdfDictionary refs = filespec.GetAsDict(PdfName.EF);
    foreach (PdfName key in refs.Keys) {
      PRStream stream = (PRStream) PdfReader.GetPdfObject(
        refs.GetAsIndirectObject(key)
      );
      zip.AddEntry(
        filespec.GetAsString(key).ToString(), 
        PdfReader.GetStreamBytes(stream)
      );
    }
  }
}

(For some reason the c# examples put the extracted files in some ZIP file while the Java versions put them into the file system... oh well...)

来源：https://stackoverflow.com/questions/14947829/reading-pdf-file-attachment-annotations-with-itextsharp

标签

pdf

annotations

itextsharp

attachment