How to extract embedded files from pdf using MuPDF

谁说胖子不能爱 提交于 2019-12-06 10:48:56

Solution, based on pdfextact.c seems like bruteforce, but it works:

  1. itarate through all pdf objects (pdf_load_object)
  2. determine if object is embedded file (isembed)
  3. if it is - access it's stream and save file (saveembed)

Embedded files stored at the end of file in most test cases, so, reverce iteration makes sence.

static int isembed(pdf_obj *obj) {
    pdf_obj *type = pdf_dict_gets(obj, "Type");
    return pdf_is_name(type) && !strcmp(pdf_to_name(type), "Filespec");
}


static void saveembed(pdf_obj *dict) {
    char *filename;

    pdf_obj *obj = pdf_dict_gets(dict, "F");
    if (obj) filename = pdf_to_str_buf(obj);

    obj = pdf_dict_gets(dict, "EF");
    if (!obj) return;

    pdf_obj *stream = pdf_dict_gets(obj, "F");
    if (!stream) return;

    FILE *f;
    fz_buffer *buf;
    int n, len;
    unsigned char *data;

    buf = pdf_load_stream(doc, pdf_to_num(stream), pdf_to_gen(stream));

    printf("extracting embedded file %s\n", filename);

    f = fopen(filename, "wb");

    len = fz_buffer_storage(ctx, buf, &data);
    n = fwrite(data, 1, len, f);

    fclose(f);
    fz_drop_buffer(ctx, buf);
}
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!