Comparison of two pdf files

前端 未结 4 986
感动是毒
感动是毒 2020-12-16 19:49

I need to compare the contents of two almost similar files and highlight the dissimilar portions in the corresponding pdf file. Am using pdfbox. Please help me atleast with

相关标签:
4条回答
  • 2020-12-16 20:09

    If you prefer a tool with a GUI, you could try this one: diffpdf. It's by Mark Summerfield, and since it's written with Qt, it should be available (or should be buildable) on all platforms where Qt runs on.

    Here's a screenshot:enter image description here

    0 讨论(0)
  • 2020-12-16 20:13

    I had this very problem myself and the quickest way that I've found is to use PHP and its bindings for ImageMagick (Imagick).

    <?php
    $im1 = new \Imagick("file1.pdf");
    $im2 = new \Imagick("file2.pdf");
    
    $result = $im1->compareImages($im2, \Imagick::METRIC_MEANSQUAREERROR);
    
    if($result[1] > 0.0){
        // Files are DIFFERENT
    }
    else{
        // Files are IDENTICAL
    }
    
    $im1->destroy();
    $im2->destroy();
    

    Of course, you need to install the ImageMagick bindings first:

    sudo apt-get install php5-imagick # Ubuntu/Debian
    
    0 讨论(0)
  • 2020-12-16 20:18

    You can do the same thing with a shell script on Linux. The script wraps 3 components:

    1. ImageMagick's compare command
    2. the pdftk utility
    3. Ghostscript

    It's rather easy to translate this into a .bat Batch file for DOS/Windows...

    Here are the building blocks:

    pdftk

    Use this command to split multipage PDF files into multiple singlepage PDFs:

    pdftk  first.pdf  burst  output  somewhere/firstpdf_page_%03d.pdf
    pdftk  2nd.pdf    burst  output  somewhere/2ndpdf_page_%03d.pdf
    

    compare

    Use this command to create a "diff" PDF page for each of the pages:

    compare \
           -verbose \
           -debug coder -log "%u %m:%l %e" \
            somewhere/firstpdf_page_001.pdf \
            somewhere/2ndpdf_page_001.pdf \
           -compose src \
            somewhereelse/diff_page_001.pdf
    

    Note, that compare is part of ImageMagick. But for PDF processing it needs Ghostscript as a 'delegate', because it cannot do so natively itself.

    Once more, pdftk

    Now you can again concatenate your "diff" PDF pages with pdftk:

    pdftk \
          somewhereelse/diff_page_*.pdf \
          cat \
          output somewhereelse/diff_allpages.pdf
    

    Ghostscript

    Ghostscript automatically inserts meta data (such as the current date+time) into its PDF output. Therefore this is not working well for MD5hash-based file comparisons.

    If you want to automatically discover all cases which consist of purely white pages (that means: there are no visible differences in your input pages), you could also convert to a meta-data free bitmap format using the bmp256 output device. You can do that for the original PDFs (first.pdf and 2nd.pdf), or for the diff-PDF pages:

     gs \
       -o diff_page_001.bmp \
       -r72 \
       -g595x842 \
       -sDEVICE=bmp256 \
        diff_page_001.pdf
    
     md5sum diff_page_001.bmp
    

    Just create an all-white BMP page with its MD5sum (for reference) like this:

     gs \
       -o reference-white-page.bmp \
       -r72 \
       -g595x842 \
       -sDEVICE=bmp256 \
       -c "showpage quit"
    
     md5sum reference-white-page.bmp
    
    0 讨论(0)
  • 2020-12-16 20:30

    I have come up with a jar using apache pdfbox to compare pdf files - this can compare pixel by pixel & highlight the differences.

    Check my blog : http://www.testautomationguru.com/introducing-pdfutil-to-compare-pdf-files-extract-resources/ for example & download.


    To get page count

    import com.taguru.utility.PDFUtil;
    
    PDFUtil pdfUtil = new PDFUtil();
    pdfUtil.getPageCount("c:/sample.pdf"); //returns the page count
    

    To get page content as plain text

    //returns the pdf content - all pages
    pdfUtil.getText("c:/sample.pdf");
    
    // returns the pdf content from page number 2
    pdfUtil.getText("c:/sample.pdf",2);
    
    // returns the pdf content from page number 5 to 8
    pdfUtil.getText("c:/sample.pdf", 5, 8);
    

    To extract attached images from PDF

    //set the path where we need to store the images
     pdfUtil.setImageDestinationPath("c:/imgpath");
     pdfUtil.extractImages("c:/sample.pdf");
    
    // extracts & saves the pdf content from page number 3
    pdfUtil.extractImages("c:/sample.pdf", 3);
    
    // extracts & saves the pdf content from page 2
    pdfUtil.extractImages("c:/sample.pdf", 2, 2);
    

    To store PDF pages as images

    //set the path where we need to store the images
     pdfUtil.setImageDestinationPath("c:/imgpath");
     pdfUtil.savePdfAsImage("c:/sample.pdf");
    

    To compare PDF files in text mode (faster – But it does not compare the format, images etc in the PDF)

    String file1="c:/files/doc1.pdf";
    String file1="c:/files/doc2.pdf";
    
    // compares the pdf documents & returns a boolean
    // true if both files have same content. false otherwise.
    pdfUtil.comparePdfFilesTextMode(file1, file2);
    
    // compare the 3rd page alone
    pdfUtil.comparePdfFilesTextMode(file1, file2, 3, 3);
    
    // compare the pages from 1 to 5
    pdfUtil.comparePdfFilesTextMode(file1, file2, 1, 5);
    

    To compare PDF files in Binary mode (slower – compares PDF documents pixel by pixel – highlights pdf difference & store the result as image)

    String file1="c:/files/doc1.pdf";
    String file1="c:/files/doc2.pdf";
    
    // compares the pdf documents & returns a boolean
    // true if both files have same content. false otherwise.
    pdfUtil.comparePdfFilesBinaryMode(file1, file2);
    
    // compare the 3rd page alone
    pdfUtil.comparePdfFilesBinaryMode(file1, file2, 3, 3);
    
    // compare the pages from 1 to 5
    pdfUtil.comparePdfFilesBinaryMode(file1, file2, 1, 5);
    
    //if you need to store the result
    pdfUtil.highlightPdfDifference(true);
    pdfUtil.setImageDestinationPath("c:/imgpath");
    pdfUtil.comparePdfFilesBinaryMode(file1, file2);
    
    0 讨论(0)
提交回复
热议问题