I need to compare the contents of two almost similar files and highlight the dissimilar portions in the corresponding pdf file. Am using pdfbox. Please help me atleast with
If you prefer a tool with a GUI, you could try this one: diffpdf. It's by Mark Summerfield, and since it's written with Qt, it should be available (or should be buildable) on all platforms where Qt runs on.
Here's a screenshot:
I had this very problem myself and the quickest way that I've found is to use PHP and its bindings for ImageMagick (Imagick).
<?php
$im1 = new \Imagick("file1.pdf");
$im2 = new \Imagick("file2.pdf");
$result = $im1->compareImages($im2, \Imagick::METRIC_MEANSQUAREERROR);
if($result[1] > 0.0){
// Files are DIFFERENT
}
else{
// Files are IDENTICAL
}
$im1->destroy();
$im2->destroy();
Of course, you need to install the ImageMagick bindings first:
sudo apt-get install php5-imagick # Ubuntu/Debian
You can do the same thing with a shell script on Linux. The script wraps 3 components:
compare
commandpdftk
utilityIt's rather easy to translate this into a .bat
Batch file for DOS/Windows...
Here are the building blocks:
Use this command to split multipage PDF files into multiple singlepage PDFs:
pdftk first.pdf burst output somewhere/firstpdf_page_%03d.pdf
pdftk 2nd.pdf burst output somewhere/2ndpdf_page_%03d.pdf
Use this command to create a "diff" PDF page for each of the pages:
compare \
-verbose \
-debug coder -log "%u %m:%l %e" \
somewhere/firstpdf_page_001.pdf \
somewhere/2ndpdf_page_001.pdf \
-compose src \
somewhereelse/diff_page_001.pdf
Note, that compare
is part of ImageMagick. But for PDF processing it needs Ghostscript as a 'delegate', because it cannot do so natively itself.
Now you can again concatenate your "diff" PDF pages with pdftk
:
pdftk \
somewhereelse/diff_page_*.pdf \
cat \
output somewhereelse/diff_allpages.pdf
Ghostscript automatically inserts meta data (such as the current date+time) into its PDF output. Therefore this is not working well for MD5hash-based file comparisons.
If you want to automatically discover all cases which consist of purely white pages (that means: there are no visible differences in your input pages), you could also convert to a meta-data free bitmap format using the bmp256
output device. You can do that for the original PDFs (first.pdf and 2nd.pdf), or for the diff-PDF pages:
gs \
-o diff_page_001.bmp \
-r72 \
-g595x842 \
-sDEVICE=bmp256 \
diff_page_001.pdf
md5sum diff_page_001.bmp
Just create an all-white BMP page with its MD5sum (for reference) like this:
gs \
-o reference-white-page.bmp \
-r72 \
-g595x842 \
-sDEVICE=bmp256 \
-c "showpage quit"
md5sum reference-white-page.bmp
I have come up with a jar using apache pdfbox to compare pdf files - this can compare pixel by pixel
& highlight the differences.
Check my blog : http://www.testautomationguru.com/introducing-pdfutil-to-compare-pdf-files-extract-resources/ for example & download.
To get page count
import com.taguru.utility.PDFUtil;
PDFUtil pdfUtil = new PDFUtil();
pdfUtil.getPageCount("c:/sample.pdf"); //returns the page count
To get page content as plain text
//returns the pdf content - all pages
pdfUtil.getText("c:/sample.pdf");
// returns the pdf content from page number 2
pdfUtil.getText("c:/sample.pdf",2);
// returns the pdf content from page number 5 to 8
pdfUtil.getText("c:/sample.pdf", 5, 8);
To extract attached images from PDF
//set the path where we need to store the images
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.extractImages("c:/sample.pdf");
// extracts & saves the pdf content from page number 3
pdfUtil.extractImages("c:/sample.pdf", 3);
// extracts & saves the pdf content from page 2
pdfUtil.extractImages("c:/sample.pdf", 2, 2);
To store PDF pages as images
//set the path where we need to store the images
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.savePdfAsImage("c:/sample.pdf");
To compare PDF files in text mode (faster – But it does not compare the format, images etc in the PDF)
String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";
// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesTextMode(file1, file2);
// compare the 3rd page alone
pdfUtil.comparePdfFilesTextMode(file1, file2, 3, 3);
// compare the pages from 1 to 5
pdfUtil.comparePdfFilesTextMode(file1, file2, 1, 5);
To compare PDF files in Binary mode (slower – compares PDF documents pixel by pixel – highlights pdf difference & store the result as image)
String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";
// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesBinaryMode(file1, file2);
// compare the 3rd page alone
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 3, 3);
// compare the pages from 1 to 5
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 1, 5);
//if you need to store the result
pdfUtil.highlightPdfDifference(true);
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.comparePdfFilesBinaryMode(file1, file2);