问题
I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF.
Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial text PDF files?
environment: PYTHON 3.6
回答1:
The below code will work, to extract data text data from both searchable and non-searchable PDF's.
import fitz
text = ""
path = "Your_scanned_or_partial_scanned.pdf"
doc = fitz.open(path)
for page in doc:
text += page.getText()
If you don't have fitz
module you need to do this:
pip install --upgrade pymupdf
回答2:
Building on top of Rahul Agarwal's solution, along with some snippets I found at this link, here is a possible algorithm that should solve your problem.
You need to install fitz
and PyMuPDF
modules. You can do it by means of pip
:
pip3 install fitz PyMuPDF
And here is the Python3 implementation:
import fitz
def get_text_percentage(file_name: str) -> float:
"""
Calculate the percentage of document that is covered by (searchable) text.
If the returned percentage of text is very low, the document is
most likely a scanned PDF
"""
total_page_area = 0.0
total_text_area = 0.0
doc = fitz.open(file_name)
for page_num, page in enumerate(doc):
total_page_area += abs(page.rect)
text_area = 0.0
for b in page.getTextBlocks():
r = fitz.Rect(b[:4]) # rectangle where block text appears
text_area = text_area + abs(r)
total_text_area += text_area
doc.close()
return total_text_area / total_page_area
if __name__ == "__main__":
text_perc = get_text_percentage("my.pdf")
print(text_perc)
if text_perc < 0.01:
print("fully scanned PDF - no relevant text")
else:
print("not fully scanned PDF - text is present")
Although this answers your question (i.e. distinguish between fully scanned and full/partial textual PDFs), this solution is not able to distinguish between full-textual PDFs and scanned PDFs that also have text within them.
回答3:
Try OCRmyPDF. You can use this command to convert a scanned pdf to digital pdf.
ocrmypdf input_scanned.pdf output_digital.pdf
If the input pdf is digital the command will throw an error "PriorOcrFoundError: page already has text!".
import subprocess as sp
import re
output = sp.getoutput("ocrmypdf input.pdf output.pdf")
if not re.search("PriorOcrFoundError: page already has text!",output):
print("Uploaded scanned pdf")
else:
print("Uploaded digital pdf")
回答4:
How about the PDF metadata check on '/Resources'
?!
I believe for any text in a PDF (electronic document) there are more chances of having a font, especially the PDF, whose objective is to make a portable file, so, it maintains the font definition.
If you are a PyPDF2
user, try
pdf_reader = PyPDF2.PdfFileReader(input_file_location)
page_data = pdf_reader.getPage(page_num)
if "/Font" in page_data["/Resources"]:
print(
"[Info]: Looks like there is text in the PDF, contains:",
page_data["/Resources"].keys(),
)
elif len(page_data["/Resources"].get("/XObject", {})) != 1:
print("[Info]: PDF Contains:", page_data["/Resources"].keys())
for obj in x_object:
obj_ = x_object[obj]
if obj_["/Subtype"] == "/Image":
print("[Info]: PDF is image only")
回答5:
def get_pdf_searchable_pages(fname):
# pip install pdfminer
from pdfminer.pdfpage import PDFPage
searchable_pages = []
non_searchable_pages = []
page_num = 0
with open(fname, 'rb') as infile:
for page in PDFPage.get_pages(infile):
page_num += 1
if 'Font' in page.resources.keys():
searchable_pages.append(page_num)
else:
non_searchable_pages.append(page_num)
if page_num > 0:
if len(searchable_pages) == 0:
print(f"Document '{fname}' has {page_num} page(s). "
f"Complete document is non-searchable")
elif len(non_searchable_pages) == 0:
print(f"Document '{fname}' has {page_num} page(s). "
f"Complete document is searchable")
else:
print(f"searchable_pages : {searchable_pages}")
print(f"non_searchable_pages : {non_searchable_pages}")
else:
print(f"Not a valid document")
if __name__ == '__main__':
get_pdf_searchable_pages("1.pdf")
get_pdf_searchable_pages("1Scanned.pdf")
Output:
Document '1.pdf' has 1 page(s). Complete document is searchable
Document '1Scanned.pdf' has 1 page(s). Complete document is non-searchable
回答6:
I created a script to detect whether a PDF was OCRd. The main idea: In OCRd PDFs is the text invisible.
Algorithm to test whether a given PDF (f1
) was OCRd:
- create a copy of
f1
noted asf2
- delete all text on
f2
- create images (PNG) for all (or just a few) pages for
f1
andf2
f1
was OCRd if all the images off1
andf2
are identical.
https://github.com/jfilter/pdf-scripts/blob/master/is_ocrd_pdf.sh
#!/usr/bin/env bash
set -e
set -x
################################################################################
# Check if a PDF was scanned or created digitally, works on OCRd PDFs
#
# Usage:
# bash is_scanned_pdf.sh [-p] file
#
# Exit 0: Yes, file is a scanned PDF
# Exit 99: No, file was created digitally
#
# Arguments:
# -p or --pages: pos. integer, only consider first N pages
#
# Please report issues at https://github.com/jfilter/pdf-scripts/issues
#
# GPLv3, Copyright (c) 2020 Johannes Filter
################################################################################
# parse arguments
# h/t https://stackoverflow.com/a/33826763/4028896
max_pages=-1
# skip over positional argument of the file(s), thus -gt 1
while [[ "$#" -gt 1 ]]; do
case $1 in
-p | --pages)
max_pages="$2"
shift
;;
*)
echo "Unknown parameter passed: $1"
exit 1
;;
esac
shift
done
# increment to make it easier with page numbering
max_pages=$((max_pages++))
command_exists() {
if ! [ -x $($(command -v $1 &>/dev/null)) ]; then
echo $(error: $1 is not installed.) >&2
exit 1
fi
}
command_exists mutool && command_exists gs && command_exists compare
command_exists pdfinfo
orig=$PWD
num_pages=$(pdfinfo $1 | grep Pages | awk '{print $2}')
echo $num_pages
echo $max_pages
if ((($max_pages > 1) && ($max_pages < $num_pages))); then
num_pages=$max_pages
fi
cd $(mktemp -d)
for ((i = 1; i <= num_pages; i++)); do
mkdir -p output/$i && echo $i
done
# important to filter text on output of GS (tmp1), cuz GS alters input PDF...
gs -o tmp1.pdf -sDEVICE=pdfwrite -dLastPage=$num_pages $1 &>/dev/null
gs -o tmp2.pdf -sDEVICE=pdfwrite -dFILTERTEXT tmp1.pdf &>/dev/null
mutool convert -o output/%d/1.png tmp1.pdf 2>/dev/null
mutool convert -o output/%d/2.png tmp2.pdf 2>/dev/null
for ((i = 1; i <= num_pages; i++)); do
echo $i
# difference in pixels, if 0 there are the same pictures
# discard diff image
if ! compare -metric AE output/$i/1.png output/$i/2.png null: 2>&1; then
echo " pixels difference, not a scanned PDF, mismatch on page $i"
exit 99
fi
done
来源:https://stackoverflow.com/questions/55704218/how-to-check-if-pdf-is-scanned-image-or-contains-text