How to check if PDF is scanned image or contains text

问题

I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF.

Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial text PDF files?

environment: PYTHON 3.6

回答1:

The below code will work, to extract data text data from both searchable and non-searchable PDF's.

import fitz

text = ""
path = "Your_scanned_or_partial_scanned.pdf"

doc = fitz.open(path)
for page in doc:
    text += page.getText()

If you don't have fitz module you need to do this:

pip install --upgrade pymupdf

回答2:

Building on top of Rahul Agarwal's solution, along with some snippets I found at this link, here is a possible algorithm that should solve your problem.

You need to install fitz and PyMuPDF modules. You can do it by means of pip:

pip3 install fitz PyMuPDF

And here is the Python3 implementation:

import fitz


def get_text_percentage(file_name: str) -> float:
    """
    Calculate the percentage of document that is covered by (searchable) text.

    If the returned percentage of text is very low, the document is
    most likely a scanned PDF
    """
    total_page_area = 0.0
    total_text_area = 0.0

    doc = fitz.open(file_name)

    for page_num, page in enumerate(doc):
        total_page_area += abs(page.rect)
        text_area = 0.0
        for b in page.getTextBlocks():
            r = fitz.Rect(b[:4])  # rectangle where block text appears
            text_area = text_area + abs(r)
        total_text_area += text_area
    doc.close()
    return total_text_area / total_page_area


if __name__ == "__main__":
    text_perc = get_text_percentage("my.pdf")
    print(text_perc)
    if text_perc < 0.01:
        print("fully scanned PDF - no relevant text")
    else:
        print("not fully scanned PDF - text is present")

Although this answers your question (i.e. distinguish between fully scanned and full/partial textual PDFs), this solution is not able to distinguish between full-textual PDFs and scanned PDFs that also have text within them.

回答3:

Try OCRmyPDF. You can use this command to convert a scanned pdf to digital pdf.

ocrmypdf input_scanned.pdf output_digital.pdf

If the input pdf is digital the command will throw an error "PriorOcrFoundError: page already has text!".

import subprocess as sp
import re

output = sp.getoutput("ocrmypdf input.pdf output.pdf")
if not re.search("PriorOcrFoundError: page already has text!",output):
   print("Uploaded scanned pdf")
else:
   print("Uploaded digital pdf")

回答4:

How about the PDF metadata check on '/Resources' ?!

I believe for any text in a PDF (electronic document) there are more chances of having a font, especially the PDF, whose objective is to make a portable file, so, it maintains the font definition.

If you are a PyPDF2 user, try

pdf_reader = PyPDF2.PdfFileReader(input_file_location)
page_data = pdf_reader.getPage(page_num)

if "/Font" in page_data["/Resources"]:
    print(
        "[Info]: Looks like there is text in the PDF, contains:",
        page_data["/Resources"].keys(),
    )
elif len(page_data["/Resources"].get("/XObject", {})) != 1:
    print("[Info]: PDF Contains:", page_data["/Resources"].keys())
for obj in x_object:
    obj_ = x_object[obj]
    if obj_["/Subtype"] == "/Image":
        print("[Info]: PDF is image only")

回答5:

def get_pdf_searchable_pages(fname):
    # pip install pdfminer
    from pdfminer.pdfpage import PDFPage
    searchable_pages = []
    non_searchable_pages = []
    page_num = 0
    with open(fname, 'rb') as infile:

        for page in PDFPage.get_pages(infile):
            page_num += 1
            if 'Font' in page.resources.keys():
                searchable_pages.append(page_num)
            else:
                non_searchable_pages.append(page_num)
    if page_num > 0:
        if len(searchable_pages) == 0:
            print(f"Document '{fname}' has {page_num} page(s). "
                  f"Complete document is non-searchable")
        elif len(non_searchable_pages) == 0:
            print(f"Document '{fname}' has {page_num} page(s). "
                  f"Complete document is searchable")
        else:
            print(f"searchable_pages : {searchable_pages}")
            print(f"non_searchable_pages : {non_searchable_pages}")
    else:
        print(f"Not a valid document")


if __name__ == '__main__':
    get_pdf_searchable_pages("1.pdf")
    get_pdf_searchable_pages("1Scanned.pdf")

Output:

Document '1.pdf' has 1 page(s). Complete document is searchable
Document '1Scanned.pdf' has 1 page(s). Complete document is non-searchable

回答6:

I created a script to detect whether a PDF was OCRd. The main idea: In OCRd PDFs is the text invisible.

Algorithm to test whether a given PDF (f1) was OCRd:

create a copy of f1 noted as f2
delete all text on f2
create images (PNG) for all (or just a few) pages for f1 and f2
f1 was OCRd if all the images of f1 and f2 are identical.

https://github.com/jfilter/pdf-scripts/blob/master/is_ocrd_pdf.sh

#!/usr/bin/env bash
set -e
set -x

################################################################################
# Check if a PDF was scanned or created digitally, works on OCRd PDFs
#
# Usage:
#   bash is_scanned_pdf.sh [-p] file
#
#   Exit 0: Yes, file is a scanned PDF
#   Exit 99: No, file was created digitally
#
# Arguments:
#   -p or --pages: pos. integer, only consider first N pages
#
# Please report issues at https://github.com/jfilter/pdf-scripts/issues
#
# GPLv3, Copyright (c) 2020 Johannes Filter
################################################################################

# parse arguments
# h/t https://stackoverflow.com/a/33826763/4028896
max_pages=-1
# skip over positional argument of the file(s), thus -gt 1
while [[ "$#" -gt 1 ]]; do
  case $1 in
  -p | --pages)
    max_pages="$2"
    shift
    ;;
  *)
    echo "Unknown parameter passed: $1"
    exit 1
    ;;
  esac
  shift
done

# increment to make it easier with page numbering
max_pages=$((max_pages++))

command_exists() {
  if ! [ -x $($(command -v $1 &>/dev/null)) ]; then
    echo $(error: $1 is not installed.) >&2
    exit 1
  fi
}

command_exists mutool && command_exists gs && command_exists compare
command_exists pdfinfo

orig=$PWD
num_pages=$(pdfinfo $1 | grep Pages | awk '{print $2}')

echo $num_pages

echo $max_pages

if ((($max_pages > 1) && ($max_pages < $num_pages))); then
  num_pages=$max_pages
fi

cd $(mktemp -d)

for ((i = 1; i <= num_pages; i++)); do
  mkdir -p output/$i && echo $i
done

# important to filter text on output of GS (tmp1), cuz GS alters input PDF...
gs -o tmp1.pdf -sDEVICE=pdfwrite -dLastPage=$num_pages $1 &>/dev/null
gs -o tmp2.pdf -sDEVICE=pdfwrite -dFILTERTEXT tmp1.pdf &>/dev/null
mutool convert -o output/%d/1.png tmp1.pdf 2>/dev/null
mutool convert -o output/%d/2.png tmp2.pdf 2>/dev/null

for ((i = 1; i <= num_pages; i++)); do
  echo $i
  # difference in pixels, if 0 there are the same pictures
  # discard diff image
  if ! compare -metric AE output/$i/1.png output/$i/2.png null: 2>&1; then
    echo " pixels difference, not a scanned PDF, mismatch on page $i"
    exit 99
  fi
done

来源：https://stackoverflow.com/questions/55704218/how-to-check-if-pdf-is-scanned-image-or-contains-text

标签

python

python-3.x

pypdf2

pdfminer

pdf-extraction