How to know if a PDF contains only images or has been OCR scanned for searching?

前端未结

关注

 7  1965

I have a bunch of PDF files that came from scanned documents. The files contain a mix of images and text. Some were scanned as images with no OCR, so each PDF page is one

相关标签:

7条回答

暖寄归人

2020-12-08 10:49

Use "dtsearch" to create an index for all the pdf files... then "view the log file" of the indexing process to check the list of pdf files that were not indexed.

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉梦人生

2020-12-08 10:51

Open the document in acrobat. Go to File -> Properties. Look in the "Advanced" section and find the PDF Producer. If it reads something like "Paper Capture..." then it has been OCR'd.

Hope this helps.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2020-12-08 10:53

Various PDF tools can tell you if there's text. Some are available as COM controls, and maybe even native .NET ones.

0 讨论(0)
发布评论:

提交评论
- 加载中...
一整个雨季

2020-12-08 11:03

Sorry to dig up old thread, but if you found this have a look at my thread:

Batch OCR Program for PDFs

you can get extra information about the pdf by catting it in unix/linux/osx or opening it as "rb" mode in python. (course that's python and you didn't want to use that but maybe it has something equivalent).

0 讨论(0)
发布评论:

提交评论
- 加载中...
清歌不尽

2020-12-08 11:12

Apago's pdfspy extracts information from PDF into an XML file. It includes information about the document including images and text. For your project, the useful information includes image count & size and where there is OCR (hidden) text.

http://www.apagoinc.com/pdfspy

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2020-12-08 11:14
Scannned images converted to PDF which have been OCR'ed in the aftermath to make text searchable do normally contain the text parts rendered as "invisible". So what you see on screen (or on paper when printed) is still the original image. But when you search successfully, you get the hits highlighted that are on the invisible text.

I'd recommend you to look at the XPDF-derived commandline tools pdffonts(.exe), pdfinfo(.exe) and pdftotext(.exe). See here for downloads: http://www.foolabs.com/xpdf/download.html

Example usage of pdffonts:
```
C:\downloads\> pdffonts cisco-ip-phone-7911-guide6.1.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
LGOKFL+Univers-BlackOblique          Type 1C           yes yes no   13171  0
LGOKGM+Univers-Black                 Type 1C           yes yes no   13172  0
[....]
```
This PDF uses fonts (indicated by the 'name' column), has them embedded (indicated by the 'yes' in the 'emb' column) and uses subset fonts (indicated by the 'yes' in the 'sub' column).
```
C:\downloads\> pdffonts examle1.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
Univers-BlackOblique                 Type 1C           yes no  no   14    0
Arial                                TrueType          no  no  no   15    0
```
This PDF uses 2 fonts (indicated by the 'name' column). The font 'Universe-BlackOblique' is embedded completely (indicated by the 'yes' in the 'emb' column and the 'no' in the 'sub' column). The font 'Arial' is also used, but is not embedded.
```
C:\downloads\> pdffonts examle2.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
```
This PDF uses not a single font, and hence does not have any text embedded (so no OCR either).

Example usage of pdftotext:
```
C:\downloads\> pdftotext ^
                   -layout ^
                   cisco-ip-phone-7911-guide6.1.pdf ^
                   cisco-ip-phone-7911-guide6.1.txt
```
This will extract all text strings from the PDF (trying to preserve some resemblance of the original layout). If there is no text in the PDF, you'd know there was no OCR...
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页