How can I tell the resolution of scanned PDF from within a shell script?

后端 未结 7 1891
猫巷女王i
猫巷女王i 2021-02-03 11:25

I have a large collection of documents scanned into PDF format, and I wish to write a shell script that will convert each document to DjVu format. Some documents were scanned a

相关标签:
7条回答
  • 2021-02-03 11:41

    Too long to put into a comment, but neither ImageMagick nor GraphicsMagic is up to the job; every answer is wrong:

    : nr@yorkie 1932 ; gm identify -format "x=%x y=%y w=%w h=%h" drh*rec*pdf
    x=0 y=0 w=612 h=792
    x=0 y=0 w=612 h=792
    x=0 y=0 w=612 h=792
    x=0 y=0 w=612 h=792
    x=0 y=0 w=612 h=792
    x=0 y=0 w=612 h=792
    x=0 y=0 w=612 h=792
    x=0 y=0 w=612 h=792
    
    : nr@yorkie 1933 ; identify -format "x=%x y=%y w=%w h=%h" drh*rec*pdf   
    x=72 Undefined y=72 Undefined w=612 h=792x=72 Undefined y=72 Undefined     w=612 h=792x=72 Undefined y=72 Undefined w=612 h=792x=72 Undefined     y=72 Undefined w=612 h=792x=72 Undefined y=72 Undefined w=612     h=792x=72 Undefined y=72 Undefined w=612 h=792x=72 Undefined y=72     Undefined w=612 h=792x=72 Undefined y=72 Undefined w=612 h=792
    : nr@yorkie 1934 ; 
    

    The correct parameters for this document is that each scanned page is 5100 pixels wide and 6600 pixels high, unsurprising for this was an 8.5-by-11 scanned at 600dpi. The output from ImageMagic is astoundingly unprofessional.

    No downvotes because you were trying to be helpful, but *Magick don't work.

    0 讨论(0)
  • 2021-02-03 11:48

    Apago's PDF Spy will tell you the acutal resolution of images in a PDF along with lots of other stuff. It's a commercial product but has a 10 day demo.

    0 讨论(0)
  • 2021-02-03 11:51

    PDF is a resolution independent format, it's a nonsensical question. You may have scanned some bitmaps at a particular resolution, and those bitmaps are individually embedded inside the pdf, but the PDF itself may contain images at multiple resolutions, as well as resolution independent vector graphics. There's no way to know without cracking open the pdf and examining every object inside it.

    Editing to continue expounding on the problem:

    You may have gotten lucky, and the software you used to scan the documents embedded some metadata about this, but don't bet on it. Such metadata is unlikely to be standard. As far as parsing the pdf, you'd want a prewritten library - such as ghostscript. The problem is that PDF isn't really a format so much as it is a specified subset of the PostScript programming language, and an agreed upon way of compressing/compiling this subset along with some binaries. Thus reading a PDF is more complicated than other types of image formats, as it involves writing a language interpreter - not so straightforward.

    The best approach is to either throw up your hands and give up, or really look hard at ghostscript and see if you can get that to tell you the answer.

    0 讨论(0)
  • 2021-02-03 11:52

    pdfimages has a -list option that gives the height width in pixels and also y-ppi and x-ppi.

     pdfimages -list tmp.pdf           
    page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
    --------------------------------------------------------------------------------------------
       1     0 image    3300  2550  gray    1   1  ccitt  no       477  0   389   232  172K  17%
       2     1 image    3300  2550  gray    1   1  ccitt  no         3  0   389   232  103K  10%
       3     2 image    3300  2550  gray    1   1  ccitt  no         7  0   389   232  236K  23%
       4     3 image    3300  2550  gray    1   1  ccitt  no        11  0   389   232  210K  20%
       5     4 image    3300  2550  gray    1   1  ccitt  no        15  0   389   232  250K  24%
       6     5 image    3300  2550  gray    1   1  ccitt  no        19  0   389   232  199K  19%
       7     6 image    3300  2550  gray    1   1  ccitt  no        23  0   389   232  503K  49%
       8     7 image    3300  2550  gray    1   1  ccitt  no        27  0   389   232  154K  15%
       9     8 image    3300  2550  gray    1   1  ccitt  no        31  0   389   232 21.5K 2.1%
      10     9 image    3300  2550  gray    1   1  ccitt  no        35  0   389   232  286K  28%
      11    10 image    3300  2550  gray    1   1  ccitt  no        39  0   389   232 46.8K 4.6%
      12    11 image    3300  2550  gray    1   1  ccitt  no        43  0   389   232 55.5K 5.4%
      13    12 image    3300  2550  gray    1   1  ccitt  no        47  0   389   232 35.0K 3.4%
      14    13 image    3300  2550  gray    1   1  ccitt  no        51  0   389   232 26.9K 2.6%
      15    14 image    3300  2550  gray    1   1  ccitt  no        55  0   389   232 66.5K 6.5%
      16    15 image    3300  2550  gray    1   1  ccitt  no        59  0   389   232 73.9K 7.2%
      17    16 image    3300  2550  gray    1   1  ccitt  no        63  0   389   232 47.0K 4.6%
      18    17 image    3300  2550  gray    1   1  ccitt  no        67  0   389   232 30.1K 2.9%
      19    18 image    3300  2550  gray    1   1  ccitt  no        71  0   389   232 70.3K 6.8%
      20    19 image    3300  2550  gray    1   1  ccitt  no        75  0   389   232 46.0K 4.5%
      21    20 image    3300  2550  gray    1   1  ccitt  no        79  0   389   232 28.9K 2.8%
      22    21 image    3300  2550  gray    1   1  ccitt  no        83  0   389   232 72.7K 7.1%
      23    22 image    3300  2550  gray    1   1  ccitt  no        87  0   389   232 47.5K 4.6%
      24    23 image    3300  2550  gray    1   1  ccitt  no        91  0   389   232 30.1K 2.9%
    
    0 讨论(0)
  • 2021-02-03 11:55

    If a pdf has been created by scanning then there should only be one image associated with each page. You can find each image resolution for each page image by parsing the pdf using the iText(Java) or iTextSharp(the .net port) libraries easily.

    If you want to roll your own utility to do this, do something like the following in iTextSharp :

    PdfReader reader = new PdfReader(filename);
    for (int i = 1; i <= reader.NumberOfPages; i++)
    {
    PdfDictionary pg = reader.GetPageN(i);
    PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
    PdfDictionary xobjs = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
    if (xobjs != null) 
    {
        foreach (PdfName xObjectKey in xobjs.Keys)
        {
        PdfObject xobj = xobjs.Get(xObjectKey);
        PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(xobj);
        PdfName subtype = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
        if  (subtype.Equals(PdfName.IMAGE))
        {
            PdfNumber width = (PdfNumber)tg.Get(PdfName.WIDTH);
            PdfNumber height = (PdfNumber)tg.Get(PdfName.HEIGHT);
            MessageBox.Show("image on page [" + i + "] resolution=[" + width +"x" + height + "]");
        }
        }
    }
    }   
    reader.Close();
    

    Here for each page we read through each XObject of subtype Image and get the WIDTH and HEIGHT values. This will be the pixel resolution of the image that the scanner has embedded in the pdf.

    Note that the scaling of this image to match the page resolution (as in the size of the page rendered in Acrobat - A4, Letter, etc) is performed separately in the page content stream, which is represented as a subset of postscript, and much harder to find without parsing the postscript.

    Be aware that there are some scanners that will embed the scanned image as a grid of smaller images (for some kind of size optimization I assume). So if you see something like 50 small images popping up for each page, that could be why.

    Hope this helps in some way if you have to roll your own utility.

    0 讨论(0)
  • 2021-02-03 12:00

    I guess that the scans are included as images in the PDF, so you could use pdfimages to extract them first. Then, identify should be able to find the correct data.

    0 讨论(0)
提交回复
热议问题