How can I tell the resolution of scanned PDF from within a shell script?

后端 未结 7 1892
猫巷女王i
猫巷女王i 2021-02-03 11:25

I have a large collection of documents scanned into PDF format, and I wish to write a shell script that will convert each document to DjVu format. Some documents were scanned a

7条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2021-02-03 11:55

    If a pdf has been created by scanning then there should only be one image associated with each page. You can find each image resolution for each page image by parsing the pdf using the iText(Java) or iTextSharp(the .net port) libraries easily.

    If you want to roll your own utility to do this, do something like the following in iTextSharp :

    PdfReader reader = new PdfReader(filename);
    for (int i = 1; i <= reader.NumberOfPages; i++)
    {
    PdfDictionary pg = reader.GetPageN(i);
    PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
    PdfDictionary xobjs = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
    if (xobjs != null) 
    {
        foreach (PdfName xObjectKey in xobjs.Keys)
        {
        PdfObject xobj = xobjs.Get(xObjectKey);
        PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(xobj);
        PdfName subtype = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
        if  (subtype.Equals(PdfName.IMAGE))
        {
            PdfNumber width = (PdfNumber)tg.Get(PdfName.WIDTH);
            PdfNumber height = (PdfNumber)tg.Get(PdfName.HEIGHT);
            MessageBox.Show("image on page [" + i + "] resolution=[" + width +"x" + height + "]");
        }
        }
    }
    }   
    reader.Close();
    

    Here for each page we read through each XObject of subtype Image and get the WIDTH and HEIGHT values. This will be the pixel resolution of the image that the scanner has embedded in the pdf.

    Note that the scaling of this image to match the page resolution (as in the size of the page rendered in Acrobat - A4, Letter, etc) is performed separately in the page content stream, which is represented as a subset of postscript, and much harder to find without parsing the postscript.

    Be aware that there are some scanners that will embed the scanned image as a grid of smaller images (for some kind of size optimization I assume). So if you see something like 50 small images popping up for each page, that could be why.

    Hope this helps in some way if you have to roll your own utility.

提交回复
热议问题