How can I determine if a file is a PDF file?

后端 未结 13 765
暖寄归人
暖寄归人 2020-12-24 11:57

I am using PdfBox in Java to extract text from PDF files. Some of the input files provided are not valid and PDFTextStripper halts on these files. Is there a clean way to ch

相关标签:
13条回答
  • 2020-12-24 12:12

    Here an adapted Java version of NinjaCross's code.

    /**
     * Test if the data in the given byte array represents a PDF file.
     */
    public static boolean is_pdf(byte[] data) {
        if (data != null && data.length > 4 &&
                data[0] == 0x25 && // %
                data[1] == 0x50 && // P
                data[2] == 0x44 && // D
                data[3] == 0x46 && // F
                data[4] == 0x2D) { // -
    
            // version 1.3 file terminator
            if (data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x33 &&
                    data[data.length - 7] == 0x25 && // %
                    data[data.length - 6] == 0x25 && // %
                    data[data.length - 5] == 0x45 && // E
                    data[data.length - 4] == 0x4F && // O
                    data[data.length - 3] == 0x46 && // F
                    data[data.length - 2] == 0x20 && // SPACE
                    data[data.length - 1] == 0x0A) { // EOL
                return true;
            }
    
            // version 1.3 file terminator
            if (data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x34 &&
                    data[data.length - 6] == 0x25 && // %
                    data[data.length - 5] == 0x25 && // %
                    data[data.length - 4] == 0x45 && // E
                    data[data.length - 3] == 0x4F && // O
                    data[data.length - 2] == 0x46 && // F
                    data[data.length - 1] == 0x0A) { // EOL
                return true;
            }
        }
        return false;
    }
    

    And some simple unit tests:

    @Test
    public void test_valid_pdf_1_3_data_is_pdf() {
        assertTrue(is_pdf("%PDF-1.3 CONTENT %%EOF \n".getBytes()));
    }
    
    @Test
    public void test_valid_pdf_1_4_data_is_pdf() {
        assertTrue(is_pdf("%PDF-1.4 CONTENT %%EOF\n".getBytes()));
    }
    
    @Test
    public void test_invalid_data_is_not_pdf() {
        assertFalse(is_pdf("Hello World".getBytes()));
    }
    

    If you come up with any failing unit tests, please let me know.

    0 讨论(0)
  • 2020-12-24 12:12

    Here is a method that checks for the presence of %%EOF with optional checks for white-space characters. You can pass in either a File or a byte[] object. There is less restriction for white-space characters in some PDF versions.

    public boolean isPdf(byte[] data) {
        if (data == null || data.length < 5) return false;
        // %PDF-
        if (data[0] == 0x25 && data[1] == 0x50 && data[2] == 0x44 && data[3] == 0x46 && data[4] == 0x2D) {
            int offset = data.length - 8, count = 0; // check last 8 bytes for %%EOF with optional white-space
            boolean hasSpace = false, hasCr = false, hasLf = false;
            while (offset < data.length) {
                if (count == 0 && data[offset] == 0x25) count++; // %
                if (count == 1 && data[offset] == 0x25) count++; // %
                if (count == 2 && data[offset] == 0x45) count++; // E
                if (count == 3 && data[offset] == 0x4F) count++; // O
                if (count == 4 && data[offset] == 0x46) count++; // F
                // Optional flags for meta info
                if (count == 5 && data[offset] == 0x20) hasSpace = true; // SPACE
                if (count == 5 && data[offset] == 0x0D) hasCr    = true; // CR
                if (count == 5 && data[offset] == 0x0A) hasLf    = true; // LF / EOL
                offset++;
            }
    
            if (count == 5) {
                String version = data.length > 13 ? String.format("%s%s%s", (char) data[5], (char) data[6], (char) data[7]) : "?";
                System.out.printf("Version : %s | Space : %b | CR : %b | LF : %b%n", version, hasSpace, hasCr, hasLf);
                return true;
            }
        }
    
        return false;
    }
    
    public boolean isPdf(File file) throws IOException {
        return isPdf(file, false);
    }
    
    // With version: 16 bytes, without version: 13 bytes.
    public boolean isPdf(File file, boolean includeVersion) throws IOException {
        if (file == null) return false;
        int offsetStart = includeVersion ? 8 : 5, offsetEnd = 8;
        byte[] bytes = new byte[offsetStart + offsetEnd];
        InputStream is = new FileInputStream(file);
        try {
            is.read(bytes, 0, offsetStart); // %PDF-
            is.skip(file.length() - bytes.length); // Skip bytes
            is.read(bytes, offsetStart, offsetEnd); // %%EOF,SP?,CR?,LF?
        } finally {
            is.close();
        }
        return isPdf(bytes);
    }
    
    0 讨论(0)
  • 2020-12-24 12:16

    Pdf files begin "%PDF" (open one in TextPad or similar and take a look)

    Any reason you can't just read the file with a StringReader and check for this?

    0 讨论(0)
  • 2020-12-24 12:16

    Relying on magic numbers does not really appeal to me. I ended up using a preflight library from Apache for this:

    compile group: 'org.apache.pdfbox', name: 'preflight', version: '2.0.19'

    private boolean isPdf(InputStream fileInputStream) {
        try {
            PreflightParser preflightParser = new PreflightParser(new ByteArrayDataSource(fileInputStream));
            preflightParser.parse();
            return true;
        } catch (Exception e) {
            return false;
        }
    }
    

    PreflightParser has constructors for files and other data sources.

    0 讨论(0)
  • 2020-12-24 12:17

    You have to try this....

    public boolean isPDF(File file){
        file = new File("Demo.pdf");
        Scanner input = new Scanner(new FileReader(file));
        while (input.hasNextLine()) {
            final String checkline = input.nextLine();
            if(checkline.contains("%PDF-")) { 
                // a match!
                return true;
            }  
        }
        return false;
    }
    
    0 讨论(0)
  • 2020-12-24 12:20

    you can find out the mime type of a file (or byte array), so you dont dumbly rely on the extension. I do it with aperture's MimeExtractor (http://aperture.sourceforge.net/) or I saw some days ago a library just for that (http://sourceforge.net/projects/mime-util)

    I use aperture to extract text from a variety of files, not only pdf, but have to tweak thinks for pdfs for example (aperture uses pdfbox, but i added another library as fallback when pdfbox fails)

    0 讨论(0)
提交回复
热议问题