Reading a particular page from a PDF document using PDFBox

后端 未结 6 2022
醉话见心
醉话见心 2020-12-03 10:41

How do I read a particular page (given a page number) from a PDF document using PDFBox?

相关标签:
6条回答
  • 2020-12-03 10:57

    Thought I would add my answer here as I found the above answers useful but not exactly what I needed.

    In my scenario I wanted to scan each page individually, look for a keyword, if that keyword appeared, then do something with that page (ie copy or ignore it).

    I've tried to simply and replace common variables etc in my answer:

    public void extractImages() throws Exception {
            try {
                String destinationDir = "OUTPUT DIR GOES HERE";
                // Load the pdf
                String inputPdf = "INPUT PDF DIR GOES HERE";
                document = PDDocument.load( inputPdf);
                List<PDPage> list = document.getDocumentCatalog().getAllPages();
                // Declare output fileName
                String fileName = "output.pdf";
                // Create output file
                PDDocument newDocument = new PDDocument();
                // Create PDFTextStripper - used for searching the page string
                PDFTextStripper textStripper=new PDFTextStripper(); 
                // Declare "pages" and "found" variable
                String pages= null; 
                boolean found = false;     
                // Loop through each page and search for "SEARCH STRING". If this doesn't exist
                // ie is the image page, then copy into the new output.pdf. 
                for(int i = 0; i < list.size(); i++) {
                    // Set textStripper to search one page at a time 
                    textStripper.setStartPage(i); 
                    textStripper.setEndPage(i);             
                    PDPage returnPage = null;
                    // Fetch page text and insert into "pages" string
                    pages = textStripper.getText(document); 
                    found = pages.contains("SEARCH STRING");
                        if (i != 0) {
                                // if nothing is found, then copy the page across to new                     output pdf file
                            if (found == false) {
                                returnPage = list.get(i - 1); 
                                System.out.println("page returned is: " + returnPage);
                                System.out.println("Copy page");
                                newDocument.importPage(returnPage);
                            }
                        }
                }    
                newDocument.save(destinationDir + fileName);
    
                System.out.println(fileName + " saved");
             } 
             catch (Exception e) {
                 e.printStackTrace();
                 System.out.println("catch extract image");
             }
        }
    
    0 讨论(0)
  • 2020-12-03 10:58
    //Using PDFBox library available from http://pdfbox.apache.org/  
    //Writes pdf document of specific pages as a new pdf file
    
    //Reads in pdf document  
    PDDocument pdDoc = PDDocument.load(file);
    
    //Creates a new pdf document  
    PDDocument document = null;
    
    //Adds specific page "i" where "i" is the page number and then saves the new pdf document   
    try {   
        document = new PDDocument();   
        document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(i));   
        document.save("file path"+"new document title"+".pdf");  
        document.close();  
    }catch(Exception e){}
    
    0 讨论(0)
  • 2020-12-03 11:00

    This should work:

    PDPage firstPage = (PDPage)doc.getAllPages().get( 0 );
    

    as seen in the BookMark section of the tutorial

    Update 2015, Version 2.0.0 SNAPSHOT

    Seems this was removed and put back (?). getPage is in the 2.0.0 javadoc. To use it:

    PDDocument document = PDDocument.load(new File(filename));
    PDPage doc = document.getPage(0);
    

    The getAllPages method has been renamed getPages

    PDPage page = (PDPage)doc.getPages().get( 0 );
    
    0 讨论(0)
  • 2020-12-03 11:03

    Here is the solution. Hope it will solve your issue.

    string fileName="C:\mypdf.pdf";
    PDDocument doc = PDDocument.load(fileName);                   
    PDFTextStripper stripper = new PDFTextStripper();
    stripper.setStartPage(1);
    stripper.setEndPage(2);
    //above page number 1 to 2 will be parsed. for parsing only one page set both value same (ex:setStartPage(1);  setEndPage(1);)
    string reslut = stripper.getText(doc);
    
    doc.close();
    
    0 讨论(0)
  • 2020-12-03 11:14

    you can you getPage method over PDDocument instance

    PDDocument pdDocument=null;
    pdDocument = PDDocument.load(inputStream);
    PDPage pdPage = pdDocument.getPage(0);
    
    0 讨论(0)
  • 2020-12-03 11:18

    Add this to the command-line call:

    ExtractText -startPage 1 -endPage 1 filename.pdf
    

    Change 1 to the page number that you need.

    0 讨论(0)
提交回复
热议问题