问题
Is there any way to extract the text of a specific region using ICEpdf? I was able to extract whole pages, but that's not what I want to do.
(I know PDFBox nicely extracts the text in a specific rectangular area of a page. However, since the image rendering works a lot better in ICEpdf, I'd like to use that library.)
回答1:
ON the Page object that represents a page you can call the method:
PageText pageText = document.getPageText(pagNumber);
Similar to the bundle example ./examples/extraction/PageTextExtraction.java
The PageText object contains all the LineText->WordText->GlyphText objects for the page. LineText, WordText and GlyphText all extend AbstractText which has a getBounds() method. The bounds of these objects are in PDF user space, the 1st geometric quadrant. Java2D is in the 4th geometric quadrant. Assuming you already have the selectionRectangle the code would be as follows:
// the currently selected state, ignore highlighted. currentPage.getViewText().clearSelected(); // get page transform, same for all calculations AffineTransform pageTransform = currentPage.getPageTransform( Page.BOUNDARY_CROPBOX, documentViewModel.getViewRotation(), documentViewModel.getViewZoom()); Rectangle2D.Float pageSpaceSelectRectangle = convertRectangleToPageSpace(selectionRectangle, pageTransform); ArrayList pageLines = pageText.getPageLines(); for (LineText pageLine : pageLines) { // check for containment, if so break into words. if (pageLine.getBounds().intersects(pageSpaceSelectRectangle )) { // you have some selected text. } } /** * Converts the rectangle to the space specified by the page tranform. This * is a utility method for converting a selection rectangle to page space * so that an intersection can be calculated to determine a selected state. * * @param mouseRect rectangle to convert space of * @param pageTransform page transform * @return converted rectangle. */ private Rectangle2D convertRectangleToPageSpace(Rectangle mouseRect, AffineTransform pageTransform) { GeneralPath shapePath; try { AffineTransform tranform = pageTransform.createInverse(); shapePath = new GeneralPath(mouseRect); shapePath.transform(tranform); return shapePath.getBounds2D(); } catch (NoninvertibleTransformException e) { logger.log(Level.SEVERE, "Error converting mouse point to page space.", e); } return null; }
回答2:
Have you posted on the icepdf forums? They are usually very good at answering questions there?
来源:https://stackoverflow.com/questions/5854969/extracting-text-in-a-specific-region-of-pdf-page-using-icepdf