Search texts and get position in pdf with java

后端 未结 1 1064
粉色の甜心
粉色の甜心 2021-01-28 15:58

How can I search for text and get position in pdf with java ? I tried with apache pdfbox and pdfclown but whenever the text goes down or start a new paragraph, it doesn\'t work.

相关标签:
1条回答
  • 2021-01-28 16:45

    You referred to one of my earlier answers as an example for PDFBox which did not work for you. Indeed, as already explained in that answer it was a surprise to see that code match anything beyond single words as the callers of the routine overridden there gave the impression of calling it word-by-word. Thus, anything spanning more than a single line indeed hardly could be expected to be found.

    But one can improve that example in quite a natural manner to allow searches across line borders, assuming lines are split at spaces. Replace the method findSubwords by this improved version:

    List<TextPositionSequence> findSubwordsImproved(PDDocument document, int page, String searchTerm) throws IOException
    {
        final List<TextPosition> allTextPositions = new ArrayList<>();
        PDFTextStripper stripper = new PDFTextStripper()
        {
            @Override
            protected void writeString(String text, List<TextPosition> textPositions) throws IOException
            {
                allTextPositions.addAll(textPositions);
                super.writeString(text, textPositions);
            }
    
            @Override
            protected void writeLineSeparator() throws IOException {
                if (!allTextPositions.isEmpty()) {
                    TextPosition last = allTextPositions.get(allTextPositions.size() - 1);
                    if (!" ".equals(last.getUnicode())) {
                        Matrix textMatrix = last.getTextMatrix().clone();
                        textMatrix.setValue(2, 0, last.getEndX());
                        textMatrix.setValue(2, 1, last.getEndY());
                        TextPosition separatorSpace = new TextPosition(last.getRotation(), last.getPageWidth(), last.getPageHeight(),
                                textMatrix, last.getEndX(), last.getEndY(), last.getHeight(), 0, last.getWidthOfSpace(), " ",
                                new int[] {' '}, last.getFont(), last.getFontSize(), (int) last.getFontSizeInPt());
                        allTextPositions.add(separatorSpace);
                    }
                }
                super.writeLineSeparator();
            }
        };
        
        stripper.setSortByPosition(true);
        stripper.setStartPage(page);
        stripper.setEndPage(page);
        stripper.getText(document);
    
        final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();
        TextPositionSequence word = new TextPositionSequence(allTextPositions);
        String string = word.toString();
    
        int fromIndex = 0;
        int index;
        while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
        {
            hits.add(word.subSequence(index, index + searchTerm.length()));
            fromIndex = index + 1;
        }
    
        return hits;
    }
    

    (SearchSubword method)

    Here we collect all TextPosition entries, we actually even add virtual such entries representing a space whenever a line break is added by PDFBox. As soon as the whole page is rendered, we search the collection of all these text positions.

    Applied to the example document in the original question,

    looking for "${var 2}" now returns all 8 occurrences, also those split across lines:

    * Looking for '${var 2}' (improved)
      Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997
      Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74
      Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998
      Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81
      Page 1 at 164.39648, 357.28998 with width -46.081444 and last letter '}' at 112.46, 372.65
      Page 1 at 174.97762, 388.72998 with width -56.662575 and last letter '}' at 112.46, 404.09
      Page 1 at 153.74, 420.16998 with width -32.004005 and last letter '}' at 112.46, 435.65
      Page 1 at 162.99922, 451.61 with width -43.692017 and last letter '}' at 112.46, 467.21
    

    The negative widths occur because the x coordinate of the end of the match is less than that of its start.

    0 讨论(0)
提交回复
热议问题