Search texts and get position in pdf with java

后端未结

关注

 1  1064

How can I search for text and get position in pdf with java ? I tried with apache pdfbox and pdfclown but whenever the text goes down or start a new paragraph, it doesn\'t work.

相关标签:

1条回答

暖寄归人

2021-01-28 16:45

You referred to one of my earlier answers as an example for PDFBox which did not work for you. Indeed, as already explained in that answer it was a surprise to see that code match anything beyond single words as the callers of the routine overridden there gave the impression of calling it word-by-word. Thus, anything spanning more than a single line indeed hardly could be expected to be found.

But one can improve that example in quite a natural manner to allow searches across line borders, assuming lines are split at spaces. Replace the method findSubwords by this improved version:

List<TextPositionSequence> findSubwordsImproved(PDDocument document, int page, String searchTerm) throws IOException
{
    final List<TextPosition> allTextPositions = new ArrayList<>();
    PDFTextStripper stripper = new PDFTextStripper()
    {
        @Override
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException
        {
            allTextPositions.addAll(textPositions);
            super.writeString(text, textPositions);
        }

        @Override
        protected void writeLineSeparator() throws IOException {
            if (!allTextPositions.isEmpty()) {
                TextPosition last = allTextPositions.get(allTextPositions.size() - 1);
                if (!" ".equals(last.getUnicode())) {
                    Matrix textMatrix = last.getTextMatrix().clone();
                    textMatrix.setValue(2, 0, last.getEndX());
                    textMatrix.setValue(2, 1, last.getEndY());
                    TextPosition separatorSpace = new TextPosition(last.getRotation(), last.getPageWidth(), last.getPageHeight(),
                            textMatrix, last.getEndX(), last.getEndY(), last.getHeight(), 0, last.getWidthOfSpace(), " ",
                            new int[] {' '}, last.getFont(), last.getFontSize(), (int) last.getFontSizeInPt());
                    allTextPositions.add(separatorSpace);
                }
            }
            super.writeLineSeparator();
        }
    };
    
    stripper.setSortByPosition(true);
    stripper.setStartPage(page);
    stripper.setEndPage(page);
    stripper.getText(document);

    final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();
    TextPositionSequence word = new TextPositionSequence(allTextPositions);
    String string = word.toString();

    int fromIndex = 0;
    int index;
    while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
    {
        hits.add(word.subSequence(index, index + searchTerm.length()));
        fromIndex = index + 1;
    }

    return hits;
}

(SearchSubword method)

Here we collect all TextPosition entries, we actually even add virtual such entries representing a space whenever a line break is added by PDFBox. As soon as the whole page is rendered, we search the collection of all these text positions.

Applied to the example document in the original question,

looking for "${var 2}" now returns all 8 occurrences, also those split across lines:

* Looking for '${var 2}' (improved)
  Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997
  Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74
  Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998
  Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81
  Page 1 at 164.39648, 357.28998 with width -46.081444 and last letter '}' at 112.46, 372.65
  Page 1 at 174.97762, 388.72998 with width -56.662575 and last letter '}' at 112.46, 404.09
  Page 1 at 153.74, 420.16998 with width -32.004005 and last letter '}' at 112.46, 435.65
  Page 1 at 162.99922, 451.61 with width -43.692017 and last letter '}' at 112.46, 467.21

The negative widths occur because the x coordinate of the end of the match is less than that of its start.

0 讨论(0)