How can I search for text and get position in pdf with java ? I tried with apache pdfbox and pdfclown but whenever the text goes down or start a new paragraph, it doesn\'t work.
You referred to one of my earlier answers as an example for PDFBox which did not work for you. Indeed, as already explained in that answer it was a surprise to see that code match anything beyond single words as the callers of the routine overridden there gave the impression of calling it word-by-word. Thus, anything spanning more than a single line indeed hardly could be expected to be found.
But one can improve that example in quite a natural manner to allow searches across line borders, assuming lines are split at spaces. Replace the method findSubwords
by this improved version:
List findSubwordsImproved(PDDocument document, int page, String searchTerm) throws IOException
{
final List allTextPositions = new ArrayList<>();
PDFTextStripper stripper = new PDFTextStripper()
{
@Override
protected void writeString(String text, List textPositions) throws IOException
{
allTextPositions.addAll(textPositions);
super.writeString(text, textPositions);
}
@Override
protected void writeLineSeparator() throws IOException {
if (!allTextPositions.isEmpty()) {
TextPosition last = allTextPositions.get(allTextPositions.size() - 1);
if (!" ".equals(last.getUnicode())) {
Matrix textMatrix = last.getTextMatrix().clone();
textMatrix.setValue(2, 0, last.getEndX());
textMatrix.setValue(2, 1, last.getEndY());
TextPosition separatorSpace = new TextPosition(last.getRotation(), last.getPageWidth(), last.getPageHeight(),
textMatrix, last.getEndX(), last.getEndY(), last.getHeight(), 0, last.getWidthOfSpace(), " ",
new int[] {' '}, last.getFont(), last.getFontSize(), (int) last.getFontSizeInPt());
allTextPositions.add(separatorSpace);
}
}
super.writeLineSeparator();
}
};
stripper.setSortByPosition(true);
stripper.setStartPage(page);
stripper.setEndPage(page);
stripper.getText(document);
final List hits = new ArrayList();
TextPositionSequence word = new TextPositionSequence(allTextPositions);
String string = word.toString();
int fromIndex = 0;
int index;
while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
{
hits.add(word.subSequence(index, index + searchTerm.length()));
fromIndex = index + 1;
}
return hits;
}
(SearchSubword method)
Here we collect all TextPosition
entries, we actually even add virtual such entries representing a space whenever a line break is added by PDFBox. As soon as the whole page is rendered, we search the collection of all these text positions.
Applied to the example document in the original question,
looking for "${var 2}"
now returns all 8 occurrences, also those split across lines:
* Looking for '${var 2}' (improved)
Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997
Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74
Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998
Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81
Page 1 at 164.39648, 357.28998 with width -46.081444 and last letter '}' at 112.46, 372.65
Page 1 at 174.97762, 388.72998 with width -56.662575 and last letter '}' at 112.46, 404.09
Page 1 at 153.74, 420.16998 with width -32.004005 and last letter '}' at 112.46, 435.65
Page 1 at 162.99922, 451.61 with width -43.692017 and last letter '}' at 112.46, 467.21
The negative widths occur because the x coordinate of the end of the match is less than that of its start.