iText PDFSweep RegexBasedCleanupStrategy not work in some case

…衆ロ難τιáo~ 提交于 2019-12-25 00:59:53

问题


I'm trying to use iText PDFSweep RegexBasedCleanupStrategy to redact some words from pdf, however I only want to redact the word but not appear in other word, eg. I want to redact "al" as single word, but I don't want to redact the "al" in "mineral". So I add the word boundary("\b") in the Regex as parameter to RegexBasedCleanupStrategy,

  new RegexBasedCleanupStrategy("\\bal\\b")

however the pdfAutoSweep.cleanUp not work if the word is at the end of line.


回答1:


In short

The cause of this issue is that the routine that flattens the extracted text chunks into a single String for applying the regular expression does not insert any indicator for a line break. Thus, in that String the last letter from one line is immediately followed by the first letter of the next which hides the word boundary. One can fix the behavior by adding an appropriate character to the String in case of a line break.

The problematic code

The routine that flattens the extracted text chunks into a single String is CharacterRenderInfo.mapString(List<CharacterRenderInfo>) in the package com.itextpdf.kernel.pdf.canvas.parser.listener. In case of a merely horizontal gap this routine inserts a space character but in case of a vertical offset, i.e. a line break, it adds nothing extra to the StringBuilder in which the String representation is generated:

if (chunk.sameLine(lastChunk)) {
    // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
    if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
        sb.append(' ');
    }
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
} else {
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
}

A possible fix

One can extend the code above to insert a newline character in case of a line break:

if (chunk.sameLine(lastChunk)) {
    // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
    if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
        sb.append(' ');
    }
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
} else {
    sb.append('\n');
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
}

This CharacterRenderInfo.mapString method is only called from the RegexBasedLocationExtractionStrategy method getResultantLocations() (package com.itextpdf.kernel.pdf.canvas.parser.listener), and only for the task mentioned, i.e. applying the regular expression in question. Thus, enabling it to properly allow recognition of word boundaries should not break anything but indeed should be considered a fix.

One merely might consider adding a different character for a line break, e.g. a plain space ' ' if one does not want to treat vertical gaps any different than horizontal ones. For a general fix one might, therefore, consider making this character a settable property of the strategy.

Versions

I tested with iText 7.1.4-SNAPSHOT and PDFSweep 2.0.3-SNAPSHOT.



来源:https://stackoverflow.com/questions/52635416/itext-pdfsweep-regexbasedcleanupstrategy-not-work-in-some-case

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!