remove invisible text from pdf using pdfbox

后端 未结 1 1928
逝去的感伤
逝去的感伤 2020-12-06 08:38

Link to pdf

When I try to extract the text from the pdf above, I get a mixture of text that was invisible in the evince viewer as well as text that was visible. In a

相关标签:
1条回答
  • 2020-12-06 09:20

    The invisible text in the OP's sample PDF mostly is made invisible by defining clip paths (outside the bounds of which the text is) and by filling paths (hiding the text underneath). Thus, we have to consider path related instructions during text extraction to ignore that invisible text.

    Unfortunately call backs designed for these instructions are not declared in PDFTextStripper or its parent classes LegacyPDFStreamEngine and PDFStreamEngine.

    But they are declared in the other major PDFStreamEngine subclass PDFGraphicsStreamEngine, and they are sensibly implemented in PageDrawer.

    To make use of this we, therefore, can copy & paste & adapt the PageDrawer implementation into a subclass of PDFTextStripper, e.g. like this:

    public class PDFVisibleTextStripper extends PDFTextStripper {
        public PDFVisibleTextStripper() throws IOException {
            addOperator(new AppendRectangleToPath());
            addOperator(new ClipEvenOddRule());
            addOperator(new ClipNonZeroRule());
            addOperator(new ClosePath());
            addOperator(new CurveTo());
            addOperator(new CurveToReplicateFinalPoint());
            addOperator(new CurveToReplicateInitialPoint());
            addOperator(new EndPath());
            addOperator(new FillEvenOddAndStrokePath());
            addOperator(new FillEvenOddRule());
            addOperator(new FillNonZeroAndStrokePath());
            addOperator(new FillNonZeroRule());
            addOperator(new LineTo());
            addOperator(new MoveTo());
            addOperator(new StrokePath());
        }
    
        @Override
        protected void processTextPosition(TextPosition text) {
            Matrix textMatrix = text.getTextMatrix();
            Vector start = textMatrix.transform(new Vector(0, 0));
            Vector end = new Vector(start.getX() + text.getWidth(), start.getY());
    
            PDGraphicsState gs = getGraphicsState();
            Area area = gs.getCurrentClippingPath();
            if (area == null || (area.contains(start.getX(), start.getY()) && area.contains(end.getX(), end.getY())))
                super.processTextPosition(text);
        }
    
        private GeneralPath linePath = new GeneralPath();
    
        void deleteCharsInPath() {
            for (List<TextPosition> list : charactersByArticle) {
                List<TextPosition> toRemove = new ArrayList<>();
                for (TextPosition text : list) {
                    Matrix textMatrix = text.getTextMatrix();
                    Vector start = textMatrix.transform(new Vector(0, 0));
                    Vector end = new Vector(start.getX() + text.getWidth(), start.getY());
                    if (linePath.contains(start.getX(), start.getY()) || linePath.contains(end.getX(), end.getY())) {
                        toRemove.add(text);
                    }
                }
                if (toRemove.size() != 0) {
                    System.out.println(toRemove.size());
                    list.removeAll(toRemove);
                }
            }
        }
    
        public final class AppendRectangleToPath extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                if (operands.size() < 4) {
                    throw new MissingOperandException(operator, operands);
                }
                if (!checkArrayTypesClass(operands, COSNumber.class)) {
                    return;
                }
                COSNumber x = (COSNumber) operands.get(0);
                COSNumber y = (COSNumber) operands.get(1);
                COSNumber w = (COSNumber) operands.get(2);
                COSNumber h = (COSNumber) operands.get(3);
    
                float x1 = x.floatValue();
                float y1 = y.floatValue();
    
                // create a pair of coordinates for the transformation
                float x2 = w.floatValue() + x1;
                float y2 = h.floatValue() + y1;
    
                Point2D p0 = context.transformedPoint(x1, y1);
                Point2D p1 = context.transformedPoint(x2, y1);
                Point2D p2 = context.transformedPoint(x2, y2);
                Point2D p3 = context.transformedPoint(x1, y2);
    
                // to ensure that the path is created in the right direction, we have to create
                // it by combining single lines instead of creating a simple rectangle
                linePath.moveTo((float) p0.getX(), (float) p0.getY());
                linePath.lineTo((float) p1.getX(), (float) p1.getY());
                linePath.lineTo((float) p2.getX(), (float) p2.getY());
                linePath.lineTo((float) p3.getX(), (float) p3.getY());
    
                // close the subpath instead of adding the last line so that a possible set line
                // cap style isn't taken into account at the "beginning" of the rectangle
                linePath.closePath();
            }
    
            @Override
            public String getName() {
                return "re";
            }
        }
    
        public final class StrokePath extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.reset();
            }
    
            @Override
            public String getName() {
                return "S";
            }
        }
    
        public final class FillEvenOddRule extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
                deleteCharsInPath();
                linePath.reset();
            }
    
            @Override
            public String getName() {
                return "f*";
            }
        }
    
        public class FillNonZeroRule extends OperatorProcessor {
            @Override
            public final void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
                deleteCharsInPath();
                linePath.reset();
            }
    
            @Override
            public String getName() {
                return "f";
            }
        }
    
        public final class FillEvenOddAndStrokePath extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
                deleteCharsInPath();
                linePath.reset();
            }
    
            @Override
            public String getName() {
                return "B*";
            }
        }
    
        public class FillNonZeroAndStrokePath extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
                deleteCharsInPath();
                linePath.reset();
            }
    
            @Override
            public String getName() {
                return "B";
            }
        }
    
        public final class ClipEvenOddRule extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
                getGraphicsState().intersectClippingPath(linePath);
            }
    
            @Override
            public String getName() {
                return "W*";
            }
        }
    
        public class ClipNonZeroRule extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
                getGraphicsState().intersectClippingPath(linePath);
            }
    
            @Override
            public String getName() {
                return "W";
            }
        }
    
        public final class MoveTo extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                if (operands.size() < 2) {
                    throw new MissingOperandException(operator, operands);
                }
                COSBase base0 = operands.get(0);
                if (!(base0 instanceof COSNumber)) {
                    return;
                }
                COSBase base1 = operands.get(1);
                if (!(base1 instanceof COSNumber)) {
                    return;
                }
                COSNumber x = (COSNumber) base0;
                COSNumber y = (COSNumber) base1;
                Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());
                linePath.moveTo(pos.x, pos.y);
            }
    
            @Override
            public String getName() {
                return "m";
            }
        }
    
        public class LineTo extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                if (operands.size() < 2) {
                    throw new MissingOperandException(operator, operands);
                }
                COSBase base0 = operands.get(0);
                if (!(base0 instanceof COSNumber)) {
                    return;
                }
                COSBase base1 = operands.get(1);
                if (!(base1 instanceof COSNumber)) {
                    return;
                }
                // append straight line segment from the current point to the point
                COSNumber x = (COSNumber) base0;
                COSNumber y = (COSNumber) base1;
    
                Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());
    
                linePath.lineTo(pos.x, pos.y);
            }
    
            @Override
            public String getName() {
                return "l";
            }
        }
    
        public class CurveTo extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                if (operands.size() < 6) {
                    throw new MissingOperandException(operator, operands);
                }
                if (!checkArrayTypesClass(operands, COSNumber.class)) {
                    return;
                }
                COSNumber x1 = (COSNumber) operands.get(0);
                COSNumber y1 = (COSNumber) operands.get(1);
                COSNumber x2 = (COSNumber) operands.get(2);
                COSNumber y2 = (COSNumber) operands.get(3);
                COSNumber x3 = (COSNumber) operands.get(4);
                COSNumber y3 = (COSNumber) operands.get(5);
    
                Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());
                Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());
                Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());
    
                linePath.curveTo(point1.x, point1.y, point2.x, point2.y, point3.x, point3.y);
            }
    
            @Override
            public String getName() {
                return "c";
            }
        }
    
        public final class CurveToReplicateFinalPoint extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                if (operands.size() < 4) {
                    throw new MissingOperandException(operator, operands);
                }
                if (!checkArrayTypesClass(operands, COSNumber.class)) {
                    return;
                }
                COSNumber x1 = (COSNumber) operands.get(0);
                COSNumber y1 = (COSNumber) operands.get(1);
                COSNumber x3 = (COSNumber) operands.get(2);
                COSNumber y3 = (COSNumber) operands.get(3);
    
                Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());
                Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());
    
                linePath.curveTo(point1.x, point1.y, point3.x, point3.y, point3.x, point3.y);
            }
    
            @Override
            public String getName() {
                return "y";
            }
        }
    
        public class CurveToReplicateInitialPoint extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                if (operands.size() < 4) {
                    throw new MissingOperandException(operator, operands);
                }
                if (!checkArrayTypesClass(operands, COSNumber.class)) {
                    return;
                }
                COSNumber x2 = (COSNumber) operands.get(0);
                COSNumber y2 = (COSNumber) operands.get(1);
                COSNumber x3 = (COSNumber) operands.get(2);
                COSNumber y3 = (COSNumber) operands.get(3);
    
                Point2D currentPoint = linePath.getCurrentPoint();
    
                Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());
                Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());
    
                linePath.curveTo((float) currentPoint.getX(), (float) currentPoint.getY(), point2.x, point2.y, point3.x, point3.y);
            }
    
            @Override
            public String getName() {
                return "v";
            }
        }
    
        public final class ClosePath extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.closePath();
            }
    
            @Override
            public String getName() {
                return "h";
            }
        }
    
        public final class EndPath extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.reset();
            }
    
            @Override
            public String getName() {
                return "n";
            }
        }
    }
    

    (PDFVisibleTextStripper)

    Please make sure you use the inner operator classes in the PDFVisibleTextStripper constructor, not the classes used by PageDrawer with the same name. To make sure simply follow the link under the code.

    This reduces the output to

    REVERSE tEaSER caRd
    500
    elections
    er of Teams
    t Bet
    1,000
    MARK BOX AS SHOWN 
    DENOTES HOME TEAM
    PRO FOOTBALL - THURSDAY, SEPTEMBER 8, 2016
     1 PANTHERS    nbc  - 10½ 8:30p 2 BRONCOS   - 3½
     PRO FOOTBALL - SUNDAY, SEPTEMBER 11, 2016
     3 FALCONS     - 9½ 1:00p 4 BUCCANEERS  - 4½
     5 VIKINGS   - 9½ 1:00p 6 TITANS  - 4½
     7 EAGLES  - 10½ 1:00p 8 BROWNS  - 3½
     9 BENGALS - 9½ 1:00p 10 JETS  - 4½
     11 SAINTS    - 7½ 1:00p 12 RAIDERS   - 6½
     13 CHIEFS  - 14½ 1:00p 14 CHARGERS  + ½
     15 RAVENS  - 10½ 1:00p 16 BILLS - 3½
     17 TEXANS  - 14½ 1:00p 18 BEARS + ½
     19 PACKERS - 12½ 1:00p 20 JAGUARS  - 1½
     21 SEAHAWKS    - 17½ 4:05p 22 DOLPHINS + 3½
     23 COWBOYS    - 7½ 4:25p 24 GIANTS - 6½
     25 COLTS     - 10½ 4:25p 26 LIONS - 3½
     27 CARDINALS   nbc  - 14½ 8:30p 28 PATRIOTS + ½
     PRO FOOTBALL - MONDAY, SEPTEMBER 12, 2016
     29 STEELERS  espn  - 10½ 7:10p 30 REDSKINS  - 3½
     31 RAMS  espn  - 9½ 10:20p 32 49ERS  - 4½
    

    which drops most of the unwanted data.


    In the context of this question it became apparent that the way processTextPosition and deleteCharsInPath calculate the end of a character baseline implicitly assumes horizontal text without page rotation. If one loosens one's criteria for "Visibility", though, one can assume a character to be visible iff the start of its baseline is visible. In that case one does not need that calculated Vector end anymore and the code works ok for rotated pages, too.


    In the context of this question it became apparent that glyph origin coordinates exactly on the clip path borders can wander outside of the clip path due to floating point calculation errors. Switching to "fat point coordinate checks" turned out to be an acceptable work-around.

    0 讨论(0)
提交回复
热议问题