How to determine artificial bold style ,artificial italic style and artificial outline style of a text using PDFBOX

后端 未结 2 804
花落未央
花落未央 2020-11-28 16:41

I am using PDFBox for validating a pdf document . There are certain requirement to check following types of text present in a PDF

  • Artificial Bold style text
相关标签:
2条回答
  • 2020-11-28 16:47

    The general procedure and a PDFBox issue

    In theory one should start this by deriving a class from PDFTextStripper and overriding its method:

    /**
     * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
     * and just calls {@link #writeString(String)}.
     *
     * @param text The text to write to the stream.
     * @param textPositions The TextPositions belonging to the text.
     * @throws IOException If there is an error when writing the text.
     */
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        writeString(text);
    }
    

    Your override then should use List<TextPosition> textPositions instead of the String text; each TextPosition essentially represents a single a single letter and the information on the graphic state active when that letter was drawn.

    Unfortunately the textPositions list does not contain the correct contents in the current version 1.8.3. E.g. for the line "This is normal text." from your PDF the method writeString is called four times, once each for the strings "This", " is", " normal", and " text." Unfortunately the textPositions list each time contains the TextPosition instances for the letters of the last string " text."

    This actually proved to have already been recognized as PDFBox issue PDFBOX-1804 which meanwhile has been resolved as fixed for versions 1.8.4 and 2.0.0.

    This been said, as soon as you have a PDFBox version which is fixed, you can check for some artificial styles as follows:

    Artificial italic text

    This text style is created like this in the page content:

    BT
    /F0 1 Tf
    24 0 5.10137 24 66 695.5877 Tm
    0 Tr
    [<03>]TJ
    ...
    

    The relevant part happens in setting the text matrix Tm. The 5.10137 is a factor by which the text is sheared.

    When you check a TextPosition textPosition as indicated above, you can query this value using

    textPosition.getTextPos().getValue(1, 0)
    

    If this value relevantly is greater than 0.0, you have artificial italics. If it is relevantly less than 0.0, you have artificial backwards italics.

    Artificial bold or outline text

    These artificial styles use double printing letters using differing rendering modes; e.g. the capital 'T', in case of bold:

    0 0 0 1 k
    ...
    BT
    /F0 1 Tf 
    24 0 0 24 66.36 729.86 Tm 
    <03>Tj 
    4 M 0.72 w 
    0 0 Td 
    1 Tr 
    0 0 0 1 K
    <03>Tj
    ET
    

    (i.e. first drawing the letter in regular mode, filling the letter area, and then drawing it in outline mode, drawing a line along the letter border, both in black, CMYK 0, 0, 0, 1; this leaves the impression of a thicker letter.)

    and in case of outline:

    BT
    /F0 1 Tf
    24 0 0 24 66 661.75 Tm
    0 0 0 0 k
    <03>Tj
    /GS1 gs
    4 M 0.288 w 
    0 0 Td
    1 Tr
    0 0 0 1 K
    <03>Tj
    ET
    

    (i.e. first drawing the letter in regular mode white, CMYK 0, 0, 0, 0, filling the letter area, and then drawing it in outline mode, drawing a line along the letter border, in black, CMYK 0, 0, 0, 1; this leaves the impression of an outlined black on white letter.)

    Unfortunately the PDFBox PDFTextStripper does not keep track of the text rendering mode. Furthermore it explicitly drops duplicate character occurrences in approximately the same position. Thus, it is not up to the task of recognizing these artificial styles.

    If you really need to do so, you'd have to change TextPosition to also contain the rendering mode, PDFStreamEngine to add it to the generated TextPosition instances, and PDFTextStripper to not drop duplicate glyphs in processTextPosition.

    Corrections

    I wrote

    Unfortunately the PDFBox PDFTextStripper does not keep track of the text rendering mode.

    This is not entirely true, you can find the current rendering mode using getGraphicsState().getTextState().getRenderingMode(). This means that during processTextPosition you do have the rendering mode available and can try and store rendering mode (and color!) information for the given TextPosition somewhere, e.g. in some Map<TextPosition, ...>, for later use.

    Furthermore it explicitly drops duplicate character occurrences in approximately the same position.

    You can disable this by calling setSuppressDuplicateOverlappingText(false).

    With these two changes you should be able to make the required tests for checking for artificial bold and outline, too.

    The latter change might even not be necessary if you store and check for the styles early in processTextPosition.

    How to retrieve rendering mode and color

    As mentioned in Corrections it indeed is possible to retrieve rendering mode and color information by collecting that information in a processTextPosition override.

    To this the OP commented that

    Always the stroking and non-stroking color is coming as Black

    This was a bit surprising at first but after looking at the PDFTextStripper.properties (from which the operators supported during text extraction are initialized), the reason became clear:

    # The following operators are not relevant to text extraction,
    # so we can silently ignore them.
    ...
    K
    k
    

    Thus color setting operators (especially those for CMYK colors as in the present document) are ignored in this context! Fortunately the implementations of these operators for the PageDrawer can be used in this context, too.

    So the following proof-of-concept shows how all required information can be retrieved.

    public class TextWithStateStripperSimple extends PDFTextStripper
    {
        public TextWithStateStripperSimple() throws IOException {
            super();
            setSuppressDuplicateOverlappingText(false);
            registerOperatorProcessor("K", new org.apache.pdfbox.util.operator.SetStrokingCMYKColor());
            registerOperatorProcessor("k", new org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor());
        }
    
        @Override
        protected void processTextPosition(TextPosition text)
        {
            renderingMode.put(text, getGraphicsState().getTextState().getRenderingMode());
            strokingColor.put(text, getGraphicsState().getStrokingColor());
            nonStrokingColor.put(text, getGraphicsState().getNonStrokingColor());
    
            super.processTextPosition(text);
        }
    
        Map<TextPosition, Integer> renderingMode = new HashMap<TextPosition, Integer>();
        Map<TextPosition, PDColorState> strokingColor = new HashMap<TextPosition, PDColorState>();
        Map<TextPosition, PDColorState> nonStrokingColor = new HashMap<TextPosition, PDColorState>();
    
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException
        {
            writeString(text + '\n');
    
            for (TextPosition textPosition: textPositions)
            {
                StringBuilder textBuilder = new StringBuilder();
                textBuilder.append(textPosition.getCharacter())
                           .append(" - shear by ")
                           .append(textPosition.getTextPos().getValue(1, 0))
                           .append(" - ")
                           .append(textPosition.getX())
                           .append(" ")
                           .append(textPosition.getY())
                           .append(" - ")
                           .append(renderingMode.get(textPosition))
                           .append(" - ")
                           .append(toString(strokingColor.get(textPosition)))
                           .append(" - ")
                           .append(toString(nonStrokingColor.get(textPosition)))
                           .append('\n');
                writeString(textBuilder.toString());
            }
        }
    
        String toString(PDColorState colorState)
        {
            if (colorState == null)
                return "null";
            StringBuilder builder = new StringBuilder();
            for (float f: colorState.getColorSpaceValue())
            {
                builder.append(' ')
                       .append(f);
            }
    
            return builder.toString();
        }
    }
    

    Using this you get the period '.' in normal text as:

    . - shear by 0.0 - 256.5701 88.6875 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0
    

    In artificial bold text you get;

    . - shear by 0.0 - 378.86 122.140015 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0
    . - shear by 0.0 - 378.86002 122.140015 - 1 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0
    

    In artificial italics:

    . - shear by 5.10137 - 327.121 156.4123 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0
    

    And in artificial outline:

    . - shear by 0.0 - 357.25 190.25 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 0.0
    . - shear by 0.0 - 357.25 190.25 - 1 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 0.0
    

    So, there you are, all information required for recognition of those artificial styles. Now you merely have to analyze the data.

    BTW, have a look at the artificial bold case: The coordinates might not always be identical but instead merely very similar. Thus, some leniency is required for the test whether two text position objects describe the same position.

    0 讨论(0)
  • 2020-11-28 16:53

    My solution for this problem was to create a new class that extends the PDFTextStripper class and overrides the function:

    getCharactersByArticle()

    note: PDFBox version 1.8.5

    CustomPDFTextStripper class

    public class CustomPDFTextStripper extends PDFTextStripper
    {
        public CustomPDFTextStripper() throws IOException {
        super();
        }
    
        public Vector<List<TextPosition>> getCharactersByArticle(){
        return charactersByArticle;
        }
    }
    

    This way i can parse the pdf document and then get the TextPosition from a custom extraction function:

     private void extractTextPosition() throws FileNotFoundException, IOException {
    
        PDFParser parser = new PDFParser(new FileInputStream(pdf));
        parser.parse();
        StringWriter outString = new StringWriter();
        CustomPDFTextStripper stripper = new CustomPDFTextStripper();
        stripper.writeText(parser.getPDDocument(), outString);
        Vector<List<TextPosition>> vectorlistoftps = stripper.getCharactersByArticle();
        for (int i = 0; i < vectorlistoftps.size(); i++) {
            List<TextPosition> tplist = vectorlistoftps.get(i);
            for (int j = 0; j < tplist.size(); j++) {
                TextPosition text = tplist.get(j);
                System.out.println(" String "
              + "[x: " + text.getXDirAdj() + ", y: "
              + text.getY() + ", height:" + text.getHeightDir()
              + ", space: " + text.getWidthOfSpace() + ", width: "
              + text.getWidthDirAdj() + ", yScale: " + text.getYScale() + "]"
              + text.getCharacter());
            }       
        }
    }
    

    TextPositions contain numerous information about the characters of the pdf document.

    OUTPUT:

    String [x: 168.24, y: 64.15997, height:6.061287, space: 8.9664, width:3.4879303, yScale: 8.9664]J

    String [x: 171.69745, y: 64.15997, height:6.061287, space: 8.9664, width: 2.2416077, yScale:8.9664]N

    String [x: 176.25777, y: 64.15997, height:6.0343876, space: 8.9664,width: 6.4737396, yScale:8.9664]N

    String [x: 182.73778, y:64.15997, height:4.214208, space: 8.9664, width: 3.981079, yScale: 8.9664]e .....

    0 讨论(0)
提交回复
热议问题