I am using PDFBox for validating a pdf document . There are certain requirement to check following types of text present in a PDF
In theory one should start this by deriving a class from PDFTextStripper
and overriding its method:
/**
* Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
* and just calls {@link #writeString(String)}.
*
* @param text The text to write to the stream.
* @param textPositions The TextPositions belonging to the text.
* @throws IOException If there is an error when writing the text.
*/
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
writeString(text);
}
Your override then should use List<TextPosition> textPositions
instead of the String text
; each TextPosition
essentially represents a single a single letter and the information on the graphic state active when that letter was drawn.
Unfortunately the textPositions
list does not contain the correct contents in the current version 1.8.3. E.g. for the line "This is normal text." from your PDF the method writeString
is called four times, once each for the strings "This", " is", " normal", and " text." Unfortunately the textPositions
list each time contains the TextPosition
instances for the letters of the last string " text."
This actually proved to have already been recognized as PDFBox issue PDFBOX-1804 which meanwhile has been resolved as fixed for versions 1.8.4 and 2.0.0.
This been said, as soon as you have a PDFBox version which is fixed, you can check for some artificial styles as follows:
This text style is created like this in the page content:
BT
/F0 1 Tf
24 0 5.10137 24 66 695.5877 Tm
0 Tr
[<03>]TJ
...
The relevant part happens in setting the text matrix Tm. The 5.10137 is a factor by which the text is sheared.
When you check a TextPosition textPosition
as indicated above, you can query this value using
textPosition.getTextPos().getValue(1, 0)
If this value relevantly is greater than 0.0, you have artificial italics. If it is relevantly less than 0.0, you have artificial backwards italics.
These artificial styles use double printing letters using differing rendering modes; e.g. the capital 'T', in case of bold:
0 0 0 1 k
...
BT
/F0 1 Tf
24 0 0 24 66.36 729.86 Tm
<03>Tj
4 M 0.72 w
0 0 Td
1 Tr
0 0 0 1 K
<03>Tj
ET
(i.e. first drawing the letter in regular mode, filling the letter area, and then drawing it in outline mode, drawing a line along the letter border, both in black, CMYK 0, 0, 0, 1; this leaves the impression of a thicker letter.)
and in case of outline:
BT
/F0 1 Tf
24 0 0 24 66 661.75 Tm
0 0 0 0 k
<03>Tj
/GS1 gs
4 M 0.288 w
0 0 Td
1 Tr
0 0 0 1 K
<03>Tj
ET
(i.e. first drawing the letter in regular mode white, CMYK 0, 0, 0, 0, filling the letter area, and then drawing it in outline mode, drawing a line along the letter border, in black, CMYK 0, 0, 0, 1; this leaves the impression of an outlined black on white letter.)
Unfortunately the PDFBox PDFTextStripper
does not keep track of the text rendering mode. Furthermore it explicitly drops duplicate character occurrences in approximately the same position. Thus, it is not up to the task of recognizing these artificial styles.
If you really need to do so, you'd have to change TextPosition
to also contain the rendering mode, PDFStreamEngine
to add it to the generated TextPosition
instances, and PDFTextStripper
to not drop duplicate glyphs in processTextPosition
.
I wrote
Unfortunately the PDFBox
PDFTextStripper
does not keep track of the text rendering mode.
This is not entirely true, you can find the current rendering mode using getGraphicsState().getTextState().getRenderingMode()
. This means that during processTextPosition
you do have the rendering mode available and can try and store rendering mode (and color!) information for the given TextPosition
somewhere, e.g. in some Map<TextPosition, ...>
, for later use.
Furthermore it explicitly drops duplicate character occurrences in approximately the same position.
You can disable this by calling setSuppressDuplicateOverlappingText(false)
.
With these two changes you should be able to make the required tests for checking for artificial bold and outline, too.
The latter change might even not be necessary if you store and check for the styles early in processTextPosition
.
As mentioned in Corrections it indeed is possible to retrieve rendering mode and color information by collecting that information in a processTextPosition
override.
To this the OP commented that
Always the stroking and non-stroking color is coming as Black
This was a bit surprising at first but after looking at the PDFTextStripper.properties
(from which the operators supported during text extraction are initialized), the reason became clear:
# The following operators are not relevant to text extraction,
# so we can silently ignore them.
...
K
k
Thus color setting operators (especially those for CMYK colors as in the present document) are ignored in this context! Fortunately the implementations of these operators for the PageDrawer
can be used in this context, too.
So the following proof-of-concept shows how all required information can be retrieved.
public class TextWithStateStripperSimple extends PDFTextStripper
{
public TextWithStateStripperSimple() throws IOException {
super();
setSuppressDuplicateOverlappingText(false);
registerOperatorProcessor("K", new org.apache.pdfbox.util.operator.SetStrokingCMYKColor());
registerOperatorProcessor("k", new org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor());
}
@Override
protected void processTextPosition(TextPosition text)
{
renderingMode.put(text, getGraphicsState().getTextState().getRenderingMode());
strokingColor.put(text, getGraphicsState().getStrokingColor());
nonStrokingColor.put(text, getGraphicsState().getNonStrokingColor());
super.processTextPosition(text);
}
Map<TextPosition, Integer> renderingMode = new HashMap<TextPosition, Integer>();
Map<TextPosition, PDColorState> strokingColor = new HashMap<TextPosition, PDColorState>();
Map<TextPosition, PDColorState> nonStrokingColor = new HashMap<TextPosition, PDColorState>();
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
writeString(text + '\n');
for (TextPosition textPosition: textPositions)
{
StringBuilder textBuilder = new StringBuilder();
textBuilder.append(textPosition.getCharacter())
.append(" - shear by ")
.append(textPosition.getTextPos().getValue(1, 0))
.append(" - ")
.append(textPosition.getX())
.append(" ")
.append(textPosition.getY())
.append(" - ")
.append(renderingMode.get(textPosition))
.append(" - ")
.append(toString(strokingColor.get(textPosition)))
.append(" - ")
.append(toString(nonStrokingColor.get(textPosition)))
.append('\n');
writeString(textBuilder.toString());
}
}
String toString(PDColorState colorState)
{
if (colorState == null)
return "null";
StringBuilder builder = new StringBuilder();
for (float f: colorState.getColorSpaceValue())
{
builder.append(' ')
.append(f);
}
return builder.toString();
}
}
Using this you get the period '.' in normal text as:
. - shear by 0.0 - 256.5701 88.6875 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0
In artificial bold text you get;
. - shear by 0.0 - 378.86 122.140015 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0
. - shear by 0.0 - 378.86002 122.140015 - 1 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0
In artificial italics:
. - shear by 5.10137 - 327.121 156.4123 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 1.0
And in artificial outline:
. - shear by 0.0 - 357.25 190.25 - 0 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 0.0
. - shear by 0.0 - 357.25 190.25 - 1 - 0.0 0.0 0.0 1.0 - 0.0 0.0 0.0 0.0
So, there you are, all information required for recognition of those artificial styles. Now you merely have to analyze the data.
BTW, have a look at the artificial bold case: The coordinates might not always be identical but instead merely very similar. Thus, some leniency is required for the test whether two text position objects describe the same position.
My solution for this problem was to create a new class that extends the PDFTextStripper class and overrides the function:
getCharactersByArticle()
note: PDFBox version 1.8.5
CustomPDFTextStripper class
public class CustomPDFTextStripper extends PDFTextStripper
{
public CustomPDFTextStripper() throws IOException {
super();
}
public Vector<List<TextPosition>> getCharactersByArticle(){
return charactersByArticle;
}
}
This way i can parse the pdf document and then get the TextPosition from a custom extraction function:
private void extractTextPosition() throws FileNotFoundException, IOException {
PDFParser parser = new PDFParser(new FileInputStream(pdf));
parser.parse();
StringWriter outString = new StringWriter();
CustomPDFTextStripper stripper = new CustomPDFTextStripper();
stripper.writeText(parser.getPDDocument(), outString);
Vector<List<TextPosition>> vectorlistoftps = stripper.getCharactersByArticle();
for (int i = 0; i < vectorlistoftps.size(); i++) {
List<TextPosition> tplist = vectorlistoftps.get(i);
for (int j = 0; j < tplist.size(); j++) {
TextPosition text = tplist.get(j);
System.out.println(" String "
+ "[x: " + text.getXDirAdj() + ", y: "
+ text.getY() + ", height:" + text.getHeightDir()
+ ", space: " + text.getWidthOfSpace() + ", width: "
+ text.getWidthDirAdj() + ", yScale: " + text.getYScale() + "]"
+ text.getCharacter());
}
}
}
TextPositions contain numerous information about the characters of the pdf document.
OUTPUT:
String [x: 168.24, y: 64.15997, height:6.061287, space: 8.9664, width:3.4879303, yScale: 8.9664]J
String [x: 171.69745, y: 64.15997, height:6.061287, space: 8.9664, width: 2.2416077, yScale:8.9664]N
String [x: 176.25777, y: 64.15997, height:6.0343876, space: 8.9664,width: 6.4737396, yScale:8.9664]N
String [x: 182.73778, y:64.15997, height:4.214208, space: 8.9664, width: 3.981079, yScale: 8.9664]e .....