Im looking for the easiest way to implement a java solution which is quiet similar to the output of
pdftotext -layout FILE
on linux machin
The problem with your approach inserting spaces like this
final Float dist = chunk.distanceFromEndOf(lastChunk)/3;
for(int i = 0; i<Math.round(dist); i++) {
sb.append(' ');
}
is that it assumes that the current position in the StringBuffer
exactly corresponds to the end of lastChunk
assuming a character width width of 3 user space units. This needs not be the case, generally each addition of characters destroys such a former correspondence. E.g. these two lines have way different widths when using a proportional font:
ililili
MWMWMWM
while in a StringBuffer
they occupy the same length.
Thus, you have to look where chunk
starts in relation to the left page border and add spaces to the buffer accordingly.
Furthermore your code completely ignores free space at the start of lines.
Your results should improve if you replace the original method getResultantText(TextChunkFilter
by this code instead:
public String getResultantText(TextChunkFilter chunkFilter){
if (DUMP_STATE) dumpState();
List<TextChunk> filteredTextChunks = filterTextChunks(locationalResult, chunkFilter);
Collections.sort(filteredTextChunks);
int startOfLinePosition = 0;
StringBuffer sb = new StringBuffer();
TextChunk lastChunk = null;
for (TextChunk chunk : filteredTextChunks) {
if (lastChunk == null){
insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, false);
sb.append(chunk.text);
} else {
if (chunk.sameLine(lastChunk))
{
if (isChunkAtWordBoundary(chunk, lastChunk))
{
insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, !startsWithSpace(chunk.text) && !endsWithSpace(lastChunk.text));
}
sb.append(chunk.text);
} else {
sb.append('\n');
startOfLinePosition = sb.length();
insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, false);
sb.append(chunk.text);
}
}
lastChunk = chunk;
}
return sb.toString();
}
void insertSpaces(StringBuffer sb, int startOfLinePosition, float chunkStart, boolean spaceRequired)
{
int indexNow = sb.length() - startOfLinePosition;
int indexToBe = (int)((chunkStart - pageLeft) / fixedCharWidth);
int spacesToInsert = indexToBe - indexNow;
if (spacesToInsert < 1 && spaceRequired)
spacesToInsert = 1;
for (; spacesToInsert > 0; spacesToInsert--)
{
sb.append(' ');
}
}
public float pageLeft = 0;
public float fixedCharWidth = 6;
pageLeft
is the coordinate of the left page border. The strategy does not know it and, therefore, must be told explicitly; in many cases, though, 0 is the correct value.
Alternatively one could use the minimum distParallelStart
value of all chunks. This would cut off the left margin but would not require you to inject the exact left page border value.
fixedCharWidth
is the assumed character width. Depending on the writing in the PDF in question a different value might be more apropos. In your case a value of 3 seems to be better than my 6.
There still is a lot of room for improvement in this code. E.g.
It assumes that there are no text chunks spanning multiple table columns. This assumption very often is correct, but I have seen weird PDFs in which the normal inter-word spacing has been implemented using separate text chunks at some offset but the inter-column spacing was represented by a single space character in a single chunk (spanning the end of one column and the start of the next)! The width of that space character has been manipulated by the word-spacing setting of the PDF graphics state.
It ignores different amounts of vertical space.