IText reading PDF like pdftotext -layout?

后端 未结 1 528
臣服心动
臣服心动 2020-11-30 16:07

Im looking for the easiest way to implement a java solution which is quiet similar to the output of

pdftotext -layout FILE

on linux machin

相关标签:
1条回答
  • 2020-11-30 16:36

    The problem with your approach inserting spaces like this

                final Float dist = chunk.distanceFromEndOf(lastChunk)/3;
                for(int i = 0; i<Math.round(dist); i++) {
                    sb.append(' ');
                }
    

    is that it assumes that the current position in the StringBuffer exactly corresponds to the end of lastChunk assuming a character width width of 3 user space units. This needs not be the case, generally each addition of characters destroys such a former correspondence. E.g. these two lines have way different widths when using a proportional font:

    ililili

    MWMWMWM

    while in a StringBuffer they occupy the same length.

    Thus, you have to look where chunk starts in relation to the left page border and add spaces to the buffer accordingly.

    Furthermore your code completely ignores free space at the start of lines.

    Your results should improve if you replace the original method getResultantText(TextChunkFilter by this code instead:

    public String getResultantText(TextChunkFilter chunkFilter){
        if (DUMP_STATE) dumpState();
        
        List<TextChunk> filteredTextChunks = filterTextChunks(locationalResult, chunkFilter);
        Collections.sort(filteredTextChunks);
    
        int startOfLinePosition = 0;
        StringBuffer sb = new StringBuffer();
        TextChunk lastChunk = null;
        for (TextChunk chunk : filteredTextChunks) {
    
            if (lastChunk == null){
                insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, false);
                sb.append(chunk.text);
            } else {
                if (chunk.sameLine(lastChunk))
                {
                    if (isChunkAtWordBoundary(chunk, lastChunk))
                    {
                        insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, !startsWithSpace(chunk.text) && !endsWithSpace(lastChunk.text));
                    }
                    
                    sb.append(chunk.text);
                } else {
                    sb.append('\n');
                    startOfLinePosition = sb.length();
                    insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, false);
                    sb.append(chunk.text);
                }
            }
            lastChunk = chunk;
        }
    
        return sb.toString();       
    }
    
    void insertSpaces(StringBuffer sb, int startOfLinePosition, float chunkStart, boolean spaceRequired)
    {
        int indexNow = sb.length() - startOfLinePosition;
        int indexToBe = (int)((chunkStart - pageLeft) / fixedCharWidth);
        int spacesToInsert = indexToBe - indexNow;
        if (spacesToInsert < 1 && spaceRequired)
            spacesToInsert = 1;
        for (; spacesToInsert > 0; spacesToInsert--)
        {
            sb.append(' ');
        }
    }
    
    public float pageLeft = 0;
    public float fixedCharWidth = 6;
    

    pageLeft is the coordinate of the left page border. The strategy does not know it and, therefore, must be told explicitly; in many cases, though, 0 is the correct value.

    Alternatively one could use the minimum distParallelStart value of all chunks. This would cut off the left margin but would not require you to inject the exact left page border value.

    fixedCharWidth is the assumed character width. Depending on the writing in the PDF in question a different value might be more apropos. In your case a value of 3 seems to be better than my 6.

    There still is a lot of room for improvement in this code. E.g.

    • It assumes that there are no text chunks spanning multiple table columns. This assumption very often is correct, but I have seen weird PDFs in which the normal inter-word spacing has been implemented using separate text chunks at some offset but the inter-column spacing was represented by a single space character in a single chunk (spanning the end of one column and the start of the next)! The width of that space character has been manipulated by the word-spacing setting of the PDF graphics state.

    • It ignores different amounts of vertical space.

    0 讨论(0)
提交回复
热议问题