how to read pdf file with blank spaces (as it is) line by Line in c#.net using iTextsharp

后端未结

关注

 2  1307

暖寄归人 2020-12-12 08:02

I am using iText (for .net) to read pdf files. It reads the document but when there are whitespaces it reads only one space.

That makes it impossible to extract data

2条回答

有刺的猬 (楼主)

2020-12-12 09:01

You use the LocationTextExtractionStrategy. As @Joris already answered, this strategy adds at most a single space character for a horizontal gap. You, on the other hand, want an amount of whitespaces for each gap which makes the result represent the horizontal layout of the text line in the PDF.

In this answer I once outlined how to build such a text extraction strategy. As a that answer was for iText / Java and b the LocationTextExtractionStrategy has changed quite a bit since then, I don't consider the current question as duplicate, though.

A C# adaption of the idea from that old answer to the current iTextSharp LocationTextExtractionStrategy using reflection instead of class copying would look like this:

class LayoutTextExtractionStrategy : LocationTextExtractionStrategy
{
    public LayoutTextExtractionStrategy(float fixedCharWidth)
    {
        this.fixedCharWidth = fixedCharWidth;
    }

    MethodInfo DumpStateMethod = typeof(LocationTextExtractionStrategy).GetMethod("DumpState", BindingFlags.NonPublic | BindingFlags.Instance);
    MethodInfo FilterTextChunksMethod = typeof(LocationTextExtractionStrategy).GetMethod("filterTextChunks", BindingFlags.NonPublic | BindingFlags.Instance);
    FieldInfo LocationalResultField = typeof(LocationTextExtractionStrategy).GetField("locationalResult", BindingFlags.NonPublic | BindingFlags.Instance);

    public override string GetResultantText(ITextChunkFilter chunkFilter)
    {
        if (DUMP_STATE)
        {
            //DumpState();
            DumpStateMethod.Invoke(this, null);
        }

        // List filteredTextChunks = filterTextChunks(locationalResult, chunkFilter);
        object locationalResult = LocationalResultField.GetValue(this);
        List filteredTextChunks = (List)FilterTextChunksMethod.Invoke(this, new object[] { locationalResult, chunkFilter });
        filteredTextChunks.Sort();

        int startOfLinePosition = 0;
        StringBuilder sb = new StringBuilder();
        TextChunk lastChunk = null;
        foreach (TextChunk chunk in filteredTextChunks)
        {

            if (lastChunk == null)
            {
                InsertSpaces(sb, startOfLinePosition, chunk.Location.DistParallelStart, false);
                sb.Append(chunk.Text);
            }
            else
            {
                if (chunk.SameLine(lastChunk))
                {
                    // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
                    if (IsChunkAtWordBoundary(chunk, lastChunk)/* && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text)*/)
                    {
                        //sb.Append(' ');
                        InsertSpaces(sb, startOfLinePosition, chunk.Location.DistParallelStart, !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text));
                    }

                    sb.Append(chunk.Text);
                }
                else
                {
                    sb.Append('\n');
                    startOfLinePosition = sb.Length;
                    InsertSpaces(sb, startOfLinePosition, chunk.Location.DistParallelStart, false);
                    sb.Append(chunk.Text);
                }
            }
            lastChunk = chunk;
        }

        return sb.ToString();
    }

    private bool StartsWithSpace(String str)
    {
        if (str.Length == 0) return false;
        return str[0] == ' ';
    }

    private bool EndsWithSpace(String str)
    {
        if (str.Length == 0) return false;
        return str[str.Length - 1] == ' ';
    }

    void InsertSpaces(StringBuilder sb, int startOfLinePosition, float chunkStart, bool spaceRequired)
    {
        int indexNow = sb.Length - startOfLinePosition;
        int indexToBe = (int)((chunkStart - pageLeft) / fixedCharWidth);
        int spacesToInsert = indexToBe - indexNow;
        if (spacesToInsert < 1 && spaceRequired)
            spacesToInsert = 1;
        for (; spacesToInsert > 0; spacesToInsert--)
        {
            sb.Append(' ');
        }
    }

    public float pageLeft = 0;
    public float fixedCharWidth = 6;
}

As you see it requires a float constructor parameter fixedCharWidth. This parameter represents the width on the PDF page a character in the result string should correspond to. It is given in PDF default user space units (such a unit usually is ¹/₇₂ in). In case of the catalog PDF the above mentioned question was about (very small font sizes) a value of 3 was appropriate; a value of 6 appears appropriate for most common PDFs which use fonts at larger sizes.

0 讨论(0)

查看其它2个回答