问题
I am using iText (for .net) to read pdf files. It reads the document but when there are whitespaces it reads only one space.
That makes it impossible to extract data by getting substrings. I want to read data line by line with whitespaces so I know the actual position of text because I want to write the data into a database.
The file is a bank statement, I want to dump it into a database for designing a reconciled system,
Here is a screen shot of a file
Following is the code which I am using
For page As Integer = 1 To pdfReader.NumberOfPages
' Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy()
Dim Strategy As ITextExtractionStrategy = New iTextSharp.text.pdf.parser.LocationTextExtractionStrategy()
Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy)
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.[Default], Encoding.UTF8, Encoding.[Default].GetBytes(currentText)))
Dim delimiterChars As Char() = {ControlChars.Lf}
Dim lines As String() = currentText.Split(delimiterChars)
Dim Bnk_Name As Boolean = True
Dim Br_Name As Boolean = False
Dim Name_acc As Boolean = False
Dim statment As Boolean = False
Dim Curr As Boolean = False
Dim Open As Boolean = False
Dim BankName = ""
Dim Branch = ""
Dim AccountNo = ""
Dim CompName = ""
Dim Currency = ""
Dim Statement_from = ""
Dim Statement_to = ""
Dim Opening_Balance = ""
Dim Closing_Balance = ""
Dim Narration As String = ""
For Each line As String In lines
line.Trim()
'BANK NAME
If Bnk_Name Then
If line.Trim() <> "" Then
BankName = line.Substring(0, 21)
Bnk_Name = False
Else
Bnk_Name = False
End If
End If
but I want as it is as whitespaces to read position
回答1:
(Without seeing your PDF, this explanation is the best I can come up with.)
Your document does not contain any spaces. That is to say, the content streams of your document do not contain spaces. In stead, the instructions that render characters simply take into account the space that needs to be there.
In that case, iText has to "guess" where the spaces are. And it will estimate to insert 1 space every time two characters are further apart that the width of the whitespace character of the font that is being used.
Possibly that's where this is going wrong.
Equally important however, you should never use text positions to extract data. This approach is simply too error-prone.
Try using regular expressions combined with a better ITextExtractionStrategy. There is an implementation of ITextExtractionStrategy that allows you to specify a Rectangle. If you do it that way, you can get the content from your document in a much more precise way.
Since you're dealing with bank statements, it should be easy to extract content by using a combination of rectangle-based-search and regular expressions (e.g. looking for things matching bank-account numbers)
回答2:
You use the LocationTextExtractionStrategy
. As @Joris already answered, this strategy adds at most a single space character for a horizontal gap. You, on the other hand, want an amount of whitespaces for each gap which makes the result represent the horizontal layout of the text line in the PDF.
In this answer I once outlined how to build such a text extraction strategy. As a that answer was for iText / Java and b the LocationTextExtractionStrategy
has changed quite a bit since then, I don't consider the current question as duplicate, though.
A C# adaption of the idea from that old answer to the current iTextSharp LocationTextExtractionStrategy
using reflection instead of class copying would look like this:
class LayoutTextExtractionStrategy : LocationTextExtractionStrategy
{
public LayoutTextExtractionStrategy(float fixedCharWidth)
{
this.fixedCharWidth = fixedCharWidth;
}
MethodInfo DumpStateMethod = typeof(LocationTextExtractionStrategy).GetMethod("DumpState", BindingFlags.NonPublic | BindingFlags.Instance);
MethodInfo FilterTextChunksMethod = typeof(LocationTextExtractionStrategy).GetMethod("filterTextChunks", BindingFlags.NonPublic | BindingFlags.Instance);
FieldInfo LocationalResultField = typeof(LocationTextExtractionStrategy).GetField("locationalResult", BindingFlags.NonPublic | BindingFlags.Instance);
public override string GetResultantText(ITextChunkFilter chunkFilter)
{
if (DUMP_STATE)
{
//DumpState();
DumpStateMethod.Invoke(this, null);
}
// List<TextChunk> filteredTextChunks = filterTextChunks(locationalResult, chunkFilter);
object locationalResult = LocationalResultField.GetValue(this);
List<TextChunk> filteredTextChunks = (List<TextChunk>)FilterTextChunksMethod.Invoke(this, new object[] { locationalResult, chunkFilter });
filteredTextChunks.Sort();
int startOfLinePosition = 0;
StringBuilder sb = new StringBuilder();
TextChunk lastChunk = null;
foreach (TextChunk chunk in filteredTextChunks)
{
if (lastChunk == null)
{
InsertSpaces(sb, startOfLinePosition, chunk.Location.DistParallelStart, false);
sb.Append(chunk.Text);
}
else
{
if (chunk.SameLine(lastChunk))
{
// we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
if (IsChunkAtWordBoundary(chunk, lastChunk)/* && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text)*/)
{
//sb.Append(' ');
InsertSpaces(sb, startOfLinePosition, chunk.Location.DistParallelStart, !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text));
}
sb.Append(chunk.Text);
}
else
{
sb.Append('\n');
startOfLinePosition = sb.Length;
InsertSpaces(sb, startOfLinePosition, chunk.Location.DistParallelStart, false);
sb.Append(chunk.Text);
}
}
lastChunk = chunk;
}
return sb.ToString();
}
private bool StartsWithSpace(String str)
{
if (str.Length == 0) return false;
return str[0] == ' ';
}
private bool EndsWithSpace(String str)
{
if (str.Length == 0) return false;
return str[str.Length - 1] == ' ';
}
void InsertSpaces(StringBuilder sb, int startOfLinePosition, float chunkStart, bool spaceRequired)
{
int indexNow = sb.Length - startOfLinePosition;
int indexToBe = (int)((chunkStart - pageLeft) / fixedCharWidth);
int spacesToInsert = indexToBe - indexNow;
if (spacesToInsert < 1 && spaceRequired)
spacesToInsert = 1;
for (; spacesToInsert > 0; spacesToInsert--)
{
sb.Append(' ');
}
}
public float pageLeft = 0;
public float fixedCharWidth = 6;
}
As you see it requires a float
constructor parameter fixedCharWidth
. This parameter represents the width on the PDF page a character in the result string should correspond to. It is given in PDF default user space units (such a unit usually is 1/72 in). In case of the catalog PDF the above mentioned question was about (very small font sizes) a value of 3
was appropriate; a value of 6
appears appropriate for most common PDFs which use fonts at larger sizes.
来源:https://stackoverflow.com/questions/46578822/how-to-read-pdf-file-with-blank-spaces-as-it-is-line-by-line-in-c-net-using-i