问题
iTextSharp 4.1.6 is the last version licensed under LGPL and is free to use in commercial purpose without paying license fees.
It might be interesting for some and for me, how to extract text with this version.
Does anyone have an idea?
回答1:
I had to hack this together manually as I was in the same boat as you. Hopefully this well help. It's probably not perfect, but I was able to get the text I needed out of the document this way. fileName
is a string variable/parameter to the PDF file.
var reader = new PdfReader(fileName);
StringBuilder sb = new StringBuilder();
try
{
for (int page = 1; page <= reader.NumberOfPages; page++)
{
var cpage = reader.GetPageN(page);
var content = cpage.Get(PdfName.CONTENTS);
var ir = (PRIndirectReference)content;
var value = reader.GetPdfObject(ir.Number);
if (value.IsStream())
{
PRStream stream = (PRStream)value;
var streamBytes = PdfReader.GetStreamBytes(stream);
var tokenizer = new PRTokeniser(new RandomAccessFileOrArray(streamBytes));
try
{
while (tokenizer.NextToken())
{
if (tokenizer.TokenType == PRTokeniser.TK_STRING)
{
string str = tokenizer.StringValue;
sb.Append(str);
}
}
}
finally
{
tokenizer.Close();
}
}
}
}
finally
{
reader.Close();
}
return sb.ToString();
来源:https://stackoverflow.com/questions/10143098/how-to-extract-text-with-itextsharp-4-1-6