问题
I'm trying to extract text from a multipage PDF document and almost all documents extract fine, but a couple of documents blow up with the encoding 10000 error. The only unique thing about the document pages that don't work is that they have a button and form fields on them.
{
var pageNumbersToSave = new List<int>();
for (var i = 1; i <= r.NumberOfPages; i++)
{
try
{
var s = PdfTextExtractor.GetTextFromPage( r, i, new SimpleTextExtractionStrategy() );
I also tried using a PDFStamper to flatten the form elements but that didn't change anything:
byte[] flatBytes;
using ( var r = new PdfReader( pdfBytes ) )
{
using (var ms = new MemoryStream())
{
using (var flattener = new PdfStamper(r, ms))
{
for ( var i = 1; i <= r.NumberOfPages; i++ )
{
r.AcroFields.RemoveFieldsFromPage( i );
}
flattener.FormFlattening = true;
flattener.Close();
}
flatBytes = ms.ToArray();
}
}
Obviously in the top code if I was testing with the stamper I was using flatBytes and not pdfBytes.
Full exception message: No data is available for encoding 10000. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method.
Stack Trace:
at System.Text.Encoding.GetEncoding(Int32 codepage)
at System.Text.Encoding.GetEncoding(Int32 codepage, EncoderFallback encoderFallback, DecoderFallback decoderFallback)
at iTextSharp.text.xml.simpleparser.IanaEncodings.GetEncodingEncoding(String name)
at iTextSharp.text.pdf.PdfEncodings.ConvertToString(Byte[] bytes, String encoding)
at iTextSharp.text.pdf.DocumentFont.FillEncoding(PdfName encoding)
at iTextSharp.text.pdf.DocumentFont.DoType1TT()
at iTextSharp.text.pdf.DocumentFont.Init()
at iTextSharp.text.pdf.DocumentFont..ctor(PRIndirectReference refFont)
at iTextSharp.text.pdf.CMapAwareDocumentFont..ctor(PRIndirectReference refFont)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.GetFont(PRIndirectReference ind)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.SetTextFont.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.FormXObjectDoHandler.HandleXObject(PdfContentStreamProcessor processor, PdfStream stream, PdfIndirectReference refi, ICollection markedContentInfoStack)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.DisplayXObject(PdfName xobjectName)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.Do.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener, IDictionary`2 additionalContentOperators)
at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
at VerataParsers.ECWPdfExtractor.StripImagesFromPdf(Byte[] pdfBytes, Int32& adjustedPageCount) in C:\Users\Dell T5610\source\repos\SecureDirectMessaging\VerataParsers\ECWPdfExtractor.cs:line 85
回答1:
Fixed by adding the System.Text.Encoding.CodePages NuGet package and then registering it as follows:
var codePages = CodePagesEncodingProvider.Instance;
Encoding.RegisterProvider(codePages);
来源:https://stackoverflow.com/questions/62227368/itextsharp-5-5-13-1-no-data-is-available-for-encoding-10000-when-extracting-text