iTextSharp 5.5.13.1 no data is available for encoding 10000 when extracting text from PDF

问题

I'm trying to extract text from a multipage PDF document and almost all documents extract fine, but a couple of documents blow up with the encoding 10000 error. The only unique thing about the document pages that don't work is that they have a button and form fields on them.

            {
                var pageNumbersToSave = new List<int>();
                for (var i = 1; i <= r.NumberOfPages; i++)
                {
                    try
                    {
                        var s       = PdfTextExtractor.GetTextFromPage( r, i, new SimpleTextExtractionStrategy() );

I also tried using a PDFStamper to flatten the form elements but that didn't change anything:

            byte[] flatBytes;
            using ( var r = new PdfReader( pdfBytes ) )
            {
                using (var ms = new MemoryStream())
                {
                    using (var flattener = new PdfStamper(r, ms))
                    {
                        for ( var i = 1; i <= r.NumberOfPages; i++ )
                        {
                            r.AcroFields.RemoveFieldsFromPage( i );
                        }
                        flattener.FormFlattening = true;
                        flattener.Close();
                    }
                    flatBytes = ms.ToArray();
                }
            }

Obviously in the top code if I was testing with the stamper I was using flatBytes and not pdfBytes.

Full exception message: No data is available for encoding 10000. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method.

Stack Trace:
   at System.Text.Encoding.GetEncoding(Int32 codepage)
   at System.Text.Encoding.GetEncoding(Int32 codepage, EncoderFallback encoderFallback, DecoderFallback decoderFallback)
   at iTextSharp.text.xml.simpleparser.IanaEncodings.GetEncodingEncoding(String name)
   at iTextSharp.text.pdf.PdfEncodings.ConvertToString(Byte[] bytes, String encoding)
   at iTextSharp.text.pdf.DocumentFont.FillEncoding(PdfName encoding)
   at iTextSharp.text.pdf.DocumentFont.DoType1TT()
   at iTextSharp.text.pdf.DocumentFont.Init()
   at iTextSharp.text.pdf.DocumentFont..ctor(PRIndirectReference refFont)
   at iTextSharp.text.pdf.CMapAwareDocumentFont..ctor(PRIndirectReference refFont)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.GetFont(PRIndirectReference ind)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.SetTextFont.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.FormXObjectDoHandler.HandleXObject(PdfContentStreamProcessor processor, PdfStream stream, PdfIndirectReference refi, ICollection markedContentInfoStack)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.DisplayXObject(PdfName xobjectName)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.Do.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
   at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener, IDictionary`2 additionalContentOperators)
   at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
   at VerataParsers.ECWPdfExtractor.StripImagesFromPdf(Byte[] pdfBytes, Int32& adjustedPageCount) in C:\Users\Dell T5610\source\repos\SecureDirectMessaging\VerataParsers\ECWPdfExtractor.cs:line 85

回答1:

Fixed by adding the System.Text.Encoding.CodePages NuGet package and then registering it as follows:

            var codePages = CodePagesEncodingProvider.Instance;
            Encoding.RegisterProvider(codePages);

来源：https://stackoverflow.com/questions/62227368/itextsharp-5-5-13-1-no-data-is-available-for-encoding-10000-when-extracting-text

标签

itext

.net-core-3.1