Read localized PDF file using Itextsharp

问题

I am trying to read PDF file using iTextSharp. The issue is when trying to read a PDF file other than English (Hindi or Arabic for example) it's not getting the correct words.

I am wondering, should I install the Hindi or Arabic font on my system or do I need to do something with encoding?

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);

Edit:

Sample PDF as Image:

Extracted Text:

uxj ikfydk ifj"kn fuokZpd ukekoyh& 2011 i`"B la[;k % 1 1 1 1& & & & ftys dk uke ftys dk uke ftys dk uke ftys dk uke % % % % 0701-ò¶âã£ûæ– 2 2 2 2& & & & fudk fudk fudk fudk; ; ; ; dk uke dk uke dk uke dk uke % % % % 1-¢âî™ 3 3 3 3& & & & okMZ la okMZ la okMZ la okMZ la[ [ [ [; ; ; ;k o uke k o uke k o uke k o uke % % % % 1-¯â“¯â™®â£û¶âû §âîºâã®â£û¶âû Õô¯âû®â£û¶âû 4 4 4 4& & & & Hkkx la Hkkx la Hkkx la Hkkx la[ [ [ [; ; ; ;k k k k % % % %

回答1:

Do not use any kind of Encoding, because you do not know what encoding is the pdf file has.

. I think it will work.

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text=text+currentText;

///do what you want with text
MessageBox.Show(text);

If still it not working then you have to install specific font.

来源：https://stackoverflow.com/questions/10900838/read-localized-pdf-file-using-itextsharp

标签

ASP.NET

itextsharp

hindi