问题
I want to extract text typed in hindi from a pdf document.I've attached the image of the sample page I am dealing with.
I've tried using pdfminer to get text from it but the text is garbled (may be due to hindi fonts)
Now I am thinking of splitting the page in three parts and then splitting each part into two parts (seperating english and hindi text) then running ocr on each half to get text but only issue is I don't know the font used for hindi so I might again get garbled text.
My Question are, Is there some better way to deal with hindi fonts ? How Can I find font name ?
回答1:
I have tried the following on your PDF and it appears to extract a lot of the text, I am guessing it might not be in the best layout but I am not able to tell.
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
with open(path, 'rb') as fp:
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp, pagenos, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
device.close()
retstr.close()
return text
print convert_pdf_to_txt("Electoral roll - Faizabad.pdf")
It displays as utf-8
so you must make sure your output console is capable to displaying using this.
For example:
भभग ससखखभककल मतदभतभ 11 1.रजजरभ आसशशकपपथममक ववददपलद रजजरप - सपमपनद779 420 359 0 779ननरभरचक नभमभरलल 2014 0S24उततर पददशवरधभन सभभ कदत कक ससखखभ ,नभम र आरकण सससनत:ललक सभभ कदत कक ससखखभ ,नभम र आरकण सससनत: 1 . पकनरलकण कभ वरररणपकनरलकण कभ ररर : 2014अहतभर कक नतथस: 01.01.2014पकनरलकण कभ सररप: ससककपत पकनरलकणपकभशन कक नतथस: 01.10.2013पकनरमकदण कक नतथस : 15.03.2014
To determine the list of fonts that it is using, you can simply load the PDF into a PDF reader such as Adobe Reader
or Foxit Reader
and select Properties
from the File menu. From here you should be able to select Fonts
. When I tried this with Foxit Reader
it displayed the following fonts:
Mangal-Bold
Arial
Mangal
Arial Bold
Times-New-Roman-Bold
来源:https://stackoverflow.com/questions/35917848/extracting-text-written-in-hindi-from-pdf-in-python