Extracting text written in hindi from pdf in python [duplicate]

不羁岁月 提交于 2020-02-28 22:16:08

问题


I want to extract text typed in hindi from a pdf document.I've attached the image of the sample page I am dealing with.

I've tried using pdfminer to get text from it but the text is garbled (may be due to hindi fonts)

Now I am thinking of splitting the page in three parts and then splitting each part into two parts (seperating english and hindi text) then running ocr on each half to get text but only issue is I don't know the font used for hindi so I might again get garbled text.

My Question are, Is there some better way to deal with hindi fonts ? How Can I find font name ?


回答1:


I have tried the following on your PDF and it appears to extract a lot of the text, I am guessing it might not be in the best layout but I am not able to tell.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    with open(path, 'rb') as fp:
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        password = ""
        caching = True
        pagenos = set()

        for page in PDFPage.get_pages(fp, pagenos, password=password,caching=caching, check_extractable=True):
            interpreter.process_page(page)

        text = retstr.getvalue()

    device.close()
    retstr.close()
    return text

print convert_pdf_to_txt("Electoral roll - Faizabad.pdf")

It displays as utf-8 so you must make sure your output console is capable to displaying using this.

For example:

भभग ससखखभककल मतदभतभ 11 1.रजजरभ आसशशकपपथममक ववददपलद रजजरप - सपमपनद779 420 359 0 779ननरभरचक नभमभरलल 2014 0S24उततर पददशवरधभन सभभ कदत कक ससखखभ ,नभम र आरकण सससनत:ललक सभभ कदत कक ससखखभ ,नभम र आरकण सससनत: 1 . पकनरलकण कभ वरररणपकनरलकण कभ ररर : 2014अहतभर कक नतथस: 01.01.2014पकनरलकण कभ सररप: ससककपत पकनरलकणपकभशन कक नतथस: 01.10.2013पकनरमकदण कक नतथस : 15.03.2014

To determine the list of fonts that it is using, you can simply load the PDF into a PDF reader such as Adobe Reader or Foxit Reader and select Properties from the File menu. From here you should be able to select Fonts. When I tried this with Foxit Reader it displayed the following fonts:

Mangal-Bold
Arial
Mangal
Arial Bold
Times-New-Roman-Bold


来源:https://stackoverflow.com/questions/35917848/extracting-text-written-in-hindi-from-pdf-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!