Extracting text written in hindi from pdf in python [duplicate]

问题

This question already has answers here:

Unable to copy exact hindi content from pdf (1 answer)

Read PDF using itextsharp where PDF language is non-English (2 answers)

Parsing a pdf(Devanagari script) using PDFminer gives incorrect output [duplicate] (1 answer)

Closed 2 years ago.

I want to extract text typed in hindi from a pdf document.I've attached the image of the sample page I am dealing with.

I've tried using pdfminer to get text from it but the text is garbled (may be due to hindi fonts)

Now I am thinking of splitting the page in three parts and then splitting each part into two parts (seperating english and hindi text) then running ocr on each half to get text but only issue is I don't know the font used for hindi so I might again get garbled text.

My Question are, Is there some better way to deal with hindi fonts ? How Can I find font name ?

回答1:

I have tried the following on your PDF and it appears to extract a lot of the text, I am guessing it might not be in the best layout but I am not able to tell.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    with open(path, 'rb') as fp:
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        password = ""
        caching = True
        pagenos = set()

        for page in PDFPage.get_pages(fp, pagenos, password=password,caching=caching, check_extractable=True):
            interpreter.process_page(page)

        text = retstr.getvalue()

    device.close()
    retstr.close()
    return text

print convert_pdf_to_txt("Electoral roll - Faizabad.pdf")

It displays as utf-8 so you must make sure your output console is capable to displaying using this.

For example:

भभग ससखखभककल मतदभतभ 11 1.रजजरभ आसशशकपपथममक ववददपलद रजजरप - सपमपनद779 420 359 0 779ननरभरचक नभमभरलल 2014 0S24उततर पददशवरधभन सभभ कदत कक ससखखभ ,नभम र आरकण सससनत:ललक सभभ कदत कक ससखखभ ,नभम र आरकण सससनत: 1 . पकनरलकण कभ वरररणपकनरलकण कभ ररर : 2014अहतभर कक नतथस: 01.01.2014पकनरलकण कभ सररप: ससककपत पकनरलकणपकभशन कक नतथस: 01.10.2013पकनरमकदण कक नतथस : 15.03.2014

To determine the list of fonts that it is using, you can simply load the PDF into a PDF reader such as Adobe Reader or Foxit Reader and select Properties from the File menu. From here you should be able to select Fonts. When I tried this with Foxit Reader it displayed the following fonts:

Mangal-Bold
Arial
Mangal
Arial Bold
Times-New-Roman-Bold

来源：https://stackoverflow.com/questions/35917848/extracting-text-written-in-hindi-from-pdf-in-python

标签

python

pdf

ocr

hindi

pdfminer