How to read line by line in pdf file using PyPdf?

后端 未结 3 1262
闹比i
闹比i 2020-12-05 03:04

I have some code to read from a pdf file. Is there a way to read line by line from the pdf file (not pages) using Pypdf, Python 2.6, on Windows?

Here is the code for

相关标签:
3条回答
  • 2020-12-05 03:12
    import pyPdf  
    def getPDFContent(path):
        content = ""
        num_pages = 10
        p = file(path, "rb")
        pdf = pyPdf.PdfFileReader(p)
        for i in range(0, num_pages):
            content += pdf.getPage(i).extractText() + "\n"
        content = " ".join(content.replace(u"\xa0", " ").strip().split())     
        return content 
    
    0 讨论(0)
  • 2020-12-05 03:29

    Looks like what you have is a large chunk of text data that you want to interpret line-by-line.

    You can use the StringIO class to wrap that content as a seekable file-like object:

    >>> import StringIO
    >>> content = 'big\nugly\ncontents\nof\nmultiple\npdf files'
    >>> buf = StringIO.StringIO(content)
    >>> buf.readline()
    'big\n'
    >>> buf.readline()
    'ugly\n'
    >>> buf.readline()
    'contents\n'
    >>> buf.readline()
    'of\n'
    >>> buf.readline()
    'multiple\n'
    >>> buf.readline()
    'pdf files'
    >>> buf.seek(0)
    >>> buf.readline()
    'big\n'
    

    In your case, do:

    from StringIO import StringIO
    
    # Read each line of the PDF
    pdfContent = StringIO(getPDFContent("test.pdf").encode("ascii", "ignore"))
    for line in pdfContent:
        doSomething(line.strip())
    
    0 讨论(0)
  • 2020-12-05 03:31

    Using yield and PdfFileReader.pages can simplify things,

    from pyPdf import PdfFileReader
    
    def get_pdf_content_lines(pdf_file_path):
        with open(pdf_file_path) as f:
            pdf_reader = PdfFileReader(f)
            for page in pdf_reader.pages: 
                for line in page.extractText().splitlines():
                    yield line
    
    for line in get_pdf_content_lines('/path/to/file.pdf'):
        print line
    

    In addition, Some may google "python get pdf content text" so here's how: (this is how i got here)

    from pyPdf import PdfFileReader
    
    def get_pdf_content(pdf_file_path):
        with open(pdf_file_path) as f:
            pdf_reader = PdfFileReader(f)
            content = "\n".join(page.extractText().strip() for page in pdf_reader.pages)
            content = ' '.join(content.split())
            return content
    
    
    print get_pdf_content('/path/to/file.pdf')
    
    0 讨论(0)
提交回复
热议问题