问题
Using pypdf python module how to read the following pdf file http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf
# -*- coding: utf-8 -*-
from pyPdf import PdfFileWriter, PdfFileReader
import pyPdf
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
print getPDFContent("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")
The above prints only binary
And how to print the contents from the below code
from pyPdf import PdfFileWriter, PdfFileReader
import sys
import pyPdf
from pyPdf import PdfFileWriter, PdfFileReader
output = PdfFileWriter()
input1 = PdfFileReader(file("/home/tom/Desktop/Hindi_Book.pdf", "rb"))
# print the title of document1.pdf
print "title = %s" % (input1.getDocumentInfo().title)
回答1:
Note that most of the "text" of the pdf document you refer to isn't real text at all: it's mostly images. The actual text seems to get extracted correctly when I try it (although I must admit that apart from some snippets on the front page and the page numbers, I can't read it ;-)).
As for the second question: I'm not sure what you're asking there.
回答2:
If you want to write specific text from the pdf file you can use exctractText() as in below:
from path_to_folder import main_path as my_text
import os
from PyPDF2 import PdfFileReader
my_pdf_path = os.path.join(my_text, "my_pdf.pdf")
with open(os.path.join(my_text, "out_put.txt"), 'w') as out_text:
pdf_read = PdfFileReader(open(my_pdf_path, "rb"))
out_text.write(pdf_read.getDocumentInfo().title)
for pages in range(pdf_read.getNumPages()):
text = pdf_read.getPage(pages).extractText()
out_text.write(text)
In the example above I just extracted text from the each page and wrote that to the text file. You can choose anything. If you need to take specific pages as pdf you can use below code:
from pyPdf import PdfFileWriter, PdfFileReader
import os, sys
main_path = "/home/tom/Desktop/"
output_file = PdfFileWriter()
input_file = PdfFileReader(file("/home/tom/Desktop/Hindi_Book.pdf", "rb"))
for page_number in range(input_file.getNumPages()):
output_file.addPage(input_file.getPage(page_number))
new_file = os.path.join(main_path, "Out_folder/new_pdf.pdf")
out_fil1 = open(new_file, "wb")
output_file.write(out_fil1)
output_file.close()
The link which you provided doesn't work, that's why I couldn't look to file sorry.
来源:https://stackoverflow.com/questions/3854963/pypdf-python-tool