How to search keywords in 400+ PDF files? [duplicate]

三世轮回 提交于 2019-12-25 03:54:47

问题


I have like 400 or more PDF files that together form a single text. Its like a book separated page by page. I need to programatically be able to search some keywords over the whole text.

So my first question is: is it better to search page by page or join all the PDFs in one big file first and then perform the search?

The second one is: what is the best way to make it? Is there already any good program or library out there?

By the way, I'm using PHP and Python, only.


回答1:


Use PyPdf, as described here.

import pyPdf

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace("\xa0", " ").strip().split())
    return content

for f in filelist:
    print keyword in getPDFContent(f)

It is faster and much simpler to search them one by one, because you can then simply loop over all the files and use the code on every file.



来源:https://stackoverflow.com/questions/25089033/how-to-search-keywords-in-400-pdf-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!