Searching (extracting text) PDF files with Algolia

问题

This is just a speculative idea for a client who has a lot of PDF files.

Algolia say in their FAQs that to search PDF files you first need to extract the text from the file. How would you go about this?

The way I envisage the a system working would be:

Client uploads PDF via CMS
CMS calls some service / program to extract the text
Algolia indexes the extracted and it's somehow linked to the original PDF

It would need to be an automated system as the client shouldn't have to tell it to index. It would be built in PHP, probably Laravel running on Ubuntu.

What software / service could do the text extraction from the PDFs and is any magic needed to 'link' this with the PDF file?

I'm also happy to have suggestions on other search services which may handle this.

回答1:

Fortunately, text extraction from pdf's is a subject that has been covered multiple times. On the command line, you could use pdftotext (available on Linux or Mac) or in your code a library as Apache Tika (for which you can find a PHP wrapper).

To avoid having too much noise in your records, I'd recommend you to then split the text and create one record per paragraph. You can then use Algolia's distinct feature to deduplicate the results.

You should already have the links to your files somewhere, just store them in your records and then, in your front-end you'll easily be able to create links to them using for instance autocomplete.js or instantsearch.js .

回答2:

For anyone still looking for a solution, I put together a GitHub repository that does exactly that: https://github.com/PDFTron/pdftron-document-search.

The text extraction happens client-side as the user uploads the document using React + Firebase + Algolia.

You can check out a quick video walking you through the sample app: https://youtu.be/IQATnzHTp7Q.

Let me know if you have any questions.

来源：https://stackoverflow.com/questions/38640877/searching-extracting-text-pdf-files-with-algolia

标签

php

algolia