Indexing Word Documents and PDFs with Sphinx

前端 未结 3 1630
醉话见心
醉话见心 2020-12-14 11:43

I have a website where users upload documents in .doc and .pdf format. I am using Sphinx to conduct full text searches on my SQL database (MySQL). What is the best way to

相关标签:
3条回答
  • 2020-12-14 12:37

    The method I use for this is pdf2text and antiword. I use both of these to dump the contents of the pdfs and word documents into the database. From there it's easy to crawl with Sphinx.

    0 讨论(0)
  • 2020-12-14 12:38

    Has anyone used Tika to index other types of documents, much like the SOLR plugin? Apache Tika

    Some links:

    1. PDF2TEXT is in poppler or poppler-utils on Linux
    2. ANTIWORD -- seems to be for old .doc, not newer .docx
    0 讨论(0)
  • 2020-12-14 12:40

    Unfortunately, Sphinx can't index those file types directly. You'll need to either import the textual contents into a database, or into an XML format that Sphinx can understand.

    0 讨论(0)
提交回复
热议问题