Using ElasticSearch and/or Solr as a datastore for MS Office and PDF documents

前端 未结 5 1580
一生所求
一生所求 2020-12-23 10:31

I\'m currently designing a full text search system where users perform text queries against MS Office and PDF documents, and the result will return a list of documents that

相关标签:
5条回答
  • 2020-12-23 10:58

    A bit late to the party but this may help someone :)

    I had a similar problem and some research led me to fscrawler. Description:

    This crawler helps to index binary documents such as PDF, Open Office, MS Office.

    Main features:

    • Local file system (or a mounted drive) crawling and index new files,
    • update existing ones and removes old ones. Remote file system over SSH crawling.
    • REST interface to let you "upload" your binary documents to elasticsearch.
    0 讨论(0)
  • 2020-12-23 11:04

    Regarding solr:

    If the docs only need to be returned on metadata searches, Solr features a BinaryField fieldtype, to which you can send binary data base64 encoded.Keep in mind that in general people recommend against doing this, as it may increase your index (RAM requirements/performance), and if possible a set-up where you store the files externally (and the path to the file in solr) might bea better choice.

    If you want solr to automatically index the text inside the pdf/doc -- that's possible with the extractingrequesthandler: https://wiki.apache.org/solr/ExtractingRequestHandler

    0 讨论(0)
  • 2020-12-23 11:08

    I would try the Elasticsearch attachment plugin. Details can be found here:

    https://www.elastic.co/guide/en/elasticsearch/plugins/2.2/mapper-attachments.html

    https://github.com/elasticsearch/elasticsearch-mapper-attachments

    It's built on top of Apache Tika:

    http://tika.apache.org/1.7/formats.html

    Attachment Type

    The attachment type allows to index different "attachment" type field (encoded as base64), for example, Microsoft Office formats, open document formats, ePub, HTML, and so on (full list can be found here).

    The attachment type is provided as a plugin extension. The plugin is a simple zip file that can be downloaded and placed under $ES_HOME/plugins location. It will be automatically detected and the attachment type will be added.

    Supported Document Formats

    • HyperText Markup Language

    • XML and derived formats

    • Microsoft Office document formats
    • OpenDocument Format
    • iWorks document formats
    • Portable Document Format
    • Electronic Publication Format
    • Rich Text Format
    • Compression and packaging formats
    • Text formats
    • Feed and Syndication formats
    • Help formats
    • Audio formats
    • Image formats
    • Video formats
    • Java class files and archives
    • Source code
    • Mail formats
    • CAD formats
    • Font formats
    • Scientific formats
    • Executable programs and libraries
    • Crypto formats
    0 讨论(0)
  • 2020-12-23 11:11

    Elasticsearch do store documents (.pdfs, .docs for instance) in the _source field. It can be used as a NoSQL datastore (same as MongoDB).

    0 讨论(0)
  • 2020-12-23 11:18

    Both Solr and Elasticsearch will index the content of the document. Solr has that built-in, Elasticsearch needs a plugin. Easy either way and both use Tika under the covers.

    Neither of them will store the document itself. You can try making them do it, but they are not designed for it and you will suffer.

    Additionally, neither Solr nor Elasticsearch are currently recommended as a primary storage. They can do it, but it is not as mission critical for them as - say - for a filesystem implementation.

    So, I would recommend having the files somewhere else and using Solr/Elasticsearch for searching only. That's where they shine.

    0 讨论(0)
提交回复
热议问题