问题
My first post to Stack Overflow so be gentle please! I am about to start a new Ruby on Rails (3.1) project for a client. One of their requirements is that there is a search engine, which will be indexing roughly 2,000 documents which are a mixture of PDF, Word, Excel and HTML.
I had hoped to use either thinking-sphinx or Texticle (most popular at https://www.ruby-toolbox.com/categories/rails_search.html) but as I understand it:
- Texticle requires PostgreSQL. I'm on MySQL.
- thinking-sphinx doesn't index files on the file system.
- even if I saved my attachments into the database, thinking-sphinx still wouldn't work as it requires plain text (according to http://groups.google.com/group/thinking-sphinx/browse_thread/thread/69cdc1c8e1c096ff)
So I'm left with two options:
- Pick a different search tool
- Try to extract plain-text versions of the attachments into the database for thinking-sphinx to read
Which approach do you recommend?
If it's a different search tool, which one? My requirements are pretty basic so I'd really like one that's very easy to set up and has lots of documentation, examples and tutorials!
If it's extracting, can you recommend extractors for common file types such as PDF, Word, Excel and HTML?
Thanks everyone. Really appreciate your help.
回答1:
Well I have not done binary file indexing before, but apparently Solr has support for it see Indexing files with SPHINX/ultrasphinx and http://wiki.apache.org/solr/ExtractingRequestHandler There are quite a few gems available for Solr, Sunspot seems to be a popular one http://outoftime.github.com/sunspot/ Although it seems Sunspot does not have built in support for Solr Cells, there seems to be some work going into it https://github.com/tomasc/sunspot_cell There are probably better options out there, but this should give you a good starting point.
回答2:
Just to update this. The approach I've decided to go with is:
Try to extract plain-text versions of the attachments into the database for thinking-sphinx to read
Specifically, I'll be doing the following:
- Using thinking-sphinx
- Using the subexec gem to call ...
- ... Tika from the command line
It looks as if it will be as simple as calling java -jar tika-app-0.10.jar -t [file]
but I'll post my experiences if it turns out to be more complicated!
来源:https://stackoverflow.com/questions/7739193/searching-attachments-from-a-rails-app-word-pdf-excel-etc