Searching attachments from a Rails app (Word, PDF, Excel etc)

半世苍凉 提交于 2019-12-22 08:51:57

问题


My first post to Stack Overflow so be gentle please! I am about to start a new Ruby on Rails (3.1) project for a client. One of their requirements is that there is a search engine, which will be indexing roughly 2,000 documents which are a mixture of PDF, Word, Excel and HTML.

I had hoped to use either thinking-sphinx or Texticle (most popular at https://www.ruby-toolbox.com/categories/rails_search.html) but as I understand it:

  • Texticle requires PostgreSQL. I'm on MySQL.
  • thinking-sphinx doesn't index files on the file system.
  • even if I saved my attachments into the database, thinking-sphinx still wouldn't work as it requires plain text (according to http://groups.google.com/group/thinking-sphinx/browse_thread/thread/69cdc1c8e1c096ff)

So I'm left with two options:

  1. Pick a different search tool
  2. Try to extract plain-text versions of the attachments into the database for thinking-sphinx to read

Which approach do you recommend?

If it's a different search tool, which one? My requirements are pretty basic so I'd really like one that's very easy to set up and has lots of documentation, examples and tutorials!

If it's extracting, can you recommend extractors for common file types such as PDF, Word, Excel and HTML?

Thanks everyone. Really appreciate your help.


回答1:


Well I have not done binary file indexing before, but apparently Solr has support for it see Indexing files with SPHINX/ultrasphinx and http://wiki.apache.org/solr/ExtractingRequestHandler There are quite a few gems available for Solr, Sunspot seems to be a popular one http://outoftime.github.com/sunspot/ Although it seems Sunspot does not have built in support for Solr Cells, there seems to be some work going into it https://github.com/tomasc/sunspot_cell There are probably better options out there, but this should give you a good starting point.




回答2:


Just to update this. The approach I've decided to go with is:

Try to extract plain-text versions of the attachments into the database for thinking-sphinx to read

Specifically, I'll be doing the following:

  • Using thinking-sphinx
  • Using the subexec gem to call ...
  • ... Tika from the command line

It looks as if it will be as simple as calling java -jar tika-app-0.10.jar -t [file] but I'll post my experiences if it turns out to be more complicated!



来源:https://stackoverflow.com/questions/7739193/searching-attachments-from-a-rails-app-word-pdf-excel-etc

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!