How do we create a simple search engine using Lucene, Solr or Nutch?

前端 未结 10 2352
孤城傲影
孤城傲影 2021-02-15 11:49

Our company has thousands of PDF documents. How do we create a simple search engine using Lucene, Solr or Nutch? We\'ll provide a basic Java/JSP web page were people can type

相关标签:
10条回答
  • 2021-02-15 12:20

    I think you want a system to manage your PDF file. Please try to use dspace system. Dspace is a digital library, it supports Lucene based on. www.dspace.org.

    0 讨论(0)
  • 2021-02-15 12:25

    Answering such a broad question in this forum will be tough. I'd recommend you check out the book Lucene in Action, which covers the basics of indexing and searching in a quite readable fashion.

    Given your application, it sounds like Nutch and Solr probably will not be necessary. Since all of your documents are available locally, Nutch probably won't be helpful. Solr may help you manage a cluster of searchers if you have a high query load, but Lucene is highly performant, and handles large document sets in a very scalable manner.

    The one area that might consume a lot of your effort is the use of PDF. It's possible to index PDF documents, and there are Lucene contributions to facilitate the extraction of raw text from PDFs, but depending on the document, the quality of results can vary. Often, the context of a keyword in a PDF document is unclear because of formatting instructions, and that can make it hard to do proximity searches or show the context of a hit.

    0 讨论(0)
  • 2021-02-15 12:26

    Having the (imho) distinct advantage of being on a Mac, I use SearchLight on a somewhat older G5. nice web interface to spotlight, the Mac OS' built-in indexing service.

    0 讨论(0)
  • 2021-02-15 12:28

    Nutch + Lucene + Pdf plugin enabled in Nutch is your solution. Nutch allows you to parse pdfs by enabling the pdf plugin.

    Lucene will allow you to index the crawled and parsed data and Nutch has servelet which gives you a search interface.

    We use the same for our internal lans.

    0 讨论(0)
  • 2021-02-15 12:34

    If you've a Linux server, you could use Beagle to index them, and then just use the search functionality that comes with it. It has an (experimental) web search interface, and it can be hooked into the FireFox search box as well.

    It automatically indexes files as they're included, and I'd suspect that you'll find it much more efficient to enhance or fix beagle than to write your own search interface to Lucene.

    0 讨论(0)
  • 2021-02-15 12:35

    I have had good luck with lucene, but it is not click, install and search, it does require a bit of work.
    If you need something that yo can download and install and be searching within 10 minutes, look at the free Ominifind Yahoo Edition http://omnifind.ibm.yahoo.net/, it uses Lucene, but is packaged such that it is configured and ready to run upon install, a much easier way to try Lucene.

    0 讨论(0)
提交回复
热议问题