How to search for text fragments in a database

后端 未结 10 1261
旧时难觅i
旧时难觅i 2021-02-06 10:51

Are there any open source or commercial tools available that allow for text fragment indexing of database contents and can be queried from Java?

Background of the questi

相关标签:
10条回答
  • 2021-02-06 11:41

    I'd use Apache Solr. The indexing strategy is entirely tunable (see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters), can incrementally read directly from your database to populate the index (see DataImportHandler in the same wiki), and can be queried from basically any language that speaks HTTP and XML or something like JSON.

    0 讨论(0)
  • 2021-02-06 11:43

    what about using tools such as proposed above (lucene etc.) for full text indexing and having LIKE search for cases, where nothing was found? (i.e. run LIKE only after fulltext indexed search returned zero results)

    0 讨论(0)
  • 2021-02-06 11:44

    I'm pretty sure Mysql offers a fulltext option, and it's probably also possible to use Lucene.

    See here for related comments

    Best efficient way to make a fulltext search in MySQL

    0 讨论(0)
  • 2021-02-06 11:46

    What you're trying to do is unlikely to ever be all that much faster than LIKE '%searchterm%' without a great deal of custom code. The equivalent of LIKE 'searchterm%' ought to be trivial though. You could do what you're asking by building an index of all possible partial words that aren't covered by the trailing wild-card, but this would result in an unbelievably large index size, and it would be unusually slow for updates. Long tokens would result in Bad Things™. May I ask why you need this? Re: Spotlight... You do realize that Spotlight doesn't do this, right? It's token-based just like every other full-text indexer. Usually query expansion is the appropriate method of getting inexact matches if that's your goal.

    Edit:

    I had a project exactly like this at one point; part-numbers for all kinds of stuff. We finally settled on searchterm* in Xapian, but I believe Lucene also has the equivalent. You won't find a good solution that handles wild-card searches on either side of the token, but a trailing wild-card is usually more than good enough for what you want, and I suspect you'll find that users adapt to your system fairly quickly if they have any control over cleaning up the data. Combine it with query expansion (or even limited token expansion) and you should be pretty well set. Query expansion would convert a query for "sledgehammer" into "sledgehammer* OR (sledge* hammer*)" or something similar. Not every query will work, but people are already pretty well trained to try related queries when something doesn't work, and as long as at least one or two obvious queries come up with the results they expect, you should be OK. Your best bet is still to clean up the data and organize it better. You'd be surprised how easy this ends up being if you version everything and implement an egalitarian edit policy. Maybe let people add keywords to an entry and be sure to index those, but put limits on how many can be set. Too many and you may actually degrade the search results.

    0 讨论(0)
提交回复
热议问题