How to search for text fragments in a database

后端 未结 10 1259
旧时难觅i
旧时难觅i 2021-02-06 10:51

Are there any open source or commercial tools available that allow for text fragment indexing of database contents and can be queried from Java?

Background of the questi

相关标签:
10条回答
  • 2021-02-06 11:24

    This may not be what you want to hear, because I presume you are trying to solve this with SQL code, but Lucene would be my first choice. You can also build up fairly clever ranking and boosting techniques with additional tools. Lucene is written in Java so it should give you exactly the interface you need.

    If you were a Microsoft shop, the majority of what you're looking for is built into SQL Server, and wildcards can be enabled which will give you the ability to do partial word matches.

    In Lucene and Lucene.Net, you can use wildcard matches if you like. However, it's not supported to use wildcards as the first symbol in a search. If you want the ability to use first character wildcards, you'll probably need to implement some sort of trie-based index on your own, since it's an expensive operation in many cases to filter the set of terms down to something reasonable for the kind of index most commonly needed for full text search applications, where suffix stemming is generally more valuable.

    You can apparently alter the Query Parser instance in Lucene to override this rule by setting setAllowLeadingWildcard to true.

    I'm fairly sure that wildcard-on-both-ends-of-a-word searches are inherently inefficient. Skip lists are sometimes used to improve performance on such searches with plaintext, but I think you're more likely to find an implementation like that in something like grep than a generalized text indexing tool.

    There are other solutions for the problem that you describe where one word may occur spelled as two, or vice versa. Fuzzy queries are supported in Lucene, for example. Orthographic and morphological variants can be handled using either by providing a filter that offers suggestions based on some sort of Bayesian mechanism, or by indexing tricks, namely, taking a corpus of frequent variants and stuffing the index with those terms. I've even seen knowledge from structured data stuffed into the full text engine (e.g. adding city name and the word "hotel" to records from the hotel table, to make it more likely that "Paris Hotels" will include a record for the pension-house Caisse des Dépôts.) While not exactly a trivial problem, it's manageable without destroying the advantages of word-based searches.

    0 讨论(0)
  • 2021-02-06 11:26

    A "real" full text index using parts of a word would be many times bigger than the source text and while the search may be faster any update or insert processing would be horibly slow.

    You only hope is if there is some sort of pattern to the "mistakes' made. You could apply a set of "AI" type rules to the incoming text and produce cannonical form of the text which you could then apply a full text index to. An example for a rule could be to split a word ending in hammer into two words s/(\w?)(hammer)/\1 \2/g or to change "sledg" "sled" and "schledge" to "sledge". You would need to apply the same set of rules to the query text. In the way a product described as "sledgehammer" could be matched by a search for ' sledg hammer'.

    0 讨论(0)
  • 2021-02-06 11:27

    I haven't had this specific requirement myself, but my experience tells me Lucene can do the trick, though perhaps not standalone. I'd definitely use it through Solr as described by Michael Della Bitta in the first answer. The link he gave was spot on - read it for more background.

    Briefly, Solr lets you define custom FieldTypes. These consist of an index-time Analyzer and a query-time Analyzer. Analyzers figure out what to do with the text, and each consists of a Tokenizer and zero to many TokenFilters. The Tokenizer splits your text into chunks and then each TokenFilter can add, subtract, or modify tokens.

    The field can thus end up indexing something quite different from the original text, including multiple tokens if necessary. So what you want is a multiple-token copy of your original text, which you query by sending Lucene something like "my_ngram_field:sledge". No wildcards involved :-)

    Then you follow a model similar to the prefix searching offered up in the solrconfig.xml file:

    <fieldType name="prefix_token" class="solr.TextField" positionIncrementGap="1">
        <analyzer type="index">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory" />
        </analyzer>
    </fieldType>
    

    The EdgeNGramFilterFactory is how they implement prefix matching for search box autocomplete. It takes the tokens coming from the previous stages (single whitespace-delimited words transformed into lower case) and fans them out into every substring on the leading edge. sledgehammer = s,sl,sle,sled,sledg,sledge,sledgeh, etc.

    You need to follow this pattern, but replace the EdgeNGramFilterFactory with your own which does all NGrams in the field. The default org.apache.solr.analysis.NGramFilterFactory is a good start, but it does letter transpositions for spell checking. You could copy it and strip that out - it's a pretty simple class to implement.

    Once you have your own FieldType (call it ngram_text) using your own MyNGramFilterFactory, just create your original field and the ngram field like so:

        <field name="title" type="text" indexed="true" stored="true"/>
        <field name="title_ngrams" type="ngram_text" indexed="true" stored="false"/>
    

    Then tell it to copy the original field into the fancy one:

    <copyField source="title" dest="title_ngrams"/>
    

    Alright, now when you search "title_ngrams:sledge" you should get a list of documents that contain this. Then in your field list for the query you just tell it to retrieve the field called title rather than the field title_ngrams.

    That should be enough of a nudge to allow you to fit things together and tune it to astonishing performance levels rather easily. At an old job we had a database with over ten million products with large HTML descriptions and managed to get Lucene to do both the standard query and the spellcheck in under 200ms on a mid-sized server handling several dozen simultaneous queries. When you have a lot of users, caching kicks in and makes it scream!

    Oh, and incremental (though not real-time) indexing is a cinch. It can even do it under high loads since it creates and optimizes the new index in the background and autowarms it before swapping it in. Very slick.

    Good luck!

    0 讨论(0)
  • 2021-02-06 11:28

    The exact answer to your question is right here Whether it will perform sufficiently well for the size of your data is another question.

    0 讨论(0)
  • 2021-02-06 11:31

    Shingle search could do the trick.

    http://en.wikipedia.org/wiki/W-shingling

    For example, if you use 3-character shingles, you can split "Roisonic" to: "roi", "son", "ic ", and store all three values, associating them with original entry. When searching for "oison", you first will search for "ois", "iso", "son". First you fuzzy-match all entries by shingles (finding the one with "son"), and then you can refine the search by using exact string matching.

    Note that 3-character shingle require the fragment in query to be at least 5 characters long, 4-char shingle requires 7-char query and so on.

    0 讨论(0)
  • 2021-02-06 11:39

    If your table is MyISAM, you can use MySQL's full text search capabilites: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

    If not, the "industry standard" is http://www.sphinxsearch.com/

    Some ideas on what to do if you are using InnoDB: http://www.mysqlperformanceblog.com/2009/09/10/what-to-do-with-mysql-full-text-search-while-migrating-to-innodb/

    Also, a good presentation that introduces Sphinx and explains architecture+usage http://www.scribd.com/doc/2670976/Sphinx-High-Performance-Full-Text-Search-for-MySQL-Presentation

    Update
    Having read your clarification to the question -- Sphinx can do substring matches. You need to set "enable-star" and create an infix index with the appropriate min_infix_length (1 will give you all possible substrings, but obviously the higher the set it, the smaller your index will be, and the faster your searches). See http://sphinxsearch.com/docs/current.html for details.

    0 讨论(0)
提交回复
热议问题