How to search for text fragments in a database

后端 未结 10 1287
旧时难觅i
旧时难觅i 2021-02-06 10:51

Are there any open source or commercial tools available that allow for text fragment indexing of database contents and can be queried from Java?

Background of the questi

10条回答
  •  醉话见心
    2021-02-06 11:27

    I haven't had this specific requirement myself, but my experience tells me Lucene can do the trick, though perhaps not standalone. I'd definitely use it through Solr as described by Michael Della Bitta in the first answer. The link he gave was spot on - read it for more background.

    Briefly, Solr lets you define custom FieldTypes. These consist of an index-time Analyzer and a query-time Analyzer. Analyzers figure out what to do with the text, and each consists of a Tokenizer and zero to many TokenFilters. The Tokenizer splits your text into chunks and then each TokenFilter can add, subtract, or modify tokens.

    The field can thus end up indexing something quite different from the original text, including multiple tokens if necessary. So what you want is a multiple-token copy of your original text, which you query by sending Lucene something like "my_ngram_field:sledge". No wildcards involved :-)

    Then you follow a model similar to the prefix searching offered up in the solrconfig.xml file:

    
        
            
            
            
        
        
            
            
        
    
    

    The EdgeNGramFilterFactory is how they implement prefix matching for search box autocomplete. It takes the tokens coming from the previous stages (single whitespace-delimited words transformed into lower case) and fans them out into every substring on the leading edge. sledgehammer = s,sl,sle,sled,sledg,sledge,sledgeh, etc.

    You need to follow this pattern, but replace the EdgeNGramFilterFactory with your own which does all NGrams in the field. The default org.apache.solr.analysis.NGramFilterFactory is a good start, but it does letter transpositions for spell checking. You could copy it and strip that out - it's a pretty simple class to implement.

    Once you have your own FieldType (call it ngram_text) using your own MyNGramFilterFactory, just create your original field and the ngram field like so:

        
        
    

    Then tell it to copy the original field into the fancy one:

    
    

    Alright, now when you search "title_ngrams:sledge" you should get a list of documents that contain this. Then in your field list for the query you just tell it to retrieve the field called title rather than the field title_ngrams.

    That should be enough of a nudge to allow you to fit things together and tune it to astonishing performance levels rather easily. At an old job we had a database with over ten million products with large HTML descriptions and managed to get Lucene to do both the standard query and the spellcheck in under 200ms on a mid-sized server handling several dozen simultaneous queries. When you have a lot of users, caching kicks in and makes it scream!

    Oh, and incremental (though not real-time) indexing is a cinch. It can even do it under high loads since it creates and optimizes the new index in the background and autowarms it before swapping it in. Very slick.

    Good luck!

提交回复
热议问题