Lucene: Multi-word phrases as search terms

后端 未结 4 1403
我寻月下人不归
我寻月下人不归 2020-12-03 15:46

I\'m trying to make a searchable phone/local business directory using Apache Lucene.

I have fields for street name, business name, phone number etc. The problem tha

相关标签:
4条回答
  • 2020-12-03 16:02

    If you want an exact words match the street, you could set Field "Street" NOT_ANALYZED which will not filter stop word "the".

    doc.add(new Field("Street", "the crescent", Field.Store.YES, Field.Index.Not_Analyzed);
    
    0 讨论(0)
  • 2020-12-03 16:06

    There is no need of using any Analyzer here coz Hibernate implicitly uses StandardAnalyzer which will split the words based on white spaces so the solution here is set the Analyze to NO it will automatically performs Multi Phrase Search

     @Column(name="skill")
        @Field(index=Index.YES, analyze=Analyze.NO, store=Store.NO)
        @Analyzer(definition="SkillsAnalyzer")
        private String skill;
    
    0 讨论(0)
  • 2020-12-03 16:07

    I found that my attempt to generate a query without using a QueryParser was not working, so I stopped trying to create my own queries and used a QueryParser instead. All of the recomendations that I saw online showed that you should use the same Analyzer in the QueryParser that you use during indexing, so I used a StandardAnalyzer to build the QueryParser.

    This works on this example because the StandardAnalyzer removes the word "the" from the street "the crescent" during indexing, and hence we can't search for it because it isn't in the index.

    However, if we choose to search for "Grove Road", we have a problem with the out-of-the-box functionality, namely that the query will return all of the results containing either "Grove" OR "Road". This is easily fixed by setting up the QueryParser so that it's default operation is AND instead of OR.

    In the end, the correct solution was the following:

    int numberOfHits = 200;
    String LocationOfDirectory = "C:\\dir\\index";
    TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
    Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
    IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);
    
    StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
    
    //WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");
    QueryParser qp = new QueryParser(Version.LUCENE_35, "Street", analyzer);
    qp.setDefaultOperator(QueryParser.Operator.AND);
    
    Query q = qp.parse("grove road");
    
    searcher.search(q, collector);
    ScoreDoc[] hits = collector.topDocs().scoreDocs;
    
    0 讨论(0)
  • 2020-12-03 16:24

    The reason why you don't get your documents back is that while indexing you're using StandardAnalyzer, which converts tokens to lowercase and removes stop words. So the only term that gets indexed for your example is 'crescent'. However, wildcard queries are not analyzed, so 'the' is included as mandatory part of the query. The same goes for phrase queries in your scenario.

    KeywordAnalyzer is probably not very suitable for your use case, because it takes whole field content as a single token. You can use SimpleAnalyzer for the street field -- it will split the input on all non-letter characters and then convert them to lowercase. You can also consider using WhitespaceAnalyzer with LowerCaseFilter. You need to try different options and work out what works best for your data and users.

    Also, you can use different analyzers per field (e.g. with PerFieldAnalyzerWrapper) if changing analyzer for that field breaks other searches.

    0 讨论(0)
提交回复
热议问题