How to do case insensitive sorting of Norwegian characters (Æ, Ø, and Å) using Hibernate Lucene Search?

问题

æ, ø, å are latest letters in the norwegian alphabet

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Æ Ø Å

When we try to sort it using Hibernate Lucene then Å clubs with A, Ø clubs with Ø, Æ clibs with A which is wrong. For example:

Currrent Results:

Aaalu, Åaalu, Baalu, Zaalu,

Expected Results:

Aaalu, Baalu, Zaalu, Åaalu,

Following is working code:

@AnalyzerDef(name = "myOwnAnalyzer",
tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class),
filters = {
    @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
    @TokenFilterDef(factory = LowerCaseFilterFactory.class),
    @TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
        @Parameter(name = "pattern", value = "('-&\\.,\\(\\))"),
        @Parameter(name = "replacement", value = " "),
        @Parameter(name = "replace", value = "all")
    }),
    @TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
        @Parameter(name = "pattern", value = "([^0-9\\p{L} ])"),
        @Parameter(name = "replacement", value = ""),
        @Parameter(name = "replace", value = "all")
    }),
    @TokenFilterDef(factory = TrimFilterFactory.class)
}
)
public class KikaPaya implements Serializable {

@Fields({ @Field(index = Index.YES, store = Store.YES), @Field(name = "KikaPayaName_for_sort", index = Index.YES, analyzer = @Analyzer(definition = "myOwnAnalyzer")) })
@Column(name = "NAME", length = 100)
private String name;

Main:

  FullTextEntityManager ftem = Search.getFullTextEntityManager(factory.createEntityManager());
  QueryBuilder qb = ftem.getSearchFactory().buildQueryBuilder().forEntity( KikaPaya.class ).get();
  org.apache.lucene.search.Query query = qb.all().getQuery(); 
  FullTextQuery fullTextQuery = ftem.createFullTextQuery(query, KikaPaya.class);
  fullTextQuery.setSort(new Sort(new SortField("KikaPayaName_for_sort", SortField.STRING, true)));
  fullTextQuery.setFirstResult(0).setMaxResults(150);
  int size = fullTextQuery.getResultSize();
  List<KikaPaya> result = fullTextQuery.getResultList();
  for (KikaPayauser : result) {
    logger.info("KikaPaya Name:" + user.getName());
  }

Following are versions of Lucene (which i cannot change):

 <hibernate.version>4.2.8.Final</hibernate.version>
    <hibernate.search.version>4.3.0.Final</hibernate.search.version>

  <dependency>
        <groupId>org.hibernate</groupId>
        <artifactId>hibernate-entitymanager</artifactId>
        <version>4.2.8.Final</version>
    </dependency>
<dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-core</artifactId>
        <version>3.6.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-analyzers</artifactId>
        <version>3.6.2</version>
    </dependency>

Could anyone suggests the way to get correct results?

回答1:

I must admit it's not something common. As far as I can see, there is a Lucene module which uses ICU for locale dependent sorting.

See the lucene-icu artifact and especially the ICUCollationKeyFilter and ICUCollationKeyAnalyzer (the analyzer is a KeywordTokenizer with the filter). You will need to create the factory necessary to use it with Hibernate Search but it should be quite easy.

Can't really promise it will work but it's probably your best bet.

来源：https://stackoverflow.com/questions/39264308/how-to-do-case-insensitive-sorting-of-norwegian-characters-%c3%86-%c3%98-and-%c3%85-using-h

标签

java

hibernate

lucene

hibernate-search