Full-text search stemming not returning consistent results in different languages

ぐ巨炮叔叔 提交于 2019-12-13 19:57:20

问题


I have an Sql Server 2016 database with full text indexes defined on 4 columns, each configured for a different language: Dutch, English, German & French. I used the wizard to setup the full-text index.

I am using CONTAINSTABLE with FORMSOF and for each language I would expect executing a query with either the word stem or any verb form would return both results from the example table. This seems to work in English & German, somewhat in French, and not at all in Dutch.

I am using a very basic example with verb forms of 'running' in every language so I'm thinking something might not be configured correctly.

Example table

+----+-------------+--------------+-----------------+----------------+
| ID | KeyWordsNL  |  KeyWordsEN  |   KeyWordsDE    |   KeyWordsFR   |
+----+-------------+--------------+-----------------+----------------+
|  1 | ik loop     | i run        | ich laufe       | je cours       |
|  2 | ik ga lopen | i am running | ich gehe laufen | je vais courir |
+----+-------------+--------------+-----------------+----------------+

English queries

CONTAINSTABLE (SearchResult, KeyWordsEN, 'FORMSOF(INFLECTIONAL, "run")')
CONTAINSTABLE (SearchResult, KeyWordsEN, 'FORMSOF(INFLECTIONAL, "running")')

returns 1 & 2 for each query

German queries

CONTAINSTABLE (SearchResult, KeyWordsDE, 'FORMSOF(INFLECTIONAL, "laufe")')
CONTAINSTABLE (SearchResult, KeyWordsDE, 'FORMSOF(INFLECTIONAL, "laufen")')

returns 1 & 2 for each query

French queries

CONTAINSTABLE (SearchResult, KeyWordsFR, 'FORMSOF(INFLECTIONAL, "cours")')
CONTAINSTABLE (SearchResult, KeyWordsFR, 'FORMSOF(INFLECTIONAL, "courir")')

only returns record 1 in the first query (cours), second query return 1 & 2

Dutch queries

CONTAINSTABLE (SearchResult, KeyWordsNL, 'FORMSOF(INFLECTIONAL, "loop")')
CONTAINSTABLE (SearchResult, KeyWordsNL, 'FORMSOF(INFLECTIONAL, "lopen")')

only returns record 1 in the first query (loop), and record 2 in the second query (lopen)

Edit: Further testing ...

It is possible to test how fts parses the input query by using sys.dm_fts_parser. This makes clear there is simply no stemming happening for 'Dutch'. Tested on different machines.

Getting the language LCID:

select * from sys.fulltext_languages where name in ('Dutch','English','German','French')

select * from sys.dm_fts_parser('FORMSOF(INFLECTIONAL, "koe")', 1043, 0, 0)

select * from sys.dm_fts_parser('FORMSOF(INFLECTIONAL, "cow")', 1033, 0, 0)

Dutch query results in "koe", while the english query results in "cow's", "cowed", "cowing", "cows", "cows", "cow".

The same happens for every word I try, no extra forms of any word in Dutch, while English typically returns 5-10 word forms.


回答1:


I found that there is simply no specific stemming library for Dutch (and other languages). It is not clearly stated, but this article explains how to revert word breaker and stemming to previous versions, and it appears the word breaker and stemmer are actually using the same dll.

The following query shows that for Dutch (LCID 1043) the default neutral language word breaker/stemmer is used, which explains the bad results.

EXEC sp_help_fulltext_system_components 'wordbreaker';

To get the LCID per language:

SELECT * FROM sys.fulltext_languages; 


来源:https://stackoverflow.com/questions/48299202/full-text-search-stemming-not-returning-consistent-results-in-different-language

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!