I am searching stemming algorithm for Slovenian language that I can use with Sphinx search.
What I\'m trying to achieve is for example when searching for \'jabolka\'
I'm not sure if this will do what you want, but I came across this reference to a tool called spelldump in the Sphinx documentation:
spelldump is one of the helper tools within the Sphinx package.
It is used to extract the contents of a dictionary file that uses ispell or MySpell format, which can help build word lists for wordforms - all of the possible forms are pre-built for you.
http://sphinxsearch.com/docs/current.html#ref-spelldump
It requires "a dictionary file that uses ispell or MySpell" - I found a reference to a Slovenian ispell dictionary file, which might be suitable.
Good luck!
I managed to compile slovenian stemmer in following steps:
stem_ISO_8859_2.sbl
stem_Unicode.sbl
(you have to find utf char codes for slovenian special chars like ČŠŽĆ)Edit both of .txt files in /libstemmer folder and add entries for slovenian:
slovene UTF_8,ISO_8859_2 slovene,sl,slv
go to folder /libstemmer and run:
./mkmodules.pl modules.h src_c modules.txt ../mkinc.mak
./mkmodules.pl modules_utf8.h src_c modules_utf8.txt ../mkinc_utf8.mak
This will generate files needed for compiling later.
make
(from root of unpacked files)If there were no errors during compile you should have /src_c folder and code for slovenian stemmer in them (next to others)
stem_UTF_8_slovene.c
stem_ISO_8859_2_slovene.c
...
Unpack latest sphinx and copy all files from your snowball project to sphinx /libstemmer_c folder (excluding libstemmer.o
and GNUmakefile
)
compile sphinx:
touch NEWS README AUTHORS ChangeLog
autoreconf --force --install
./configure --with-libstemmer
make
make install
if all went fine you should have slovene stemmer for sphinx working, you just have to enable it in you sphinx index configuratiun (on my Debian it is in /usr/local/etc/sphinx.conf):
charset_type = utf-8
morphology = libstemmer_slovene
Hope this helps someone, I had no prior experience with autoconf so it took me a while to figure this out.
This slovene stemmer is not officially released on http://snowball.tartarus.org, but from my tests it works good enough for my project.