I need a sample code to help me detect farsi language web pages by apache tika toolkit.
LanguageIdentifier identifier = new LanguageIdentifier(\"فارسی\");
Tika doesn't ship with a language profile for the Farsi language yet. As of version 1.0 27 languages are supported out of the box:
languages=be,ca,da,de,eo,et,el,en,es,fi,fr,gl,hu,is,it,lt,nl,no,pl,pt,ro,ru,sk,sl,sv,th,uk
In your example the input is misdetected as li
(Lithuanian) with a distance of 0.41, which is above the certainty threshold of 0.022. See the source code for more information on the inner works of LanguageIdentifier
.
The Farsi language (Persian, ISO 639-1 2-letter code fa) is not recognized by default. If you want Tika to recognize another language, you have to create a language profile first.
For this the following steps are necessary:
Find a text corpus for your language. I found the Hamshahri Collection. This should be sufficient. Download the corpus or parts of it and create a plain text file out of the XML.
Create an ngram file for the language identifier. This can be done using TikaCLI:
java -jar tika-app-1.0.jar --create-profile=fa -eUTF-8 fa-corpus.txt
This will a file called fa.ngp
which contains the n-grams.
Configure Tika so that it recognizes the new language. Either do this programmatically using LanguageIdentifier.initProfiles()
or put a property file with the name tika.language.override.properties
into the classpath. Make sure the ngram file is in the classpath as well.
If you now run Tika, it should correctly detect your language.
Update: Detailed the steps necessary to create a language profile.