I have a table in PostgreSQL where a column is a text. I need a library or tool that can identify the language of each text for a test purpose.
There is no need for
Try these:
This blog post shares some tests to compare the 2 libraries (along with a 3rd - the Language Identification module of Apache Tika, which really is a complete toolkit for Text Analysis).
The problem with language detection is, that it will never be fully precise. My browser quite often misidentifies the language, and it was done by google who probably put a lot of great minds to that tasks.
However here are some points to consider:
I am not sure what Perls Lingua::Identify
module really is using, but most often these tasks are handled by Naive Baysian models as somebody pointed out in another answer. Baysian models use probability to classify into a number of categories, in your case these would be different language. Now these probabilities are both dependend probablities, i.e. how often a certain feature appears for each category, as well as independent (prior) probabilities, i.e. how often each category appears in total.
Because both these informations are used, you are very likely to get a low prediction quality when the priors are wrong. I guess Linua::Identify
has mostly been trained by a corpus of online document, so the highest prior will most likely be english. What this means, that Lingua::Identify
will most likely classify your documents as english, unless it has severe reasons to believe otherwise (In your case it most likely does have severe reason, because you say your documents are misclassified as italian, french and spanish).
This means you should try to re-train your model, if possible. There might be some methods within Lingua::Identify
to help you with this. If not, I would suggest you write your own Naive Bayes classifier (it's quite simple actually).
In case you have a Naive Bayes Classifier, you have to decide on a set of features. Most often the frequencies of letters are very characteristic for each language, so this would be a first guess. Just try to train your classifier on these frequencies first. Naive Bayes Classifier are used in spam-filters, so you can train it like one of those. Have it run on a sample set, and whenever you get a misclassification, update the classifier to the correct classification. After a while it will get less and less wrong.
In case single letter frequency does not give you well enough results, you could try using n-grams instead (however be aware of the combinatorial explosion this will introduce). I would not suggest ever trying anything more than 3-grams. In case this still does not give you good results, try manually identifying unique frequent words in each language and add those to your feature set. I am sure once you start experimenting on this you will get more ideas for features to try out.
Another nice thing about the approach using Bayesian Classifiers, is that you can always add new information in case more documents come in, which do not match the trained data. In this case you can just reclassify a few of the new documents and similar to a spam filter the classifier will adapt to the changing environment.
I found a library called TextCat, which is available under LGPL. I can't say what the quality of its identification is, but it's got an online demo form, so maybe you can throw some text at it before deciding if its worth downloading.
It's also written in Perl, so if you do want to use it, the approach in filiprem's answer would be a good start point.
You can use PL/Perl (CREATE FUNCTION langof(text) LANGUAGE
plperluAS ...
) with Lingua::Identify CPAN module.
Perl script:
#!/usr/bin/perl
use Lingua::Identify qw(langof);
undef $/;
my $textstring = <>; ## warning - slurps whole file to memory
my $a = langof( $textstring ); # gives the most probable language
print "$a\n";
And the function:
create or replace function langof( text ) returns varchar(2)
immutable returns null on null input
language plperlu as $perlcode$
use Lingua::Identify qw(langof);
return langof( shift );
$perlcode$;
Works for me:
filip@filip=# select langof('Pójdź, kiń-że tę chmurność w głąb flaszy');
langof
--------
pl
(1 row)
Time: 1.801 ms
PL/Perl language libary (plperl.dll) comes preinstalled in latest Windows installer of postgres.
But to use PL/Perl, you need Perl interpreter itself. Specifically, Perl 5.14 (at the time of this writing). Most common installer is ActiveState, but it's not free. Free one comes from StrawberryPerl. Make sure you have PERL514.DLL
in place.
After installing Perl, login to your postgres database and try to run
CREATE LANGUAGE plperlu;
If quality is your concern, you have some options: You can improve Lingua::Identify yourself (it's open source) or you could try another library. I found this one, which is commercial but looks promising.
Naive Bayes classifiers are very good at language identification. You find implementations in all the major languages, or you can implement one yourself, it's not extremely hard. The wikipedia entry is interesting too: https://en.wikipedia.org/wiki/Naive_Bayes_classifier.
Also there is a language detection webservice which provides both free and premium services at http://detectlanguage.com
It has Ruby and PHP clients, but can be accessed from any language simple web request. Output is in JSON.