I have a table in PostgreSQL where a column is a text. I need a library or tool that can identify the language of each text for a test purpose.
There is no need for
The problem with language detection is, that it will never be fully precise. My browser quite often misidentifies the language, and it was done by google who probably put a lot of great minds to that tasks.
However here are some points to consider:
I am not sure what Perls Lingua::Identify
module really is using, but most often these tasks are handled by Naive Baysian models as somebody pointed out in another answer. Baysian models use probability to classify into a number of categories, in your case these would be different language. Now these probabilities are both dependend probablities, i.e. how often a certain feature appears for each category, as well as independent (prior) probabilities, i.e. how often each category appears in total.
Because both these informations are used, you are very likely to get a low prediction quality when the priors are wrong. I guess Linua::Identify
has mostly been trained by a corpus of online document, so the highest prior will most likely be english. What this means, that Lingua::Identify
will most likely classify your documents as english, unless it has severe reasons to believe otherwise (In your case it most likely does have severe reason, because you say your documents are misclassified as italian, french and spanish).
This means you should try to re-train your model, if possible. There might be some methods within Lingua::Identify
to help you with this. If not, I would suggest you write your own Naive Bayes classifier (it's quite simple actually).
In case you have a Naive Bayes Classifier, you have to decide on a set of features. Most often the frequencies of letters are very characteristic for each language, so this would be a first guess. Just try to train your classifier on these frequencies first. Naive Bayes Classifier are used in spam-filters, so you can train it like one of those. Have it run on a sample set, and whenever you get a misclassification, update the classifier to the correct classification. After a while it will get less and less wrong.
In case single letter frequency does not give you well enough results, you could try using n-grams instead (however be aware of the combinatorial explosion this will introduce). I would not suggest ever trying anything more than 3-grams. In case this still does not give you good results, try manually identifying unique frequent words in each language and add those to your feature set. I am sure once you start experimenting on this you will get more ideas for features to try out.
Another nice thing about the approach using Bayesian Classifiers, is that you can always add new information in case more documents come in, which do not match the trained data. In this case you can just reclassify a few of the new documents and similar to a spam filter the classifier will adapt to the changing environment.