Language detection with data in PostgreSQL

后端 未结 6 533
我在风中等你
我在风中等你 2020-12-31 15:17

I have a table in PostgreSQL where a column is a text. I need a library or tool that can identify the language of each text for a test purpose.

There is no need for

相关标签:
6条回答
  • 2020-12-31 15:42

    Try these:

    • http://code.google.com/p/language-detection/ (Java)
    • http://code.google.com/p/chromium-compact-language-detector/ (C++/Python)

    This blog post shares some tests to compare the 2 libraries (along with a 3rd - the Language Identification module of Apache Tika, which really is a complete toolkit for Text Analysis).

    0 讨论(0)
  • 2020-12-31 15:44

    The problem with language detection is, that it will never be fully precise. My browser quite often misidentifies the language, and it was done by google who probably put a lot of great minds to that tasks.

    However here are some points to consider:

    I am not sure what Perls Lingua::Identify module really is using, but most often these tasks are handled by Naive Baysian models as somebody pointed out in another answer. Baysian models use probability to classify into a number of categories, in your case these would be different language. Now these probabilities are both dependend probablities, i.e. how often a certain feature appears for each category, as well as independent (prior) probabilities, i.e. how often each category appears in total.

    Because both these informations are used, you are very likely to get a low prediction quality when the priors are wrong. I guess Linua::Identify has mostly been trained by a corpus of online document, so the highest prior will most likely be english. What this means, that Lingua::Identify will most likely classify your documents as english, unless it has severe reasons to believe otherwise (In your case it most likely does have severe reason, because you say your documents are misclassified as italian, french and spanish).

    This means you should try to re-train your model, if possible. There might be some methods within Lingua::Identify to help you with this. If not, I would suggest you write your own Naive Bayes classifier (it's quite simple actually).

    In case you have a Naive Bayes Classifier, you have to decide on a set of features. Most often the frequencies of letters are very characteristic for each language, so this would be a first guess. Just try to train your classifier on these frequencies first. Naive Bayes Classifier are used in spam-filters, so you can train it like one of those. Have it run on a sample set, and whenever you get a misclassification, update the classifier to the correct classification. After a while it will get less and less wrong.

    In case single letter frequency does not give you well enough results, you could try using n-grams instead (however be aware of the combinatorial explosion this will introduce). I would not suggest ever trying anything more than 3-grams. In case this still does not give you good results, try manually identifying unique frequent words in each language and add those to your feature set. I am sure once you start experimenting on this you will get more ideas for features to try out.

    Another nice thing about the approach using Bayesian Classifiers, is that you can always add new information in case more documents come in, which do not match the trained data. In this case you can just reclassify a few of the new documents and similar to a spam filter the classifier will adapt to the changing environment.

    0 讨论(0)
  • 2020-12-31 15:51

    I found a library called TextCat, which is available under LGPL. I can't say what the quality of its identification is, but it's got an online demo form, so maybe you can throw some text at it before deciding if its worth downloading.

    It's also written in Perl, so if you do want to use it, the approach in filiprem's answer would be a good start point.

    0 讨论(0)
  • 2020-12-31 15:53

    You can use PL/Perl (CREATE FUNCTION langof(text) LANGUAGEplperluAS ...) with Lingua::Identify CPAN module.

    Perl script:

    #!/usr/bin/perl
    use Lingua::Identify qw(langof);
    undef $/;
    my $textstring = <>;  ## warning - slurps whole file to memory
    my $a = langof( $textstring );    # gives the most probable language
    print "$a\n";
    

    And the function:

    create or replace function langof( text ) returns varchar(2)
    immutable returns null on null input
    language plperlu as $perlcode$
        use Lingua::Identify qw(langof);
        return langof( shift );
    $perlcode$;
    

    Works for me:

    filip@filip=# select langof('Pójdź, kiń-że tę chmurność w głąb flaszy');
     langof
    --------
     pl
    (1 row)
    
    Time: 1.801 ms
    

    PL/Perl on Windows

    PL/Perl language libary (plperl.dll) comes preinstalled in latest Windows installer of postgres.

    But to use PL/Perl, you need Perl interpreter itself. Specifically, Perl 5.14 (at the time of this writing). Most common installer is ActiveState, but it's not free. Free one comes from StrawberryPerl. Make sure you have PERL514.DLL in place.

    After installing Perl, login to your postgres database and try to run

    CREATE LANGUAGE plperlu;
    

    Language identification library

    If quality is your concern, you have some options: You can improve Lingua::Identify yourself (it's open source) or you could try another library. I found this one, which is commercial but looks promising.

    0 讨论(0)
  • 2020-12-31 15:53

    Naive Bayes classifiers are very good at language identification. You find implementations in all the major languages, or you can implement one yourself, it's not extremely hard. The wikipedia entry is interesting too: https://en.wikipedia.org/wiki/Naive_Bayes_classifier.

    0 讨论(0)
  • 2020-12-31 15:53

    Also there is a language detection webservice which provides both free and premium services at http://detectlanguage.com

    It has Ruby and PHP clients, but can be accessed from any language simple web request. Output is in JSON.

    0 讨论(0)
提交回复
热议问题