How to determine the (natural) language of a document?

前端 未结 11 1575
情话喂你
情话喂你 2020-12-24 07:16

I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on t

11条回答
  •  一生所求
    2020-12-24 07:59

    Language detection is not very difficult conceptually. Please look at my reply to a related question and other replies to the same question.

    In case you want to take a shot at writing it yourself, you should be able to write a naive detector in half a day. We use something similar to the following algorithm at work and it works surprisingly well. Also read the python implementation tutorial in the post I linked.

    Steps:

    1. Take two corpora for the two languages and extract character level bigrams, trigrams and whitespace-delimited tokens (words). Keep a track of their frequencies. This step builds your "Language Model" for both languages.

    2. Given a piece of text, identify the char bigrams, trigrams and whitespace-delimited tokens and their corresponding "relative frequencies" for each corpus. If a particular "feature" (char bigram/trigram or token) is missing from your model, treat its "raw count" as 1 and use it to calculate its "relative frequency".

    3. The product of the relative frequencies for a particular language gives the "score" for the language. This is a very-naive approximation of the probability that the sentence belongs to that language.

    4. The higher scoring language wins.

    Note 1: We treat the "raw count" as 1 for features that do not occur in our language model. This is because, in reality, that feature would have a very small value but since we have a finite corpus, we may not have encountered it yet. If you take it's count to be zero, then your entire product would also be zero. To avoid this, we assume that it's occurence is 1 in our corpus. This is called add-one smoothing. There are other advance smoothing techniques.

    Note 2: Since you will be multiplying large number of fractions, you can easily run to zero. To avoid this, you can work in a logarithmic space and use this equation to calculate your score.

                    a X b =  exp(log(a)+log(b))
    

    Note 3: The algorithm I described is a "very-naive" version of the "Naive Bayes Algorithm".

提交回复
热议问题