How to determine the (natural) language of a document?

前端 未结 11 1578
情话喂你
情话喂你 2020-12-24 07:16

I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on t

11条回答
  •  孤城傲影
    2020-12-24 08:07

    The stop words approach for the two languages is quick and would be made quicker by heavily weighting ones that don't occur in the other language "das" in German and "the" in English, for example. The use of the "exclusive words" would help extend this approach robustly over a larger group of languages as well.

提交回复
热议问题