fasttext models detecting norwegian text as danish [closed]

问题

I am using fasttext (v=0.9.1) to detect the language of a text (see this).

Norwegian text is being detected as Danish when using this model.

!curl "https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin" > lid.bin

import fastText
language_detector=fastText.load_model('lid.bin')
language_detector.predict('Hei Jeg viser til hyggelig sam', k=3)

Output:

(('__label__da', '__label__no', '__label__hu'),
array([9.16624188e-01, 8.25065151e-02, 2.37607688e-04]))

Any help?

回答1:

It seems that distinguishing the Norwegian and Danish languages is difficult (see this).

fastText is not particularly suitable for this task.

You can try to use polyglot, a python library dedicated to multilingual NLP.

from polyglot.detect import Detector

detector = Detector('Hei Jeg viser til hyggelig sam')
print(detector)

output:

Prediction is reliable: True
Language 1: name: Norwegian   code: no       confidence:  96.0 read bytes:  1189
Language 2: name: un          code: un       confidence:   0.0 read bytes:     0
Language 3: name: un          code: un       confidence:   0.0 read bytes:     0

A little note: if you install polyglot, please be careful with dependencies (read this and this).

来源：https://stackoverflow.com/questions/64769198/fasttext-models-detecting-norwegian-text-as-danish

标签

fasttext

language-detection