问题
I am using fasttext (v=0.9.1) to detect the language of a text (see this).
Norwegian text is being detected as Danish when using this model.
!curl "https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin" > lid.bin
import fastText
language_detector=fastText.load_model('lid.bin')
language_detector.predict('Hei Jeg viser til hyggelig sam', k=3)
Output:
(('__label__da', '__label__no', '__label__hu'),
array([9.16624188e-01, 8.25065151e-02, 2.37607688e-04]))
Any help?
回答1:
It seems that distinguishing the Norwegian and Danish languages is difficult (see this).
fastText is not particularly suitable for this task.
You can try to use polyglot, a python library dedicated to multilingual NLP.
from polyglot.detect import Detector
detector = Detector('Hei Jeg viser til hyggelig sam')
print(detector)
output:
Prediction is reliable: True
Language 1: name: Norwegian code: no confidence: 96.0 read bytes: 1189
Language 2: name: un code: un confidence: 0.0 read bytes: 0
Language 3: name: un code: un confidence: 0.0 read bytes: 0
A little note: if you install polyglot, please be careful with dependencies (read this and this).
来源:https://stackoverflow.com/questions/64769198/fasttext-models-detecting-norwegian-text-as-danish