fasttext models detecting norwegian text as danish [closed]

给你一囗甜甜゛ 提交于 2021-01-29 06:50:08

问题


I am using fasttext (v=0.9.1) to detect the language of a text (see this).

Norwegian text is being detected as Danish when using this model.

!curl "https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin" > lid.bin

import fastText
language_detector=fastText.load_model('lid.bin')
language_detector.predict('Hei Jeg viser til hyggelig sam', k=3)

Output:

(('__label__da', '__label__no', '__label__hu'),
array([9.16624188e-01, 8.25065151e-02, 2.37607688e-04]))

Any help?


回答1:


It seems that distinguishing the Norwegian and Danish languages ​​is difficult (see this).

fastText is not particularly suitable for this task.

You can try to use polyglot, a python library dedicated to multilingual NLP.

from polyglot.detect import Detector

detector = Detector('Hei Jeg viser til hyggelig sam')
print(detector)

output:

Prediction is reliable: True
Language 1: name: Norwegian   code: no       confidence:  96.0 read bytes:  1189
Language 2: name: un          code: un       confidence:   0.0 read bytes:     0
Language 3: name: un          code: un       confidence:   0.0 read bytes:     0

A little note: if you install polyglot, please be careful with dependencies (read this and this).



来源:https://stackoverflow.com/questions/64769198/fasttext-models-detecting-norwegian-text-as-danish

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!