Adding language profile to Apache Tika

喜夏-厌秋 提交于 2019-12-10 13:05:40

问题


Could please anybody who managed to do that explain how to do that :-)

Do I need to get n-gram files for the language I need to add ?

Is it a matter of creating tika.language.override.properties, add some other lang codes and add lang-code.ngp n-gram file on the classPath ? In that case, where do I get it and why Tika doesn't support more languages, if it is just a matter of this ?

There are currently these languages supported for language detection

da,de,et,el,en,es,fi,fr,hu,is,it,lt,nl,no,pl,pt,ru,sv,th

and tika uses traditional n-gram notation

er_ 132232
_de 103517
en_ 82666
et_ 80661
for 65286
_fo 57945
de_ 51382
der 44049
at_ 41915
det 41381
_og 40344
_at 39482
ing 38707
den 36795
og_ 36577
_me 34924
nde 34528

This lang detection application currently supports these languages, but has kinda different n-gram files

af  bg  cs  de  en  fa  fr  he  hr  id  ja  ko  ml  ne  no  pl  ro  sk  sq  sw   te  tl  uk   vi     zh-tw ar  bn  da  el  es  fi   gu  hi  hu  it  kn  mk  mr   nl   pa  pt  ru  so   sv  ta  th   tr  ur  zh-cn

in JSON notation

{"freq":{"D":9246,"E":2445,"F":2510,"G":3299,"A":6930,"B":3706,"C":2451,"L":2519,"M":3951,"N":3334,"O":2514,"H" ....

回答1:


It looks like as of TIKA-490, it should be possible to add new language profiles. TIKA-546 seems to indicate it isn't yet as easy as it might be, and in the mean time you'll need to start with Nutch's NGramProfile tool and tweak the output.

I'd suggest you try using the Nutch tool to generate the files, then look at the comments on TIKA-490 for details on how to use them.



来源:https://stackoverflow.com/questions/6227565/adding-language-profile-to-apache-tika

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!