language-detection

Python langdetect: choose between one language or the other only

我只是一个虾纸丫 提交于 2019-12-07 01:04:33
问题 I'm using langdetect to determine the language of a set of strings which I know are either in English or French. Sometimes, langdetect tells me the language is Romanian for a string I know is in French. How can I make langdetect choose between English or French only, and not all other languages? Thanks! 回答1: Option 1 One option would be using the package langid instead. Then you can simply restrict the languages with a method call: import langid langid.set_languages(['fr', 'en']) # ISO 639-1

Python langdetect: choose between one language or the other only

坚强是说给别人听的谎言 提交于 2019-12-05 05:45:07
I'm using langdetect to determine the language of a set of strings which I know are either in English or French. Sometimes, langdetect tells me the language is Romanian for a string I know is in French. How can I make langdetect choose between English or French only, and not all other languages? Thanks! Option 1 One option would be using the package langid instead. Then you can simply restrict the languages with a method call: import langid langid.set_languages(['fr', 'en']) # ISO 639-1 codes lang, score = langid.classify('This is a french or english text') print(lang) # en Option 2 If you

Recognizing text as Simplified vs. Traditional Chinese

感情迁移 提交于 2019-12-04 08:42:15
问题 Given a block of text that's known to be Chinese and encoded in UTF-8, is there a way to determine if it's Simplified or Traditional? 回答1: I don't know if this will work, but I'd try using iconv to see if it will translate between the charsets correctly, comparing the results from the same conversion with //TRANSLIT and //IGNORE. If the two results match, then the charset conversion hasn't encountered any characters that fail to translate, so you should have a match. $test1 = iconv("UTF-8",

Language detection for very short text [closed]

∥☆過路亽.° 提交于 2019-12-03 03:58:59
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 7 years ago . I'm creating an application for detecting the language of short texts, with an average of < 100 characters and contains slang (e.g

Recognizing text as Simplified vs. Traditional Chinese

强颜欢笑 提交于 2019-12-03 01:35:23
Given a block of text that's known to be Chinese and encoded in UTF-8, is there a way to determine if it's Simplified or Traditional? I don't know if this will work, but I'd try using iconv to see if it will translate between the charsets correctly, comparing the results from the same conversion with //TRANSLIT and //IGNORE. If the two results match, then the charset conversion hasn't encountered any characters that fail to translate, so you should have a match. $test1 = iconv("UTF-8", "big5//TRANSLIT", $text); $test2 = iconv("UTF-8", "big5//IGNORE", $text); if ($test1 == $test2) { echo

Language detection for very short text [closed]

為{幸葍}努か 提交于 2019-12-02 18:17:18
I'm creating an application for detecting the language of short texts, with an average of < 100 characters and contains slang (e.g tweets, user queries, sms). All the libraries I tested work well for normal web pages but not for very short text. The library that's giving the best results so far is Chrome's Language Detection (CLD) library which I had to build as a shared library. CLD fails when the text is made of very short words. After looking at the source code of CLD, I see that it uses 4-grams so that could be the reason. The approach I'm thinking of right now to improve the accuracy is:

How can I detect a user's input language using Ruby without using an online service?

元气小坏坏 提交于 2019-12-01 10:43:30
I'm looking for a library or technique to detect the input language of blocks of text provided by users. Online lookups (like Google translate) won't work for this task as I'm writing an app which must run offline. Thanks. Here are two more n-gram -based gems you might want to try. They work offline. https://github.com/echen/unsupervised-language-identification , optimized for separating english and other languages (has a live demo) https://github.com/feedbackmine/language_detector , less specialized, will detect more languages. Some languages may need some extra training — I found it to be

Language detection with data in PostgreSQL

南楼画角 提交于 2019-11-30 09:17:07
I have a table in PostgreSQL where a column is a text. I need a library or tool that can identify the language of each text for a test purpose. There is no need for a PostgreSQL code because I'm having problems to install languages, but any language that can connect to the database, retrieve the texts and identify it arewelcome. I used Lingua::Identify suggested in the answers right in the Perl script, it worked, but the results are not precise. The texts I want to identify comes from the web and most are in portuguese, but Lingua::Identify is classifying much as french, italian and spanish

How to detect language

二次信任 提交于 2019-11-30 03:50:57
Are there any good, open source engines out there for detecting what language a text is in, perhaps with a probability metric? One that I can run locally and doesn't query Google or Bing? I'd like to detect language for each page in about 15 million pages of OCR'ed text. Not all documents will contain languages which use the Latin alphabet. Depending on what you're doing, you might want to check out the python Natural Language Processing Toolkit (NLTK), which has some support for Bayesian Learning Algorithms. In general, the letter and word frequencies would probably be the fastest evaluation,

Browser language detection [duplicate]

北战南征 提交于 2019-11-29 14:46:22
问题 This question already has answers here : JavaScript for detecting browser language preference [duplicate] (26 answers) Closed 3 years ago . I need in my Angular2 app detect browser language. Based on this language I need to send request (to a REST API of backend) with localization and IDs of my variables, which I need to translate. After that I received response with translated variables. So the app workflow is to detect browser language, ok it is for example en-US , after that I am going to