问题
I am writing a script to detect words from a language B in a language A. The two languages are very similar and may have instances of the same words.
The code is here if you are interested in what I have so far: https://github.com/arashsa/language-detection.git
I will explain my method here: I create a list of bigrams in language B, a list of bigrams in language A (small corpus in language B, large corpus in language A). Then I remove all bigrams that are common. Then I go through the text in language A, and using the bigrams I detect these in language A and store them in a file. However, this methods finds many words that are common for both languages, and it also finds strange bigrams like the name of two countries adjacent to each other, and other anomalies.
Do any of you have suggestions, reading material, NLP methods that I might use?
回答1:
If your method is returning words present in both languages, and you only want to return words that exist in one language, you might want to create a list of one-grams in language A and one-grams in language B, and then remove the words in both. Then, if you like, you can proceed with the bigram analysis.
That said, there are some good tools in Python for language identification. I've found lang-id
to be one of the best. It comes pre-trained with language classifiers for over 90 languages, and is fairly easy to train for additional languages if you like. Here are the docs. There is also guess-language, but it doesn't perform as well in my estimation. Depending on how localized the bits of foreign language are, you could try chunking your texts at an appropriate level of granularity and running those chunks through (e.g.) langid's classifier.
来源:https://stackoverflow.com/questions/27191457/detecting-foreign-words