I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on t
English and German use the same set of letters except for ä, ö, ü and ß (eszett). You can look for those letters for determining the language.
You can also look at this text (Comparing two language identification schemes) from Grefenstette. It looks at letter trigrams and short words. Common trigrams for german en_, er_, _de. Common trigrams for English the_, he_, the...
There’s also Bob Carpenter’s How does LingPipe Perform Language ID?
I believe the standard procedure is to measure the quality of a proposed algorithm with test data (i.e. with a corpus). Define the percentage of correct analysis that you would like the algorithm to achieve, and then run it over a number of documents which you have manually classified.
As for the specific algorithm: using a list of stop words sounds fine. Another approach that has been reported to work is to use a Bayesian Filter, e.g. SpamBayes. Rather than training it into ham and spam, train it into English and German. Use a portion of your corpus, run that through spambayes, and then test it on the complete data.
Isn't the problem several orders of magnitude easier if you've only got two languages (English and German) to choose from? In this case your approach of a list of stop words might be good enough.
Obviously you'd need to consider a rewrite if you added more languages to your list.
You can use the Google Language Detection API.
Here is a little program that uses it:
baseUrl = "http://ajax.googleapis.com/ajax/services/language/detect"
def detect(text):
import json,urllib
"""Returns the W3C language code of a natural language"""
params = urllib.urlencode({'v': '1.0' , "q":text[0:3000]}) # only use first 3000 characters
resp = json.load(urllib.urlopen(baseUrl + "?" + params))
try:
retText = resp['responseData']['language']
except:
raise
return retText
def test():
print "Type some text to detect its language:"
while True:
text = raw_input('#> ')
retText = detect(text)
print retText
if __name__=='__main__':
import sys
try:
test()
except KeyboardInterrupt:
print "\n"
sys.exit(0)
Other useful references:
Google Announces APIs (and demo): http://googleblog.blogspot.com/2008/03/new-google-ajax-language-api-tools-for.html
Python wrapper: http://code.activestate.com/recipes/576890-python-wrapper-for-google-ajax-language-api/
Another python script: http://www.halotis.com/2009/09/15/google-translate-api-python-script/
RFC 1766 defines W3C languages
Get the current language codes from: http://www.iana.org/assignments/language-subtag-registry
Try measure occurences of each letter in text. For English and German texts are calculated the frequencies and, maybe, the distributions of them. Having obtained these data, you may reason what language the distribution of frequencies for your text belongs.
You should use Bayesian inference to determine the closest language (with a certain error probability) or, maybe, there are other statistical methods for such tasks.