I am working a word based game. My word database contains around 10,000 english words (sorted alphabetically). I am planning to have 5 difficulty levels in the game. Level 1
depending on the type of game the definition of "difficult" will change. If your game involves typing quickly (ztype-style...), "difficult" will have a different meaning than in a game where you need to define a word's meaning.
That said, Scrabble has a way to measure how "difficult" a word is which is also quite easy algoritmically.
Also you may look into defining "difficult" in terms of your game. You could beta test your game and classify words according to how "difficult" players find them in the context of your own game.
I'm not understanding how frequency is being used... if you were to scan a newspaper, I'm sure you would see the word "thoroughly" mentioned much more frequently than the word "bop" or "moo" but that doesn't mean it's an easier word; on the contrary 'thoroughly' is one of the most disgustingly absurd spelling anomalies that gives grade school children nightmares...
Try explaining to a sane human being learning english as a second language the subtle difference between slaughter and laughter.
I agree that frequency of use is the most likely metric; there are studies supporting a high correlation between word frequency and difficulty (correct responses on tests, etc.). Check out the English Lexicon Project at http://elexicon.wustl.edu/ for some 70k(?) frequency-rated words.
Word length is a good indicator , for word frequency , you would need data as an algorithm can obviously not determine it by itself. You could also use some sort of scoring like the scrabble game does : each letter has a value and the final value would be the sum of the values. It would be imo easier to find frequency data about each letter in your language .
I would guess that the grade at wich the word is introduced into normal students vocabulary is a measure of difficulty. Next would be how many standard rule violations it has. Meaning your words that have spellings or pronunciations that seem to violate the normal set off rules. Finally.. the meaning.. can be a tough concept. .. for example ... try explaining abstract to someone who's never heard the word.
Difficulty is a pretty amorphus concept. If you've no clear idea of what you want, perhaps you could take a look at the Porter Stemming Algorithm (see for example the original paper). That contains a more advanced idea of 'length' by defining words as being of the form [C](VC){m}[V]
; C means a block of consonants and V a block of vowels and this definition says a word is an optional C followed by m VC blocks and finally an optional V. The m value is this advanced 'length'.