I have a string with some characters, and I\'m looking for the organization of those characters such that it\'s the most pronounceable possible.
For example, if I have
(For completeness, here's my original pure Python solution that inspired me to try machine learning.)
I agree a reliable solution would require a sophisticated model of the English language, but maybe we can come up with a simple heuristic that's tolerably bad.
I can think of two basic rules satisfied by most pronouncable words:
1. contain a vowel sound
2. no more than two consonant sounds in succession
As a regular expression this can be written c?c?(v+cc?)*v*
Now a simplistic attempt to identify sounds from spelling:
vowels = "a e i o u y".split()
consonants = "b bl br c ch cr chr cl ck d dr f fl g gl gr h j k l ll m n p ph pl pr q r s sc sch sh sl sp st t th thr tr v w wr x y z".split()
Then it's possible to the rules with regular expressions:
v = "({0})".format("|".join(vowels))
c = "({0})".format("|".join(consonants))
import re
pattern = re.compile("^{1}?{1}?({0}+{1}{1}?)*{0}*$".format(v, c))
def test(w):
return re.search(pattern, w)
def predict(words):
return ["word" if test(w) else "scrambled" for w in words]
This scores about 74% on the word/scrambled test set.
precision recall f1-score support
scrambled 0.90 0.57 0.70 52403
word 0.69 0.93 0.79 52940
avg / total 0.79 0.75 0.74 105343
A tweaked version scored 80%.