Arranging letters in the most pronounceable way?

前端 未结 2 633
误落风尘
误落风尘 2021-02-05 13:22

I have a string with some characters, and I\'m looking for the organization of those characters such that it\'s the most pronounceable possible.

For example, if I have

相关标签:
2条回答
  • 2021-02-05 14:03

    Start by solving a simpler problem: is a given word pronounceable?

    Machine learning 'supervised learning' could be effective here. Train a binary classifier on a training set of dictionary words and scrambled words (assume the scrambled words are all unpronounceable). For features, I suggest counting bigrams and trigrams. My reasoning: unpronounceable trigrams such as 'tns' and 'srh' are rare in dictionary words, even though the individual letters are each common.

    The idea is that the trained algorithm will learn to classify words with any rare trigrams as unpronounceable, and words with only common trigrams as pronounceable.


    Here's an implementation with scikit-learn http://scikit-learn.org/

    import random
    def scramble(s):
        return "".join(random.sample(s, len(s)))
    
    words = [w.strip() for w in open('/usr/share/dict/words') if w == w.lower()]
    scrambled = [scramble(w) for w in words]
    
    X = words+scrambled
    y = ['word']*len(words) + ['unpronounceable']*len(scrambled)
    
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    from sklearn.pipeline import Pipeline
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.naive_bayes import MultinomialNB
    
    text_clf = Pipeline([
        ('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))),
        ('clf', MultinomialNB())
        ])
    
    text_clf = text_clf.fit(X_train, y_train)
    predicted = text_clf.predict(X_test)
    
    from sklearn import metrics
    print(metrics.classification_report(y_test, predicted))
    

    It scores 92% accuracy. Given pronounceability is subjective anyway, this might be as good as it gets.

                     precision    recall  f1-score   support
    
          scrambled       0.93      0.91      0.92     52409
               word       0.92      0.93      0.93     52934
    
        avg / total       0.92      0.92      0.92    105343
    

    It agrees with your examples:

    >>> text_clf.predict("scaroly crasoly oascrly yrlcsoa".split())
    ['word', 'word', 'unpronounceable', 'unpronounceable']
    

    For the curious, here are 10 scrambled words it classifies pronounceable:

    • moro garapm ocenfir onerixoatteme arckinbo raetomoporyo bheral accrene cchmanie suroatipsheq

    And finally 10 dictionary words misclassified as unpronouncable:

    • ilch tohubohu usnea halfpaced pyrostilpnite lynnhaven cruel enure moldproof piecemeal
    0 讨论(0)
  • 2021-02-05 14:03

    (For completeness, here's my original pure Python solution that inspired me to try machine learning.)

    I agree a reliable solution would require a sophisticated model of the English language, but maybe we can come up with a simple heuristic that's tolerably bad.

    I can think of two basic rules satisfied by most pronouncable words:

    1. contain a vowel sound
    2. no more than two consonant sounds in succession
    

    As a regular expression this can be written c?c?(v+cc?)*v*

    Now a simplistic attempt to identify sounds from spelling:

    vowels = "a e i o u y".split()
    consonants = "b bl br c ch cr chr cl ck d dr f fl g gl gr h j k l ll m n p ph pl pr q r s sc sch sh sl sp st t th thr tr v w wr x y z".split()
    

    Then it's possible to the rules with regular expressions:

    v = "({0})".format("|".join(vowels))
    c = "({0})".format("|".join(consonants))
    
    import re
    pattern = re.compile("^{1}?{1}?({0}+{1}{1}?)*{0}*$".format(v, c))
    def test(w):
        return re.search(pattern, w)
    
    def predict(words):
        return ["word" if test(w) else "scrambled" for w in words]
    

    This scores about 74% on the word/scrambled test set.

                 precision    recall  f1-score   support
    
      scrambled       0.90      0.57      0.70     52403
           word       0.69      0.93      0.79     52940
    
    avg / total       0.79      0.75      0.74    105343
    

    A tweaked version scored 80%.

    0 讨论(0)
提交回复
热议问题