Programmatically determine whether to describe an object with “a” or “an”?

后端 未结 8 1492
耶瑟儿~
耶瑟儿~ 2020-12-01 21:13

I have a database of nouns (ex \"house\", \"exclamation point\", \"apple\") that I need to output and describe in my application. It\'s hard to put together a natural-soundi

相关标签:
8条回答
  • 2020-12-01 21:57

    What you want is to determine the appropriate indefinite article. Lingua::EN::Inflect is a Perl module that does an great job. I've extracted the relevant code and pasted it below. It's just a bunch of cases and some regular expressions, so it shouldn't be difficult to port to PHP. A friend ported it to Python here if anyone is interested.

    # 2. INDEFINITE ARTICLES
    
    # THIS PATTERN MATCHES STRINGS OF CAPITALS STARTING WITH A "VOWEL-SOUND"
    # CONSONANT FOLLOWED BY ANOTHER CONSONANT, AND WHICH ARE NOT LIKELY
    # TO BE REAL WORDS (OH, ALL RIGHT THEN, IT'S JUST MAGIC!)
    
    my $A_abbrev = q{
    (?! FJO | [HLMNS]Y.  | RY[EO] | SQU
      | ( F[LR]? | [HL] | MN? | N | RH? | S[CHKLMNPTVW]? | X(YL)?) [AEIOU])
    [FHLMNRSX][A-Z]
    };
    
    # THIS PATTERN CODES THE BEGINNINGS OF ALL ENGLISH WORDS BEGINING WITH A
    # 'y' FOLLOWED BY A CONSONANT. ANY OTHER Y-CONSONANT PREFIX THEREFORE
    # IMPLIES AN ABBREVIATION.
    
    my $A_y_cons = 'y(b[lor]|cl[ea]|fere|gg|p[ios]|rou|tt)';
    
    # EXCEPTIONS TO EXCEPTIONS
    
    my $A_explicit_an = enclose join '|',
    (
        "euler",
        "hour(?!i)", "heir", "honest", "hono",
    );
    
    my $A_ordinal_an = enclose join '|',
    (
        "[aefhilmnorsx]-?th",
    );
    
    my $A_ordinal_a = enclose join '|',
    (
        "[bcdgjkpqtuvwyz]-?th",
    );
    
    sub A {
        my ($str, $count) = @_;
        my ($pre, $word, $post) = ( $str =~ m/\A(\s*)(?:an?\s+)?(.+?)(\s*)\Z/i );
        return $str unless $word;
        my $result = _indef_article($word,$count);
        return $pre.$result.$post;
    }
    
    sub AN { goto &A }
    
    sub _indef_article {
        my ( $word, $count ) = @_;
    
        $count = $persistent_count
            if !defined($count) && defined($persistent_count);
    
        return "$count $word"
            if defined $count && $count!~/^($PL_count_one)$/io;
    
        # HANDLE USER-DEFINED VARIANTS
    
        my $value;
        return "$value $word"
            if defined($value = ud_match($word, @A_a_user_defined));
    
        # HANDLE ORDINAL FORMS
    
        $word =~ /^($A_ordinal_a)/i         and return "a $word";
        $word =~ /^($A_ordinal_an)/i        and return "an $word";
    
        # HANDLE SPECIAL CASES
    
        $word =~ /^($A_explicit_an)/i       and return "an $word";
        $word =~ /^[aefhilmnorsx]$/i        and return "an $word";
        $word =~ /^[bcdgjkpqtuvwyz]$/i      and return "a $word";
    
    
        # HANDLE ABBREVIATIONS
    
        $word =~ /^($A_abbrev)/ox           and return "an $word";
        $word =~ /^[aefhilmnorsx][.-]/i     and return "an $word";
        $word =~ /^[a-z][.-]/i              and return "a $word";
    
        # HANDLE CONSONANTS
    
        $word =~ /^[^aeiouy]/i              and return "a $word";
    
        # HANDLE SPECIAL VOWEL-FORMS
    
        $word =~ /^e[uw]/i                  and return "a $word";
        $word =~ /^onc?e\b/i                and return "a $word";
        $word =~ /^uni([^nmd]|mo)/i         and return "a $word";
        $word =~ /^ut[th]/i                 and return "an $word";
        $word =~ /^u[bcfhjkqrst][aeiou]/i   and return "a $word";
    
        # HANDLE SPECIAL CAPITALS
    
        $word =~ /^U[NK][AIEO]?/            and return "a $word";
    
        # HANDLE VOWELS
    
        $word =~ /^[aeiou]/i                and return "an $word";
    
        # HANDLE y... (BEFORE CERTAIN CONSONANTS IMPLIES (UNNATURALIZED) "i.." SOUND)
    
        $word =~ /^($A_y_cons)/io           and return "an $word";
    
        # OTHERWISE, GUESS "a"
        return "a $word";
    }
    
    0 讨论(0)
  • 2020-12-01 22:02

    The problem with a rule based system is that they deal poorly with edge cases, and that they're complicated. If you can base your decisions on actual data, you'll do better. In this answer I describe how you might use wikipedia to build a lookup dictionary, and link to a (very simple) javascript implementation using such a dictionary.

    A prefix-dictionary will deal fairly well with acronyms and numbers, though with some effort you could probably do better.

    0 讨论(0)
提交回复
热议问题