Separate word lists for nouns, verbs, adjectives, etc

后端 未结 5 1873
眼角桃花
眼角桃花 2021-01-30 02:54

Usually word lists are 1 file that contains everything, but are there separately downloadable noun list, verb list, adjective list, etc?

I need them for English specific

5条回答
  •  醉酒成梦
    2021-01-30 03:08

    As others have suggested, the WordNet database files are a great source for parts of speech. That said, the examples used to extract the words isn't entirely correct. Each line is actually a "synonym set" consisting of multiple synonyms and their definition. Around 30% of words only appear as synonyms, so simply extracting the first word is missing a large amount of data.

    The line format is pretty simple to parse (search.c, function parse_synset), but if all you're interested in are the words, the relevant part of the line is formatted as:

    NNNNNNNN NN a NN word N [word N ...]
    

    These correspond to:

    • Byte offset within file (8 character integer)
    • File number (2 character integer)
    • Part of speech (1 character)
    • Number of words (2 characters, hex encoded)
    • N occurrences of...
      • Word with spaces replaced with underscores, optional comment in parentheses
      • Word lexical ID (a unique occurrence ID)

    For example, from data.adj:

    00004614 00 s 02 cut 0 shortened 0 001 & 00004412 a 0000 | with parts removed; "the drastically cut film"
    
    • Byte offset within the file is 4614
    • File number is 0
    • Part of speech is s, corresponding to adjective (wnutil.c, function getpos)
    • Number of words is 2
      • First word is cut with lexical ID 0
      • Second word is shortened with lexical ID 0

    A short Perl script to simply dump the words from the data.* files:

    #!/usr/bin/perl
    
    while (my $line = <>) {
        # If no 8-digit byte offset is present, skip this line
        if ( $line !~ /^[0-9]{8}\s/ ) { next; }
        chomp($line);
    
        my @tokens = split(/ /, $line);
        shift(@tokens); # Byte offset
        shift(@tokens); # File number
        shift(@tokens); # Part of speech
    
        my $word_count = hex(shift(@tokens));
        foreach ( 1 .. $word_count ) {
            my $word = shift(@tokens);
            $word =~ tr/_/ /;
            $word =~ s/\(.*\)//;
            print $word, "\n";
    
            shift(@tokens); # Lexical ID
        }
    }
    

    A gist of the above script can be found here.
    A more robust parser which stays true to the original source can be found here.

    Both scripts are used in a similar fashion: ./wordnet_parser.pl DATA_FILE.

提交回复
热议问题