Separate word lists for nouns, verbs, adjectives, etc

后端未结

关注

 5  1873

眼角桃花 2021-01-30 02:54

Usually word lists are 1 file that contains everything, but are there separately downloadable noun list, verb list, adjective list, etc?

I need them for English specific

5条回答

醉酒成梦 (楼主)

2021-01-30 03:08
As others have suggested, the WordNet database files are a great source for parts of speech. That said, the examples used to extract the words isn't entirely correct. Each line is actually a "synonym set" consisting of multiple synonyms and their definition. Around 30% of words only appear as synonyms, so simply extracting the first word is missing a large amount of data.

The line format is pretty simple to parse (search.c, function parse_synset), but if all you're interested in are the words, the relevant part of the line is formatted as:
```
NNNNNNNN NN a NN word N [word N ...]
```
These correspond to:
- Byte offset within file (8 character integer)
- File number (2 character integer)
- Part of speech (1 character)
- Number of words (2 characters, hex encoded)
- N occurrences of...
  - Word with spaces replaced with underscores, optional comment in parentheses
  - Word lexical ID (a unique occurrence ID)
For example, from data.adj:
```
00004614 00 s 02 cut 0 shortened 0 001 & 00004412 a 0000 | with parts removed; "the drastically cut film"
```
- Byte offset within the file is 4614
- File number is 0
- Part of speech is s, corresponding to adjective (wnutil.c, function getpos)
- Number of words is 2
  - First word is cut with lexical ID 0
  - Second word is shortened with lexical ID 0
A short Perl script to simply dump the words from the data.* files:
```
#!/usr/bin/perl

while (my $line = <>) {
    # If no 8-digit byte offset is present, skip this line
    if ( $line !~ /^[0-9]{8}\s/ ) { next; }
    chomp($line);

    my @tokens = split(/ /, $line);
    shift(@tokens); # Byte offset
    shift(@tokens); # File number
    shift(@tokens); # Part of speech

    my $word_count = hex(shift(@tokens));
    foreach ( 1 .. $word_count ) {
        my $word = shift(@tokens);
        $word =~ tr/_/ /;
        $word =~ s/$.*$//;
        print $word, "\n";

        shift(@tokens); # Lexical ID
    }
}
```
A gist of the above script can be found here.
A more robust parser which stays true to the original source can be found here.

Both scripts are used in a similar fashion: ./wordnet_parser.pl DATA_FILE.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...