Usually word lists are 1 file that contains everything, but are there separately downloadable noun list, verb list, adjective list, etc?
I need them for English specific
As others have suggested, the WordNet database files are a great source for parts of speech. That said, the examples used to extract the words isn't entirely correct. Each line is actually a "synonym set" consisting of multiple synonyms and their definition. Around 30% of words only appear as synonyms, so simply extracting the first word is missing a large amount of data.
The line format is pretty simple to parse (search.c
, function parse_synset
), but if all you're interested in are the words, the relevant part of the line is formatted as:
NNNNNNNN NN a NN word N [word N ...]
These correspond to:
For example, from data.adj
:
00004614 00 s 02 cut 0 shortened 0 001 & 00004412 a 0000 | with parts removed; "the drastically cut film"
s
, corresponding to adjective (wnutil.c
, function getpos
)cut
with lexical ID 0shortened
with lexical ID 0A short Perl script to simply dump the words from the data.*
files:
#!/usr/bin/perl
while (my $line = <>) {
# If no 8-digit byte offset is present, skip this line
if ( $line !~ /^[0-9]{8}\s/ ) { next; }
chomp($line);
my @tokens = split(/ /, $line);
shift(@tokens); # Byte offset
shift(@tokens); # File number
shift(@tokens); # Part of speech
my $word_count = hex(shift(@tokens));
foreach ( 1 .. $word_count ) {
my $word = shift(@tokens);
$word =~ tr/_/ /;
$word =~ s/\(.*\)//;
print $word, "\n";
shift(@tokens); # Lexical ID
}
}
A gist of the above script can be found here.
A more robust parser which stays true to the original source can be found here.
Both scripts are used in a similar fashion: ./wordnet_parser.pl DATA_FILE
.