I\'m preparing some table names for an ORM, and I want to turn plural table names into single entity names. My only problem is finding an algorithm that does it reliably. He
See also this answer, which recommends using Morpha (or studying the algorithm behind it).
If you know that the words that you want to lemmatize are plural nouns then you can tag them with NNS
to get a more accurate output.
Input example:
$ cat test.txt
Types_NNS
Pies_NNS
Trees_NNS
Buses_NNS
Radii_NNS
Communities_NNS
Sheep_NNS
Fish_NNS
Output example:
$ cat test.txt | ./morpha -c
Type
Pie
Tree
Bus
Radius
Community
Sheep
Fish
Those are all general rules (and good ones) but English is not a language for the faint of heart :-).
My own preference would be to have a transformation engine along with a set of transformations (surprisingly enough) for doing the actual work. You would run through the transformations (from specific to general) and, when a match was found, apply the transformation to the word and stop.
Regular expressions would be an ideal approach to this due to their expressiveness. An example rule set:
1. If the word is fish, return fish.
2. If the word is sheep, return sheep.
3. If the word is "radii", return "radius".
4. If the word ends in "ii", replace that "ii" with "us" (octopii,virii).
5. If a word ends with -ies, replace the ending with -y
6. If a word ends with -es, remove it.
7. Otherwise, just remove any trailing -s.
Note the requirement to keep this transformation set up to date. For example, let's say someone adds the table name types
. This would currently be captured by rule #6
and you would get the singular value typ
, which is obviously wrong.
The solution is to insert a new rule somewhere before #6
, something like:
3.5: If the word is "types", return "type".
for a very specific transformation, or perhaps somewhere later if it can be made more general.
In other words, you'll basically need to keep this transformation table updated as you find all those wondrous exceptions that English has spawned over the centuries.
The other possibility is to not waste your time with general rules at all.
Since the names of the tables will be relatively limited, just create another table (or some sort of data structure) called singulars
which maps all the relevant plural table names (employees
, customers
) to singular object names (employee
, customer
).
Then every time a table is added to your schema, ensure you add an entry to the singulars "table" so you can singularize it.
Consider the python package "inflect"
"Correctly generate plurals, singular nouns, ordinals, indefinite articles; convert numbers to words"
https://pypi.python.org/pypi/inflect
Maybe take a look at source code of something like Rails Inflector
I'm going to try this MorphAdorner: http://morphadorner.northwestern.edu/morphadorner/download/ (Java). It's a collection of different types of NLP processing tools, and you can test them through online examples. For your problem (that is also my problem) there's the Pluralizer tool: http://morphadorner.northwestern.edu/morphadorner/pluralizer/example/
The problem is that's based on the general rules, but English has (figuratively) a billion exceptions... What do you do with words like "fish", or "geese"?
Also, the rules are for how to turn singular nouns to plurals. The reverse mapping isn't necessarily possible (consider "freebies").