Can you programmatically detect pluralizations of English words, and derive the singular form?

我是研究僧i 提交于 2019-11-29 00:21:34

It really depends on what you mean by 'programmatically'. Part of English works on easy to understand rules, and part doesn't. It has to do mainly with frequency. For a brief overview, you can read Pinker's "Words and Rules", but do yourself a favor and don't take the whole generative theory of linguistics entirely to heart. There's a lot more empiricism there than that school of thought really lends to the pursuit.

A lot of English can be statistically lemmatized. By the way, stemming or lemmatization is the term you're looking for. One of the most effective lemmatizers which work off of statistical rules bootstrapped with frequency-based exceptions is the Morpha Lemmatizer. You can give this a shot if you have a project that requires this type of simplification of strings which represent specific terms in English.

There are even more naive approaches that accomplish much with respect to normalizing related terms. Take a look at the Porter Stemmer, which is effective enough to cluster together most terms in English.

Going from singular to plural, English plural form is actually pretty regular compared to some other European languages I have a passing familiarity with. In German for example, working out the plural form is really complicated (eg Land -> Länder). I think there are roughly 20-30 exceptions and the rest follow a fairly simple ruleset:

  • -y -> -ies (family -> families)
  • -us -> -i (cactus -> cacti)
  • -s -> -ses (loss -> losses)
  • otherwise add -s

That being said, plural to singular form becomes that much harder because the reverse cases have ambiguities. For example:

  • pies: is it py or pie?
  • ski: is it singular or plural for 'skus'?
  • molasses: is it singular or plural for 'molasse' or 'molass'?

So it can be done but you're going to have a much larger list of exceptions and you're going to have to store a lot of false positives (ie things that appear plural but aren't).

Is "axes" the plural of "ax" or of "axis"? Even a human cannot tell without context.

You can take a look at Inflector.net - my port of Rails' inflection class.

No - English isn't a language which sticks to many rules.

I think your best bet is either:

  • use a dictionary of common words and their plurals (or group them by their plural rule, eg: group words where you just add an S, words where you add ES, words where you drop a Y and add IES...)
  • rethink your application

It is not possible, as nickf has already said. It would be simple for the classes of words you have described, but what about all the words that end with s naturally? My name, Marius, for example, is not plural of Mariu. Same with Bus I guess. Pluralization of words in English is a one way function (a hash function), and you usually need the rest of the sentence or paragraph for context.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!