How can I best determine the correct capitalization for a word?

前端未结

关注

 3  458

I have a database containing sentences which only contain capitalized letters. The database is technical, containing medical terms, and I want to normalize it so that the ca

相关标签:

3条回答

爱一瞬间的悲伤

2020-12-31 15:16

Search for work on truecasing: http://en.wikipedia.org/wiki/Truecasing

It would be really easy to generate your own data set if you have access to similar medical data with normal capitalization. Capitalize everything and use the mapping to the original text to train/test your algorithm.

0 讨论(0)
发布评论:

提交评论
- 加载中...
死守一世寂寞

2020-12-31 15:27

Easiest way to do this is to use a spell correction algorithm based on ngrams.

You can use, for example LingPipe SpellChecker. You can find source code for predicting spaces in word, similar to what can be done for predicting case.

0 讨论(0)
发布评论:

提交评论
- 加载中...

忘了有多久

2020-12-31 15:28

One way could be to infer capitalization from POS-tagging, for example using the Python Natural Language Toolkit (NLTK):

import nltk, re

def truecase(text):
    truecased_sents = [] # list of truecased sentences
    # apply POS-tagging
    tagged_sent = nltk.pos_tag([word.lower() for word in nltk.word_tokenize(text)])
    # infer capitalization from POS-tags
    normalized_sent = [w.capitalize() if t in ["NN","NNS"] else w for (w,t) in tagged_sent]
    # capitalize first word in sentence
    normalized_sent[0] = normalized_sent[0].capitalize()
    # use regular expression to get punctuation right
    pretty_string = re.sub(" (?=[\.,'!?:;])", "", ' '.join(normalized_sent))
    return pretty_string

This will not be perfect, especially because I don't know what your data exactely looks like, but maybe you can get the idea:

>>> text = "Clonazepam Has Been Approved As An Anticonvulsant To Be Manufactured In 0.5mg, 1mg And 2mg Tablets. It Is The Generic Equivalent Of Roche Laboratories' Klonopin."
>>> truecase(text)
"Clonazepam has been approved as an anticonvulsant to be manufactured in 0.5mg, 1mg and 2mg Tablets. It is the generic Equivalent of Roche Laboratories' Klonopin."

0 讨论(0)