stemming | 易学教程

NLTK-based stemming and lemmatization

阅读更多关于 NLTK-based stemming and lemmatization

问题 I am trying to preprocess a string using lemmatizer and then remove the punctuation and digits. I am using the code below to do this. I am not getting any error but the text is not preprocessed appropriately. Only the stop words are removed but the lemmatizing does not work and punctuation and digits also remain. from nltk.stem import WordNetLemmatizer import string import nltk tweets = "This is a beautiful day16~. I am; working on an exercise45.^^^45 text34." lemmatizer = WordNetLemmatizer()

converting stemmed word to the root word in R

阅读更多关于 converting stemmed word to the root word in R

问题 Hi I have a list of words which have been stemmed using the "tm" package in R. Can I get back the root word some how after this step. Thanks in Advance. Ex : activiti --> activity 回答1: You can use the stemCompletion() function to achieve this, but you may need to trim the stems first. Consider the following: library(tm) library(qdap) # providers the stemmer() function active.text = "there are plenty of funny activities" active.corp = Corpus(VectorSource(active.text)) (st.text = tolower

Python Stemming words in a File

阅读更多关于 Python Stemming words in a File

问题 I want to do stemming in a file. When I use it in terminal it works fine, but when I apply it in a text file, it does not work. Terminal code: print PorterStemmer().stem_word('complications') Function code: def stemming_text_1(): with open('test.txt', 'r') as f: text = f.read() print text singles = [] stemmer = PorterStemmer() #problem from HERE for plural in text: singles.append(stemmer.stem(plural)) print singles Input test.txt 126211 crashes bookmarks runs error logged debug core bookmarks

Snowball Stemmer only stems last word

阅读更多关于 Snowball Stemmer only stems last word

问题 I want to stem the documents in a Corpus of plain text documents using the tm package in R. When I apply the SnowballStemmer function to all documents of the corpus, only the last word of each document is stemmed. library(tm) library(Snowball) library(RWeka) library(rJava) path <- c("C:/path/to/diretory") corp <- Corpus(DirSource(path), readerControl = list(reader = readPlain, language = "en_US", load = TRUE)) tm_map(corp,SnowballStemmer) #stemDocument has the same problem I think it is

converting stemmed word to the root word in R

阅读更多关于 converting stemmed word to the root word in R

Hi I have a list of words which have been stemmed using the "tm" package in R. Can I get back the root word some how after this step. Thanks in Advance. Ex : activiti --> activity You can use the stemCompletion() function to achieve this, but you may need to trim the stems first. Consider the following: library(tm) library(qdap) # providers the stemmer() function active.text = "there are plenty of funny activities" active.corp = Corpus(VectorSource(active.text)) (st.text = tolower(stemmer(active.text,warn=F))) # this is what the columns of your Term Document Matrix are going to look like [1]

Does stemming harm precision in text classification?

阅读更多关于 Does stemming harm precision in text classification?

问题 I have read stemming harms precision but improves recall in text classification. How does that happen? When you stem you increase the number of matches between the query and the sample documents right? 回答1: It's always the same, if you raise recall, your doing a generalisation. Because of that, you're losing precision. Stemming merge words together. On the one hand, words which ought to be merged together (such as "adhere" and "adhesion") may remain distinct after stemming; on the other,

SnowballStemmer for Russian words list

阅读更多关于 SnowballStemmer for Russian words list

问题 I do know how to perform SnowballStemmer on a single word (in my case, on russian one). Doing the next things: from nltk.stem.snowball import SnowballStemmer stemmer = SnowballStemmer("russian") stemmer.stem("Василий") 'Васил' How can I do the following if I have a list of words like ['Василий', 'Геннадий', 'Виталий']? My approach using for loop seems to be not working :( l=[stemmer.stem(word) for word in l] 回答1: Your variable l is not pre-defined, causing the name error. See my last two

How to split a text into two meaningful words in R

阅读更多关于 How to split a text into two meaningful words in R

this is the text in my dataframe df which has a text column called 'problem_note_text' SSCIssue: Note Dispenser Failureperformed checks / dispensor failure / asked the stores to take the note dispensor out and set it back / still error message says front door is open / hence CE attn reqContact details - Olivia taber 01159063390 / 7am-11pm df$problem_note_text <- tolower(df$problem_note_text) df$problem_note_text <- tm::removeNumbers(df$problem_note_text) df$problem_note_text<- str_replace_all(df$problem_note_text, " ", "") # replace double spaces with single space df$problem_note_text = str

Exact word search in Solr

阅读更多关于 Exact word search in Solr

问题 I have a question which closely relates to this question. In my schema I have a field <field name="text" type="textgen" indexed="true" stored="true" required="true"/> This gives an exact match, ie. stemming disabled eat = eat Is it possible, while configured to textgen to search for other variants of the word eg. eat = eat, eats, eating eat~0 will give similar sounding words such as meat, beat etc. but this is not what I want. I'm starting to think that the only way to achieve this is to add

Snowball Stemmer only stems last word

阅读更多关于 Snowball Stemmer only stems last word

I want to stem the documents in a Corpus of plain text documents using the tm package in R. When I apply the SnowballStemmer function to all documents of the corpus, only the last word of each document is stemmed. library(tm) library(Snowball) library(RWeka) library(rJava) path <- c("C:/path/to/diretory") corp <- Corpus(DirSource(path), readerControl = list(reader = readPlain, language = "en_US", load = TRUE)) tm_map(corp,SnowballStemmer) #stemDocument has the same problem I think it is related to the way the documents are read into the corpus. To illustrate this with some simple examples: >