stemming | 易学教程

Need explanation on Language Stemmer of Solr

阅读更多关于 Need explanation on Language Stemmer of Solr

问题 I'm using nutch with Solr for a developing a search engine for Arabic texts. I need to implement a stemmer on my Arabic texts, and while serching on Solr Stemmer I found that it provide those two filters <filter class="solr.ArabicNormalizationFilterFactory"/> <filter class="solr.ArabicStemFilterFactory"/> I tried them but did not understand what they do .. So please any one can help me with some examples?? and do these two do this: العملات Stemmed to عملة البسَاتِين ، بساتينكم Stemmed to

How is the correct use of stemDocument?

阅读更多关于 How is the correct use of stemDocument?

问题 I have already read this and this questions, but I still didn't understand the use of stemDocument in tm_map . Let's follow this example: q17 <- VCorpus(VectorSource(x = c("poder", "pode")), readerControl = list(language = "pt", load = TRUE)) lapply(q17, content) $`character(0)` [1] "poder" $`character(0)` [1] "pode" If I use: > stemDocument("poder", language = "portuguese") [1] "pod" > stemDocument("pode", language = "portuguese") [1] "pod" it does work! But if I use: > q17 <- tm_map(q17,

Search stem and exact words in Lucene 4.4.0

阅读更多关于 Search stem and exact words in Lucene 4.4.0

问题 i've store a lucene document with a single TextField contains words without stems. I need to implement a search program that allow users to search words and exact words, but if i've stored words without stemming, a stem search cannot be done. There's a method to search both exact words and/or stemming words in Documents without store Two fields ? Thanks in advance. 回答1: Indexing two separate fields seems like the right approach to me. Stemmed and unstemmed text require different analysis

How to correctly configure solr stemming

阅读更多关于 How to correctly configure solr stemming

问题 I have configured a field in Solr as follows. When I search for the word "Conditioner", I was hoping to find words that contain "Conditioning" also. But based on Solr Analysis, the porterstemfilter is cutting the word "Conditioning" to "Condit" at index time. Hence, at the search time, when I query for "Conditioner", it is stemmed as "Condition" and hence not matching "Conditioning". How to configure stemming so that both Conditioner and Conditioning should stem to condition? <fieldType name=

R tm stemCompletion generates NA value

阅读更多关于 R tm stemCompletion generates NA value

问题 when i try to apply stemCompletion to a corpus , this function generates NA values.. this is my code: my.corpus <- tm_map(my.corpus, removePunctuation) my.corpus <- tm_map(my.corpus, removeWords, stopwords("english")) (one result of this is: [[2584]] zoning plan ) the next step is stamming corpus and so: my.corpus <- tm_map(my.corpus, stemDocument, language="english") my.corpus <- tm_map(my.corpus, stemCompletion, dictionary=my.corpus_copy, type="first") but result is this [[2584]] NA plant

Remove punctuation but keep hyphenated phrases in R text cleaning

阅读更多关于 Remove punctuation but keep hyphenated phrases in R text cleaning

问题 Is there any effective way to remove punctuation in text but keeping hyphenated expressions, such as "accident-prone"? I used the following function to clean my text clean.text = function(x) { # remove rt x = gsub("rt ", "", x) # remove at x = gsub("@\\w+", "", x) x = gsub("[[:punct:]]", "", x) x = gsub("[[:digit:]]", "", x) # remove http x = gsub("http\\w+", "", x) x = gsub("[ |\t]{2,}", "", x) x = gsub("^ ", "", x) x = gsub(" $", "", x) x = str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")

Can I perform stemming using regular expressions?

阅读更多关于 Can I perform stemming using regular expressions?

问题 How can I get my regular expression to match against just one condition exactly? For example I have the following regular expression: (\w+)(?=ly|es|s|y) Matching the expression against the word "glasses" returns: glasse The correct match should be: glass (match should be on 'es' rather than 's' as in the match above) The expression should cater for any kinds of words such as: films lovely glasses glass Currently the regular expression is matching the above words as: film - correct lovel -

nltk : How to prevent stemming of proper nouns

阅读更多关于 nltk : How to prevent stemming of proper nouns

问题 I am trying to wrote a keyword extraction program using Stanford POS taggers and NER. For keyword extraction, i am only interested in proper nouns. Here is the basic approach Clean up the data by removing anything but alphabets Remove stopwords Stem each word Determine POS tag of each word If the POS tag is a noun then feed it to the NER The NER will then determine if the word is a person, organization or location sample code docText="'Jack Frost works for Boeing Company. He manages 5

Python stemming (with pandas dataframe)

阅读更多关于 Python stemming (with pandas dataframe)

问题 I created a dataframe with sentences to be stemmed. I would like to use a Snowballstemmer to obtain higher accuracy with my classification algorithm. How can I achieve this? import pandas as pd from nltk.stem.snowball import SnowballStemmer # Use English stemmer. stemmer = SnowballStemmer("english") # Sentences to be stemmed. data = ["programers program with programing languages", "my code is working so there must be a bug in the optimizer"] # Create the Pandas dataFrame. df = pd.DataFrame

NLTK-based stemming and lemmatization

阅读更多关于 NLTK-based stemming and lemmatization

I am trying to preprocess a string using lemmatizer and then remove the punctuation and digits. I am using the code below to do this. I am not getting any error but the text is not preprocessed appropriately. Only the stop words are removed but the lemmatizing does not work and punctuation and digits also remain. from nltk.stem import WordNetLemmatizer import string import nltk tweets = "This is a beautiful day16~. I am; working on an exercise45.^^^45 text34." lemmatizer = WordNetLemmatizer() tweets = lemmatizer.lemmatize(tweets) data=[] stop_words = set(nltk.corpus.stopwords.words('english'))