stemming

Need explanation on Language Stemmer of Solr

余生长醉 提交于 2019-12-12 03:12:29
问题 I'm using nutch with Solr for a developing a search engine for Arabic texts. I need to implement a stemmer on my Arabic texts, and while serching on Solr Stemmer I found that it provide those two filters <filter class="solr.ArabicNormalizationFilterFactory"/> <filter class="solr.ArabicStemFilterFactory"/> I tried them but did not understand what they do .. So please any one can help me with some examples?? and do these two do this: العملات Stemmed to عملة البسَاتِين ، بساتينكم Stemmed to

How is the correct use of stemDocument?

微笑、不失礼 提交于 2019-12-11 11:58:45
问题 I have already read this and this questions, but I still didn't understand the use of stemDocument in tm_map . Let's follow this example: q17 <- VCorpus(VectorSource(x = c("poder", "pode")), readerControl = list(language = "pt", load = TRUE)) lapply(q17, content) $`character(0)` [1] "poder" $`character(0)` [1] "pode" If I use: > stemDocument("poder", language = "portuguese") [1] "pod" > stemDocument("pode", language = "portuguese") [1] "pod" it does work! But if I use: > q17 <- tm_map(q17,

Search stem and exact words in Lucene 4.4.0

落花浮王杯 提交于 2019-12-11 05:59:12
问题 i've store a lucene document with a single TextField contains words without stems. I need to implement a search program that allow users to search words and exact words, but if i've stored words without stemming, a stem search cannot be done. There's a method to search both exact words and/or stemming words in Documents without store Two fields ? Thanks in advance. 回答1: Indexing two separate fields seems like the right approach to me. Stemmed and unstemmed text require different analysis

How to correctly configure solr stemming

淺唱寂寞╮ 提交于 2019-12-11 03:39:31
问题 I have configured a field in Solr as follows. When I search for the word "Conditioner", I was hoping to find words that contain "Conditioning" also. But based on Solr Analysis, the porterstemfilter is cutting the word "Conditioning" to "Condit" at index time. Hence, at the search time, when I query for "Conditioner", it is stemmed as "Condition" and hence not matching "Conditioning". How to configure stemming so that both Conditioner and Conditioning should stem to condition? <fieldType name=

R tm stemCompletion generates NA value

时间秒杀一切 提交于 2019-12-11 02:16:14
问题 when i try to apply stemCompletion to a corpus , this function generates NA values.. this is my code: my.corpus <- tm_map(my.corpus, removePunctuation) my.corpus <- tm_map(my.corpus, removeWords, stopwords("english")) (one result of this is: [[2584]] zoning plan ) the next step is stamming corpus and so: my.corpus <- tm_map(my.corpus, stemDocument, language="english") my.corpus <- tm_map(my.corpus, stemCompletion, dictionary=my.corpus_copy, type="first") but result is this [[2584]] NA plant

Remove punctuation but keep hyphenated phrases in R text cleaning

给你一囗甜甜゛ 提交于 2019-12-10 22:14:10
问题 Is there any effective way to remove punctuation in text but keeping hyphenated expressions, such as "accident-prone"? I used the following function to clean my text clean.text = function(x) { # remove rt x = gsub("rt ", "", x) # remove at x = gsub("@\\w+", "", x) x = gsub("[[:punct:]]", "", x) x = gsub("[[:digit:]]", "", x) # remove http x = gsub("http\\w+", "", x) x = gsub("[ |\t]{2,}", "", x) x = gsub("^ ", "", x) x = gsub(" $", "", x) x = str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")

Can I perform stemming using regular expressions?

扶醉桌前 提交于 2019-12-10 15:17:09
问题 How can I get my regular expression to match against just one condition exactly? For example I have the following regular expression: (\w+)(?=ly|es|s|y) Matching the expression against the word "glasses" returns: glasse The correct match should be: glass (match should be on 'es' rather than 's' as in the match above) The expression should cater for any kinds of words such as: films lovely glasses glass Currently the regular expression is matching the above words as: film - correct lovel -

nltk : How to prevent stemming of proper nouns

浪尽此生 提交于 2019-12-10 14:23:16
问题 I am trying to wrote a keyword extraction program using Stanford POS taggers and NER. For keyword extraction, i am only interested in proper nouns. Here is the basic approach Clean up the data by removing anything but alphabets Remove stopwords Stem each word Determine POS tag of each word If the POS tag is a noun then feed it to the NER The NER will then determine if the word is a person, organization or location sample code docText="'Jack Frost works for Boeing Company. He manages 5

Python stemming (with pandas dataframe)

血红的双手。 提交于 2019-12-09 06:22:05
问题 I created a dataframe with sentences to be stemmed. I would like to use a Snowballstemmer to obtain higher accuracy with my classification algorithm. How can I achieve this? import pandas as pd from nltk.stem.snowball import SnowballStemmer # Use English stemmer. stemmer = SnowballStemmer("english") # Sentences to be stemmed. data = ["programers program with programing languages", "my code is working so there must be a bug in the optimizer"] # Create the Pandas dataFrame. df = pd.DataFrame

NLTK-based stemming and lemmatization

折月煮酒 提交于 2019-12-08 19:26:27
I am trying to preprocess a string using lemmatizer and then remove the punctuation and digits. I am using the code below to do this. I am not getting any error but the text is not preprocessed appropriately. Only the stop words are removed but the lemmatizing does not work and punctuation and digits also remain. from nltk.stem import WordNetLemmatizer import string import nltk tweets = "This is a beautiful day16~. I am; working on an exercise45.^^^45 text34." lemmatizer = WordNetLemmatizer() tweets = lemmatizer.lemmatize(tweets) data=[] stop_words = set(nltk.corpus.stopwords.words('english'))