stop-words

What is the default list of stopwords used in Lucene's StopFilter?

大兔子大兔子 提交于 2019-12-17 09:33:25
问题 Lucene have a default stopfilter (http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/core/StopFilter.html), does anyone know which are words in the list? 回答1: The default stop words set in StandardAnalyzer and EnglishAnalyzer is from StopAnalyzer.ENGLISH_STOP_WORDS_SET , and they are: "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these"

User Warning: Your stop_words may be inconsistent with your preprocessing

不问归期 提交于 2019-12-13 15:23:26
问题 I am following this document clustering tutorial. As an input I give a txt file which can be downloaded here. It's a combined file of 3 other txt files divided with a use of \n. After creating a tf-idf matrix I received this warning: ,,UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid',

How to remove quotes after removing stopwords from nltk?

走远了吗. 提交于 2019-12-13 02:17:24
问题 I had captured headers from newspapers,also i removed stopwords from headres but after removing stopwords the word comes with single quote,so i dont want these quote,for this i tried below code: from nltk.corpus import stopwords blog_posts=[] stop = stopwords.words('english')+[ '.', ',', '--', '\'s', '?', ')', '(', ':', '\'', '\'re', '"', '-', '}', '{', u'—', 'a', 'able', 'about', 'above', 'according', 'accordingly', 'across', 'actually', 'after', 'afterwards', 'again', 'against', 'all',

Solr stop words replaced with _ symbol

点点圈 提交于 2019-12-12 23:28:38
问题 I have problems with solr stopwords in my autosuggest. All stopwords was replaced by _ symbol. For example I have text "the simple text in" in field "deal_title". When I try to search word "simple" solr show me next result "_ simple text _" but I expect "simple text". Could someone explain me why this works in such way and how to fix it ? Here is part of my schema.xml <fieldType class="solr.TextField" name="text_auto"> <analyzer type="index"> <charFilter class="solr.HTMLStripCharFilterFactory

Solr stopwords showing up in facet search results

为君一笑 提交于 2019-12-12 17:33:19
问题 I am currently testing facet searches on a text field in my Solr schema and noticing that I am getting a significant number of results that are in my stopwords.txt file. My schema is currently using the default configuration for the text data type, and I was under the impression that stopwords were not indexed if the "solr.StopFilterFactory" filter was in use. I am hoping that someone can shed some light on this and either a) help me understand why stopwords don't apply to facets and how to

Where to find an exhaustive list of stop words?

China☆狼群 提交于 2019-12-12 15:03:05
问题 Where could I find an exhaustive list of stop words? The one I have is quite short and it seems to be inapplicable to scientific texts. I am creating lexical chains to extract key topics from scientific papers. The problem is that words like based , regarding , etc. should also be considered as stop words as they do not deliver much sense. 回答1: You can also easily add to existing stop word lists. E.g. use the one in the NLTK toolkit: from nltk.corpus import stopwords and then add whatever you

How to remove list of words from strings

两盒软妹~` 提交于 2019-12-12 10:40:02
问题 What I would like to do (in Clojure): For example, I have a vector of words that need to be removed: (def forbidden-words [":)" "the" "." "," " " ...many more...]) ... and a vector of strings: (def strings ["the movie list" "this.is.a.string" "haha :)" ...many more...]) So, each forbidden word should be removed from each string, and the result, in this case, would be: ["movie list" "thisisastring" "haha"]. How to do this ? 回答1: (def forbidden-words [":)" "the" "." ","]) (def strings ["the

pyspark : how to configure StopWordsRemover with french language on spark 1.6.3

送分小仙女□ 提交于 2019-12-11 09:49:36
问题 I would like to know how to configure stopwordsremover with french language in spark 1.6.3. I'm currently using pyspark. Thanks for your help. Best regards, 回答1: Take a look at the nltk package I use it for portuguese words: from pyspark.ml.feature import StopWordsRemover import nltk nltk.download("stopwords") ... stopwordList = nltk.corpus.stopwords.words('portuguese') remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="stopWordsRem", stopWords=stopwordList) Hope it

R tm removeWords stopwords is not removing stopwords

▼魔方 西西 提交于 2019-12-11 04:09:16
问题 I'm using the R tm package, and find that almost none of the tm_map functions that remove elements of text are working for me. By 'working' I mean for example, I'll run: d <- tm_map(d, removeWords, stopwords('english')) but then when I run ddtm <- DocumentTermMatrix(d, control = list( weighting = weightTfIdf, minWordLength = 2)) findFreqTerms(ddtm, 10) I still get: [1] the this ...etc., and a bunch of other stopwords. I see no error indicating something has gone wrong. Does anyone know what

How to match certain words between two strings (in MATLAB)?

社会主义新天地 提交于 2019-12-11 02:08:03
问题 In the following two strings, the words 'rabbit' and 'tree' are matching: str1 = ('rabbit is eating grass near a tree'); str2 = ('rabbit is sleeping under tree'); Suppose cmp is a variable declared to compare both. I want the result as: cmp = 2 or something that shows that two words are matching. How do I do this? 回答1: "Crazy" bsxfun approach, which might be similar to intersect, but not tested - Function - function out = cell2_matchind(split1,split2) c1 = char(split1)-'0'; c2 = char(split2)-