stop-words

Elasticsearch: index a field with keyword tokenizer but without stopwords

拈花ヽ惹草 提交于 2019-12-11 01:42:49
问题 I am looking for a way to search company names with keyword tokenizing but without stopwords. For ex : The indexed company name is "Hansel und Gretel Gmbh." Here "und" and "Gmbh" are stop words for the company name. If the search term is "Hansel Gretel", that document should be found, If the search term is "Hansel" then no document should be found. And if the search term is "hansel gmbh", the no document should be found as well. I have tried to combine keywords tokenizer with stopwords in

Stopword removal with pandas

◇◆丶佛笑我妖孽 提交于 2019-12-11 00:59:53
问题 I would like to remove stopwords from a column of a data frame. Inside the column there is text which needs to be splitted. For example my data frame looks like this: ID Text 1 eat launch with me 2 go outside have fun I want to apply stopword on text column so it should be splitted. I tried this: for item in cached_stop_words: if item in df_from_each_file[['text']]: print(item) df_from_each_file['text'] = df_from_each_file['text'].replace(item, '') So my output should be like this: ID Text 1

How to get a list of StopWords used in my FullText Catalog?

核能气质少年 提交于 2019-12-10 03:09:05
问题 Is there a way to get the StopWord list that my SQL Server 2008 FullText Catalog is using? And use it, in my C# codebehind? I want to use it in a ASP.NET page that I use to search terms and highlight them. The search page and the highlight are already working fine, but I want to improve the highlight. I don't want to highlight a word that is on my StopWord list. 回答1: In sql server management studio if you ask the properties from the fulltext index you can see which stoplist it uses. See here.

How to remove stop words from a large collection files with more efficient way?

馋奶兔 提交于 2019-12-08 09:08:02
问题 I have 200,000 files for which I've to process and extract tokens for each file. The size of all files is 1.5GB. When I wrote the code for extracting tokens from each file, it works well. Over all execution time is 10mins. After that, I tried to remove stopwords Performance went down badly. It's taking 25 to 30 mins. I'm using stop words from the website here There are around 571 stop words. The general procedure is to extract each stop word from a text file at once and compare with each

MySQL full-text stopwords problem

故事扮演 提交于 2019-12-07 10:51:42
问题 I have a database named "products" and a FULLTEXT index with the columns: title and description . All of my products are lubrifiants (oils), and there are two types of it: industrials and aut-moto, with a rate of 55%-45%. If I make a search after auto-moto oils then it will return no results because the "auto-moto" keyword is present in more then half of the rows, and the oils in all of them, so the MySQL puts them into the STOPWORDS list. I am using PHP. How can I make that query to give

Filter out common words for search query

拈花ヽ惹草 提交于 2019-12-06 05:58:39
Are there any easy ways to implement filtering a user's input (possibly a question) by extracting the meaningful data in the query? I basically want to filter out any noise words so I can send a 'clean' query to Google's search api. Um, won't Google do this for you? Send all those dirty, filthy words to Google and let them clean them up for you. Jeff talked about "stop words" in one of the previous stackoverflow podcasts. You might try searching for that phrase on google. The wikipedia page seems to have some overview and pointers to options. http://en.wikipedia.org/wiki/Stop_words You can try

Using Shingles and Stop words with Elasticsearch and Lucene 4.4

南楼画角 提交于 2019-12-06 00:22:38
问题 In the index I'm building, I'm interested in running a query, then (using facets) returning the shingles of that query. Here's the analyzer I'm using on the text: { "settings": { "analysis": { "analyzer": { "shingleAnalyzer": { "tokenizer": "standard", "filter": [ "standard", "lowercase", "custom_stop", "custom_shingle", "custom_stemmer" ] } }, "filter": { "custom_stemmer" : { "type": "stemmer", "name": "english" }, "custom_stop": { "type": "stop", "stopwords": "_english_" }, "custom_shingle"

MySQL full-text stopwords problem

五迷三道 提交于 2019-12-05 18:33:19
I have a database named "products" and a FULLTEXT index with the columns: title and description . All of my products are lubrifiants (oils), and there are two types of it: industrials and aut-moto, with a rate of 55%-45%. If I make a search after auto-moto oils then it will return no results because the "auto-moto" keyword is present in more then half of the rows, and the oils in all of them, so the MySQL puts them into the STOPWORDS list. I am using PHP. How can I make that query to give back the right results? The answer is IN BOOLEAN MODE . If you use boolean mode then the mysql will ignore

How to get a list of StopWords used in my FullText Catalog?

断了今生、忘了曾经 提交于 2019-12-05 03:44:36
Is there a way to get the StopWord list that my SQL Server 2008 FullText Catalog is using? And use it, in my C# codebehind? I want to use it in a ASP.NET page that I use to search terms and highlight them. The search page and the highlight are already working fine, but I want to improve the highlight. I don't want to highlight a word that is on my StopWord list. Sem Vanmeenen In sql server management studio if you ask the properties from the fulltext index you can see which stoplist it uses. See here . You can then use the system views sys.fulltext_stoplists and sys.fulltext_stopwords to get

Apache Lucene doesn't filter stop words despite the usage of StopAnalyzer and StopFilter

≡放荡痞女 提交于 2019-12-04 16:02:15
I have a module based on Apache Lucene 5.5 / 6.0 which retrieves keywords. Everything is working fine except one thing — Lucene doesn't filter stop words. I tried to enable stop word filtering with two different approaches. Approach #1: tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet()); tokenStream.reset(); Approach #2: tokenStream = new StopFilter(new ClassicFilter(new LowerCaseFilter(stdToken)), StopAnalyzer.ENGLISH_STOP_WORDS_SET); tokenStream.reset(); The full code is available here: https:/