stop-words

Full Text Search: Noise words are being searched for

◇◆丶佛笑我妖孽 提交于 2019-12-04 10:21:20
I have a database in SQL Server 2008 with Full Text Search indexes. I have defined the Stopword 'al' in the Stoplist. However, when I search for any phrase with the keyword 'al', the word 'al' is still uesd in ranking. This might be related to the fact that I am breaking up search terms, and reconstructing them. I am then searching across multiple fields and ranking the results: http://pastebin.com/fdce11ff . This functions to break up a search 'al hamra' into ("*al*" ~ "*hamra*") OR ("*al*" OR "*hamra*") for the Full Text Search. Imagine this scenario: Name: Al Hamra, Author: Jack Brown,

Stopwords and MySQL boolean fulltext

送分小仙女□ 提交于 2019-12-04 10:07:23
I'm using mysql's built in boolean fulltext features to search a dataset. (MATCH... AGAINST syntax). I'm running into a problem where keywords that are in MySql's default stopwords list are not returning any results. For example, "before", "between", etc. There is (I think) no way to disable MySql's stopwords at runtime. And because I am hosting my website on a shared server (DreamHost), I dont have the option of recompiling MySQL with stopwords disabled. I'm wondering if anyone has any suggestions on ways around the above problem? (Without upgrading to a VPS or dedicated system) Thanks in

how to add custom stop words using lucene in java

 ̄綄美尐妖づ 提交于 2019-12-04 10:06:37
I am using lucene to remove English Stop words but my requirement is remove English stop words and Custom stop words. Below is my code to remove English stop words using lucene. My Sample Code: public class Stopwords_remove { public String removeStopWords(String string) throws IOException { StandardAnalyzer ana = new StandardAnalyzer(Version.LUCENE_30); TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36,newStringReader(string)); StringBuilder sb = new StringBuilder(); tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, ana.STOP_WORDS_SET); CharTermAttribute token =

Tokenize, remove stop words using Lucene with Java

こ雲淡風輕ζ 提交于 2019-12-04 09:39:46
问题 I am trying to tokenize and remove stop words from a txt file with Lucene. I have this: public String removeStopWords(String string) throws IOException { Set<String> stopWords = new HashSet<String>(); stopWords.add("a"); stopWords.add("an"); stopWords.add("I"); stopWords.add("the"); TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string)); tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords); StringBuilder sb = new StringBuilder();

Using Shingles and Stop words with Elasticsearch and Lucene 4.4

坚强是说给别人听的谎言 提交于 2019-12-04 05:23:04
In the index I'm building, I'm interested in running a query, then (using facets) returning the shingles of that query. Here's the analyzer I'm using on the text: { "settings": { "analysis": { "analyzer": { "shingleAnalyzer": { "tokenizer": "standard", "filter": [ "standard", "lowercase", "custom_stop", "custom_shingle", "custom_stemmer" ] } }, "filter": { "custom_stemmer" : { "type": "stemmer", "name": "english" }, "custom_stop": { "type": "stop", "stopwords": "_english_" }, "custom_shingle": { "type": "shingle", "min_shingle_size": "2", "max_shingle_size": "3" } } } } } The major issue is

MySQL Fulltext Stopwords Rationale

人走茶凉 提交于 2019-12-04 01:22:34
问题 I am currently trying to develop a basic fulltext search for my website, and I noticed that certain words like "regarding" are listed as stopwords for MySQL fulltext searches. This doesn't bother me too much right now since people searching for a given news item wouldn't necessarily search using the word "regarding" (but I certainly can't speak for everyone!). However, I was hoping someone here could enlighten me about the rationale for having a stopwords list. Thanks! For Clarification: I'm

Stop words and stemmer in java

ⅰ亾dé卋堺 提交于 2019-12-03 13:56:20
问题 I'm thinking of putting a stop words in my similarity program and then a stemmer (going for porters 1 or 2 depends on what easiest to implement) I was wondering that since I read my text from files as whole lines and save them as a long string, so if I got two strings ex. String one = "I decided buy something from the shop."; String two = "Nevertheless I decidedly bought something from a shop."; Now that I got those strings Stemming: Can I just use the stemmer algoritmen directly on it, save

Extract Relevant Tag/Keywords from Text block

蹲街弑〆低调 提交于 2019-12-03 10:39:08
问题 I wanted a particular implementation, such that the user provide a block of text like: "Requirements - Working knowledge, on LAMP Environment using Linux, Apache 2, MySQL 5 and PHP 5, - Knowledge of Web 2.0 Standards - Comfortable with JSON - Hands on Experience on working with Frameworks, Zend, OOPs - Cross Browser Javascripting, JQuery etc. - Knowledge of Version Control Software such as sub-version will be preferable." What I want to do is automatically select relevant keywords and create

Stop words and stemmer in java

纵饮孤独 提交于 2019-12-03 04:49:25
I'm thinking of putting a stop words in my similarity program and then a stemmer (going for porters 1 or 2 depends on what easiest to implement) I was wondering that since I read my text from files as whole lines and save them as a long string, so if I got two strings ex. String one = "I decided buy something from the shop."; String two = "Nevertheless I decidedly bought something from a shop."; Now that I got those strings Stemming: Can I just use the stemmer algoritmen directly on it, save it as a String and then continue working on the similarity like I did before implementing the stemmer

Tokenize, remove stop words using Lucene with Java

北慕城南 提交于 2019-12-03 04:41:16
I am trying to tokenize and remove stop words from a txt file with Lucene. I have this: public String removeStopWords(String string) throws IOException { Set<String> stopWords = new HashSet<String>(); stopWords.add("a"); stopWords.add("an"); stopWords.add("I"); stopWords.add("the"); TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_43, new StringReader(string)); tokenStream = new StopFilter(Version.LUCENE_43, tokenStream, stopWords); StringBuilder sb = new StringBuilder(); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); while (tokenStream