stop-words

How to select stop words using tf-idf? (non english corpus)

别等时光非礼了梦想. 提交于 2019-12-02 22:53:54
I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document. Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in more than 50% of them, or the top 500 or some type of threshold that you will have to tune. The best (as in

Can InnoDB use a stopword file?

浪子不回头ぞ 提交于 2019-12-02 14:55:06
问题 With fulltext search for MyISAM, I know that I can specify a stopword file in my.cnf with the following: ft_stopword_file = '/etc/stopword.txt' Can the same also be done with fulltext search for InnoDB? I'd like to do something like the following if possible: ft_stopword_file_innodb = '/etc/stopword.txt' However, I haven't seen any documentation indicating that stopwords for InnoDB can be stored in a file. 回答1: No, it cannot natively, hotwired, use a text file out of the box. That is, mysql

Can InnoDB use a stopword file?

烂漫一生 提交于 2019-12-02 08:30:29
With fulltext search for MyISAM, I know that I can specify a stopword file in my.cnf with the following: ft_stopword_file = '/etc/stopword.txt' Can the same also be done with fulltext search for InnoDB? I'd like to do something like the following if possible: ft_stopword_file_innodb = '/etc/stopword.txt' However, I haven't seen any documentation indicating that stopwords for InnoDB can be stored in a file. No, it cannot natively, hotwired, use a text file out of the box. That is, mysql as shipped. To achieve that you would need to write speciality UDF's which would be absurd considering the

No results after removing mysql ft_stopword_file

隐身守侯 提交于 2019-12-01 19:35:27
I have a film database that contains information about a film called Yes, We're Open. When searching the database, I'm having an issue wherein a search for "yes we're open" returns another title that has the words "we're" and "open" but not "yes" in its description, even though I require all words in boolean mode (i.e. "yes we\'re open" is translated to '+yes +we\'re +open' before it's sent as a query). I assumed this was because "yes" is in the built-in stopwords list. However, when I set ft_stopword_file = "" , restart mysql, and then repair table [tablename] quick the table that i'm

elasticsearch: how to index terms which are stopwords only?

青春壹個敷衍的年華 提交于 2019-12-01 11:22:40
I had much success building my own little search with elasticsearch in the background. But there is one thing I couldn't find in the documentation. I'm indexing the names of musicians and bands. There is one band called "The The" and due to the stop words list this band is never indexed. I know I can ignore the stop words list completely but this is not what I want since the results searching for other bands like "the who" would explode. So, is it possible to save "The The" in the index but not disabling the stop words at all? You can use the synonym filter to convert The The into a single

elasticsearch: how to index terms which are stopwords only?

戏子无情 提交于 2019-12-01 07:24:24
问题 I had much success building my own little search with elasticsearch in the background. But there is one thing I couldn't find in the documentation. I'm indexing the names of musicians and bands. There is one band called "The The" and due to the stop words list this band is never indexed. I know I can ignore the stop words list completely but this is not what I want since the results searching for other bands like "the who" would explode. So, is it possible to save "The The" in the index but

Most used words in text with php

霸气de小男生 提交于 2019-11-30 16:04:39
I found the code below on stackoverflow and it works well in finding the most common words in a string. But can I exclude the counting on common words like "a, if, you, have, etc"? Or would I have to remove the elements after counting? How would I do this? Thanks in advance. <?php $text = "A very nice to tot to text. Something nice to think about if you're into text."; $words = str_word_count($text, 1); $frequency = array_count_values($words); arsort($frequency); echo '<pre>'; print_r($frequency); echo '</pre>'; ?> This is a function that extract common words from a string. it takes three

Why are these words considered stopwords?

偶尔善良 提交于 2019-11-30 12:49:10
I do not have a formal background in Natural Language Processing was wondering if someone from the NLP side can shed some light on this. I am playing around with the NLTK library and I was specifically looking into the stopwords function provided by this package: In [80]: nltk.corpus.stopwords.words('english') Out[80]: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom',

Can I customize Elastic Search to use my own Stop Word list?

旧城冷巷雨未停 提交于 2019-11-30 12:09:15
specifically, I want to index everything (e.g. the who) with no stop word list. Is elastic search flexible enough and easy enough to change? By default, the analyzer elasticsearch uses is a standard analyzer with the default Lucene English stopwords. I have configured elasticsearch to use the same analyzer but without stopwords by adding the following to the elasticsearch.yml file. # Index Settings index: analysis: analyzer: # set standard analyzer with no stop words as the default for both indexing and searching default: type: standard stopwords: _none_ Yes, you can do this using

Adding custom stopwords in R tm

假如想象 提交于 2019-11-30 11:46:50
I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords tm_map(abs, removeWords, stopwords("english")) Is there a way to add my own custom stop words to this list? stopwords just provides you with a vector of words, just c ombine your own ones to this. tm_map(abs, removeWords, c(stopwords("english"),"my","custom","words")) Reza Rahimi Save your custom stop words in a csv file (ex: word.csv ). library(tm) stopwords <- read.csv("word.csv", header = FALSE) stopwords <- as.character(stopwords$V1) stopwords <- c(stopwords, stopwords()) Then you can