stop-words | 易学教程

How to select stop words using tf-idf? (non english corpus)

阅读更多关于 How to select stop words using tf-idf? (non english corpus)

I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document. Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in more than 50% of them, or the top 500 or some type of threshold that you will have to tune. The best (as in

Can InnoDB use a stopword file?

阅读更多关于 Can InnoDB use a stopword file?

问题 With fulltext search for MyISAM, I know that I can specify a stopword file in my.cnf with the following: ft_stopword_file = '/etc/stopword.txt' Can the same also be done with fulltext search for InnoDB? I'd like to do something like the following if possible: ft_stopword_file_innodb = '/etc/stopword.txt' However, I haven't seen any documentation indicating that stopwords for InnoDB can be stored in a file. 回答1: No, it cannot natively, hotwired, use a text file out of the box. That is, mysql

Can InnoDB use a stopword file?

阅读更多关于 Can InnoDB use a stopword file?

With fulltext search for MyISAM, I know that I can specify a stopword file in my.cnf with the following: ft_stopword_file = '/etc/stopword.txt' Can the same also be done with fulltext search for InnoDB? I'd like to do something like the following if possible: ft_stopword_file_innodb = '/etc/stopword.txt' However, I haven't seen any documentation indicating that stopwords for InnoDB can be stored in a file. No, it cannot natively, hotwired, use a text file out of the box. That is, mysql as shipped. To achieve that you would need to write speciality UDF's which would be absurd considering the

No results after removing mysql ft_stopword_file

阅读更多关于 No results after removing mysql ft_stopword_file

I have a film database that contains information about a film called Yes, We're Open. When searching the database, I'm having an issue wherein a search for "yes we're open" returns another title that has the words "we're" and "open" but not "yes" in its description, even though I require all words in boolean mode (i.e. "yes we\'re open" is translated to '+yes +we\'re +open' before it's sent as a query). I assumed this was because "yes" is in the built-in stopwords list. However, when I set ft_stopword_file = "" , restart mysql, and then repair table [tablename] quick the table that i'm

elasticsearch: how to index terms which are stopwords only?

阅读更多关于 elasticsearch: how to index terms which are stopwords only?

I had much success building my own little search with elasticsearch in the background. But there is one thing I couldn't find in the documentation. I'm indexing the names of musicians and bands. There is one band called "The The" and due to the stop words list this band is never indexed. I know I can ignore the stop words list completely but this is not what I want since the results searching for other bands like "the who" would explode. So, is it possible to save "The The" in the index but not disabling the stop words at all? You can use the synonym filter to convert The The into a single

elasticsearch: how to index terms which are stopwords only?

阅读更多关于 elasticsearch: how to index terms which are stopwords only?

问题 I had much success building my own little search with elasticsearch in the background. But there is one thing I couldn't find in the documentation. I'm indexing the names of musicians and bands. There is one band called "The The" and due to the stop words list this band is never indexed. I know I can ignore the stop words list completely but this is not what I want since the results searching for other bands like "the who" would explode. So, is it possible to save "The The" in the index but

Most used words in text with php

阅读更多关于 Most used words in text with php

I found the code below on stackoverflow and it works well in finding the most common words in a string. But can I exclude the counting on common words like "a, if, you, have, etc"? Or would I have to remove the elements after counting? How would I do this? Thanks in advance. <?php $text = "A very nice to tot to text. Something nice to think about if you're into text."; $words = str_word_count($text, 1); $frequency = array_count_values($words); arsort($frequency); echo '<pre>'; print_r($frequency); echo '</pre>'; ?> This is a function that extract common words from a string. it takes three

Why are these words considered stopwords?

阅读更多关于 Why are these words considered stopwords?

I do not have a formal background in Natural Language Processing was wondering if someone from the NLP side can shed some light on this. I am playing around with the NLTK library and I was specifically looking into the stopwords function provided by this package: In [80]: nltk.corpus.stopwords.words('english') Out[80]: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom',

Can I customize Elastic Search to use my own Stop Word list?

阅读更多关于 Can I customize Elastic Search to use my own Stop Word list?

specifically, I want to index everything (e.g. the who) with no stop word list. Is elastic search flexible enough and easy enough to change? By default, the analyzer elasticsearch uses is a standard analyzer with the default Lucene English stopwords. I have configured elasticsearch to use the same analyzer but without stopwords by adding the following to the elasticsearch.yml file. # Index Settings index: analysis: analyzer: # set standard analyzer with no stop words as the default for both indexing and searching default: type: standard stopwords: _none_ Yes, you can do this using

Adding custom stopwords in R tm

阅读更多关于 Adding custom stopwords in R tm

I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords tm_map(abs, removeWords, stopwords("english")) Is there a way to add my own custom stop words to this list? stopwords just provides you with a vector of words, just c ombine your own ones to this. tm_map(abs, removeWords, c(stopwords("english"),"my","custom","words")) Reza Rahimi Save your custom stop words in a csv file (ex: word.csv ). library(tm) stopwords <- read.csv("word.csv", header = FALSE) stopwords <- as.character(stopwords$V1) stopwords <- c(stopwords, stopwords()) Then you can