text-mining | 易学教程

negation handling in R, how can I replace a word following a negation in R?

阅读更多关于 negation handling in R, how can I replace a word following a negation in R?

问题 I'm doing sentiment analysis for financial articles. To enhance the accuracy of my naive Bayes classifier, I'd like to implement negation handling. Specifically, I want to add the prefix "not_" to the word following a "not" or "n't" So if there's something like this in my corpus: x <- "They didn't sell the company." I want to get the following: "they didn't not_sell the company." (the stopword "didn't" will be removed later) I could find only the gsub() function, but it doesn't seem to work

negation handling in R, how can I replace a word following a negation in R?

阅读更多关于 negation handling in R, how can I replace a word following a negation in R?

Lemmatization using txt file with lemmes in R

阅读更多关于 Lemmatization using txt file with lemmes in R

问题 I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/) Abadan Abadanem Abadan Abadanie Abadan Abadanowi Abadan Abadanu abadańczyk abadańczycy abadańczyk abadańczyka abadańczyk abadańczykach abadańczyk abadańczykami abadańczyk abadańczyki abadańczyk abadańczykiem abadańczyk abadańczykom abadańczyk abadańczyków abadańczyk abadańczykowi abadańczyk abadańczyku abadanka abadance

Parse GATE Document to get Co-Reference Text

阅读更多关于 Parse GATE Document to get Co-Reference Text

问题 I'm creating a GATE app which used to find co-reference text. It works fine and I have created zipped file of the app by export option provided in GATE. Now I'm trying to use the same in my Java code. Gate.runInSandbox(true); Gate.setGateHome(new File(gateHome)); Gate.setPluginsHome(new File(gateHome, "plugins")); Gate.init(); URL applicationURL = new URL("file:" + new Path(gateHome, "application.xgapp").toString()); application = (CorpusController) PersistenceManager.loadObjectFromUrl

Parse GATE Document to get Co-Reference Text

阅读更多关于 Parse GATE Document to get Co-Reference Text

How to select stop words using tf-idf? (non english corpus)

阅读更多关于 How to select stop words using tf-idf? (non english corpus)

问题 I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document. 回答1: Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in

Remove all punctuation from text including apostrophes for tm package

阅读更多关于 Remove all punctuation from text including apostrophes for tm package

问题 I have a of vector consisting of Tweets (just the message text) that I am cleaning for text mining purposes. I have used removePunctuation from the tm package like so: clean_tweet_text = removePunctuation(tweet_text) This have resulted in a vector with all punctuation removed from the text except apostrophes, which ruins my keyword searches because words touching apostrophes are not registered. For example, one of my keywords is climate but if a tweet has 'climate it won't be counted. How can

CPU-and-memory efficient NGram extraction with R

阅读更多关于 CPU-and-memory efficient NGram extraction with R

问题 I wrote an algorithm which extract NGrams (bigrams, trigrams, ... till 5-grams) from a list of 50000 street addresses. My goal is to have for each address a boolean vector representing whether the NGrams are present or not in the address. Therefor each address will be characterized by a vector of attributes, and then I can carry out a clustering on the addresses. The algo works that way : I start with the bi-grams, I calculate all the combinations of (a-z and 0-9 and / and tabulation) : for

CPU-and-memory efficient NGram extraction with R

阅读更多关于 CPU-and-memory efficient NGram extraction with R

converting a text corpus to a text document with vocabulary_id and respective tfidf score

阅读更多关于 converting a text corpus to a text document with vocabulary_id and respective tfidf score

问题 I have a text corpus with say 5 documents, every document is separated with each other by /n. I want to provide an id to every word in the document and calculate its respective tfidf score. for example, suppose we have a text corpus named "corpus.txt" as follows:- "Stack over flow text vectorization scikit python scipy sparse csr" while calculating the tfidf using mylist =list("corpus.text") vectorizer= CountVectorizer x_counts = vectorizer_train.fit_transform(mylist) tfidf_transformer =