text-mining

negation handling in R, how can I replace a word following a negation in R?

谁说我不能喝 提交于 2020-01-13 20:23:10
问题 I'm doing sentiment analysis for financial articles. To enhance the accuracy of my naive Bayes classifier, I'd like to implement negation handling. Specifically, I want to add the prefix "not_" to the word following a "not" or "n't" So if there's something like this in my corpus: x <- "They didn't sell the company." I want to get the following: "they didn't not_sell the company." (the stopword "didn't" will be removed later) I could find only the gsub() function, but it doesn't seem to work

negation handling in R, how can I replace a word following a negation in R?

不羁的心 提交于 2020-01-13 20:23:07
问题 I'm doing sentiment analysis for financial articles. To enhance the accuracy of my naive Bayes classifier, I'd like to implement negation handling. Specifically, I want to add the prefix "not_" to the word following a "not" or "n't" So if there's something like this in my corpus: x <- "They didn't sell the company." I want to get the following: "they didn't not_sell the company." (the stopword "didn't" will be removed later) I could find only the gsub() function, but it doesn't seem to work

Lemmatization using txt file with lemmes in R

↘锁芯ラ 提交于 2020-01-13 06:42:25
问题 I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/) Abadan Abadanem Abadan Abadanie Abadan Abadanowi Abadan Abadanu abadańczyk abadańczycy abadańczyk abadańczyka abadańczyk abadańczykach abadańczyk abadańczykami abadańczyk abadańczyki abadańczyk abadańczykiem abadańczyk abadańczykom abadańczyk abadańczyków abadańczyk abadańczykowi abadańczyk abadańczyku abadanka abadance

Parse GATE Document to get Co-Reference Text

拈花ヽ惹草 提交于 2020-01-13 06:04:28
问题 I'm creating a GATE app which used to find co-reference text. It works fine and I have created zipped file of the app by export option provided in GATE. Now I'm trying to use the same in my Java code. Gate.runInSandbox(true); Gate.setGateHome(new File(gateHome)); Gate.setPluginsHome(new File(gateHome, "plugins")); Gate.init(); URL applicationURL = new URL("file:" + new Path(gateHome, "application.xgapp").toString()); application = (CorpusController) PersistenceManager.loadObjectFromUrl

Parse GATE Document to get Co-Reference Text

我们两清 提交于 2020-01-13 06:04:09
问题 I'm creating a GATE app which used to find co-reference text. It works fine and I have created zipped file of the app by export option provided in GATE. Now I'm trying to use the same in my Java code. Gate.runInSandbox(true); Gate.setGateHome(new File(gateHome)); Gate.setPluginsHome(new File(gateHome, "plugins")); Gate.init(); URL applicationURL = new URL("file:" + new Path(gateHome, "application.xgapp").toString()); application = (CorpusController) PersistenceManager.loadObjectFromUrl

How to select stop words using tf-idf? (non english corpus)

戏子无情 提交于 2020-01-11 20:01:10
问题 I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document. 回答1: Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in

Remove all punctuation from text including apostrophes for tm package

谁说我不能喝 提交于 2020-01-11 11:26:30
问题 I have a of vector consisting of Tweets (just the message text) that I am cleaning for text mining purposes. I have used removePunctuation from the tm package like so: clean_tweet_text = removePunctuation(tweet_text) This have resulted in a vector with all punctuation removed from the text except apostrophes, which ruins my keyword searches because words touching apostrophes are not registered. For example, one of my keywords is climate but if a tweet has 'climate it won't be counted. How can

CPU-and-memory efficient NGram extraction with R

不问归期 提交于 2020-01-11 07:05:51
问题 I wrote an algorithm which extract NGrams (bigrams, trigrams, ... till 5-grams) from a list of 50000 street addresses. My goal is to have for each address a boolean vector representing whether the NGrams are present or not in the address. Therefor each address will be characterized by a vector of attributes, and then I can carry out a clustering on the addresses. The algo works that way : I start with the bi-grams, I calculate all the combinations of (a-z and 0-9 and / and tabulation) : for

CPU-and-memory efficient NGram extraction with R

夙愿已清 提交于 2020-01-11 07:04:03
问题 I wrote an algorithm which extract NGrams (bigrams, trigrams, ... till 5-grams) from a list of 50000 street addresses. My goal is to have for each address a boolean vector representing whether the NGrams are present or not in the address. Therefor each address will be characterized by a vector of attributes, and then I can carry out a clustering on the addresses. The algo works that way : I start with the bi-grams, I calculate all the combinations of (a-z and 0-9 and / and tabulation) : for

converting a text corpus to a text document with vocabulary_id and respective tfidf score

我是研究僧i 提交于 2020-01-07 05:45:10
问题 I have a text corpus with say 5 documents, every document is separated with each other by /n. I want to provide an id to every word in the document and calculate its respective tfidf score. for example, suppose we have a text corpus named "corpus.txt" as follows:- "Stack over flow text vectorization scikit python scipy sparse csr" while calculating the tfidf using mylist =list("corpus.text") vectorizer= CountVectorizer x_counts = vectorizer_train.fit_transform(mylist) tfidf_transformer =