text-mining

How to filter meta data by user-defined statements in R?

一曲冷凌霜 提交于 2019-12-13 01:27:05
问题 There is a function called sFilter in R to filter meta data. However, the function is an old (Version: 0.5-10) tm package. Is there any function instead of it in a new version? My code block is; query <- "LEWISSPLIT == 'TRAIN'" trainData <- tm_filter(Corpus, FUN = sFilter, query) It means, get documents which have "TRAIN" value in their LEWISSPLIT attribute. <REUTERS TOPICS=?? LEWISSPLIT=?? CGISPLIT=?? OLDID=?? NEWID=??> 回答1: Just write your own filtering function: trainData <- tm_filter

Text mining with tm.plugin.webmining package using GoogleFinanceSource function

岁酱吖の 提交于 2019-12-12 18:49:07
问题 I am studying text mining on the online book http://tidytextmining.com/. In the fifth chapter: http://tidytextmining.com/dtm.html#financial the following code: library(tm.plugin.webmining) library(purrr) company <- c("Microsoft", "Apple", "Google", "Amazon", "Facebook", "Twitter", "IBM", "Yahoo", "Netflix") symbol <- c("MSFT", "AAPL", "GOOG", "AMZN", "FB", "TWTR", "IBM", "YHOO", "NFLX") download_articles <- function(symbol) { WebCorpus(GoogleFinanceSource(paste0("NASDAQ:", symbol))) } stock

replacement of words in strings

家住魔仙堡 提交于 2019-12-12 17:43:26
问题 I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled. How can I search a string, a word that matches and replace it? The expected result is the following example: a1<- c(" the classroom is ful ") a2<- c(" full") In this case I would be replacing ful for full in a1 回答1: Take a look at the hunspell package. As the comments have already suggested, your problem is much more difficult than it seems, unless you already have a dictionary of

Removing all “H” within the strings, EXCEPT the ones including “CH”

旧巷老猫 提交于 2019-12-12 17:05:46
问题 I am trying to remove all "H" within the strings, EXCEPT the ones including "CH" in the following example: strings <- c("Cash","Wishes","Chain","Chip","Check") I found that the code below remove only "H" data<- gsub("H", "", strings) 回答1: You can do this with a negative look-behind. gsub("(?<!c)h", "", strings, perl=TRUE, ignore.case = TRUE) 来源: https://stackoverflow.com/questions/47538826/removing-all-h-within-the-strings-except-the-ones-including-ch

tm.package: findAssocs vs Cosine

Deadly 提交于 2019-12-12 12:09:17
问题 I'm new here and my questions is of mathematical rather than programming nature where I would like to get a second opinion on whether my approach makes sense. I was trying to find associations between words in my corpus using the function findAssocs , from the tm package. Even though it appears to perform reasonably well on the data available through the package, such as New York Times and US Congress, I was disappointed with its performance on my own, less tidy dataset. It appears to be

n-grams from text in PostgreSQL

柔情痞子 提交于 2019-12-12 11:34:32
问题 I am looking to create n-grams from text column in PostgreSQL. I currently split(on white-space) data(sentences) in a text column to an array. enter code here select regexp_split_to_array(sentenceData,E'\s+') from tableName Once I have this array, how do I go about: Creating a loop to find n-grams, and write each to a row in another table Using unnest I can obtain all the elements of all the arrays on separate rows, and maybe I can then think of a way to get n-grams from a single column, but

text mining with tm package in R ,remove words starting from [http] or any other specifc word

一世执手 提交于 2019-12-12 07:28:44
问题 I am new to R and text mining. I had made a word cloud out of twitter feed related to some term. The problem that I'm facing is that in the wordcloud it shows http:... or htt... How do I deal about this issue I tried using metacharacter * but I still doubt if I'm applying it right tw.text = removeWords(tw.text,c(stopwords("en"),"rt","http\\*")) somebody into text-minning please help me with this. 回答1: If you are looking to remove URLs from your string, you may use: gsub("(f|ht)tp(s?)://(.*)[.

removing stop words without using nltk corpus

倖福魔咒の 提交于 2019-12-12 04:56:18
问题 I am trying to remove stop words in a text file without using nltk. I have f1,f2,f3 three text files. f1 has text line by line and f2 has stop words list and f3 is empty file. I want to read f1 line by line and in turn word by word and need to check whether it is in f2(stop words). If the word is not in the stop word then write the word in f3. Thus at the end f3 should have text as in f1 but in each line, words in f2(stop words) should be removed. f1 = open("file1.txt","r") f2 = open("stop

How to do text classification with label probabilities?

試著忘記壹切 提交于 2019-12-12 04:48:03
问题 I'm trying to solve a text classification problem for academic purpose. I need to classify the tweets into labels like "cloud" ,"cold", "dry", "hot", "humid", "hurricane", "ice", "rain", "snow", "storms", "wind" and "other". Each tweet in training data has probabilities against all the label. Say the message "Can already tell it's going to be a tough scoring day. It's as windy right now as it was yesterday afternoon." has 21% chance for being hot and 79% chance for wind. I have worked on the

R: compilation failed for package 'slam'

为君一笑 提交于 2019-12-12 04:29:35
问题 I am using R 2.15.2 and I want to install the tm package to do some text analysis. I have downloaded the compatible tm package from the CRAN archives. I downloaded tm_0.5-9 I tried to install it using install.packages("/Downloads/tm_0.5-9.tar.gz", repos = NULL, type="source", dependencies = TRUE) and got the following error Installing package(s) into ‘/Documents/R/win-library/2.15’ (as ‘lib’ is unspecified) ERROR: dependency 'slam' is not available for package 'tm' * removing '/Documents/R