text-mining

Quanteda - Extracting identified dictionary words

回眸只為那壹抹淺笑 提交于 2019-12-12 04:24:04
问题 I am trying to extract the identified dictionary words from a Quanteda dfm, but have been unable to find a solution. Does someone have a solution for this? Sample input: dict <- dictionary(list(season = c("spring", "summer", "fall", "winter"))) dfm <- dfm("summer is great", dictionary = dict) Output: > dfm Document-feature matrix of: 1 document, 1 feature. 1 x 1 sparse Matrix of class "dfmSparse" features docs season text1 1 I now know that a seasonality dict word has been identified in the

set encoding for reading text files into tm Corpora

家住魔仙堡 提交于 2019-12-12 03:45:42
问题 loading a bunch of documents using tm Corpus i need to specify encoding. All documents are UTF-8 encoded. If openend via text editor content is ok but corpus contents is full of strange symbols (indicioâ., ‘sœs....) Source text is in spanish. ES_es library(tm) cname <- file.path("C:", "Users", "john", "Documents", "texts") docs <- Corpus(DirSource(cname), encoding ="UTF-8") > Error in Corpus(DirSource(cname), encoding = "UTF-8") : unused argument (encoding = "UTF-8") EDITED: Getting str

How to apply grepl for data frame

99封情书 提交于 2019-12-12 03:38:10
问题 I want to use grepl for multiple patterns defined as a data frame. df_sen is presented as sentence "She would like to go there" "I had it few days ago" "We have spent few millions" df_triggers is presented as follows: trigger few days few millions And I want to create a matrix where sentence x triggers and on the intersection to see 1 if trigger was found in a sentence and 0 if it was not. I have tried to do it like this: matrix <- grepl(df_triggers$trigger, df_sen$sentence) But I see the

randomForest in R object not found error

江枫思渺然 提交于 2019-12-12 03:24:27
问题 # init libs <- c("tm", "plyr", "class", "RTextTools", "randomForest") lapply(libs, require, character.only = TRUE) # set options options(stringsAsFactors = FALSE) # set parameters labels <- read.table('labels.txt') path <- paste(getwd(), "/data", sep="") # clean text cleanCorpus <- function(corpus) { corpus.tmp <- tm_map(corpus, removePunctuation) corpus.tmp <- tm_map(corpus.tmp, removeNumbers) corpus.tmp <- tm_map(corpus.tmp, stripWhitespace) corpus.tmp <- tm_map(corpus.tmp, content

Splitting strings in R

我与影子孤独终老i 提交于 2019-12-12 00:53:07
问题 I have a following line x<-"CUST_Id_8Name:Mr.Praveen KumarDOB:Mother's Name:Contact Num:Email address:Owns Car:Products held with Bank:Company Name:Salary per. month:Background:" I want to extract "CUST_Id_8", "Mr. Praveen Kumar" and anything written after DOB: Mother's name: Contact Num: and so on stored in variables like Customer Id, Name, DOB and so on. Please help. I used strsplit(x, ":") But the result is a list containing the texts. But I need blanks if there is nothing after the

How to scan values from each words based on tables and then calculate it And Make The VSM (Vector Space Model) From It

只愿长相守 提交于 2019-12-12 00:07:40
问题 Say that I have a table that contains probabilities from each words from another table. This Table has 2 classes; actual and non_actual . I will name it master_table actual = [0.5;0.4;0.6;0.75;0.23;0.96;0.532]; %sum of the probabilities is 1. actual + non_actual = 1 non_actual = [0.5;0.6:0.4;0.25;0.77;0.04;0.468]; words = {'finn';'jake';'iceking';'marceline';'shelby';'bmo';'naptr'}; master_table = table(actual,non_actual,... 'RowNames',words) And then I have a table that contains sentences. I

Getting Text From Tweets

江枫思渺然 提交于 2019-12-11 20:36:35
问题 I am tring to read my tweets from a csv file (which I have downloaded previously), and I am having some problems: sia.list <- searchTwitter('#singaporeair', n=10, since=NULL, until=NULL, cainfo="cacert.pem") sia.df = twListToDF(sia.list) write.csv(sia.df, file='C:/temp/siaTweets.csv', row.names=F) I am trying to extract the text from the list and the problems is with the third line below: sia.df <- read.csv(file=paste(path,"siaTweets.csv",sep="")) sia.list <- as.list(t(sia.df)) sia_txt =

Extracting the POS tags in R using

匆匆过客 提交于 2019-12-11 17:28:41
问题 In my dataset I am trying to create variables containing the number of nouns, verbs and adjectives, respectively for each observation. Using the openNLP package I have managed to get this far: s <- paste(c("Pierre Vinken, 61 years old, will join the board as a ", "nonexecutive director Nov. 29.\n", "Mr. Vinken is chairman of Elsevier N.V., ", "the Dutch publishing group."), collapse = "") s <- as.String(s) s sent_token_annotator <- Maxent_Sent_Token_Annotator() word_token_annotator <- Maxent

How to compute similarity between two sentences (syntactical and semantical)

人走茶凉 提交于 2019-12-11 16:55:38
问题 I'm supposed to take two sentences each time and compute if they are similar. By similar I mean, both syntactically and semantically. INPUT1: Obama signs the law. A new law is signed by Obama. INPUT2: A Bus is stopped here. A vehicle stops here. INPUT3: Fire in NY. NY is burnt down. INPUT4: Fire in NY. 50 died in NY fire. I don't want to use ontology tree as a soul. I wrote a code to compute Levenshtein distance (LD) between sentences and then decide if the 2nd sentence: can be ignored

Sentiment analysis R syuzhet NRC Word-Emotion Association Lexicon

半城伤御伤魂 提交于 2019-12-11 16:39:13
问题 How do you find the associated words to the eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) (NRC Word-Emotion Association Lexicon) when using get_nrc_sentiment of the using the syuzhet package? a <- c("I hate going to work it is dull","I love going to work it is fun") a_corpus = Corpus(VectorSource(a)) a_tm <- TermDocumentMatrix(a_corpus) a_tmx <- as.matrix(a_tm) a_df<-data.frame(text=unlist(sapply(a, `[`)), stringsAsFactors=F) a_sent<-get_nrc