corpus

DocumentTermMatrix wrong counting when using a dictionary

大兔子大兔子 提交于 2019-12-07 18:26:36
问题 actually I am trying to do a sentiment analysis based on twitter data using the naive bayes algorithm. I have a look on 2000 Tweets. After getting the data into R studio I split and preprocess the date as follows: train_size = floor(0.75 * nrow(Tweets_Model_Input)) set.seed(123) train_sub = sample(seq_len(nrow(Tweets_Model_Input)), size = train_size) Tweets_Model_Input_Train = Tweets_Model_Input[train_sub, ] Tweets_Model_Input_Test = Tweets_Model_Input[-train_sub, ] myCorpus = Corpus

Print first line of one element of Corpus in R using tm package

半世苍凉 提交于 2019-12-07 09:56:28
How do you print a small sample, or first line, of a corpus in R using the tm package? I have a very large corpus ( > 1 GB) and am doing some text cleaning. I would like to test as I apply cleaning procedures. Printing just the first line, or first few lines of a corpus would be ideal. # Load Libraries library(tm) # Read in Corpus corp <- SimpleCorpus( DirSource( "C:/TextDocument")) # Remove puncuation corp <- removePunctuation(corp, preserve_intra_word_contractions = TRUE, preserve_intra_word_dashes = TRUE) I have tried accessing the corpus several ways: # Print first line of first element of

What is the difference between corpus and lexicon in NLTK (python) [closed]

為{幸葍}努か 提交于 2019-12-07 09:36:33
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . Can someone tell me the difference between a Corpora , corpus and lexicon in NLTK ? What is the movie data set ? what is Wordnet ? 回答1: Corpora is the plural for corpus. Corpus basically means a body, and in the context of Natural Language Processing (NLP), it means a body of text

DocumentTermMatrix wrong counting when using a dictionary

家住魔仙堡 提交于 2019-12-06 08:38:43
actually I am trying to do a sentiment analysis based on twitter data using the naive bayes algorithm. I have a look on 2000 Tweets. After getting the data into R studio I split and preprocess the date as follows: train_size = floor(0.75 * nrow(Tweets_Model_Input)) set.seed(123) train_sub = sample(seq_len(nrow(Tweets_Model_Input)), size = train_size) Tweets_Model_Input_Train = Tweets_Model_Input[train_sub, ] Tweets_Model_Input_Test = Tweets_Model_Input[-train_sub, ] myCorpus = Corpus(VectorSource(Tweets_Model_Input_Train$SentimentText)) myCorpus <- tm_map(myCorpus, removeWords, stopwords(

NLTK - Get and Simplify List of Tags

和自甴很熟 提交于 2019-12-06 04:52:57
问题 I'm using the Brown Corpus. I want some way to print out all the possible tags and their names (not just tag abbreviations). There are also quite a few tags, is there a way to 'simplify' the tags? By simplify I mean combine two extremely similar tags into one and re-tag the merged words with the other tag? 回答1: It's somehow discussed previously in: Java Stanford NLP: Part of Speech labels? Simplifying the French POS Tag Set with NLTK https://linguistics.stackexchange.com/questions/2249/turn

What is the use of Brown Corpus in measuring Semantic Similarity based on WordNet

耗尽温柔 提交于 2019-12-05 17:37:45
I came across several methods for measuring semantic similarity that use the structure and hierarchy of WordNet, e.g. Jiang and Conrath measure (JNC), Resnik measure(RES), Lin measure (LIN) etc. The way they are measured using NLTK is: sim2=wn.jcn_similarity(entry1,entry2,brown_ic) sim3=entry1.res_similarity(entry2, brown_ic) sim4=entry1.lin_similarity(entry2,brown_ic) If WordNet is the basis of calculating semantic similarity, what is the use of Brown Corpus here? arturomp Take a look at the explanation at the NLTK howto for wordnet. Specifically, the *_ic notation is information content .

Creating a subset of words from a corpus in R

自闭症网瘾萝莉.ら 提交于 2019-12-05 16:43:18
I have a 1,500-row vector created from a Twitter search using the XML package. I have then converted it to a Corpus to be used with the tm package. I want to ultimately create a wordcloud with some (the most frequent) of those words, so I converted it to a TermDocumentMatrix to be able to find terms with a minimum frequency. I create the object "a", which is a list of those terms. a <- findFreqTerms(mydata.dtm, 10) The wordcloud package does not work on document matrices. So now, I want to filter the original vector to include only the words included in the "a" object (If I use the object

Fake reviews datasets

浪子不回头ぞ 提交于 2019-12-05 09:16:32
There are datasets with usual mail spam in the Internet, but I need datasets with fake reviews to conduct some research and I can't find any of them. Can anybody give me advices on where fake reviews datasets can be obtained? Myle Ott Our dataset is available on my Cornell homepage: http://www.cs.cornell.edu/~myleott/ A recent ACL paper, where the authors compiled such a data set: Finding Deceptive Opinion Spam by Any Stretch of the Imagination Myle Ott, Yejin Choi, Claire Cardie, Jeffrey T. Hancock You might be able to find something in the references. Alternatively, you can mail the authors

Converting NLTK phrase structure trees to BRAT .ann standoff

流过昼夜 提交于 2019-12-05 05:55:23
问题 I'm trying to annotate a corpus of plain text. I'm working with systemic functional grammar, which is fairly standard in terms of part-of-speech annotation, but differs in terms of phrases/chunks. Accordingly, I've POS tagged my data with NLTK defaults, and made a regex chunker with nltk.RegexpParser . Basically, the output now is an NLTK-style phrase structure tree: Tree('S', [Tree('Clause', [Tree('Process-dependencies', [Tree('Participant', [('This', 'DT')]), Tree('Verbal-group', [('is',

Can I control the way the CountVectorizer vectorizes the corpus in scikit learn?

安稳与你 提交于 2019-12-04 13:03:02
问题 I am working with a CountVectorizer from scikit learn, and I'm possibly attempting to do some things that the object was not made for...but I'm not sure. In terms of getting counts for occurrence: vocabulary = ['hi', 'bye', 'run away!'] corpus = ['run away!'] cv = CountVectorizer(vocabulary=vocabulary) X = cv.fit_transform(corpus) print X.toarray() gives: [[0 0 0 0]] What I'm realizing is that the CountVectorizer will break the corpus into what I believe is unigrams: vocabulary = ['hi', 'bye'