corpus

R: inspect Document Term Matrix results in Error: Repeated indices currently not allowed

倾然丶 夕夏残阳落幕 提交于 2019-12-11 09:29:52
问题 I have the following dummy data: final6 <- data.frame(docname = paste0("doc", 1:6), articles = c("Catalonia independence in matter of days", "Anger over Johnson Libya bodies comment", "Man admits frenzied mum and son murder", "The headache that changed my life", "Las Vegas killer sick, demented - Trump", "Instagram baby photo scammer banned") ) And I want to create a DocumentTermMatrix with reference to document names (that I could later link to the original article text). To achieve this, I

R tm: reloading a 'PCorpus' backend filehash database as corpus (e.g. in restarted session/script)

心不动则不痛 提交于 2019-12-10 15:58:14
问题 Having learned loads from answers on this site (thanks!), it's finally time to ask my own question. I'm using R (tm and lsa packages) to create, clean and simplify, and then run LSA (latent semantic analysis) on, a corpus of about 15,000 text documents. I'm doing this in R 3.0.0 under Mac OS X 10.6. For efficiency (and to cope with having too little RAM), I've been trying to use either the 'PCorpus' (backend database support supported by the 'filehash' package) option in tm, or the newer 'tm

How to import nltk corpus in HDFS when I use hadoop streaming

℡╲_俬逩灬. 提交于 2019-12-10 12:17:35
问题 I got a little problem I want to use nltk corpus in hdfs,But failed.For example I want to load nltk.stopwords in my python code. I use this http://eigenjoy.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/ I do all that say,but I don't know how to transform it in my work. My nltk file name is nltk-2.0.1.rc1 my pyam file name is PyYAML.3.0.1 so my commad is: zip -r nltkandyaml.zip nltk-2.0.1.rc1 PyYAML.3.0.1 then it said "mv ntlkandyaml.zip /path/to/where/your/mapper/will/be

Need to set categorized corpus reader in NLTK and Python, corpus texts in one file, one text per line

扶醉桌前 提交于 2019-12-10 10:05:21
问题 I am getting familiar with NLTK and text categorization by Jacob Perkins's book "Python Text Processing with NLTK 2.0 Cookbook". My corpus documents/texts each consists of a paragraph of text, so each of them is in a separate line of file, not in a separate file. The number of such these paragraphs/lines are about 2 millions. Therefore there are about 2 million on machine learning instances. Each line in my file (a paragraph of text - a combination of domain title, description, keywords),

Wordnet (Word Sense Annotated) Corpus

爱⌒轻易说出口 提交于 2019-12-09 13:26:28
问题 I've been utilizing lots of different corpora for natural language processing, and I've been looking for a corpus that has been annotated with Wordnet Word Senses. I understand that there probably is not a big corpus with this information, since the corpus needs to be built up manually, but there has to be something to go off of. Also if there isn't a corpus in existence, is there at least a sense annotated ngram database (with what percentage of the time a word is each of its definitions, or

convert corpus into data.frame in R

女生的网名这么多〃 提交于 2019-12-09 13:11:38
问题 I'm using the tm package to apply stemming, and I need to convert the resulting data into a data frame. A solution for this can be found here R tm package vcorpus: Error in converting corpus to data frame, but in my case I have the content of the corpus as: [[2195]] i was very impress instead of [[2195]] "i was very impress" and because of this, if I apply data.frame(text=unlist(sapply(mycorpus, `[`, "content")), stringsAsFactors=FALSE) the result will be <NA>. Any help is much appreciated!

CWB encoding Corpus

早过忘川 提交于 2019-12-09 06:40:29
According to the Corpus Work Bench, to encode a corpus i need to use the cwb-encode perl script "encode the corpus, i.e. convert the verticalized text to CWB binary format with the cwb-encode tool. Note that the command below has to be entered on a single line." http://cogsci.uni-osnabrueck.de/~korpora/ws/CWBdoc/CWB_Encoding_Tutorial/node3.html $ cwb-encode -d /corpora/data/example -f example.vrt -R /usr/local/share/cwb/registry/example -P pos -S s when i tried it, it says the file is missing but i'm sure the file is in $HOME/corpora/data/example, the error was $ cwb-encode -d /corpora/data

How to read corpus of parsed sentences using NLTK in python?

孤街醉人 提交于 2019-12-08 11:29:28
I am working with the BLLIP 1987-89 WSJ Corpus Release 1 ( https://catalog.ldc.upenn.edu/LDC2000T43 ). I am trying to use NLTK's SyntaxCorpusReader class to read in the parsed sentences. I'm trying to get it to work with a simple example of just 1 file. Here is my code... from nltk.corpus.reader import SyntaxCorpusReader path = '/corpus/wsj' filename = 'wsj1' reader = SyntaxCorpusReader('/corpus/wsj','wsj1') I am able to see the raw text from the file. It returns a string of the parsed sentences. reader.raw() u"(S1 (S (PP-LOC (IN In)\n\t(NP (NP (DT a) (NN move))\n\t (SBAR (WHNP#0 (WDT that))\n

How to reconnect to the PCorpus in the R tm package?

匆匆过客 提交于 2019-12-08 09:07:30
问题 I create a PCorpus, which as far as I understand is stored on HDD, with the following code: pc = PCorpus(vs, readerControl = list(language = "pl"), dbControl = list(dbName = "pcorpus", dbType = "DB1")) How may I reconnect to that database later? 回答1: You can't as far as I'm aware. The 'database' is actually a filehash object, which you can reconnect to and load as follows, db <- dbInit("pcorpus") pc<-dbLoad(db) but it loads each file as it's own object. You need to save to disk explicitly

R DocumentTermMatrix loses results less than 100

此生再无相见时 提交于 2019-12-08 05:58:09
问题 I'm trying to feed a corpus into DocumentTermMatrix (I shorthand as DTM) to get term frequencies, but I noticed that DTM doesn't keep all terms and I don't know why! Check it out: A<-c(" 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107") B<-c(" 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107") C<-Corpus(VectorSource(c(A,B))) inspect(C) >A corpus with 2 text documents > >The metadata consists of 2 tag-value pairs and a data frame >Available tags are: > create_date