corpus

Speeding up the processing of large data frames in R

倖福魔咒の 提交于 2019-12-04 11:21:56
问题 Context I have been trying to implement the algorithm recently proposed in this paper. Given a large amount of text (corpus), the algorithm is supposed to return characteristic n -grams (i.e., sequence of n words) of the corpus. The user can decide the appropriate n , and at the moment I am trying with n = 2-6 as in the original paper. In other words, using the algorithm, I want to extract 2- to 6-grams that characterize the corpus. I was able to implement the part that calculates the score

Using the first field in AWK as file name

对着背影说爱祢 提交于 2019-12-04 08:47:44
The dataset is one big file with three columns: An ID of a section, something irrelevant and a line of text. An example could look like the following: A01 001 This is a simple test. A01 002 Just for exemplary purpose. A01 003 A02 001 This is another text I want to use the first column (in this example A01 and A02, which represent different texts) to be the file name, whichs content is everything in that line after the second column. The example above should result two files, one with name A01 and content: This is a simple test. Just for exemplary purpose. and another one A02 with content: This

NLTK - Get and Simplify List of Tags

独自空忆成欢 提交于 2019-12-04 08:36:26
I'm using the Brown Corpus. I want some way to print out all the possible tags and their names (not just tag abbreviations). There are also quite a few tags, is there a way to 'simplify' the tags? By simplify I mean combine two extremely similar tags into one and re-tag the merged words with the other tag? alvas It's somehow discussed previously in: Java Stanford NLP: Part of Speech labels? Simplifying the French POS Tag Set with NLTK https://linguistics.stackexchange.com/questions/2249/turn-penn-treebank-into-simpler-pos-tags The POS tag output from nltk.pos_tag are PennTreeBank tagset, https

How do I tag textfiles with hunpos in nltk?

送分小仙女□ 提交于 2019-12-04 02:33:22
问题 Can someone help me with the syntax for hunpos tagging a corpus in nltk? What do I import for the hunpos.HunPosTagger module? How do I HunPosTag the corpus? See the code below. import nltk from nltk.corpus import PlaintextCorpusReader from nltk.corpus.util import LazyCorpusLoader corpus_root = './' reader = PlaintextCorpusReader (corpus_root, '.*') ntuen = LazyCorpusLoader ('ntumultien', PlaintextCorpusReader, reader) ntuen.fileids() isinstance (ntuen, PlaintextCorpusReader) # So how do I

Can I control the way the CountVectorizer vectorizes the corpus in scikit learn?

蹲街弑〆低调 提交于 2019-12-03 08:28:14
I am working with a CountVectorizer from scikit learn, and I'm possibly attempting to do some things that the object was not made for...but I'm not sure. In terms of getting counts for occurrence: vocabulary = ['hi', 'bye', 'run away!'] corpus = ['run away!'] cv = CountVectorizer(vocabulary=vocabulary) X = cv.fit_transform(corpus) print X.toarray() gives: [[0 0 0 0]] What I'm realizing is that the CountVectorizer will break the corpus into what I believe is unigrams: vocabulary = ['hi', 'bye', 'run'] corpus = ['run away!'] cv = CountVectorizer(vocabulary=vocabulary) X = cv.fit_transform(corpus

Speeding up the processing of large data frames in R

浪子不回头ぞ 提交于 2019-12-03 07:03:58
Context I have been trying to implement the algorithm recently proposed in this paper . Given a large amount of text (corpus), the algorithm is supposed to return characteristic n -grams (i.e., sequence of n words) of the corpus. The user can decide the appropriate n , and at the moment I am trying with n = 2-6 as in the original paper. In other words, using the algorithm, I want to extract 2- to 6-grams that characterize the corpus. I was able to implement the part that calculates the score based on which characteristic n -grams are identified, but have been struggling to eliminate non

Is there any Treebank for free? [closed]

别等时光非礼了梦想. 提交于 2019-12-03 02:22:14
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . Is any place I can download Treebank of English phrases for free or less than $100? I need training data containing bunch of syntactic parsed sentences (>1000) in English in any format. Basically all I need is just words in this sentences being recognized by part of speech. 回答1: NLTK (for Python) offers several

creating corpus from multiple html text files

旧街凉风 提交于 2019-12-02 20:38:23
问题 I have a list of html files, I have taken some texts from the web and make them read with the read_html . My files names are like: a1 <- read_html(link of the text) a2 <- read_html(link of the text) . . . ## until: a100 <- read_html(link of the text) I am trying to create a corpus with these. Any ideas how can I do it? Thanks. 回答1: You could allocate the vector beforehand: text <- rep(NA, 100) text[1] <- read_html(link1) ... text[100] <- read_html(link100) Even better, if you organize your

Is there any Treebank for free? [closed]

。_饼干妹妹 提交于 2019-12-02 15:53:57
Is any place I can download Treebank of English phrases for free or less than $100? I need training data containing bunch of syntactic parsed sentences (>1000) in English in any format. Basically all I need is just words in this sentences being recognized by part of speech. cyborg NLTK (for Python) offers several treebanks for free . Here are a couple (English) treebanks available for free: American National Corpus: MASC Questions: QuestionBank and Stanford's corrections British news: BNC TED talks: NAIST-NTT TED Treebank Georgetown University Multilayer Corpus: GUM Biomedical: NaCTeM GENIA

creating corpus from multiple html text files

…衆ロ難τιáo~ 提交于 2019-12-02 10:21:54
I have a list of html files, I have taken some texts from the web and make them read with the read_html . My files names are like: a1 <- read_html(link of the text) a2 <- read_html(link of the text) . . . ## until: a100 <- read_html(link of the text) I am trying to create a corpus with these. Any ideas how can I do it? Thanks. You could allocate the vector beforehand: text <- rep(NA, 100) text[1] <- read_html(link1) ... text[100] <- read_html(link100) Even better, if you organize your links as vector. Then you can use, as suggested in the comments, lapply : text <- lapply(links, read_html)