corpus

Looking for dataset to test FULLTEXT style searches on [closed]

北城以北 提交于 2019-12-01 05:17:19
I am looking for a corpus of text to run some trial fulltext style data searches across. Either something I can download, or a system that generates it. Something a bit more random would be better e.g. 1,000,000 wikipedia articles in a format easy to insert into a 2 column database (id, text). Any ideas or suggestions? I'll throw this out there since I'm familiar with it - Prosper.com makes their member loan listings available for analysis through an XML export . The export would have about 50,000 loan requests with descriptions and over 1,000,000 member profiles (although many of those are

The similar method from the nltk module produces different results on different machines. Why?

送分小仙女□ 提交于 2019-12-01 02:36:12
I have taught a few introductory classes to text mining with Python, and the class tried the similar method with the provided practice texts. Some students got different results for text1.similar() than others. All versions and etc. were the same. Does anyone know why these differences would occur? Thanks. Code used at command line. python >>> import nltk >>> nltk.download() #here you use the pop-up window to download texts >>> from nltk.book import * *** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it

The similar method from the nltk module produces different results on different machines. Why?

为君一笑 提交于 2019-11-30 22:15:19
问题 I have taught a few introductory classes to text mining with Python, and the class tried the similar method with the provided practice texts. Some students got different results for text1.similar() than others. All versions and etc. were the same. Does anyone know why these differences would occur? Thanks. Code used at command line. python >>> import nltk >>> nltk.download() #here you use the pop-up window to download texts >>> from nltk.book import * *** Introductory Examples for the NLTK

Adding custom stopwords in R tm

假如想象 提交于 2019-11-30 11:46:50
I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords tm_map(abs, removeWords, stopwords("english")) Is there a way to add my own custom stop words to this list? stopwords just provides you with a vector of words, just c ombine your own ones to this. tm_map(abs, removeWords, c(stopwords("english"),"my","custom","words")) Reza Rahimi Save your custom stop words in a csv file (ex: word.csv ). library(tm) stopwords <- read.csv("word.csv", header = FALSE) stopwords <- as.character(stopwords$V1) stopwords <- c(stopwords, stopwords()) Then you can

How to show corpus text in R tm package?

我与影子孤独终老i 提交于 2019-11-30 06:59:46
问题 I'm completely new in R and tm package, so please excuse my stupid question ;-) How can I show the text of a plain text corpus in R tm package? I've loaded a corpus with 323 plain text files in a corpus: src <- DirSource("Korpora/technologie") corpus <- Corpus(src) But when I call the corpus with: corpus[[1]] I always get some output like this instead of the corpus text itself: <<PlainTextDocument>> Metadata: 7 Content: chars: 144 Content: chars: 141 Content: chars: 224 Content: chars: 75

R tm removeWords function not removing words

被刻印的时光 ゝ 提交于 2019-11-30 03:21:51
问题 I am trying to remove some words from a corpus I have built but it doesn't seem to be working. I first run through everything and create a dataframe that lists my words in order of their frequency. I use this list to identify words I am not interested in and then try to create a new list with the words removed. However, the words remain in my dataset. I am wondering what I am doing wrong and why the words aren't being removed? I have included the full code below: install.packages("rvest")

Adding custom stopwords in R tm

白昼怎懂夜的黑 提交于 2019-11-29 17:10:48
问题 I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords tm_map(abs, removeWords, stopwords("english")) Is there a way to add my own custom stop words to this list? 回答1: stopwords just provides you with a vector of words, just c ombine your own ones to this. tm_map(abs, removeWords, c(stopwords("english"),"my","custom","words")) 回答2: Save your custom stop words in a csv file (ex: word.csv ). library(tm) stopwords <- read.csv("word.csv", header =

Need free English dictionary or Corpus, ultimately for a MySQL database [closed]

霸气de小男生 提交于 2019-11-29 08:08:31
I'm trying to find a free downloadable dictionary (or Corpus might be the better word) which I can import into MySQL. I need to words to have the type (noun, verb, adjective) associated with them. Any tips on where I can find one? I found one several years ago that worked nicely, but I no longer have it around. Thanks! Chris Project Gutenberg has public domain books you can download. This includes 'The Gutenberg Webster's Unabridged Dictionary' , but nothing modern, and not in a format immediately suitable for import into a MySQL database. Not without some work, anyway. What was the one you

R Corpus Is Messing Up My UTF-8 Encoded Text

回眸只為那壹抹淺笑 提交于 2019-11-29 05:12:05
I am simply trying to create a corpus from Russian, UTF-8 encoded text. The problem is, the Corpus method from the tm package is not encoding the strings correctly. Here is a reproducible example of my problem: Load in the Russian text: > data <- c("Renault Logan, 2005","Складское помещение, 345 м²", "Су-шеф","3-к квартира, 64 м², 3/5 эт.","Samsung galaxy S4 mini GT-I9190 (чёрный)") Create a VectorSource: > vs <- VectorSource(data) > vs # outputs correctly Then, create the corpus: > corp <- Corpus(vs) > inspect(corp) # output is not encoded properly The output that I get is: > inspect(corp) <

How to show corpus text in R tm package?

☆樱花仙子☆ 提交于 2019-11-28 23:44:58
I'm completely new in R and tm package, so please excuse my stupid question ;-) How can I show the text of a plain text corpus in R tm package? I've loaded a corpus with 323 plain text files in a corpus: src <- DirSource("Korpora/technologie") corpus <- Corpus(src) But when I call the corpus with: corpus[[1]] I always get some output like this instead of the corpus text itself: <<PlainTextDocument>> Metadata: 7 Content: chars: 144 Content: chars: 141 Content: chars: 224 Content: chars: 75 Content: chars: 105 How can I show the text of the corpus? Thanks! UPDATE Reproducible sample: I've tried