text-mining

How to sum up the word count for each person in a dialogue?

旧时模样 提交于 2019-12-22 13:52:23
问题 I'm starting to learn Python and I'm trying to write a program that would import a text file, count the total number of words, count the number of words in a specific paragraph (said by each participant, described by 'P1', 'P2' etc.), exclude these words (i.e. 'P1' etc.) from my word count, and print paragraphs separately. Thanks to @James Hurford I got this code: words = None with open('data.txt') as f: words = f.read().split() total_words = len(words) print 'Total words:', total_words in

Lucene Entity Extraction

江枫思渺然 提交于 2019-12-22 08:07:02
问题 Given a finite dictionary of entity terms, I'm looking for a way to do Entity Extraction with intelligent tagging using Lucene. Currently I've been able to use Lucene for: - Searching for complex phrases with some fuzzyness - Highlighting results However, I 'm not aware how to: -Get accurate offsets of the matched phrases -Do entity-specific annotaions per match(not just tags for every single hit) I have tried using the explain() method - but this only gives the terms in the query which got

R Text Mining with quanteda

情到浓时终转凉″ 提交于 2019-12-22 00:28:03
问题 I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code. # Load the relevant dictionary (relevant for analysis) liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC") # Read File # Facebooks posts could be generated by FB Netvizz # https://apps.facebook.com/netvizz # Load FB posts as .csv-file from .zip-file fbpost <- read.csv("D:/FB-com.csv", sep=";") # Define the relevant column(s) fb_test <-as.character(FB_com$comment

Working with text classification and big sparse matrices in R

不问归期 提交于 2019-12-21 22:22:49
问题 I'm working on a text multi-class classification project and I need to build the document / term matrices and train and test in R language. I already have datasets that don't fit in the limited dimensionality of the base matrix class in R and would need to build big sparse matrices to be able to classify for example, 100k tweets. I am using the quanteda package, as it has been for now more useful and reliable than the package tm , where creating a DocumentTermMatrix with a dictionary, makes

R Tidytext and unnest_tokens error

自闭症网瘾萝莉.ら 提交于 2019-12-21 21:26:07
问题 Very new to R and have started to use the tidytext package. I'm trying to use arguments to feed into the unnest_tokens function so I can do multiple column analysis. So instead of this library(janeaustenr) library(tidytext) library(dplyr) library(stringr) original_books <- austen_books() %>% group_by(book) %>% mutate(linenumber = row_number(), chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>% ungroup() original_books tidy_books <- original_books %>%

Sentence to Word Table with R

断了今生、忘了曾经 提交于 2019-12-21 06:01:20
问题 I have some sentences, from the sentences I want to separate the words to get row vector each. But the words are repeating to match with the largest sentence's row vector that I do not want. I want no matter how large the sentence is, the row vector of each of the sentences will only be the words one time. sentence <- c("case sweden", "meeting minutes ht board meeting st march now also attachment added agenda today s board meeting", "draft meeting minutes board meeting final meeting minutes

Clustering text in MATLAB

烂漫一生 提交于 2019-12-21 05:43:11
问题 I want to do hierarchical agglomerative clustering on texts in MATLAB. Say, I have four sentences, I have a pen. I have a paper. I have a pencil. I have a cat. I want to cluster the above four sentences to see which are more similar. I know Statistic toolbox has command like pdist to measure pair-wise distances, linkage to calculate the cluster similarity etc. A simple code like: X=[1 2; 2 3; 1 4]; Y=pdist(X, 'euclidean'); Z=linkage(Y, 'single'); H=dendrogram(Z) works fine and return a

Use tm's Corpus function with big data in R

百般思念 提交于 2019-12-21 04:48:06
问题 I'm trying to do text mining on big data in R with tm . I run into memory issues frequently (such as can not allocation vector of size.... ) and use the established methods of troubleshooting those issues, such as using 64-bit R trying different OS's (Windows, Linux, Solaris, etc) setting memory.limit() to its maximum making sure that sufficient RAM and compute is available on the server (which there is) making liberal use of gc() profiling the code for bottlenecks breaking up big operations

Apache Spark Naive Bayes based Text Classification

∥☆過路亽.° 提交于 2019-12-20 10:15:06
问题 im trying to use Apache Spark for document classification. For example i have two types of Class (C and J) Train data is : C, Chinese Beijing Chinese C, Chinese Chinese Shanghai C, Chinese Macao J, Tokyo Japan Chinese And test data is : Chinese Chinese Chinese Tokyo Japan // What is ist J or C ? How i can train and predict as above datas. I did Naive Bayes text classification with Apache Mahout, however no with Apache Spark. How can i do this with Apache Spark? 回答1: Yes, it doesn't look like

Text classification/categorization algorithm [closed]

时间秒杀一切 提交于 2019-12-20 09:19:40
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . My objective is to [semi]automatically assign texts to different categories. There's a set of user defined categories and a set of texts for each category. The ideal algorithm should be able to learn from a human-defined classification and then classify new texts automatically. Can anybody suggest such an