text-mining

How I classify a word of a text in things like names, number, money, date,etc?

人走茶凉 提交于 2020-01-01 07:30:52
问题 I did some questions about text-mining a week ago, but I was a bit confused and still, but now I know wgat I want to do. The situation: I have a lot of download pages with HTML content. Some of then can bean be a text from a blog, for example. They are not structured and came from different sites. What I want to do: I will split all the words with whitespace and I want to classify each one or a group of ones in some pre-defined itens like names, numbers, phone, email, url, date, money,

Inconsistent behaviour with tm_map transformation functions when using multiple cores

邮差的信 提交于 2019-12-30 07:51:07
问题 Another potential title for this post could be "When parallel processing in r, does the ratio between number of cores, loop chunk size and object size matter?" I have a corpus I am running some transformations on using tm package. Since the corpus is large I'm using parallel processing with doparallel package. Sometimes the transformations do the task, but sometimes they do not. For example, tm::removeNumbers() . The very first document in the corpus has a content value of "n417". So if

Use scikit-learn TfIdf with gensim LDA

自作多情 提交于 2019-12-29 06:18:09
问题 I've used various versions of TFIDF in scikit learn to model some text data. vectorizer = TfidfVectorizer(min_df=1,stop_words='english') The resulting data X is in this format: <rowsxcolumns sparse matrix of type '<type 'numpy.float64'>' with xyz stored elements in Compressed Sparse Row format> I wanted to experiment with LDA as a way to do reduce dimensionality of my sparse matrix. Is there a simple way to feed the NumPy sparse matrix X into a gensim LDA model? lda = models.ldamodel.LdaModel

R text mining documents from CSV file (one row per doc)

给你一囗甜甜゛ 提交于 2019-12-29 03:33:14
问题 I am trying to work with the tm package in R, and have a CSV file of customer feedback with each line being a different instance of feedback. I want to import all the content of this feedback into a corpus but I want each line to be a different document within the corpus, so that I can compare the feedback in a DocTerms Matrix. There are over 10,000 rows in my data set. Originally I did the following: fdbk_corpus <-Corpus(VectorSource(fdbk), readerControl = list(language="eng"), sep="\t")

Extracting dates that are in different formats using regex and sorting them - pandas

﹥>﹥吖頭↗ 提交于 2019-12-28 06:26:15
问题 I am new to text mining and I need to extract the dates from a *.txt file and sort them. The dates are in between the sentences ( each line) and their format can potentially be as follows: 04/20/2009; 04/20/09; 4/20/09; 4/3/09 Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009 Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009 Feb 2009; Sep 2009; Oct 2010 6/2008; 12/2009 2009; 2010 If the day is missing consider the 1st

R-Project no applicable method for 'meta' applied to an object of class “character”

可紊 提交于 2019-12-27 11:47:01
问题 I am trying to run this code (Ubuntu 12.04, R 3.1.1) # Load requisite packages library(tm) library(ggplot2) library(lsa) # Place Enron email snippets into a single vector. text <- c( "To Mr. Ken Lay, I’m writing to urge you to donate the millions of dollars you made from selling Enron stock before the company declared bankruptcy.", "while you netted well over a $100 million, many of Enron's employees were financially devastated when the company declared bankruptcy and their retirement plans

R-Project no applicable method for 'meta' applied to an object of class “character”

不问归期 提交于 2019-12-27 11:45:09
问题 I am trying to run this code (Ubuntu 12.04, R 3.1.1) # Load requisite packages library(tm) library(ggplot2) library(lsa) # Place Enron email snippets into a single vector. text <- c( "To Mr. Ken Lay, I’m writing to urge you to donate the millions of dollars you made from selling Enron stock before the company declared bankruptcy.", "while you netted well over a $100 million, many of Enron's employees were financially devastated when the company declared bankruptcy and their retirement plans

R-Project no applicable method for 'meta' applied to an object of class “character”

半腔热情 提交于 2019-12-27 11:45:04
问题 I am trying to run this code (Ubuntu 12.04, R 3.1.1) # Load requisite packages library(tm) library(ggplot2) library(lsa) # Place Enron email snippets into a single vector. text <- c( "To Mr. Ken Lay, I’m writing to urge you to donate the millions of dollars you made from selling Enron stock before the company declared bankruptcy.", "while you netted well over a $100 million, many of Enron's employees were financially devastated when the company declared bankruptcy and their retirement plans

Extracting textual content from XML documents using XSLT [closed]

孤街醉人 提交于 2019-12-25 18:57:16
问题 Closed . This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post. Closed 4 years ago . How it is possible to extract textual content of an XML document preferably using XSLT. For such fragment, <record> <tag1>textual content</tag1> <tag2>textual content</tag2> <tag2>textual content</tag2> </record> the desired result is : textual content, textual content, textual content What's the

Why is the cluster words' frequencies so small in a big dataset?

柔情痞子 提交于 2019-12-25 12:56:12
问题 Referring to the question answered by @holzben Clustering: how to extract most distinguishing features? Using the SK-Means package, I managed to get the cluster. I couldn't figure out why the word frequency in all clusters is so small. It didn't make sense to me as I have about 10,000 tweets in my dataset. What am I doing wrong? My dataset is available at https://docs.google.com/a/siswa.um.edu.my/file/d/0B3-xuXnLwF0yTHAzbE5KbTlQWWM/edit > class(myCorpus) [1] "VCorpus" "Corpus" "list" > dtm<