text-mining | 易学教程

How I classify a word of a text in things like names, number, money, date,etc?

阅读更多关于 How I classify a word of a text in things like names, number, money, date,etc?

问题 I did some questions about text-mining a week ago, but I was a bit confused and still, but now I know wgat I want to do. The situation: I have a lot of download pages with HTML content. Some of then can bean be a text from a blog, for example. They are not structured and came from different sites. What I want to do: I will split all the words with whitespace and I want to classify each one or a group of ones in some pre-defined itens like names, numbers, phone, email, url, date, money,

Inconsistent behaviour with tm_map transformation functions when using multiple cores

阅读更多关于 Inconsistent behaviour with tm_map transformation functions when using multiple cores

问题 Another potential title for this post could be "When parallel processing in r, does the ratio between number of cores, loop chunk size and object size matter?" I have a corpus I am running some transformations on using tm package. Since the corpus is large I'm using parallel processing with doparallel package. Sometimes the transformations do the task, but sometimes they do not. For example, tm::removeNumbers() . The very first document in the corpus has a content value of "n417". So if

Use scikit-learn TfIdf with gensim LDA

阅读更多关于 Use scikit-learn TfIdf with gensim LDA

问题 I've used various versions of TFIDF in scikit learn to model some text data. vectorizer = TfidfVectorizer(min_df=1,stop_words='english') The resulting data X is in this format: <rowsxcolumns sparse matrix of type '<type 'numpy.float64'>' with xyz stored elements in Compressed Sparse Row format> I wanted to experiment with LDA as a way to do reduce dimensionality of my sparse matrix. Is there a simple way to feed the NumPy sparse matrix X into a gensim LDA model? lda = models.ldamodel.LdaModel

R text mining documents from CSV file (one row per doc)

阅读更多关于 R text mining documents from CSV file (one row per doc)

问题 I am trying to work with the tm package in R, and have a CSV file of customer feedback with each line being a different instance of feedback. I want to import all the content of this feedback into a corpus but I want each line to be a different document within the corpus, so that I can compare the feedback in a DocTerms Matrix. There are over 10,000 rows in my data set. Originally I did the following: fdbk_corpus <-Corpus(VectorSource(fdbk), readerControl = list(language="eng"), sep="\t")

Extracting dates that are in different formats using regex and sorting them - pandas

阅读更多关于 Extracting dates that are in different formats using regex and sorting them - pandas

问题 I am new to text mining and I need to extract the dates from a *.txt file and sort them. The dates are in between the sentences ( each line) and their format can potentially be as follows: 04/20/2009; 04/20/09; 4/20/09; 4/3/09 Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009 Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009 Feb 2009; Sep 2009; Oct 2010 6/2008; 12/2009 2009; 2010 If the day is missing consider the 1st

R-Project no applicable method for 'meta' applied to an object of class “character”

阅读更多关于 R-Project no applicable method for 'meta' applied to an object of class “character”

问题 I am trying to run this code (Ubuntu 12.04, R 3.1.1) # Load requisite packages library(tm) library(ggplot2) library(lsa) # Place Enron email snippets into a single vector. text <- c( "To Mr. Ken Lay, I’m writing to urge you to donate the millions of dollars you made from selling Enron stock before the company declared bankruptcy.", "while you netted well over a $100 million, many of Enron's employees were financially devastated when the company declared bankruptcy and their retirement plans

R-Project no applicable method for 'meta' applied to an object of class “character”

阅读更多关于 R-Project no applicable method for 'meta' applied to an object of class “character”

R-Project no applicable method for 'meta' applied to an object of class “character”

阅读更多关于 R-Project no applicable method for 'meta' applied to an object of class “character”

Extracting textual content from XML documents using XSLT [closed]

阅读更多关于 Extracting textual content from XML documents using XSLT [closed]

问题 Closed . This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post. Closed 4 years ago . How it is possible to extract textual content of an XML document preferably using XSLT. For such fragment, <record> <tag1>textual content</tag1> <tag2>textual content</tag2> <tag2>textual content</tag2> </record> the desired result is : textual content, textual content, textual content What's the

Why is the cluster words' frequencies so small in a big dataset?

阅读更多关于 Why is the cluster words' frequencies so small in a big dataset?

问题 Referring to the question answered by @holzben Clustering: how to extract most distinguishing features? Using the SK-Means package, I managed to get the cluster. I couldn't figure out why the word frequency in all clusters is so small. It didn't make sense to me as I have about 10,000 tweets in my dataset. What am I doing wrong? My dataset is available at https://docs.google.com/a/siswa.um.edu.my/file/d/0B3-xuXnLwF0yTHAzbE5KbTlQWWM/edit > class(myCorpus) [1] "VCorpus" "Corpus" "list" > dtm<