data-mining

Lift value calculation

拈花ヽ惹草 提交于 2020-01-03 14:00:29
问题 I have a (symmetric) adjacency matrix, which has been created based on the co-occurence of names (e.g.: Greg, Mary, Sam, Tom) in newspaper articles (e.g.: a,b,c,d). See below. How to calculate the lift value for the non-zero matrix elements (http://en.wikipedia.org/wiki/Lift_(data_mining))? I would be interested in an efficient implementation, which could also be used for very large matrices (e.g. a million non-zero elements). I appreciate any help. # Load package library(Matrix) # Data A <-

How to cluster data with discrete binary attributes?

左心房为你撑大大i 提交于 2020-01-03 04:40:15
问题 In my data, there are ten millions of binary attributes, But only some of them are informative, most of them are zeros. Format is like as following: data attribute1 attribute2 attribute3 attribute4 ......... A 0 1 0 1 ......... B 1 0 1 0 ......... C 1 1 0 1 ......... D 1 1 0 0 ......... What is a smart way to cluster this? I know K-means clustering. But I don't think it's suitable in this case. Because the binary value makes distances less obvious. And it will suffer form the curse of high

Dummy Coding of Nominal Attributes - Effect of Using K Dummies, Effect of Attribute Selection

荒凉一梦 提交于 2020-01-02 18:04:43
问题 Summing up my understanding of the topic 'Dummy Coding' is usually understood as coding a nominal attribute with K possible values as K-1 binary dummies. The usage of K values would cause redundancy and would have a negative impact e.g. on logistic regression, as far as I learned it. That far, everything's clear to me. Yet, two issues are unclear to me: 1) Bearing in mind the issue stated above, I am confused that the 'Logistic' classifier in WEKA actually uses K dummies (see picture). Why

how to write output from rapidminer to a txt file?

徘徊边缘 提交于 2020-01-02 10:23:08
问题 i am using rapidminer 5.3.I took a small document which contains around three english sentences , tokenized it and filtered it with respect to the length of words.i want to write the output into a different word document.i tried using Write document utility but it is not working,it is simply writing the same original document into the new one.However when i write the output to the console,it gives me the expected answer.Something wrong with the write document utility. Here is my process READ

Similarity matrix -> feature vectors algorithm?

余生长醉 提交于 2020-01-02 03:56:05
问题 If we have a set of M words, and know the similarity of the meaning of each pair of words in advance (have a M x M matrix of similarities), which algorithm can we use to make one k-dimensional bit vector for each word, so that each pair of words can be compared just by comparing their vectors (e.g. getting the absolute difference of vectors)? I don't know how this particular problem is called. If I knew, it would be much easier to find among a bunch of algorithms with similar descriptions,

ID3 and C4.5: How Does “Gain Ratio” Normalize “Gain”?

独自空忆成欢 提交于 2020-01-01 03:39:30
问题 The ID3 algorithm uses "Information Gain" measure. The C4.5 uses "Gain Ratio" measure which is Information Gain divided by SplitInfo , whereas SplitInfo is high for a split where records split evenly between different outcomes and low otherwise. My question is: How does this help to solve the problem that Information Gain is biased towards splits with many outcomes? I can't see the reason. SplitInfo doesn't even take into account the number of outcomes, just the distribution of records in the

extracting relations from text

不问归期 提交于 2020-01-01 03:31:17
问题 I want to extract relations from unstructured text in the form of (SUBJECT,OBJECT,ACTION) relations, for instance, "The boy is sitting on the table eating the chicken" would give me, (boy,chicken,eat) (boy,table,LOCATION) etc.. although a python program + NLTK could process such a simple sentence as above. I'd like to know if any of you have used tools or libraries preferably opensource to extract relations from a much wider domain such as a large collection of text documents or the web. 回答1:

Finding the center of a cluster

血红的双手。 提交于 2020-01-01 03:11:40
问题 I have the following problem - made abstract to bring out the key issues. I have 10 points each which is some distance from the other. I want to be able to find the center of the cluster i.e. the point for which the pairwise distance to each other point is minimised, let p(j) ~ p(k) represent the pairwise distance beteen points j and k p(i) is center-point of the cluster iff p(i) s.t. min[sum(p(j)~p(k))] for all 0 < j,k <= n where we have n points in the cluster determine how to split the

Ways to calculate similarity

不羁岁月 提交于 2019-12-29 13:19:06
问题 I am doing a community website that requires me to calculate the similarity between any two users. Each user is described with the following attributes: age, skin type (oily, dry), hair type (long, short, medium), lifestyle (active outdoor lover, TV junky) and others. Can anyone tell me how to go about this problem or point me to some resources? 回答1: Another way of computing (in R) all the pairwise dissimilarities (distances) between observations in the data set. The original variables may be

Ways to calculate similarity

和自甴很熟 提交于 2019-12-29 13:18:52
问题 I am doing a community website that requires me to calculate the similarity between any two users. Each user is described with the following attributes: age, skin type (oily, dry), hair type (long, short, medium), lifestyle (active outdoor lover, TV junky) and others. Can anyone tell me how to go about this problem or point me to some resources? 回答1: Another way of computing (in R) all the pairwise dissimilarities (distances) between observations in the data set. The original variables may be