data-mining | 易学教程

create aggregate column based on variables with R [duplicate]

阅读更多关于 create aggregate column based on variables with R [duplicate]

问题 This question already has answers here : Calculating statistics on subsets of data [duplicate] (3 answers) Closed 3 years ago . I apologize in advanced if this is somewhat of a noob question but I looked in the forum and couldn't find a way to search what I am trying to do. I have a training set and I am trying to find a way to reduce the number of levels I have for my categorical variables (In the example below the category is the state). I would like to map the state to the mean or rate of

Iterate through association rules using the header of an itemset

阅读更多关于 Iterate through association rules using the header of an itemset

问题 I have a data frame of inputs which look like this I generate association rules using pandas frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True) rules = association_rules(frequent_itemsets, metric= "confidence", min_threshold = 0.6 ) My output only generates rules values of each itemset without labeling the header. It looks something like below. My questions are 1- I want to label the antecedent and consequents with their header name (Age, AL, Sex,...etc) because I can't

R arules : Extract lhs items from rules

阅读更多关于 R arules : Extract lhs items from rules

问题 I want to extract lhs items from a rule generated from arules. For example, {a,b,c} => {d} I want to be able to extract a,b,c and put it in a character vector, so I can iterate and do further processing based on these items. At the moment, I can think of parsing the set of rules, converting it to a data frame and then separate these items using character manipulation/regex. I hope there's better way of extracting these items. 回答1: Just coerce the LHS and/or the RHS into a list: data("Adult")

Text classification extract tags from text

阅读更多关于 Text classification extract tags from text

问题 I have a lucene index with a lot of text data, each item has a description, I want to extract the more common words from the description and generate tags to classify each item based on the description, is there a lucene.net library for doing this or any other library for text classification? 回答1: No, lucene.net can make search, index, text normalization, "find more like this" funtionalty, but not a text classification. What to suggest to you depends from your requirements. So, maybe more

Best XML format for log events in terms of tool support for data mining and visualization?

阅读更多关于 Best XML format for log events in terms of tool support for data mining and visualization?

问题 We want to be able to create log files from our Java application which is suited for later processing by tools to help investigate bugs and gather performance statistics. Currently we use the traditional "log stuff which may or may not be flattened into text form and appended to a log file", but this works the best for small amounts of information read by a human. After careful consideration the best bet has been to store the log events as XML snippets in text files (which is then treated

java Open source projects for medical diagnose & data mining [closed]

阅读更多关于 java Open source projects for medical diagnose & data mining [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . I'm looking for some OS java engines for medical diseases diagnose . these are engines that takes queries input from user discribing patient symptoms and the engine should return suggestions of potential disease according to input symptoms. does such engines exists somewhere? I prefer some Java OS engine in this

how to use different distance formula other than euclidean distance in k means

阅读更多关于 how to use different distance formula other than euclidean distance in k means

问题 I am working with latitude longitude data. I have to make clusters based on distance between two points. Now distance between two different point is =ACOS(SIN(lat1)*SIN(lat2)+COS(lat1)*COS(lat2)*COS(lon2-lon1))*6371 I want to use k means in R. Is there any way I can override distance calculation in that process? 回答1: K-means is not distance based It is based on variance minimization . The sum-of-variance formula equals the sum of squared Euclidean distances , but the converse, for other

How to find the minimum support in Apriori algorithm

阅读更多关于 How to find the minimum support in Apriori algorithm

问题 When the percentage values of support and confidence is given how can I find the minimum support in Apriori algorithm. For an example when support and confidence is given as 60% and 60% respectively what is the minimum support? 回答1: The support and confidence are measures to measure how interesting a rule is. The minimum support and minimum confidence are set by the users, and are parameters of the Apriori algorithm for association rule generation. These parameters are used to exclude rules

In scikit-learn, can DBSCAN use sparse matrix?

阅读更多关于 In scikit-learn, can DBSCAN use sparse matrix?

问题 I got Memory Error when I was running dbscan algorithm of scikit. My data is about 20000*10000, it's a binary matrix. (Maybe it's not suitable to use DBSCAN with such a matrix. I'm a beginner of machine learning. I just want to find a cluster method which don't need an initial cluster number) Anyway I found sparse matrix and feature extraction of scikit. http://scikit-learn.org/dev/modules/feature_extraction.html http://docs.scipy.org/doc/scipy/reference/sparse.html But I still have no idea

Plot the cluster member in r

阅读更多关于 Plot the cluster member in r

问题 I use DTW package in R. and I finally finished hierarchical clustering. but I wanna plot time-series cluster separately like below picture. sc <- read.table("D:/handling data/confirm.csv", header=T, sep="," ) rownames(sc) <- sc$STDR_YM_CD sc$STDR_YM_CD <- NULL col_n <- colnames(sc) hc <- hclust(dist(sc), method="average") plot(hc, main="") How can I do it?? My data in http://blogattach.naver.com/e772fb415a6c6ddafd1370417f96e494346a9725/20170207_141_blogfile/khm2963_1486442387926_THgZRt_csv