data-mining

create aggregate column based on variables with R [duplicate]

佐手、 提交于 2019-12-25 03:29:18
问题 This question already has answers here : Calculating statistics on subsets of data [duplicate] (3 answers) Closed 3 years ago . I apologize in advanced if this is somewhat of a noob question but I looked in the forum and couldn't find a way to search what I am trying to do. I have a training set and I am trying to find a way to reduce the number of levels I have for my categorical variables (In the example below the category is the state). I would like to map the state to the mean or rate of

Iterate through association rules using the header of an itemset

人盡茶涼 提交于 2019-12-24 11:23:00
问题 I have a data frame of inputs which look like this I generate association rules using pandas frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True) rules = association_rules(frequent_itemsets, metric= "confidence", min_threshold = 0.6 ) My output only generates rules values of each itemset without labeling the header. It looks something like below. My questions are 1- I want to label the antecedent and consequents with their header name (Age, AL, Sex,...etc) because I can't

R arules : Extract lhs items from rules

最后都变了- 提交于 2019-12-24 07:28:12
问题 I want to extract lhs items from a rule generated from arules. For example, {a,b,c} => {d} I want to be able to extract a,b,c and put it in a character vector, so I can iterate and do further processing based on these items. At the moment, I can think of parsing the set of rules, converting it to a data frame and then separate these items using character manipulation/regex. I hope there's better way of extracting these items. 回答1: Just coerce the LHS and/or the RHS into a list: data("Adult")

Text classification extract tags from text

送分小仙女□ 提交于 2019-12-24 00:03:32
问题 I have a lucene index with a lot of text data, each item has a description, I want to extract the more common words from the description and generate tags to classify each item based on the description, is there a lucene.net library for doing this or any other library for text classification? 回答1: No, lucene.net can make search, index, text normalization, "find more like this" funtionalty, but not a text classification. What to suggest to you depends from your requirements. So, maybe more

Best XML format for log events in terms of tool support for data mining and visualization?

久未见 提交于 2019-12-22 12:54:31
问题 We want to be able to create log files from our Java application which is suited for later processing by tools to help investigate bugs and gather performance statistics. Currently we use the traditional "log stuff which may or may not be flattened into text form and appended to a log file", but this works the best for small amounts of information read by a human. After careful consideration the best bet has been to store the log events as XML snippets in text files (which is then treated

java Open source projects for medical diagnose & data mining [closed]

对着背影说爱祢 提交于 2019-12-22 10:06:14
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . I'm looking for some OS java engines for medical diseases diagnose . these are engines that takes queries input from user discribing patient symptoms and the engine should return suggestions of potential disease according to input symptoms. does such engines exists somewhere? I prefer some Java OS engine in this

how to use different distance formula other than euclidean distance in k means

主宰稳场 提交于 2019-12-21 20:42:28
问题 I am working with latitude longitude data. I have to make clusters based on distance between two points. Now distance between two different point is =ACOS(SIN(lat1)*SIN(lat2)+COS(lat1)*COS(lat2)*COS(lon2-lon1))*6371 I want to use k means in R. Is there any way I can override distance calculation in that process? 回答1: K-means is not distance based It is based on variance minimization . The sum-of-variance formula equals the sum of squared Euclidean distances , but the converse, for other

How to find the minimum support in Apriori algorithm

我与影子孤独终老i 提交于 2019-12-21 11:32:12
问题 When the percentage values of support and confidence is given how can I find the minimum support in Apriori algorithm. For an example when support and confidence is given as 60% and 60% respectively what is the minimum support? 回答1: The support and confidence are measures to measure how interesting a rule is. The minimum support and minimum confidence are set by the users, and are parameters of the Apriori algorithm for association rule generation. These parameters are used to exclude rules

In scikit-learn, can DBSCAN use sparse matrix?

谁都会走 提交于 2019-12-21 09:07:44
问题 I got Memory Error when I was running dbscan algorithm of scikit. My data is about 20000*10000, it's a binary matrix. (Maybe it's not suitable to use DBSCAN with such a matrix. I'm a beginner of machine learning. I just want to find a cluster method which don't need an initial cluster number) Anyway I found sparse matrix and feature extraction of scikit. http://scikit-learn.org/dev/modules/feature_extraction.html http://docs.scipy.org/doc/scipy/reference/sparse.html But I still have no idea

Plot the cluster member in r

梦想与她 提交于 2019-12-21 06:21:06
问题 I use DTW package in R. and I finally finished hierarchical clustering. but I wanna plot time-series cluster separately like below picture. sc <- read.table("D:/handling data/confirm.csv", header=T, sep="," ) rownames(sc) <- sc$STDR_YM_CD sc$STDR_YM_CD <- NULL col_n <- colnames(sc) hc <- hclust(dist(sc), method="average") plot(hc, main="") How can I do it?? My data in http://blogattach.naver.com/e772fb415a6c6ddafd1370417f96e494346a9725/20170207_141_blogfile/khm2963_1486442387926_THgZRt_csv