data-mining

Plot the cluster member in r

眉间皱痕 提交于 2019-12-21 06:20:53
问题 I use DTW package in R. and I finally finished hierarchical clustering. but I wanna plot time-series cluster separately like below picture. sc <- read.table("D:/handling data/confirm.csv", header=T, sep="," ) rownames(sc) <- sc$STDR_YM_CD sc$STDR_YM_CD <- NULL col_n <- colnames(sc) hc <- hclust(dist(sc), method="average") plot(hc, main="") How can I do it?? My data in http://blogattach.naver.com/e772fb415a6c6ddafd1370417f96e494346a9725/20170207_141_blogfile/khm2963_1486442387926_THgZRt_csv

Sentence to Word Table with R

断了今生、忘了曾经 提交于 2019-12-21 06:01:20
问题 I have some sentences, from the sentences I want to separate the words to get row vector each. But the words are repeating to match with the largest sentence's row vector that I do not want. I want no matter how large the sentence is, the row vector of each of the sentences will only be the words one time. sentence <- c("case sweden", "meeting minutes ht board meeting st march now also attachment added agenda today s board meeting", "draft meeting minutes board meeting final meeting minutes

Computing F-measure for clustering

99封情书 提交于 2019-12-21 05:41:15
问题 Can anyone help me to calculate F-measure collectively ? I know how to calculate recall and precision, but don't know for a given algorithm how to calculate one F-measure value. As an exemple, suppose my algorithm creates m clusters, but I know there are n clusters for the same data (as created by another benchmark algorithm). I found one pdf but it is not useful since the collective value I got is greater than 1. Reference of pdf is F Measure explained. Specifically I have read some research

Python, web log data mining for frequent patterns

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-21 05:16:09
问题 I need to develop a tool for web log data mining. Having many sequences of urls, requested in a particular user session (retrieved from web-application logs), I need to figure out the patterns of usage and groups (clusters) of users of the website. I am new to Data Mining, and now examining Google a lot. Found some useful info, i.e. querying Frequent Pattern Mining in Web Log Data seems to point to almost exactly similar studies. So my questions are: Are there any python-based tools that do

Cosine distance as vector distance function for k-means

醉酒当歌 提交于 2019-12-21 03:41:32
问题 I have a graph of N vertices where each vertex represents a place. Also I have vectors, one per user, each one of N coefficients where the coefficient's value is the duration in seconds spent at the corresponding place or 0 if that place was not visited. E.g. for the graph: the vector: v1 = {100, 50, 0 30, 0} would mean that we spent: 100secs at vertex 1 50secs at vertex 2 and 30secs at vertex 4 (vertices 3 & 5 where not visited, thus the 0s). I want to run a k-means clustering and I've

Exact implementation of RandomForest in Weka 3.7

天涯浪子 提交于 2019-12-21 03:15:11
问题 Having reviewed the original Breiman (2001) paper as well as some other board posts, I am slightly confused with the actual procedure used by WEKAs random forest implementation. None of the sources was sufficiently elaborate, many even contradict each other. How does it work in detail, which steps are carried out? My understanding till now: For each tree a bootstrap sample of the same size as the training data is created Only a random subset of the available features of defined size

Is Triangle inequality necessary for kmeans?

你。 提交于 2019-12-21 02:44:34
问题 I wonder if Triangle inequality is necessary for the distance measure used in kmeans. 回答1: k-means is designed for Euclidean distance, which happens to satisfy triangle inequality. Using other distance functions is risky, as it may stop converging . The reason however is not the triangle inequality, but the mean might not minimize the distance function . (The arithmetic mean minimizes the sum-of-squares, not arbitrary distances!) There are faster methods for k-means that exploit the triangle

What data mining tools do you use? [closed]

拟墨画扇 提交于 2019-12-21 01:18:15
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 7 years ago . Besides the two well-known Open Source tools RapidMiner and Weka, are there any other good tools (either Open Source or Commercial),

Architecture for database analytics

喜欢而已 提交于 2019-12-20 10:46:23
问题 We have an architecture where we provide each customer Business Intelligence-like services for their website (internet merchant). Now, I need to analyze those data internally (for algorithmic improvement, performance tracking, etc...) and those are potentially quite heavy: we have up to millions of rows / customer / day, and I may want to know how many queries we had in the last month, weekly compared, etc... that is the order of billions entries if not more. The way it is currently done is

Javascript and Scientific Processing? [closed]

断了今生、忘了曾经 提交于 2019-12-20 08:41:08
问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 4 years ago . Matlab, R, and Python are powerful but either costly or slow for some data mining work I'd like to do. I'm considering using Javascript both for speed, good visualization libraries, and to be able to use the browser as an interface. The first question I faced is the obvious