data-mining | 易学教程

What makes the distance measure in k-medoid “better” than k-means?

阅读更多关于 What makes the distance measure in k-medoid “better” than k-means?

问题 I am reading about the difference between k-means clustering and k-medoid clustering. Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the more familiar sum of squared Euclidean distance-type metric to evaluate variance that we find with k-means. And apparently this different distance metric somehow reduces noise and outliers. I have seen this claim but I have yet to see any good reasoning as to the mathematics behind this claim.

How to apply DBSCAN algorithm on grouping of similar url [closed]

阅读更多关于 How to apply DBSCAN algorithm on grouping of similar url [closed]

问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 7 years ago . how to group similar url using the DBSCAN algorithm. I have seen many datasets but none were on url , I want to take similar type of urls and group it together. Here i am not able to know distance (eps) and

Error in extracting phrases using Gensim

阅读更多关于 Error in extracting phrases using Gensim

问题 I am trying to get the bigrams in the sentences using Phrases in Gensim as follows. from gensim.models import Phrases from gensim.models.phrases import Phraser documents = ["the mayor of new york was there", "machine learning can be useful sometimes","new york mayor was present"] sentence_stream = [doc.split(" ") for doc in documents] #print(sentence_stream) bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ') bigram_phraser = Phraser(bigram) for sent in sentence_stream

K-Medoids / K-Means Algorithm. Data point with the equal distances between two or more cluster representatives

阅读更多关于 K-Medoids / K-Means Algorithm. Data point with the equal distances between two or more cluster representatives

问题 I have been researching and studying about partition-based clustering algorithms like K-means and K-Medoids. I have learned that K-medoids is more robust to outliers compared to K-means. However I am curious on what will happen if during the assigning of data points, two or more cluster representatives have the same distance on a data point. Which cluster will you assign the data point? Will the assignment of the data point to a cluster greatly affect the clustering results? 回答1: To prevent

Data Mining Operation using SQL Query (Fuzzy Apriori Algorithm) - How do i code it using SQL?

阅读更多关于 Data Mining Operation using SQL Query (Fuzzy Apriori Algorithm) - How do i code it using SQL?

问题 So i have this Table : Trans_ID Name Fuzzy_Value Total_Item 100 I1 0.33333333 3 100 I2 0.33333333 3 100 I5 0.33333333 3 200 I2 0.5 2 200 I5 0.5 2 300 I2 0.5 2 300 I3 0.5 2 400 I1 0.33333333 3 400 I2 0.33333333 3 400 I4 0.33333333 3 500 I1 0.5 2 500 I3 0.5 2 600 I2 0.5 2 600 I3 0.5 2 700 I1 0.5 2 700 I3 0.5 2 800 I1 0.25 4 800 I2 0.25 4 800 I3 0.25 4 800 I5 0.25 4 900 I1 0.33333333 3 900 I2 0.33333333 3 900 I3 0.33333333 3 1000 I1 0.2 5 1000 I2 0.2 5 1000 I4 0.2 5 1000 I6 0.2 5 1000 I8 0.2 5

How to group nearby latitude and longitude locations stored in SQL

阅读更多关于 How to group nearby latitude and longitude locations stored in SQL

问题 Im trying to analyse data from cycle accidents in the UK to find statistical black spots. Here is the example of the data from another website. http://www.cycleinjury.co.uk/map I am currently using SQLite to ~100k store lat / lon locations. I want to group nearby locations together. This task is called cluster analysis. I would like simplify the dataset by ignoring isolated incidents and instead only showing the origin of clusters where more than one accident have taken place in a small area.

How can I perform K-means clustering on time series data?

阅读更多关于 How can I perform K-means clustering on time series data?

问题 How can I do K-means clustering of time series data? I understand how this works when the input data is a set of points, but I don't know how to cluster a time series with 1XM, where M is the data length. In particular, I'm not sure how to update the mean of the cluster for time series data. I have a set of labelled time series, and I want to use the K-means algorithm to check whether I will get back a similar label or not. My X matrix will be N X M, where N is number of time series and M is

Choosing eps and minpts for DBSCAN (R)?

阅读更多关于 Choosing eps and minpts for DBSCAN (R)?

问题 I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me. I'm using dbscan from the fpc library in R. For example, I am looking at the USArrests data set and am using dbscan on it as follows: library(fpc) ds <- dbscan(USArrests,eps=20) Choosing eps was merely by trial and error in this case. However I am wondering if there is a function or code available to automate the choice of the best eps/minpts. I know some books recommend producing a plot

FCM Clustering numeric data and csv/excel file

阅读更多关于 FCM Clustering numeric data and csv/excel file

问题 Hi I asked a previous question that gave a reasonable answer and I thought I was back on track, Fuzzy c-means tcp dump clustering in matlab the problem is the preprocessing stage of the below tcp/udp data that I would like to run through matlabs fcm clustering algorithm.My question: 1) how do i or what would be the best method to convert the text data in the cells to a numeric value? what should the numeric value be? Edit: My data in excel looks like this now: 0,tcp,http,SF,239,486,0,0,0,0,0

Is there a good way to do this type of mining?

阅读更多关于 Is there a good way to do this type of mining?

问题 I am trying to find points that are closest in space in X and Y directions (sample dataset given at the end) and am looking to see if there are smarter approaches to do this than my trivial (and untested) approach. The plot of these points in space looks something like the following and am trying to find sets of points marked inside the boxes i.e. the output I am looking for is a set of groups: Group 1: (1,23), (2,23), (3,23)... Group 2: (68,200), (68,201), (68,203), (68,204), (68,100), (68