data-mining

Calculate ordering of dendrogram leaves

此生再无相见时 提交于 2020-01-13 06:01:08
问题 I have five points and I need to create dendrogram from these. The function 'dendrogram' can be used to find the ordering of these points as shown below. However, I do not want to use dendrogram as it is slow and result in error for large number of points (I asked this question here Python alternate way to find dendrogram). Can someone points me how to convert the 'linkage' output (Z) to the "dendrogram(Z)['ivl']" value. >>> from hcluster import pdist, linkage, dendrogram >>> import numpy >>>

Web scraping, screen scraping, data mining tips? [closed]

ⅰ亾dé卋堺 提交于 2020-01-11 19:50:27
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 7 years ago . I'm working on a project and I need to do a lot of screen scraping to get a lot of data as fast as possible. I'm wondering if anyone

Web scraping, screen scraping, data mining tips? [closed]

不羁的心 提交于 2020-01-11 19:50:07
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 7 years ago . I'm working on a project and I need to do a lot of screen scraping to get a lot of data as fast as possible. I'm wondering if anyone

Scikit-learn: How to run KMeans on a one-dimensional array?

佐手、 提交于 2020-01-09 19:08:20
问题 I have an array of 13.876(13,876) values between 0 and 1. I would like to apply sklearn.cluster.KMeans to only this vector to find the different clusters in which the values are grouped. However, it seems KMeans works with a multidimensional array and not with one-dimensional ones. I guess there is a trick to make it work but I don't know how. I saw that KMeans.fit() accepts "X : array-like or sparse matrix, shape=(n_samples, n_features)" , but it wants the n_samples to be bigger than one I

Scikit-learn: How to run KMeans on a one-dimensional array?

[亡魂溺海] 提交于 2020-01-09 19:07:16
问题 I have an array of 13.876(13,876) values between 0 and 1. I would like to apply sklearn.cluster.KMeans to only this vector to find the different clusters in which the values are grouped. However, it seems KMeans works with a multidimensional array and not with one-dimensional ones. I guess there is a trick to make it work but I don't know how. I saw that KMeans.fit() accepts "X : array-like or sparse matrix, shape=(n_samples, n_features)" , but it wants the n_samples to be bigger than one I

Scikit-learn: How to run KMeans on a one-dimensional array?

纵然是瞬间 提交于 2020-01-09 19:06:50
问题 I have an array of 13.876(13,876) values between 0 and 1. I would like to apply sklearn.cluster.KMeans to only this vector to find the different clusters in which the values are grouped. However, it seems KMeans works with a multidimensional array and not with one-dimensional ones. I guess there is a trick to make it work but I don't know how. I saw that KMeans.fit() accepts "X : array-like or sparse matrix, shape=(n_samples, n_features)" , but it wants the n_samples to be bigger than one I

Scikit-learn: How to run KMeans on a one-dimensional array?

不想你离开。 提交于 2020-01-09 19:06:44
问题 I have an array of 13.876(13,876) values between 0 and 1. I would like to apply sklearn.cluster.KMeans to only this vector to find the different clusters in which the values are grouped. However, it seems KMeans works with a multidimensional array and not with one-dimensional ones. I guess there is a trick to make it work but I don't know how. I saw that KMeans.fit() accepts "X : array-like or sparse matrix, shape=(n_samples, n_features)" , but it wants the n_samples to be bigger than one I

Fuzzy queries to database

时光总嘲笑我的痴心妄想 提交于 2020-01-06 19:32:35
问题 I'm curious about how works feature on many social sites today. For example, you enter list of movies you like and system suggests other movies you may like (based on movies that like other people who likes the same movies that you). I think doing it straight-sql way (join list of my movies with movies-users join with user-movies group by movie title and apply count to it ) on large datasets would be just impossible to implement due to "heaviness" of such query. At the same time we don't need

Ordered colored plot after clustering using python

浪子不回头ぞ 提交于 2020-01-05 18:26:40
问题 I have a 1D array called data=[5 1 100 102 3 4 999 1001 5 1 2 150 180 175 898 1012]. I am using python scipy.cluster.vq to find clusters within it. There are 3 clusters in the data. After clustering when I'm trying to plot the data, there is no order in it. It would be great if it's possible to plot the data in the same order as it is given and color different sections belong to different groups or clusters. Here is my code: import numpy as np import matplotlib.pyplot as plt from scipy

Lift value calculation

假装没事ソ 提交于 2020-01-03 14:00:40
问题 I have a (symmetric) adjacency matrix, which has been created based on the co-occurence of names (e.g.: Greg, Mary, Sam, Tom) in newspaper articles (e.g.: a,b,c,d). See below. How to calculate the lift value for the non-zero matrix elements (http://en.wikipedia.org/wiki/Lift_(data_mining))? I would be interested in an efficient implementation, which could also be used for very large matrices (e.g. a million non-zero elements). I appreciate any help. # Load package library(Matrix) # Data A <-