问题
I'm new to R and data analysis. I'm trying to create a simple custom recommendation system for a web site. So, as input information I have user/session-id,item-id,item-price
which users clicked at.
c165c2ee-81cf-48cf-ba3f-83b70204c00c 161785 124.0
a886fdd5-7cee-4152-b1b7-77a2702687b0 643339 42.0
5e5fd670-b104-445b-a36d-b3798cd43279 131332 38.0
888d736f-99bc-49ca-969d-057e7d4bb8d1 1032763 39.0
I would like to apply cluster analysis to that data.
If I try to apply k-means clustering to my data:
> q <- kmeans(dat, centers=25)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In kmeans(dat, centers = 25) : NAs introduced by coercion
If I try to apply hierarchial clustering to the data:
> m <- as.matrix(dat)
> d <- dist(m) # find distance matrix
Warning message:
In dist(m) : NAs introduced by coercion
The "NAs introduced by coercion" seems to happen as a first column is not a number. So, I've tried to run the code against dat[-1]
but result is the same.
What am I missing or doing wrong?
Thanks a lot in advance.
=== UPDATE #1 ===
Output on str and factor:
> str(dat)
'data.frame': 14634 obs. of 3 variables:
$ V3 : Factor w/ 10062 levels "000880bf-6cb7-4c4a-9a9d-1c0a975b52ba",..: 7548 6585 3670 5336 9181 6429 62 410 7386 9409 ...
$ V8 : Factor w/ 5561 levels "1000120","1000910",..: 835 3996 443 65 1289 2084 582 695 3666 4787 ...
$ V12: Factor w/ 395 levels "100.0","101.0",..: 25 278 249 256 352 249 1 88 361 1 ...
> dat[,1] = factor(dat[,1])
> str(dat)
'data.frame': 14634 obs. of 3 variables:
$ V3 : Factor w/ 10062 levels "000880bf-6cb7-4c4a-9a9d-1c0a975b52ba",..: 7548 6585 3670 5336 9181 6429 62 410 7386 9409 ...
$ V8 : Factor w/ 5561 levels "1000120","1000910",..: 835 3996 443 65 1289 2084 582 695 3666 4787 ...
$ V12: Factor w/ 395 levels "100.0","101.0",..: 25 278 249 256 352 249 1 88 361 1 ...
> dd <- dist(dat)
Warning message:
In dist(dat) : NAs introduced by coercion
> hc <- hclust(dd) # apply hirarchical clustering
Error in hclust(dd) : NA/NaN/Inf in foreign function call (arg 11)
=== UPDATE #2 ===
I would not like to remove the first column as there could be multiple clicks for the same user which I consider to be important for the analysis.
回答1:
It sounds like you want to retain the first column (even though 10062 levels for 14634 observations is quite high). The way to convert a factor to numeric values is with the model.matrix
function. Before converting your factor:
data(iris)
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
After model.matrix
:
head(model.matrix(~.+0, data=iris))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Speciessetosa Speciesversicolor Speciesvirginica
# 1 5.1 3.5 1.4 0.2 1 0 0
# 2 4.9 3.0 1.4 0.2 1 0 0
# 3 4.7 3.2 1.3 0.2 1 0 0
# 4 4.6 3.1 1.5 0.2 1 0 0
# 5 5.0 3.6 1.4 0.2 1 0 0
# 6 5.4 3.9 1.7 0.4 1 0 0
As you can see, it expands out your factor values. So you could then run k-means clustering on the expanded version of your data:
kmeans(model.matrix(~.+0, data=iris), centers=3)
# K-means clustering with 3 clusters of sizes 49, 50, 51
#
# Cluster means:
# Sepal.Length Sepal.Width Petal.Length Petal.Width Speciessetosa Speciesversicolor Speciesvirginica
# 1 6.622449 2.983673 5.573469 2.032653 0 0.0000000 1.00000000
# 2 5.006000 3.428000 1.462000 0.246000 1 0.0000000 0.00000000
# 3 5.915686 2.764706 4.264706 1.333333 0 0.9803922 0.01960784
# ...
回答2:
Try dat[,1] = factor(dat[,1])
. I think NA
is from the session id (first column) which is not number. factor
would make session id to be indexed.
回答3:
k-means only works for continuous data.
You have two id columns that must not be used for clustering; they will make your result meaningless.
But even then I doubt that k-means is the appropriate algorithm for your problem. You first need to understand your data, then preprocess and transform it into an appropriate representation.
Don't expect a push-button solution. These don't exist / work.
回答4:
Don't use SPECIE column
km<- kmeans(iris[,1:4],3)
km
K-means clustering with 3 clusters of sizes 50, 38, 62
Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.006000 3.428000 1.462000 0.246000
2 6.850000 3.073684 5.742105 2.071053
3 5.901613 2.748387 4.393548 1.433871
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3
[59] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2 2 2 3 3 2
[117] 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2 2 3
Within cluster sum of squares by cluster:
[1] 15.15100 23.87947 39.82097
(between_SS / total_SS = 88.4 %)
来源:https://stackoverflow.com/questions/23203592/basic-clustering-with-r