Cosine distance as vector distance function for k-means

我只是一个虾纸丫 提交于 2019-12-03 11:52:16

Let's divide cosine similarity into parts and see how and why it works.

Cosine between 2 vectors - a and b - is defined as:

cos(a, b) = sum(a .* b) / (length(a) * length(b))

where .* is an element-wise multiplication. Denominator is here just for normalization, so let's simply call it L. With it our functions turns into:

cos(a, b) = sum(a .* b) / L

which, in its turn, may be rewritten as:

cos(a, b) = (a[1]*b[1] + a[2]*b[2] + ... + a[k]*b[k]) / L = 
          = a[1]*b[1]/L + a[2]*b[2]/L + ... + a[k]*b[k]/L

Let's get a bit more abstract and replace x * y / L with function g(x, y) (L here is constant, so we don't put it as function argument). Our cosine function thus becomes:

cos(a, b) = g(a[1], b[1]) + g(a[2], b[2]) + ... + g(a[n], b[n]) 

That is, each pair of elements (a[i], b[i]) is treated separately, and result is simply sum of all treatments. And this is good for your case, because you don't want different pairs (different vertices) to mess with each other: if user1 visited only vertex2 and user2 - only vertex1, then they have nothing in common, and similarity between them should be zero. What you actually don't like is how similarity between individual pairs - i.e. function g() - is calculated.

With cosine function similarity between individual pairs looks like this:

g(x, y) = x * y / L

where x and y represent time users spent on the vertex. And here's the main question: does multiplication represent similarity between individual pairs well? I don't think so. User who spent 90 seconds on some vertex should be close to user who spent there, say, 70 or 110 seconds, but much more far from users who spend there 1000 or 0 seconds. Multiplication (even normalized by L) is totally misleading here. What it even means to multiply 2 time periods?

Good news is that this is you who design similarity function. We have already decided that we are satisfied with independent treatment of pairs (vertices), and we only want individual similarity function g(x, y) to make something reasonable with its arguments. And what is reasonable function to compare time periods? I'd say subtraction is a good candidate:

g(x, y) = abs(x - y)

This is not similarity function, but instead distance function - the closer are values to each other, the smaller is result of g() - but eventually idea is the same, so we can interchange them when we need.

We may also want to increase impact of large mismatches by squaring the difference:

g(x, y) = (x - y)^2 

Hey! We've just reinvented (mean) squared error! We can now stick to MSE to calculate distance, or we can proceed finding good g() function.

Sometimes we may want not increase, but instead smooth the difference. In this case we can use log:

g(x, y) = log(abs(x - y))

We can use special treatment for zeros like this:

g(x, y) = sign(x)*sign(y)*abs(x - y)   # sign(0) will turn whole expression to 0

Or we can go back from distance to similarity by inversing the difference:

g(x, y) = 1 / abs(x - y)

Note, that in recent options we haven't used normalization factor. In fact, you can come up with some good normalization for each case, or just omit it - normalization is not always needed or good. For example, in cosine similarity formula if you change normalization constant L=length(a) * length(b) to L=1, you will get different, but still reasonable results. E.g.

cos([90, 90, 90]) == cos(1000, 1000, 1000)  # measuring angle only
cos_no_norm([90, 90, 90]) < cos_no_norm([1000, 1000, 1000])  # measuring both - angle and magnitude

Summarizing this long and mostly boring story, I would suggest rewriting cosine similarity/distance to use some kind of difference between variables in two vectors.

Cosine similarity is meant for the case where you do not want to take length into accoun, but the angle only. If you want to also include length, choose a different distance function.

Cosine distance is closely related to squared Euclidean distance (the only distance for which k-means is really defined); which is why spherical k-means works.

The relationship is quite simple:

squared Euclidean distance sum_i (x_i-y_i)^2 can be factored into sum_i x_i^2 + sum_i y_i^2 - 2 * sum_i x_i*y_i. If both vectors are normalized, i.e. length does not matter, then the first two terms are 1. In this case, squared Euclidean distance is 2 - 2 * cos(x,y)!

In other words: Cosine distance is squared Euclidean distance with the data normalized to unit length.

If you don't want to normalize your data, don't use Cosine.

Q1. Why is this assumption wrong?

As we see from the definition, cosine similarity measures angle between 2 vectors.

In your case, vector v1 lies flat on the first dimension, while c1 and c2 both are equally aligned from the axes, and thus, cosine similarity has to be same.

Note that the trouble lies with c1 and c2 pointing in the same direction. Any v1 will have the same cosine similarity with both of them. For illustration :

Q2. Is the cosine distance a correct distance function for this case?

As we see from the example in hand, probably not.

Q3. What would be a better one given the nature of the problem?

Consider Euclidean Distance.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!