Cosine similarity between pair of arrays in Bigquery

↘锁芯ラ 提交于 2019-12-11 08:14:46

问题


I have created a table that has a pair of IDs and coordinate fro each of them so that I can calculate pairwise cosine similarity between them.

The table looks like this

The number of dimension for the coords are currently 128, but it can vary. But the number dimensions for a pair of ID are always same in the same table.

coord1 and coord2 are repeated field (array) with floating point values.

Is there a way to calculate cosine similarity between them?

My expected output would have three columns, id1, id2 and cosine_similarity.


回答1:


Below is for BigQuery Standard SQL

#standardSQL
SELECT id1, id2, ( 
  SELECT 
    SUM(value1 * value2)/ 
    SQRT(SUM(value1 * value1))/ 
    SQRT(SUM(value2 * value2))
  FROM UNNEST(coord1) value1 WITH OFFSET pos1 
  JOIN UNNEST(coord2) value2 WITH OFFSET pos2 
  ON pos1 = pos2
  ) cosine_similarity
FROM `project.dataset.table`  

below is dummy example for you to play with

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id1, [1.0, 2.0, 3.0, 4.0] coord1, 2 id2, [1.0, 2.0, 3.0, 4.0] coord2 UNION ALL
  SELECT 3, [2.0, 0.0, 1.0, 1.0, 0, 2.0, 1.0, 1.0], 4, [2.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0]
)
SELECT id1, id2, ( 
  SELECT 
    SUM(value1 * value2)/
    SQRT(SUM(value1 * value1))/ 
    SQRT(SUM(value2 * value2))
  FROM UNNEST(coord1) value1 WITH OFFSET pos1 
  JOIN UNNEST(coord2) value2 WITH OFFSET pos2 
  ON pos1 = pos2
  ) cosine_similarity
FROM `project.dataset.table`  

with result

Row id1 id2 cosine_similarity    
1   1   2   1.0  
2   3   4   0.8215838362577491   


来源:https://stackoverflow.com/questions/53927630/cosine-similarity-between-pair-of-arrays-in-bigquery

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!