问题
I have created a table that has a pair of IDs and coordinate fro each of them so that I can calculate pairwise cosine similarity between them.
The table looks like this
The number of dimension for the coords are currently 128, but it can vary. But the number dimensions for a pair of ID are always same in the same table.
coord1
and coord2
are repeated field (array) with floating point values.
Is there a way to calculate cosine similarity between them?
My expected output would have three columns, id1
, id2
and cosine_similarity
.
回答1:
Below is for BigQuery Standard SQL
#standardSQL
SELECT id1, id2, (
SELECT
SUM(value1 * value2)/
SQRT(SUM(value1 * value1))/
SQRT(SUM(value2 * value2))
FROM UNNEST(coord1) value1 WITH OFFSET pos1
JOIN UNNEST(coord2) value2 WITH OFFSET pos2
ON pos1 = pos2
) cosine_similarity
FROM `project.dataset.table`
below is dummy example for you to play with
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id1, [1.0, 2.0, 3.0, 4.0] coord1, 2 id2, [1.0, 2.0, 3.0, 4.0] coord2 UNION ALL
SELECT 3, [2.0, 0.0, 1.0, 1.0, 0, 2.0, 1.0, 1.0], 4, [2.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0]
)
SELECT id1, id2, (
SELECT
SUM(value1 * value2)/
SQRT(SUM(value1 * value1))/
SQRT(SUM(value2 * value2))
FROM UNNEST(coord1) value1 WITH OFFSET pos1
JOIN UNNEST(coord2) value2 WITH OFFSET pos2
ON pos1 = pos2
) cosine_similarity
FROM `project.dataset.table`
with result
Row id1 id2 cosine_similarity
1 1 2 1.0
2 3 4 0.8215838362577491
来源:https://stackoverflow.com/questions/53927630/cosine-similarity-between-pair-of-arrays-in-bigquery