问题
I have a table id_vectors
that contains id
and their corresponding coordinates
. Each of the coordinates
is a repeated fields with 512 elements inside it.
I am looking for pairwise cosine similarity between all those vectors, e.g. If I have three ids
1,2 and 3 then I am looking for a table where I have cosine similarity between them (based on the calculation using 512 coordinates) like below:
id1 id2 similarity
1 2 0.5
1 3 0.1
2 3 0.99
Now in my table I have 424,970 unique ID
and their corresponding 512-dimension coordinates. Which means that basically I need to create around (424970 * 424969 / 2) unique pair of IDs and calculate their similarity.
I first tried with the following query using reference from here:
#standardSQL
with pairwise as
(SELECT t1.id as id_1, t1.coords as coord1, t2.id as id_2, t2.coords as coord2
FROM `project.dataset.id_vectors` t1
inner join `project.dataset.id_vectors` t2
on t1.id < t2.id)
SELECT id_1, id_2, (
SELECT
SUM(value1 * value2)/
SQRT(SUM(value1 * value1))/
SQRT(SUM(value2 * value2))
FROM UNNEST(coord1) value1 WITH OFFSET pos1
JOIN UNNEST(coord2) value2 WITH OFFSET pos2
ON pos1 = pos2
) cosine_similarity
FROM pairwise
But after running for 6 hrs I encountered the following error message
Query exceeded resource limits. 2.2127481953201417E7 CPU seconds were used, and this query must use less than 428000.0 CPU seconds.
Then I thought rather than using an intermediate table pairwise
, why don't I try to create that table first then do the cosine similarity calculation.
So I tried the following query:
SELECT t1.id as id_1, t1.coords as coord1, t2.id as id_2, t2.coords as coord2
FROM `project.dataset.id_vectors` t1
inner join `project.dataset.id_vectors` t2
on t1.id < t2.id
But this time the query could not be completed and I encountered the following message:
Error: Quota exceeded: Your project exceeded quota for total shuffle size limit. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
.
Then I tried to create even a smaller table, by just creating the combination pairs of the ids and stripping off the coordinates from it, using the following query:
SELECT t1.id as id_1, t2.id as id_2
FROM `project.dataset.id_vectors` t1
inner join `project.dataset.id_vectors` t2
on t1.id < t2.id
Again my query ends up with the error message Query exceeded resource limits. 610104.3843576935 CPU seconds were used, and this query must use less than 3000.0 CPU seconds. (error code: billingTierLimitExceeded)
I totally understand that this is a huge query and my stopping point is my billing quota.
What I am asking is that, is there a way to execute the query in a smarter way so that I do not exceed either of the resourceLimit
, shuffleSizeLimit
or billingTierLimit
?
回答1:
Quick idea is - instead of joining table on itself with redundant coordinates - you should rather just create simple table of pairs (id1, id2), so then you will "dress" respective id's with their coordinates vectors by having two extra joining to dataset.table.id_vectors
Below is quick example of how this could looks like:
#standardSQL
WITH pairwise AS (
SELECT t1.id AS id_1, t2.id AS id_2
FROM `project.dataset.id_vectors` t1
INNER JOIN `project.dataset.id_vectors` t2
ON t1.id < t2.id
)
SELECT id_1, id_2, (
SELECT
SUM(value1 * value2)/
SQRT(SUM(value1 * value1))/
SQRT(SUM(value2 * value2))
FROM UNNEST(a.coords) value1 WITH OFFSET pos1
JOIN UNNEST(b.coords) value2 WITH OFFSET pos2
ON pos1 = pos2
) cosine_similarity
FROM pairwise t
JOIN `project.dataset.id_vectors` a ON a.id = id_1
JOIN `project.dataset.id_vectors` b ON b.id = id_2
Obviously it works on small dummy set as you can see below:
#standardSQL
WITH `project.dataset.id_vectors` AS (
SELECT 1 id, [1.0, 2.0, 3.0, 4.0] coords UNION ALL
SELECT 2, [1.0, 2.0, 3.0, 4.0] UNION ALL
SELECT 3, [2.0, 0.0, 1.0, 1.0] UNION ALL
SELECT 4, [0, 2.0, 1.0, 1.0] UNION ALL
SELECT 5, [2.0, 1.0, 1.0, 0.0] UNION ALL
SELECT 6, [1.0, 1.0, 1.0, 1.0]
), pairwise AS (
SELECT t1.id AS id_1, t2.id AS id_2
FROM `project.dataset.id_vectors` t1
INNER JOIN `project.dataset.id_vectors` t2
ON t1.id < t2.id
)
SELECT id_1, id_2, (
SELECT
SUM(value1 * value2)/
SQRT(SUM(value1 * value1))/
SQRT(SUM(value2 * value2))
FROM UNNEST(a.coords) value1 WITH OFFSET pos1
JOIN UNNEST(b.coords) value2 WITH OFFSET pos2
ON pos1 = pos2
) cosine_similarity
FROM pairwise t
JOIN `project.dataset.id_vectors` a ON a.id = id_1
JOIN `project.dataset.id_vectors` b ON b.id = id_2
with result
Row id_1 id_2 cosine_similarity
1 1 2 1.0
2 1 3 0.6708203932499369
3 1 4 0.819891591749923
4 1 5 0.521749194749951
5 1 6 0.9128709291752769
6 2 3 0.6708203932499369
7 2 4 0.819891591749923
8 2 5 0.521749194749951
9 2 6 0.9128709291752769
10 3 4 0.3333333333333334
11 3 5 0.8333333333333335
12 3 6 0.8164965809277261
13 4 5 0.5000000000000001
14 4 6 0.8164965809277261
15 5 6 0.8164965809277261
So, try on your real data and let's see how it will work for you :o)
And ... obviously you should pre-create / materialize pairwise
table
Another optimization idea is to have pre-calculated values of SQRT(SUM(value1 * value1))
in your project.dataset.id_vectors
- this can save quite CPU - this should be simple adjustment so I leave it to you :o)
来源:https://stackoverflow.com/questions/53953848/calculating-pairwise-cosine-similarity-between-quite-a-large-number-of-vectors-i