indexing and query high dimensional data in postgreSQL

问题

I want to index data in height dimensions (128 dimensional vectors of integers in range of [0,254] are possible):

| id |      vector       |
|  1 | { 1, 0, ..., 254} |
|  2 | { 2, 128, ...,1}  |
|  . | { 1, 0, ..., 252} |
|  n | { 1, 2, ..., 251} |

I saw that PostGIS implemented R-Trees. So can I use these trees in PostGIS to index and query multidimensional vectors in Postgres?

I also saw that there is a index implementation for int arrays.

Now I have questions about how to perform a query.
Can I perform a knn-search and a radius search on an integer array? Maybe I also must define my own distance function. Is this possible? I want to use the Manhattan distance (block distance) for my queries.

I also can represent my vector as a binary string with the pattern v1;v2;...;vn. Does this help to perform the search?

For example if I had these two string:

1;2;1;1
1;3;2;2

The result / distance between these two strings should be 3.

回答1:

Perhaps a better choice would be the cube extension, since your area of interest is not individual integer, but full vector.

Cube supports GiST indexing, and Postgres 9.6 will also bring KNN indexing to cubes, supporting euclidean, taxicab (aka Manhattan) and chebishev distances.

It is a bit annoying that 9.6 is still in development, however there's no problem backporting patch for cube extension to 9.5 and I say that from experience.

Hopefully 128 dimensions will still be enough to get meaningful results.

How to do this?

First have an example table:

create extension cube;
create table vectors (id serial, vector cube);

Populate table with example data:

insert into vectors select id, cube(ARRAY[round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000)]) from generate_series(1, 2000000) id;

Then try selecting:

explain analyze SELECT * from vectors
order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=123352.07..123352.09 rows=10 width=76) (actual time=1705.499..1705.501 rows=10 loops=1)
   ->  Sort  (cost=123352.07..129852.07 rows=2600000 width=76) (actual time=1705.496..1705.497 rows=10 loops=1)
         Sort Key: (('(966, 82, 765, 343, 600, 718, 338, 505)'::cube <#> vector))
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Seq Scan on vectors  (cost=0.00..67167.00 rows=2600000 width=76) (actual time=0.038..998.864 rows=2600000 loops=1)
 Planning time: 0.172 ms
 Execution time: 1705.541 ms
(7 rows)

We should create an index:

create index vectors_vector_idx on vectors (vector);

Does it help:

explain analyze SELECT * from vectors
order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10;

--------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.41..1.93 rows=10 width=76) (actual time=41.339..143.915 rows=10 loops=1)
   ->  Index Scan using vectors_vector_idx on vectors  (cost=0.41..393704.41 rows=2600000 width=76) (actual time=41.336..143.902 rows=10 loops=1)
         Order By: (vector <#> '(966, 82, 765, 343, 600, 718, 338, 505)'::cube)
 Planning time: 0.146 ms
 Execution time: 145.474 ms
(5 rows)

At 8 dimensions, it does help.

回答2:

(Addendum to selected answer)

For people wanting more than 100 dimensions, beware: there's a 100 dimensions limit in cube extension.

The tricky part is that postgres allows you to create cubes with more than 100 dimensions just fine. It's when you try to restore a backup that it is refused (the worst time to realize that).

As recommended in documentation, I patched cube extension to support more dimensions. I made a docker image for it, and you can look at the Dockerfile to see how to do it yourself, from the github repos.

来源：https://stackoverflow.com/questions/35405147/indexing-and-query-high-dimensional-data-in-postgresql

标签

sql

postgresql

multidimensional-array

indexing

postgresql-9.3