Secondary in-memory index representations in Python

问题

I am searching for an efficient solution to build a secondary in-memory index in Python using a high-level optimised mathematical package such as numpy and arrow. I am excluding pandas for performance reasons.

Definition

"A secondary index contains an entry for each existing value of the attribute to be indexed. This entry can be seen as a key/value pair with the attribute value as key and as value a list of pointers to all records in the base table that have this value." - JV. D'Silva et al. (2017)

Let's take a simple example, we can scale this later on to produce some benchmarks:

import numpy as np

pk = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='uint32')
val = np.array([15.5, 3.75, 142.88, 142.88, None, None, None, 7.2, 2.1], dtype='float32')

Interestingly pyarrow.Array.dictionary_encode method can transform the value array into a dictionary encoded representation that is close to a secondary index.

val.dictionary_encode()
Out[55]: 
<pyarrow.lib.DictionaryArray object at 0x7ff430d8b4d0>
-- dictionary:
  [
    15.5,
    3.75,
    142.88,
    nan,
    7.2,
    2.1
  ]
-- indices:
  [
    0,
    1,
    2,
    2,
    3,
    3,
    3,
    4,
    5
  ]

I have opened an issue here

So, the question is about how fast you can build a secondary index in memory using Python data structures to hold efficiently values and indices. But this is half the story as the index will be useful if it serves well both filtering queries (point, range) and transformations - reconstruction of row, column and association a.k.a hyperedge in TRIADB. And even this quick description here does not cover how easy it will be to update this kind of index.

For many reasons, I have started investigating a possible PyArrow open-source solution. A sorted dictionary-encoded representation should generally meet the requirements of the problem with an excellent combination of smaller memory footprint and faster/flexible zero copy I/O processing.

回答1:

Solution

I have searched both in the past and in the present for an open-source solution to this problem but I have not found one that satisfies my appetite. This time I decided to start building my own and discuss openly its implementation that also covers the null case, i.e. missing data scenario.

Do notice that secondary index is very close to adjacency list representation, a core element in my TRIADB project and that is the main reason behind searching for a solution.

Let's start with one line code using numpy

idx = np.sort(np.array(list(zip(pk, val)), dtype=struct_type), order='val')

idx['val']
Out[68]: 
array([  2.1 ,   3.75,   7.2 ,  15.5 , 142.88, 142.88,    nan,    nan,
          nan], dtype=float32)

idx['pk']
Out[69]: array([8, 1, 7, 0, 2, 3, 4, 5, 6], dtype=uint32)

Faster solution (less generic)

this is the special but perfectly valid case where pk has values in range(n)

idx_pk = np.argsort(val)
idx_pk
Out[91]: array([8, 1, 7, 0, 2, 3, 4, 5, 6])

idx_val = val[idx_pk]
idx_val
Out[93]: array([  2.1 ,   3.75,   7.2 ,  15.5 , 142.88, 142.88,    nan,    nan,   nan], dtype=float32)

There are a few more steps to get a secondary index representation according to the definition of JV. D'Silva et al.

Get rid of nan
Calculate unique values of secondary index
For each unique value calculate the list of primary key indices to all rows of the table that contain that value

Unique Secondary Index with adjacency lists

def secondary_index_with_adjacency_list(arr):
    idx_pk = np.argsort(arr)
    idx_val = arr[idx_pk]
    cnt = np.count_nonzero(~np.isnan(idx_val))
    usec_ndx, split_ndx, cnt_arr = np.unique(idx_val[:cnt], return_index=True, return_counts=True)
    adj_list = np.split(idx_pk[:cnt], split_ndx)[1:]

    return usec_ndx, cnt_arr, adj_list

ndx, freq, adj = secondary_index_with_adjacency_list(val)

pd.DataFrame({'val': ndx, 'freq': freq, 'adj': adj})

Out[11]: 
      val  freq     adj
0    2.10     1     [8]
1    3.75     1     [1]
2    7.20     1     [7]
3   15.50     1     [0]
4  142.88     2  [2, 3]

Discussion

In practice it is faster to use the representation of secondary index with repeated values than the one with lists of pointers to records of a table but the second one has the interesting property of being closer to a hypergraph representation that I am using in TRIADB.

The kind of secondary index described in this solution is more suitable for analysis, filtering of big data sets that don't fit in memory but stored on disk with a column-store format. In that case for a specific set of columns it is possible to reconstruct a subset of records in memory (column-store) format and even present it on a hypergraph (stay tuned for the next release of TRIADB)

来源：https://stackoverflow.com/questions/59918440/secondary-in-memory-index-representations-in-python

标签

python

numpy

adjacency-list

pyarrow

secondary-indexes