Representing Sparse Data in PostgreSQL

前端 未结 4 1499
盖世英雄少女心
盖世英雄少女心 2021-01-31 04:58

What\'s the best way to represent a sparse data matrix in PostgreSQL? The two obvious methods I see are:

  1. Store data in a single a table with a separate column f

相关标签:
4条回答
  • 2021-01-31 05:09

    I know this is an old thread, but MadLib provides a sparse vector type for Postgres, along with several machine learning and statistical methods.

    0 讨论(0)
  • 2021-01-31 05:10

    A NULL value will take up no space when it's NULL. It'll take up one bit in a bitmap in the tuple header, but that will be there regardless.

    However, the system can't deal with millions of columns, period. There is a theoretical max of a bit over a thousand, IIRC, but you really don't want to go that far.

    If you really need that many, in a single table, you need to go the EAV method, which is basically what you're saying in (2).

    If each entry has only a relatively few keys, I suggest you look at the "hstore" contrib modules which lets you store this type of data very efficiently, as a third option. It's been enhanced further in the upcoming 9.0 version, so if you are a bit away from production deployment, you might want to look directly at that one. However, it's well worth it in 8.4 as well. And it does support some pretty efficient index based lookups. Definitely worth looking into.

    0 讨论(0)
  • 2021-01-31 05:27

    A few solutions spring to mind,

    1) Separate your features into groups that are usually set together, create a table for each group with a one-to-one foreign key relationship to the main data, only join on tables you need when querying

    2) Use the EAV anti-pattern, create a 'feature' table with a foreign key field from your primary table as well as a fieldname and a value column, and store the features as rows in that table instead of as attributes in your primary table

    3) Similarly to how PostgreDynamic does it, create a table for each 'column' in your primary table (they use a separate namespace for those tables), and create functions to simplify (as well as efficiently index) accessing and updating the data in those tables

    4) create a column in your primary data using XML, or VARCHAR, and store some structured text format within it representing your data, create indexes over the data with functional indexes, write functions to update the data (or use the XML functions if you are using that format)

    5) use the contrib/hstore module to create a column of type hstore that can hold key-value pairs, and can be indexed and updated

    6) live with lots of empty fields

    0 讨论(0)
  • 2021-01-31 05:30

    I'm assuming you're thinking of sparse matrices from mathematical context: http://en.wikipedia.org/wiki/Sparse_matrix (The storing techniques described there are for memory storage (fast arithmetic operation), not persistent storage (low disk usage).)

    Since one usually do operate on this matrices on client side rather than on server side a SQL-ARRAY[] is the best choice!

    The question is how to take advantage of the sparsity of the matrix? Here the results from some investigations.

    Setup:

    • Postgres 8.4
    • Matrices w/ 400*400 elements in double precision (8 Bytes) --> 1.28MiB raw size per matrix
    • 33% non-zero elements --> 427kiB effective size per matrix
    • averaged using ~1000 different random populated matrices

    Competing methods:

    • Rely on the automatic server side compression of columns with SET STORAGE MAIN or EXTENDED.
    • Only store the non-zero elements plus a bitmap (bit varying(xx)) describing where to locate the non-zero elements in the matrix. (One double precision is 64 times bigger than one bit. In theory (ignoring overheads) this method should be an improvement if <=98% are non-zero ;-).) Server side compression is activated.
    • Replace the zeros in the matrix with NULL. (The RDBMSs are very effective in storing NULLs.) Server side compression is activated.

    (Indexing of non-zero elements using a 2nd index-ARRAY[] is not very promising and therefor not tested.)

    Results:

    • Automatic compression
      • no extra implementation efforts
      • no reduced network traffic
      • minimal compression overhead
      • persistent storage = 39% of the raw size
    • Bitmap
      • acceptable implementation effort
      • network traffic slightly decreased; dependent on sparsity
      • persistent storage = 33.9% of the raw size
    • Replace zeros with NULLs
      • some implementation effort (API needs to know where and how to set the NULLs in the ARRAY[] while constructing the INSERT query)
      • no change in network traffic
      • persistent storage = 35% of the raw size

    Conclusion: Start with the EXTENDED/MAIN storage parameter. If you have some free time investigate your data and use my test setup with your sparsity level. But the effect may be lower than you expect.

    I suggest always to use the matrix serialization (e.g. Row-major order) plus two integer columns for the matrix dimensions NxM. Since most APIs use textual SQL you are saving a lot of network traffic and client memory for nested "ARRAY[ARRAY[..], ARRAY[..], ARRAY[..], ARRAY[..], ..]" !!!

    Tebas


    CREATE TABLE _testschema.matrix_dense
    (
      matdata double precision[]
    );
    ALTER TABLE _testschema.matrix_dense ALTER COLUMN matdata SET STORAGE EXTERN;
    
    
    CREATE TABLE _testschema.matrix_sparse_autocompressed
    (
      matdata double precision[]
    );
    
    CREATE TABLE _testschema.matrix_sparse_bitmap
    (
      matdata double precision[]
      bitmap bit varying(8000000)
    );
    

    Insert the same matrices into all tables. The concrete data depends on the certain table. Do not change the data on server side due to unused but allocated pages. Or do a VACUUM.

    SELECT 
    pg_total_relation_size('_testschema.matrix_dense') AS dense, 
    pg_total_relation_size('_testschema.matrix_sparse_autocompressed') AS autocompressed, 
    pg_total_relation_size('_testschema.matrix_sparse_bitmap') AS bitmap;
    
    0 讨论(0)
提交回复
热议问题