efficiently group many fields including large text and jsonb

做~自己de王妃 提交于 2020-01-25 04:02:26

问题


Apologies in advance... long-winded question.

Suppose I have a table table_x with 20 fields in it:

table_x_id (identity pk)
int1
int...
int8
text1
text...
text8
jsonb1
jsonb2

Now suppose I want to maintain rapid access to grouped data (say, fields int1, int2, text1_id, text2_id and jsonb1) in table_x. Call that Report 1. Data doesn't really play an important role in posing this question, but here's an imaginary snippet from Report 1:

+-----------------------------------------------------------------------+
| int1value int2value text1value text2value jsonb1->item1 jsonb1->item2 |
+-----------------------------------------------------------------------+
|                                                       (table_x_id) 12 |
|                                                       (table_x_id) 20 |
|                                                       (table_x_id) 34 |
+-----------------------------------------------------------------------+

Now imagine I have three or more such reporting needs, and that each report involves grouping many (but not all) of the fields in table_x.

Each text field can easily reach up to, say, 1,000 characters, and the jsonb fields, while not large, just add to the problem.

The challenge: speed up the grouping for reporting.

To make grouping operations faster and table_x rowsize smaller, I broke unique text field values (which do indeed have a lot of overlap) into a separate text_table.

Now table_x is:

table_x_id (identity pk)
int1
int...
int8
text1_id (fk lookup)
text..._id (fk lookup)
text8_id (fk lookup)
jsonb1
jsonb2

Where grouping is concerned, I was then thinking to maintain a hash of the relevant columns within the table_x itself, using digest() calls in an insert/update trigger. (The thought was to convert all relevant fields in my grouping into strings, concatenate them together, and run a hash on the resultant string.)

Now table_x is:

table_x_id (identity pk)
int1
int...
int8
text1_id (lookup)
text..._id (lookup)
text8_id (lookup)
jsonb1
jsonb2
hash1_bytea (based on int1, int2, text1_id, text2_id and jsonb1)
hash2_bytea (based on int3, int7, text3_id, jsonb1 and jsonb2)
hash3_bytea (based on int2, int5, text1_id and jsonb2)

The report now requires more lookups, but that's fast and now I only have to group by hash1_bytea in order to achieve the same Report 1 output.

The fear: equivalent jsonb fields in different rows might not be equivalent when compared by their jsonb::text representations. From what I read here those fears seem justified.

But if I cannot convert jsonb values to text in a deterministic way, my "hash field inside the table" solution falls apart at the seams.

I then decided to maintain the jsonb values in a separate jsonb_table where I guarantee that any one row has a unique jsonb object.

jsonb_table is:

jsonb_id (identity pk)
jsonb (unique jsonb)

For any unique jsonb object (disregarding ordering of objects in it when represented in text) there is now one and exactly one row in jsonb_table that represents it.

Now table_x is:

table_x_id (identity pk)
int1
int...
int8
text1_id (fk lookup)
text..._id (fk lookup)
text8_id (fk lookup)
jsonb1_id (fk lookup)
jsonb2_id (fk lookup)
hash1_bytea (based on int1, int2, text1_id, text2_id and jsonb1_id )
hash2_bytea (based on int3, int7, text3_id, jsonb1_id  and jsonb2_id )
hash3_bytea (based on int2, int5, text1_id and jsonb2_id )

Yes, maintaining text_table and jsonb_table is a hassle, but this is doable, and table_x does seem quite efficient now, quickly able to maintain multiple hashes.

It seems I've accomplished fast, accurate grouping on multiple flavors of many-field groupings.

I have two questions to pose at this point:

  1. Is my approach reasonable and relatively well-designed? Or is there a better way to accomplish my goals?

  2. The json in jsonb1 and jsonb2 is really just an array of less frequently used, ad-hoc key-value pairs, However the data in jsonb1 and jsonb2 need referential integrity with data maintained in normalized relational tables. That being the case, would it be a bad idea to create a jsonb_child_table?

jsonb_child_table is:

jsonb_child_id (pk identity)
jsonb_id (fk to jsonb_table)
key_lookup_id (fk lookup)
value_lookup_id (fk lookup)

Again, a hassle to make sure that records in jsonb_child_table are correct breakouts of the jsonb field in jsonb_table, but by doing it this way I can:

  • quickly maintain all that grouping info discussed before
  • guarantee good referential integrity
  • report on jsonb1 (for example) using the fields in jsonb_child_table ordered (for example) by virtue of meta-data (via sql join using key_lookup_id) not stored in jsonb1 itself.

That last bullet point seems to echo what I've read elsewhere in SO... that maintaining an array of key-values in jsonb invites a re-think... If you want to ensure pair orderings, have referential integrity and get data faster, jsonb might be a poor choice. In my case, however, maintaining a jsonb "header" table (providing a single foreign key identity field) allows for fast grouping of disparate collections (in table_x) of value pairs. So I'm seeing benefits to maintaining the same data in both jsonb (for easy grouping) and also in real tables (for RI and faster, cleaner reporting).

Yes, this 2nd question is worthy of another SO question by itself, but the whole pile seems to hold together as inter-related, so I present it all here in one (sorry) long post.

Thanks in advance for feedback!

来源:https://stackoverflow.com/questions/58487208/efficiently-group-many-fields-including-large-text-and-jsonb

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!