问题
This is more like a puzzling question for me and would like to understand why.
I have two tables, almost identical the only differences are one column's data type and sortkey.
table mbytes rows
stg_user_event_properties_hist 460948 2378751028
stg_user_event_properties_hist_1 246442 2513860837
Even though they have almost same number of rows, size is close to double.
Here are the table structures
stg.stg_user_event_properties_hist
(
id bigint,
source varchar(20),
time_of_txn timestamp,
product varchar(50),
region varchar(50),
city varchar(100),
state varchar(100),
zip varchar(10),
price integer,
category varchar(50),
model varchar(50),
origin varchar(50),
l_code varchar(10),
d_name varchar(100),
d_id varchar(10),
medium varchar(255),
network varchar(255),
campaign varchar(255),
creative varchar(255),
event varchar(255),
property_name varchar(100),
property_value varchar(4000),
source_file_name varchar(255),
etl_batch_id integer,
etl_row_id integer,
load_date timestamp
);
stg.stg_user_event_properties_hist_1
(
id bigint,
source varchar(20),
time_of_txn timestamp,
product varchar(50),
region varchar(50),
city varchar(100),
state varchar(100),
zip varchar(10),
price integer,
category varchar(50),
model varchar(50),
origin varchar(50),
l_code varchar(10),
d_name varchar(100),
d_id varchar(10),
medium varchar(255),
network varchar(255),
campaign varchar(255),
creative varchar(255),
event varchar(255),
property_name varchar(100),
property_value varchar(4000),
source_file_name varchar(255),
etl_batch_id integer,
etl_row_id varchar(20),
load_date timestamp
);
The differences again etl_row_id
has data type varchar(20) in _1, integer in the other table, and the first table has a sortkey on source column.
What would be the explanation for the size difference?
UPDATE: The problem was both compression and sort keys, even though _1 table created with CTAS 11 of 26 had different compression settings, also the first table was created with Compound SortKey of 14 columns, recreated the table with no sort keys (it's a history table after all) size went down to 231GB.
回答1:
Suspect that the larger table has different compression settings or no compression at all. You can use our view v_generate_tbl_ddl to generate table DDL that includes the compression settings.
Even with the same compression settings table size can vary with different sort keys. The sort key is use to place the data into blocks on disk. If one sort key places lots of similar column values together it will compress better and require less space.
回答2:
The sizes are different for these two tables because one table is being allocated more blocks than the other based on sortkeys. For your bigger table, the distribution is happening in such a way that the disk blocks are not fully occupied, thus needing more blocks to store the same amount of data.
This happens because of the 1MB block size of Redshift and the way it stores data across slices and nodes. In general, data gets distributed across different nodes and slices based on the diststyle. For your case I am assuming this distribution is happening in a round robin way. So slice1 gets the first record, slice2 gets second record etc. As the minimum block size is 1MB for Redshift, every time a new record goes to a new slice, 1MB gets allocated (even if the record only takes a few KBs). For subsequent records to the same slice, data goes to the same 1MB block till it is possible, after which a new 1MB block gets allocated on the slice. But, if there are no more records after the first record for this slice, it still occupies the first block of 1MB size. The total size of the table is the sum of all blocks being occupied (irrespective of how much data in present in the blocks)
回答3:
The difference in table size could be due to the following reasons.
- The encoding used for each column. (query PG_TABLE_DEF)
- Distribution key used for the table. (query PG_TABLE_DEF)
- Vaccum performed on the table. (query SVV_VACUUM_SUMMARY)
If I’ve made a bad assumption please comment and I’ll refocus my answer.
来源:https://stackoverflow.com/questions/51623529/redshift-table-size