Hive external table optimal partition size

后端 未结 3 567
迷失自我
迷失自我 2021-01-15 02:32

What is the optimal size for external table partition? I am planning to partition table by year/month/day and we are getting about 2GB of data daily.

相关标签:
3条回答
  • 2021-01-15 02:58

    Optimal table partitioning is such that matching to your table usage scenario. Partitioning should be chosen based on:

    1. how the data is being queried (if you need to work mostly with daily data then partition by date).
    2. how the data is being loaded (parallel threads should load their own partitions, not overlapped)

    2Gb is not too much even for one file, though it again depends on your usage scenario. Avoid unnecessary complex and redundant partitions like (year, month, date) - in this case date is enough for partition pruning.

    0 讨论(0)
  • 2021-01-15 03:14

    Hive partitions definition will be stored in the metastore, therefore too many partitions will take much space in the metastore.

    Partitions will be stored as directories in the HDFS, therefore many partitions keys will produce hirarchical directories which make their scanning slower.

    Your query will be executed as a MapReduce job, therefore it's useless to make too tiny partitions.

    It's case depending, think how your data will be queried. For your case I prefer one key defined as 'yyyymmdd', hence we will get 365 partitions / year, only one level in the table directory and 2G data / partition which is nice for a MapReduce job.

    For the completness of the answer, if you use Hive < 0.12, make your partition key string typed, see here.

    Usefull blog here.

    0 讨论(0)
  • 2021-01-15 03:16

    Hive partitioning is most effective in cases where the data is sparse. By sparse I mean that the data internally has visible partitions such as by year, month or day.

    In your case, partitioning by date doesn't make much sense as each day will have 2 Gb of data which is not too big to handle. Partitioning by week or month makes more sense as it will optimize the query time and will not create too many small partition files.

    0 讨论(0)
提交回复
热议问题