Hive query to quickly find table size (number of rows)

前端 未结 6 1450
遇见更好的自我
遇见更好的自我 2021-01-31 09:20

Is there a Hive query to quickly find table size (i.e. number of rows) without launching a time-consuming MapReduce job? (Which is why I want to avoid COUNT(*).)

相关标签:
6条回答
  • 2021-01-31 09:34

    Use parquet format to store data of your external/internal table. Then you will get quicker results.

    0 讨论(0)
  • 2021-01-31 09:34

    It is a good question. the count() will take much time for finding the result. But unfortunately, count() is the only way to do.

    There is an alternative way(can't say alternate but better latency than above case) :

    Set the property

    set hive.exec.mode.local.auto=true;

    and run the same command ( select count(*) from tbl ) which gives better latency than prior.

    0 讨论(0)
  • 2021-01-31 09:39

    tblproperties will give the size of the table and can be used to grab just that value if needed.

    -- gives all properties
    show tblproperties yourTableName
    
    -- show just the raw data size
    show tblproperties yourTableName("rawDataSize")
    
    0 讨论(0)
  • 2021-01-31 09:40

    How about using :

        hdfs dfs -du -s -h /path/to/table/name
    
    0 讨论(0)
  • 2021-01-31 09:45

    Here is the quick command

    ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [noscan];
    

    For Example,If table is partitioned

     hive> ANALYZE TABLE ops_bc_log PARTITION(day) COMPUTE STATISTICS noscan;
    

    output is

    Partition logdata.ops_bc_log{day=20140523} stats: [numFiles=37, numRows=26095186, totalSize=654249957, rawDataSize=58080809507]

    Partition logdata.ops_bc_log{day=20140521} stats: [numFiles=30, numRows=21363807, totalSize=564014889, rawDataSize=47556570705]

    Partition logdata.ops_bc_log{day=20140524} stats: [numFiles=35, numRows=25210367, totalSize=631424507, rawDataSize=56083164109]

    Partition logdata.ops_bc_log{day=20140522} stats: [numFiles=37, numRows=26295075, totalSize=657113440, rawDataSize=58496087068]

    0 讨论(0)
  • 2021-01-31 09:49

    solution, though not quick
    if the table is partitioned, we can count the number of partitions and count(number of rows) in each partition.
    For example:, if partition by date (mm-dd-yyyy)

    select partition_date, count(*) from <table_name> where <partion_column_name> >= '05-14-2018' group by <partion_column_name>
    
    0 讨论(0)
提交回复
热议问题