I know that MSCK REPAIR TABLE
updates the metastore with the current partitions of an external table.
To do that, you only need to do ls
on the
You are right in the sense it reads the directory structure, creates partitions out of it and then updates the hive metastore. In fact more recently, the command was improved to remove non-existing partitions from metastore as well. The example that you are giving is very simple since it has only one level of partition keys. Consider table with multiple partition keys (2-3 partition keys is common in practice). msck repair
will have to do a full-tree traversal of all the sub-directories under the table directory, parse the file names, make sure that the file names are valid, check if the partition is already existing in the metastore and then add the only partitions which are not present in the metastore. Note that each listing on the filesystem is a RPC to the namenode (in case of HDFS) or a web-service call in case of S3 or ADLS which can add to significant amount of time. Additionally, in order to figure out if the partition is already present in metastore or not, it needs to do a full listing of all the partitions which metastore knows of for the table. Both these steps can potentially increase the time taken for the command on large tables. The performance of msck repair table was improved considerably recently Hive 2.3.0 (see HIVE-15879 for more details). You may want to tune hive.metastore.fshandler.threads
and hive.metastore.batch.retrieve.max
to improve the performance of command.