I know du -sh
in common Linux filesystems. But how to do that with HDFS?
When trying to calculate the total of a particular group of files within a directory the -s
option does not work (in Hadoop 2.7.1). For example:
Directory structure:
some_dir
├abc.txt
├count1.txt
├count2.txt
└def.txt
Assume each file is 1 KB in size. You can summarize the entire directory with:
hdfs dfs -du -s some_dir
4096 some_dir
However, if I want the sum of all files containing "count" the command falls short.
hdfs dfs -du -s some_dir/count*
1024 some_dir/count1.txt
1024 some_dir/count2.txt
To get around this I usually pass the output through awk.
hdfs dfs -du some_dir/count* | awk '{ total+=$1 } END { print total }'
2048