How to find the size of a HDFS file? What command should be used to find the size of any file in HDFS.
If you want to do it through the API, you can use 'getFileStatus()' method.
See the command below with awk script to see the size (in GB) of filtered output in HDFS:
hadoop fs -du -s /data/ClientDataNew/**A*** | awk '{s+=$1} END {printf "%.3fGB\n", s/1000000000}'
output ---> 2.089GB
hadoop fs -du -s /data/ClientDataNew/**B*** | awk '{s+=$1} END {printf "%.3fG\n", s/1000000000}'
output ---> 1.724GB
hadoop fs -du -s /data/ClientDataNew/**C*** | awk '{s+=$1} END {printf "%.3fG\n", s/1000000000}'
output ---> 0.986GB
You can use hadoop fs -ls
command to list files in the current directory as well as their details. The 5th column in the command output contains file size in bytes.
For e.g. command hadoop fs -ls input
gives following output:
Found 1 items
-rw-r--r-- 1 hduser supergroup 45956 2012-07-19 20:57 /user/hduser/input/sou
The size of file sou
is 45956 bytes.
hdfs dfs -du -s -h /directory
This is the human readable version, otherwise it will give in bad units (slight bigger)
I used the below function which helped me to get the file size.
public class GetflStatus
{
public long getflSize(String args) throws IOException, FileNotFoundException
{
Configuration config = new Configuration();
Path path = new Path(args);
FileSystem hdfs = path.getFileSystem(config);
ContentSummary cSummary = hdfs.getContentSummary(path);
long length = cSummary.getLength();
return length;
}
}
I also find myself using hadoop fs -dus <path>
a great deal. For example, if a directory on HDFS named "/user/frylock/input" contains 100 files and you need the total size for all of those files you could run:
hadoop fs -dus /user/frylock/input
and you would get back the total size (in bytes) of all of the files in the "/user/frylock/input" directory.
Also, keep in mind that HDFS stores data redundantly so the actual physical storage used up by a file might be 3x or more than what is reported by hadoop fs -ls
and hadoop fs -dus
.