In Java code, I want to connect to a directory in HDFS, learn the number of files in that directory, get their names and want to read them. I can already read the files but I co
hadoop fs -du [-s] [-h] [-x] URI [URI ...]
Displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.
Options:
The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files. Without the -s option, calculation is done by going 1-level deep from the given path.
The -h option will format file sizes in a “human-readable” fashion (e.g 64.0m instead of 67108864)
The -x option will exclude snapshots from the result calculation. Without the -x option (default), the result is always calculated from all INodes, including all snapshots under the given path.
You can use the following to check file count in that particular directory
hadoop fs -count /directoryPath/* | print $2 | wc -l
count : counts the number of files, directories, and bytes under the path
print $2 : To print second column from the output
wc -l : To check the line count
On command line, you can do it as below.
hdfs dfs -ls $parentdirectory | awk '{system("hdfs dfs -count " $6) }'
FileSystem fs = FileSystem.get(conf);
Path pt = new Path("/path");
ContentSummary cs = fs.getContentSummary(pt);
long fileCount = cs.getFileCount();
count
Usage: hadoop fs -count [-q] <paths>
Count the number of directories, files and bytes under the paths that match the specified file pattern. The output columns are: DIR_COUNT, FILE_COUNT, CONTENT_SIZE FILE_NAME.
The output columns with -q are: QUOTA, REMAINING_QUATA, SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT, CONTENT_SIZE, FILE_NAME.
Example:
hadoop fs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hadoop fs -count -q hdfs://nn1.example.com/file1
Exit Code:
Returns 0 on success and -1 on error.
You can just use the FileSystem and iterate over the files inside the path. Here is some example code
int count = 0;
FileSystem fs = FileSystem.get(getConf());
boolean recursive = false;
RemoteIterator<LocatedFileStatus> ri = fs.listFiles(new Path("hdfs://my/path"), recursive);
while (ri.hasNext()){
count++;
ri.next();
}
To do a quick and simple count, you can also try the following one-liner:
hdfs dfs -ls -R /path/to/your/directory/ | grep -E '^-' | wc -l
Quick explanation:
grep -E '^-'
or egrep '^-'
: Grep all files: Files start with '-' whereas folders start with 'd';
wc -l
: line count.