问题
I have to HDFS folder size which is having sub directories from java.
From command line we can use -dus option, But anyone can help me on how to get the same using java.
回答1:
The getSpaceConsumed()
function in the ContentSummary
class will return the actual space the file/directory occupies in the cluster i.e. it takes into account the replication factor set for the cluster.
For instance, if the replication factor in the hadoop cluster is set to 3 and the directory size is 1.5GB, the getSpaceConsumed()
function will return the value as 4.5GB.
Using getLength() function in the ContentSummary
class will return you the actual file/directory size.
回答2:
You could use getContentSummary(Path f)
method provided by the class FileSystem
. It returns a ContentSummary
object on which the getSpaceConsumed()
method can be called which will give you the size of directory in bytes.
Usage :
package org.myorg.hdfsdemo;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class GetDirSize {
/**
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Configuration config = new Configuration();
config.addResource(new Path(
"/hadoop/projects/hadoop-1.0.4/conf/core-site.xml"));
config.addResource(new Path(
"/hadoop/projects/hadoop-1.0.4/conf/core-site.xml"));
FileSystem fs = FileSystem.get(config);
Path filenamePath = new Path("/inputdir");
System.out.println("SIZE OF THE HDFS DIRECTORY : " + fs.getContentSummary(filenamePath).getSpaceConsumed());
}
}
HTH
回答3:
Thank you guys.
Scala version
package com.beloblotskiy.hdfsstats.model.hdfs
import java.nio.file.{Files => NioFiles, Paths => NioPaths}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import org.apache.commons.io.IOUtils
import java.nio.file.{Files => NioFiles}
import java.nio.file.{Paths => NioPaths}
import com.beloblotskiy.hdfsstats.common.Settings
/**
* HDFS utilities
* @author v-abelablotski
*/
object HdfsOps {
private val conf = new Configuration()
conf.addResource(new Path(Settings.pathToCoreSiteXml))
conf.addResource(new Path(Settings.pathToHdfsSiteXml))
private val fs = FileSystem.get(conf)
/**
* Calculates disk usage with replication factor.
* If function returns 3G for folder with replication factor = 3, it means HDFS has 1G total files size multiplied by 3 copies space usage.
*/
def duWithReplication(path: String): Long = {
val fsPath = new Path(path);
fs.getContentSummary(fsPath).getSpaceConsumed()
}
/**
* Calculates disk usage without pay attention to replication factor.
* Result will be the same with hadopp fs -du /hdfs/path/to/directory
*/
def du(path: String): Long = {
val fsPath = new Path(path);
fs.getContentSummary(fsPath).getLength()
}
//...
}
回答4:
Spark-shell tool to show all tables and their consumption
A typical and illustrative tool for Spark-shell, looping loop over all bases, tables and partitions, to get sizes and report into a CSV file:
// sshell -i script.scala > ls.csv
import org.apache.hadoop.fs.{FileSystem, Path}
def cutPath (thePath: String, toCut: Boolean = true) : String =
if (toCut) thePath.replaceAll("^.+/", "") else thePath
val warehouse = "/apps/hive/warehouse" // the Hive default location for all databases
val fs = FileSystem.get( sc.hadoopConfiguration )
println(s"base,table,partitions,bytes")
fs.listStatus( new Path(warehouse) ).foreach( x => {
val b = x.getPath.toString
fs.listStatus( new Path(b) ).foreach( x => {
val t = x.getPath.toString
var parts = 0; var size = 0L; // var size3 = 0L
fs.listStatus( new Path(t) ).foreach( x => {
// partition path is x.getPath.toString
val p_cont = fs.getContentSummary(x.getPath)
parts = parts + 1
size = size + p_cont.getLength
//size3 = size3 + p_cont.getSpaceConsumed
}) // t loop
println(s"${cutPath(b)},${cutPath(t)},${parts},${size}")
// display opt org.apache.commons.io.FileUtils.byteCountToDisplaySize(size)
}) // b loop
}) // warehouse loop
System.exit(0) // get out from spark-shell
PS: I checked, size3 is always 3*size, no extra information.
来源:https://stackoverflow.com/questions/16581327/get-folder-size-of-hdfs-from-java