问题
I am new to spark ,I want to do a broadcast join and before that i am trying to get the size of my data frame that i want to broadcast..
Is there anyway to find the size of a data frame .
I am using Python as my programming language for spark
Any help much appreciated
回答1:
If you are looking for size in bytes as well as size in row count follow this-
Alternative-1
// ### Alternative -1
/**
* file content
* spark-test-data.json
* --------------------
* {"id":1,"name":"abc1"}
* {"id":2,"name":"abc2"}
* {"id":3,"name":"abc3"}
*/
val fileName = "spark-test-data.json"
val path = getClass.getResource("/" + fileName).getPath
spark.catalog.createTable("df", path, "json")
.show(false)
/**
* +---+----+
* |id |name|
* +---+----+
* |1 |abc1|
* |2 |abc2|
* |3 |abc3|
* +---+----+
*/
// Collect only statistics that do not require scanning the whole table (that is, size in bytes).
spark.sql("ANALYZE TABLE df COMPUTE STATISTICS NOSCAN")
spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)
/**
* +----------+---------+-------+
* |col_name |data_type|comment|
* +----------+---------+-------+
* |Statistics|68 bytes | |
* +----------+---------+-------+
*/
spark.sql("ANALYZE TABLE df COMPUTE STATISTICS")
spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)
/**
* +----------+----------------+-------+
* |col_name |data_type |comment|
* +----------+----------------+-------+
* |Statistics|68 bytes, 3 rows| |
* +----------+----------------+-------+
*/
Alternative-2
// ### Alternative 2
val df = spark.range(10)
df.createOrReplaceTempView("myView")
spark.sql("explain cost select * from myView").show(false)
/**
* +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
* |plan |
* +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
* |== Optimized Logical Plan ==
* Range (0, 10, step=1, splits=Some(2)), Statistics(sizeInBytes=80.0 B, hints=none)
*
* == Physical Plan ==
* *(1) Range (0, 10, step=1, splits=2)|
* +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
*/
Alternative-3
// ### altervative 3
println(spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats.sizeInBytes)
// 80
来源:https://stackoverflow.com/questions/62461550/how-to-get-the-size-of-a-data-frame-before-doing-the-broadcast-join-in-pyspark