问题
I want to iterate across the columns of dataframe in my Spark program and calculate min and max value. I'm new to Spark and scala and not able to iterate over the columns once I fetch it in a dataframe.
I have tried running the below code but it needs column number to be passed to it, question is how do I fetch it from dataframe and pass it dynamically and store the result in a collection.
val parquetRDD = spark.read.parquet("filename.parquet")
parquetRDD.collect.foreach ({ i => parquetRDD_subset.agg(max(parquetRDD(parquetRDD.columns(2))), min(parquetRDD(parquetRDD.columns(2)))).show()})
Appreciate any help on this.
回答1:
You should not be iterating on rows or records. You should be using aggregation function
import org.apache.spark.sql.functions._
val df = spark.read.parquet("filename.parquet")
val aggCol = col(df.columns(2))
df.agg(min(aggCol), max(aggCol)).show()
First when you do spark.read.parquet you are reading a dataframe. Next we define the column we want to work on using the col function. The col function translate a column name to a column. You could instead use df("name") where name is the name of the column.
The agg function takes aggregation columns so min and max are aggregation functions which take a column and return a column with an aggregated value.
Update
According to the comments, the goal is to have min and max for all columns. You can therefore do this:
val minColumns = df.columns.map(name => min(col(name)))
val maxColumns = df.columns.map(name => max(col(name)))
val allMinMax = minColumns ++ maxColumns
df.agg(allMinMax.head, allMinMax.tail: _*).show()
You can also simply do:
df.describe().show()
which gives you statistics on all columns including min, max, avg, count and stddev
来源:https://stackoverflow.com/questions/45171920/iterate-across-columns-in-spark-dataframe-and-calculate-min-max-value