Fetching distinct values on a column using Spark DataFrame

后端 未结 2 709
余生分开走
余生分开走 2021-01-31 15:07

Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million record

2条回答
  •  庸人自扰
    2021-01-31 15:35

    Well to obtain all different values in a Dataframe you can use distinct. As you can see in the documentation that method returns another DataFrame. After that you can create a UDF in order to transform each record.

    For example:

    val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("age", "salary")
    
    // I obtain all different values. If you show you must see only {1, 3}
    val distinctValuesDF = df.select(df("age")).distinct
    
    // Define your udf. In this case I defined a simple function, but they can get complicated.
    val myTransformationUDF = udf(value => value / 10)
    
    // Run that transformation "over" your DataFrame
    val afterTransformationDF = distinctValuesDF.select(myTransformationUDF(col("age")))
    

提交回复
热议问题