Fetching distinct values on a column using Spark DataFrame

后端 未结 2 705
余生分开走
余生分开走 2021-01-31 15:07

Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million record

2条回答
  •  清酒与你
    2021-01-31 15:50

    This solution demonstrates how to transform data with Spark native functions which are better than UDFs. It also demonstrates how dropDuplicates which is more suitable than distinct for certain queries.

    Suppose you have this DataFrame:

    +-------+-------------+
    |country|    continent|
    +-------+-------------+
    |  china|         asia|
    | brazil|south america|
    | france|       europe|
    |  china|         asia|
    +-------+-------------+
    

    Here's how to take all the distinct countries and run a transformation:

    df
      .select("country")
      .distinct
      .withColumn("country", concat(col("country"), lit(" is fun!")))
      .show()
    
    +--------------+
    |       country|
    +--------------+
    |brazil is fun!|
    |france is fun!|
    | china is fun!|
    +--------------+
    

    You can use dropDuplicates instead of distinct if you don't want to lose the continent information:

    df
      .dropDuplicates("country")
      .withColumn("description", concat(col("country"), lit(" is a country in "), col("continent")))
      .show(false)
    
    +-------+-------------+------------------------------------+
    |country|continent    |description                         |
    +-------+-------------+------------------------------------+
    |brazil |south america|brazil is a country in south america|
    |france |europe       |france is a country in europe       |
    |china  |asia         |china is a country in asia          |
    +-------+-------------+------------------------------------+
    

    See here for more information about filtering DataFrames and here for more information on dropping duplicates.

    Ultimately, you'll want to wrap your transformation logic in custom transformations that can be chained with the Dataset#transform method.

提交回复
热议问题