How to melt Spark DataFrame?

前端 未结 4 776
日久生厌
日久生厌 2020-11-22 02:57

Is there an equivalent of Pandas Melt Function in Apache Spark in PySpark or at least in Scala?

I was running a sample dataset till now in python and now I want to u

4条回答
  •  长发绾君心
    2020-11-22 03:32

    There is no built-in function (if you work with SQL and Hive support enabled you can use stack function, but it is not exposed in Spark and has no native implementation) but it is trivial to roll your own. Required imports:

    from pyspark.sql.functions import array, col, explode, lit, struct
    from pyspark.sql import DataFrame
    from typing import Iterable 
    

    Example implementation:

    def melt(
            df: DataFrame, 
            id_vars: Iterable[str], value_vars: Iterable[str], 
            var_name: str="variable", value_name: str="value") -> DataFrame:
        """Convert :class:`DataFrame` from wide to long format."""
    
        # Create array>
        _vars_and_vals = array(*(
            struct(lit(c).alias(var_name), col(c).alias(value_name)) 
            for c in value_vars))
    
        # Add to the DataFrame and explode
        _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
    
        cols = id_vars + [
                col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]
        return _tmp.select(*cols)
    

    And some tests (based on Pandas doctests):

    import pandas as pd
    
    pdf = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
                       'B': {0: 1, 1: 3, 2: 5},
                       'C': {0: 2, 1: 4, 2: 6}})
    
    pd.melt(pdf, id_vars=['A'], value_vars=['B', 'C'])
    
       A variable  value
    0  a        B      1
    1  b        B      3
    2  c        B      5
    3  a        C      2
    4  b        C      4
    5  c        C      6
    
    sdf = spark.createDataFrame(pdf)
    melt(sdf, id_vars=['A'], value_vars=['B', 'C']).show()
    
    +---+--------+-----+
    |  A|variable|value|
    +---+--------+-----+
    |  a|       B|    1|
    |  a|       C|    2|
    |  b|       B|    3|
    |  b|       C|    4|
    |  c|       B|    5|
    |  c|       C|    6|
    +---+--------+-----+
    

    Note: For use with legacy Python versions remove type annotations.

    Related:

    • r sparkR - equivalent to melt function
    • Gather in sparklyr

提交回复
热议问题