PySpark - rename more than one column using withColumnRenamed

前端 未结 7 1971
别那么骄傲
别那么骄傲 2020-12-02 09:23

I want to change names of two columns using spark withColumnRenamed function. Of course, I can write:

data = sqlCont         


        
相关标签:
7条回答
  • 2020-12-02 09:36

    It is not possible to use a single withColumnRenamed call.

    • You can use DataFrame.toDF method*

      data.toDF('x3', 'x4')
      

      or

      new_names = ['x3', 'x4']
      data.toDF(*new_names)
      
    • It is also possible to rename with simple select:

      from pyspark.sql.functions import col
      
      mapping = dict(zip(['x1', 'x2'], ['x3', 'x4']))
      data.select([col(c).alias(mapping.get(c, c)) for c in data.columns])
      

    Similarly in Scala you can:

    • Rename all columns:

      val newNames = Seq("x3", "x4")
      
      data.toDF(newNames: _*)
      
    • Rename from mapping with select:

      val  mapping = Map("x1" -> "x3", "x2" -> "x4")
      
      df.select(
        df.columns.map(c => df(c).alias(mapping.get(c).getOrElse(c))): _*
      )
      

      or foldLeft + withColumnRenamed

      mapping.foldLeft(data){
        case (data, (oldName, newName)) => data.withColumnRenamed(oldName, newName) 
      }
      

    * Not to be confused with RDD.toDF which is not a variadic functions, and takes column names as a list,

    0 讨论(0)
  • 2020-12-02 09:47

    why do you want to perform it in a single line if you print the execution plan it is actually done in single line only

    data = spark.createDataFrame([(1,2), (3,4)], ['x1', 'x2'])
    data = (data
       .withColumnRenamed('x1','x3')
       .withColumnRenamed('x2', 'x4'))
    data.explain()
    

    OUTPUT

    == Physical Plan ==
    *(1) Project [x1#1548L AS x3#1552L, x2#1549L AS x4#1555L]
    +- Scan ExistingRDD[x1#1548L,x2#1549L]
    

    if you want to do it with a tuple of list you can use a simple map function

    data = spark.createDataFrame([(1,2), (3,4)], ['x1', 'x2'])
    new_names = [("x1","x3"),("x2","x4")]
    data = data.select(list(
           map(lambda old,new:F.col(old).alias(new),*zip(*new_names))
           ))
    
    data.explain()
    

    still has same plan

    OUTPUT

    == Physical Plan ==
    *(1) Project [x1#1650L AS x3#1654L, x2#1651L AS x4#1655L]
    +- Scan ExistingRDD[x1#1650L,x2#1651L]
    
    0 讨论(0)
  • 2020-12-02 09:49

    This should work if you want to rename multiple columns using the same column name with a prefix

    df.select([f.col(c).alias(PREFIX + c) for c in columns])
    
    0 讨论(0)
  • 2020-12-02 09:51

    Easiest way to do this is as follows:

    Explanation:

    1. Get all columns in the pyspark dataframe using df.columns
    2. Create a list looping through each column from step 1
    3. The list will output:col("col1").alias("col1_x").Do this only for the required columns
    4. *[list] will unpack the list for select statement in pypsark

    from pyspark.sql import functions as F (df .select(*[F.col(c).alias(f"{c}_x") for c in df.columns]) .toPandas().head() )

    Hope this helps

    0 讨论(0)
  • 2020-12-02 09:54

    I have this hack in all of my pyspark program:

    import pyspark
    def rename_sdf(df, mapper={}, **kwargs_mapper):
        ''' Rename column names of a dataframe
            mapper: a dict mapping from the old column names to new names
            Usage:
                df.rename({'old_col_name': 'new_col_name', 'old_col_name2': 'new_col_name2'})
                df.rename(old_col_name=new_col_name)
        '''
        for before, after in mapper.items():
            df = df.withColumnRenamed(before, after)
        for before, after in kwargs_mapper.items():
            df = df.withColumnRenamed(before, after)
        return df
    pyspark.sql.dataframe.DataFrame.rename = rename_sdf
    

    Now you can easily rename any spark dataframe in the pandas way!

    df.rename({'old1':'new1', 'old2':'new2'})
    
    0 讨论(0)
  • 2020-12-02 09:55

    The accepted answer by zero323 is efficient. Most of the other answers should be avoided.

    Here's another efficient solution that leverages the quinn library and is well suited for production codebases:

    df = spark.createDataFrame([(1,2), (3,4)], ['x1', 'x2'])
    def rename_col(s):
        mapping = {'x1': 'x3', 'x2': 'x4'}
        return mapping[s]
    actual_df = df.transform(quinn.with_columns_renamed(rename_col))
    actual_df.show()
    

    Here's the DataFrame that's outputted:

    +---+---+
    | x3| x4|
    +---+---+
    |  1|  2|
    |  3|  4|
    +---+---+
    

    Let's take a look at the logical plans that are outputted with actual_df.explain(True) and verify they're efficient:

    == Parsed Logical Plan ==
    'Project ['x1 AS x3#52, 'x2 AS x4#53]
    +- LogicalRDD [x1#48L, x2#49L], false
    
    == Analyzed Logical Plan ==
    x3: bigint, x4: bigint
    Project [x1#48L AS x3#52L, x2#49L AS x4#53L]
    +- LogicalRDD [x1#48L, x2#49L], false
    
    == Optimized Logical Plan ==
    Project [x1#48L AS x3#52L, x2#49L AS x4#53L]
    +- LogicalRDD [x1#48L, x2#49L], false
    
    == Physical Plan ==
    *(1) Project [x1#48L AS x3#52L, x2#49L AS x4#53L]
    

    The parsed logical plan and physical plan are basically equal, so Catalyst isn't doing any heavy lifting to optimize the plan.

    Calling withColumnRenamed multiple times should be avoided because it creates an inefficient parsed plan that needs to be optimized.

    Let's look at an unnecessarily complex parsed plan:

    def rename_columns(df, columns):
        for old_name, new_name in columns.items():
            df = df.withColumnRenamed(old_name, new_name)
        return df
    
    def rename_col(s):
        mapping = {'x1': 'x3', 'x2': 'x4'}
        return mapping[s]
    actual_df = rename_columns(df, {'x1': 'x3', 'x2': 'x4'})
    actual_df.explain(True)
    
    == Parsed Logical Plan ==
    Project [x3#52L, x2#49L AS x4#55L]
    +- Project [x1#48L AS x3#52L, x2#49L]
       +- LogicalRDD [x1#48L, x2#49L], false
    
    == Analyzed Logical Plan ==
    x3: bigint, x4: bigint
    Project [x3#52L, x2#49L AS x4#55L]
    +- Project [x1#48L AS x3#52L, x2#49L]
       +- LogicalRDD [x1#48L, x2#49L], false
    
    == Optimized Logical Plan ==
    Project [x1#48L AS x3#52L, x2#49L AS x4#55L]
    +- LogicalRDD [x1#48L, x2#49L], false
    
    == Physical Plan ==
    *(1) Project [x1#48L AS x3#52L, x2#49L AS x4#55L]
    

    Read this blog post for a detailed description on the different approaches to named PySpark column names.

    0 讨论(0)
提交回复
热议问题