How to change a column position in a spark dataframe?

后端 未结 6 1023
无人共我
无人共我 2020-12-08 19:33

I was wondering if it is possible to change the position of a column in a dataframe, actually to change the schema?

Precisely if I have got a dataframe like [f

相关标签:
6条回答
  • 2020-12-08 19:40

    for any dynamic frame, firstly convert dynamic frame to data frame to use standard pyspark functions

    data_frame = dynamic_frame.toDF()
    

    Now, rearrange columns to new data frame using select function operation.

    data_frame_temp = data_frame.select(["col_5","col_1","col_2","col_3","col_4"])
    
    0 讨论(0)
  • 2020-12-08 19:43

    Here's what you can do in pyspark:

    As with MySQL queries, you can re-select and pass in the desired column order to the parameters, returning the same order as you passed in the query parameters.

    from pyspark.sql import SparkSession
    
    data = [
        {'id': 1, 'sex': 1, 'name': 'foo', 'age': 13},
        {'id': 1, 'sex': 0, 'name': 'bar', 'age': 12},
    ]
    
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .getOrCreate()
    
    # init df
    df = spark.createDataFrame(data)
    df.show()
    

    The output is as follows

    +---+---+----+---+
    |age| id|name|sex|
    +---+---+----+---+
    | 13|  1| foo|  1|
    | 12|  1| bar|  0|
    +---+---+----+---+
    

    Pass in the column position order you want as an argument to select

    # change columns position
    df = df.select(df.id, df.name, df.age, df.sex)
    df.show()
    

    The output is as follows

    +---+----+---+---+
    | id|name|age|sex|
    +---+----+---+---+
    |  1| foo| 13|  1|
    |  1| bar| 12|  0|
    +---+----+---+---+
    

    I hope I can help you.

    0 讨论(0)
  • 2020-12-08 19:44

    A tiny different version compare to @Tzach Zohar

    val cols = df.columns.map(df(_)).reverse
    val reversedColDF = df.select(cols:_*)
    
    0 讨论(0)
  • 2020-12-08 19:50

    Like others have commented, I'm curious to know why would you do this as the order is not relevant when you can query the columns by their names.

    Anyway, using a select should give the feeling the columns have moved in schema description:

    val data = Seq(
      ("a",       "hello", 1),
      ("b",       "spark", 2)
    )
    .toDF("field1", "field2", "field3")
    
    data
     .show()
    
    data
     .select("field3", "field2", "field1")
     .show()
    
    0 讨论(0)
  • 2020-12-08 19:53

    You can get the column names, reorder them however you want, and then use select on the original DataFrame to get a new one with this new order:

    val columns: Array[String] = dataFrame.columns
    val reorderedColumnNames: Array[String] = ??? // do the reordering you want
    val result: DataFrame = dataFrame.select(reorderedColumnNames.head, reorderedColumnNames.tail: _*)
    
    0 讨论(0)
  • 2020-12-08 20:04

    The spark-daria library has a reorderColumns method that makes it easy to reorder the columns in a DataFrame.

    import com.github.mrpowers.spark.daria.sql.DataFrameExt._
    
    val actualDF = sourceDF.reorderColumns(
      Seq("field1", "field3", "field2")
    )
    

    The reorderColumns method uses @Rockie Yang's solution under the hood.

    If you want to get the column ordering of df1 to equal the column ordering of df2, something like this should work better than hardcoding all the columns:

    df1.reorderColumns(df2.columns)
    

    The spark-daria library also defines a sortColumns transformation to sort columns in ascending or descending order (if you don't want to specify all the column in a sequence).

    import com.github.mrpowers.spark.daria.sql.transformations._
    
    df.transform(sortColumns("asc"))
    
    0 讨论(0)
提交回复
热议问题