I was wondering if it is possible to change the position of a column in a dataframe, actually to change the schema?
Precisely if I have got a dataframe like [f
for any dynamic frame, firstly convert dynamic frame to data frame to use standard pyspark functions
data_frame = dynamic_frame.toDF()
Now, rearrange columns to new data frame using select function operation.
data_frame_temp = data_frame.select(["col_5","col_1","col_2","col_3","col_4"])
Here's what you can do in pyspark:
As with MySQL queries, you can re-select and pass in the desired column order to the parameters, returning the same order as you passed in the query parameters.
from pyspark.sql import SparkSession
data = [
{'id': 1, 'sex': 1, 'name': 'foo', 'age': 13},
{'id': 1, 'sex': 0, 'name': 'bar', 'age': 12},
]
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
# init df
df = spark.createDataFrame(data)
df.show()
The output is as follows
+---+---+----+---+
|age| id|name|sex|
+---+---+----+---+
| 13| 1| foo| 1|
| 12| 1| bar| 0|
+---+---+----+---+
Pass in the column position order you want as an argument to select
# change columns position
df = df.select(df.id, df.name, df.age, df.sex)
df.show()
The output is as follows
+---+----+---+---+
| id|name|age|sex|
+---+----+---+---+
| 1| foo| 13| 1|
| 1| bar| 12| 0|
+---+----+---+---+
I hope I can help you.
A tiny different version compare to @Tzach Zohar
val cols = df.columns.map(df(_)).reverse
val reversedColDF = df.select(cols:_*)
Like others have commented, I'm curious to know why would you do this as the order is not relevant when you can query the columns by their names.
Anyway, using a select should give the feeling the columns have moved in schema description:
val data = Seq(
("a", "hello", 1),
("b", "spark", 2)
)
.toDF("field1", "field2", "field3")
data
.show()
data
.select("field3", "field2", "field1")
.show()
You can get the column names, reorder them however you want, and then use select
on the original DataFrame to get a new one with this new order:
val columns: Array[String] = dataFrame.columns
val reorderedColumnNames: Array[String] = ??? // do the reordering you want
val result: DataFrame = dataFrame.select(reorderedColumnNames.head, reorderedColumnNames.tail: _*)
The spark-daria library has a reorderColumns
method that makes it easy to reorder the columns in a DataFrame.
import com.github.mrpowers.spark.daria.sql.DataFrameExt._
val actualDF = sourceDF.reorderColumns(
Seq("field1", "field3", "field2")
)
The reorderColumns
method uses @Rockie Yang's solution under the hood.
If you want to get the column ordering of df1
to equal the column ordering of df2
, something like this should work better than hardcoding all the columns:
df1.reorderColumns(df2.columns)
The spark-daria library also defines a sortColumns
transformation to sort columns in ascending or descending order (if you don't want to specify all the column in a sequence).
import com.github.mrpowers.spark.daria.sql.transformations._
df.transform(sortColumns("asc"))