问题
I have a spark dataframe which is like
id start_time feature
1 01-01-2018 3.567
1 01-02-2018 4.454
1 01-03-2018 6.455
2 01-02-2018 343.4
2 01-08-2018 45.4
3 02-04-2018 43.56
3 02-07-2018 34.56
3 03-07-2018 23.6
I want to be able to split this into two dataframes based on the id column.So I should groupby the id column, sort by start_time and take 70% of the rows into one dataframe and 30% of the rows into another dataframe by preserving the order.The result should look like:
Dataframe1:
id start_time feature
1 01-01-2018 3.567
1 01-02-2018 4.454
2 01-02-2018 343.4
3 02-04-2018 43.56
3 02-07-2018 34.56
Dataframe2:
1 01-03-2018 6.455
2 01-08-2018 45.4
3 03-07-2018 23.6
I am using Spark 2.0 with python. What is the best way to implement this?
回答1:
The way I had to do it was to create two windows:
w1 = Window.partitionBy(df.id).orderBy(df.start_time)
w2 = Window.partitionBy(df.id)
df = df.withColumn("row_number",F.row_number().over(w1))\
.withColumn("count",F.count("id").over(w2))\
.withColumn("percent",(F.col("row_number")/F.col("count")))
train = df.filter(df.percent<=0.70)
test = df.filter(df.percent>0.70)
来源:https://stackoverflow.com/questions/52958225/split-spark-dataframe-into-two-dataframes-70-and-30-based-on-id-column-by-p