问题
Suppose I have data set like :
Name | Subject | Y1 | Y2
A | math | 1998| 2000
B | | 1996| 1999
| science | 2004| 2005
I want to split rows of this data set such that Y2 column will be eliminated like :
Name | Subject | Y1
A | math | 1998
A | math | 1999
A | math | 2000
B | | 1996
B | | 1997
B | | 1998
B | | 1999
| science | 2004
| science | 2005
Can someone suggest something here ? I hope I had made my query clear. Thanks in advance.
回答1:
I think you only need to create an udf
to create the range. Then you can use explode to create the necessary rows:
val createRange = udf { (yearFrom: Int, yearTo: Int) =>
(yearFrom to yearTo).toList
}
df.select($"Name", $"Subject", functions.explode(createRange($"Y1", $"Y2"))).show()
EDIT: The python version of this code would be something like:
from pyspark.sql import Row
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import IntegerType
createRange=udf( lambda (yearFrom, yearTo): list(range(yearFrom, yearTo)), IntegerType())
df.select($"Name", $"Subject", explode(createRange($"Y1", $"Y2"))).show()
回答2:
I have tested this code in pyspark and it works as expected:
data= sc.parallelize([["A","math",1998,2000],["B","",1996,1999],["","science",2004,2005]]
data.map(lambda reg: ((reg[0],reg[1]),(range(reg[2],reg[3]+1))) )
.flatMapValues(lambda reg: reg).collect()
In more detail, you need to convert the input data to a pair RDD in the form (key,value), where key is composed with the first two fields, since the result will be flattened keeping the key intact with flatMapValues
. The values to be mapped are constructed as a range
from Y1
to Y2
. All of this is done in the first map
.
flatMapValues
will return each of the values in the range
associated to its key
.
The output looks like this:
[(('A', 'math'), 1998),
(('A', 'math'), 1999),
(('A', 'math'), 2000),
(('B', ''), 1996),
(('B', ''), 1997),
(('B', ''), 1998),
(('B', ''), 1999),
(('', 'science'), 2004),
(('', 'science'), 2005)]
回答3:
Here is the way in which you can implement this :
val resultantDF= df.rdd.flatMap{row =>
val rangeInitial = row.getInt(2)
val rangeEnd = row.getInt(3)
val array = rangeInitial to rangeEnd
(List.fill(array.size)(row.getString(0)),List.fill(array.size)(row.getString(1)),array).zipped.toList
}.toDF("Name","Subject","Y1")
resultantDF.show()
回答4:
You can use spark select easily to get what you want in a Data frame, or even in RDD.
Dataset<Row> sqlDF = spark.sql("SELECT Name,Subject,Y1 FROM tableName");
if you are starting from already exesting Data frame, say users, you can use something like this:
resultDF = usersDF.select("Name","Subject","Y1");
来源:https://stackoverflow.com/questions/40586307/how-to-split-rows-to-different-columns-in-spark-dataframe-dataset