问题
I want to fill field structure from another existing structure
A11
of my data1
will get the value of x1.f2
.
I tried different manner and I didn't succeed. Please, who have an idea?.
schema = StructType(
[
StructField('data1',
StructType([
StructField('A1',
StructType([
StructField('A11', StringType(),True),
StructField('A12', IntegerType(),True)
])
),
StructField('A2', IntegerType(),True)
])
)])
df = sqlCtx.createDataFrame([],schema)
#Creation of df1
schema1 = StructType(
[
StructField('x1',
StructType([
StructField('f1', IntegerType(),True),
StructField('f2', IntegerType(),True),
StructField('x12',
StructType([
StructField('f5', StringType(),True)
])
),
])
),
StructField('x2',
StructType([
StructField('f3', StringType(),True),
StructField('f4', BooleanType(),True)
])
)
])
df1 = sqlCtx.createDataFrame([Row(Row(10, 3, Row('xv')), Row("tmp",True))], schema1)
df1.printSchema()
df = df1.withColumn("data1", struct(struct((col("A1")("x1")("f2")).as("A11"), lit(None).cast(IntegerType()).as("A12")), lit(None).cast(IntegerType()).as("A2"))).select ("data1")
Picture of my file
回答1:
You're very close! To access the nested fields use dots x1.f1
and not col("A1")("x1")("f2")
.
Try this:
df = df1.withColumn("data1",
struct(
struct(col("x1.f1").alias("A11"), lit(None).cast(IntegerType()).alias("A12")
).alias("A1"),
lit(None).cast(IntegerType()).alias("A2")
))
EDIT:
As per the comments, if you have a list of case/when for each field of the structs you can do something like this:
Have your transformations in list:
transformation_list = [("x1.f1", "x", "a", "data1.A1.A11"),
("x1.f1", "xa", "xa", "data1.A1.A11"),
....
]
Then using that list you can group values by target field name:
from itertools import groupby
# group by source field and target field
grouped_transf = groupby(transformation_list, lambda x: (x[0], x[3]))
Now you can loop throught the grouped list and construct the case/when expressions:
cols = {}
for key, transform in grouped_transf:
field_name = key[1].split(".")[2] # A11, A12, ...
case_expr = f"CASE {key[0]} "
for t in transform:
case_expr += f"WHEN '{t[1]}' THEN '{t[2]}' "
case_expr += f"ELSE `{t[0]}` END"
cols[field_name] = expr(case_expr).alias(field_name)
And finally use the dict of columns to add new column to dataframe:
df = df1.withColumn("data1",
struct(struct(cols["A11"], cols["A12"]).alias("A1"),
struct(cols["A21"], cols["A22"]).alias("A2")
)
)
You can print the intermediary steps to get the logic
来源:https://stackoverflow.com/questions/59545687/change-values-of-structure-dataframe