change values of structure dataframe

问题

I want to fill field structure from another existing structure A11 of my data1 will get the value of x1.f2. I tried different manner and I didn't succeed. Please, who have an idea?.

schema = StructType(
[
StructField('data1',
    StructType([
        StructField('A1',
               StructType([        
                    StructField('A11', StringType(),True),
                    StructField('A12', IntegerType(),True)
        ])
),
         StructField('A2', IntegerType(),True)
    ])
)])
 df = sqlCtx.createDataFrame([],schema)
 #Creation of df1
 schema1 = StructType(
 [
    StructField('x1',
        StructType([        
           StructField('f1', IntegerType(),True),
           StructField('f2', IntegerType(),True),        
           StructField('x12',
               StructType([        
                    StructField('f5', StringType(),True)
        ])
                   ),
  ])
  ),   
 StructField('x2',
    StructType([        
    StructField('f3', StringType(),True),
    StructField('f4', BooleanType(),True)
])
         )
])
 df1 = sqlCtx.createDataFrame([Row(Row(10, 3, Row('xv')), Row("tmp",True))], schema1)
 df1.printSchema()
 df = df1.withColumn("data1", struct(struct((col("A1")("x1")("f2")).as("A11"), lit(None).cast(IntegerType()).as("A12")), lit(None).cast(IntegerType()).as("A2"))).select ("data1")

Picture of my file

回答1:

You're very close! To access the nested fields use dots x1.f1 and not col("A1")("x1")("f2").

Try this:

df = df1.withColumn("data1",
                   struct(
                       struct(col("x1.f1").alias("A11"), lit(None).cast(IntegerType()).alias("A12")
                              ).alias("A1"),
                       lit(None).cast(IntegerType()).alias("A2")
                   ))

EDIT:

As per the comments, if you have a list of case/when for each field of the structs you can do something like this:

Have your transformations in list:

transformation_list = [("x1.f1", "x", "a", "data1.A1.A11"),
                       ("x1.f1", "xa", "xa", "data1.A1.A11"),
                       ....
                       ]

Then using that list you can group values by target field name:

from itertools import groupby

# group by source field and target field
grouped_transf = groupby(transformation_list, lambda x: (x[0], x[3]))

Now you can loop throught the grouped list and construct the case/when expressions:

cols = {}
for key, transform in grouped_transf:
    field_name = key[1].split(".")[2] # A11, A12, ...
    case_expr = f"CASE {key[0]} "
    for t in transform:
        case_expr += f"WHEN '{t[1]}' THEN '{t[2]}' "

    case_expr += f"ELSE `{t[0]}` END"

    cols[field_name] = expr(case_expr).alias(field_name)

And finally use the dict of columns to add new column to dataframe:

df = df1.withColumn("data1",
                    struct(struct(cols["A11"], cols["A12"]).alias("A1"),
                           struct(cols["A21"], cols["A22"]).alias("A2")
                           )
                    )

You can print the intermediary steps to get the logic

来源：https://stackoverflow.com/questions/59545687/change-values-of-structure-dataframe

标签

python

apache-spark

pyspark

apache-spark-sql

pyspark-sql