creating dataframe specific schema : StructField starting with capital letter

问题

Apologies for the lengthy post for a seemingly simple curiosity, but I wanted to give full context...

In Databricks, I am creating a "row" of data based on a specific schema definition, and then inserting that row into an empty dataframe (also based on the same specific schema).

The schema definition looks like this:

myschema_xb = StructType(
  [
    StructField("_xmlns", StringType(), True),
    StructField("_Version", DoubleType(), True),
    StructField("MyIds",
      ArrayType(
        StructType(
          [
            StructField("_ID", StringType(), True),
            StructField("_ID_Context", StringType(), True),
            StructField("_Type", LongType(), True),
          ]
        ),
        True
      ),
      True
    ),
  ]
)

And the row entry is thusly:

myRow = Row(
    _xmlns="http://some.where.com",
    _Version=12.3,
    MyIds=[
        Row(
          _ID="XY",
          _ID_Context="Exxwhy",
          _Type=9
        ),
        Row(
          _ID="9152",
          _ID_Context="LNUMB",
          _Type=21
        ),
    ]
)

Lastly, the databricks notebook code is:

mydf = spark.createDataFrame(sc.emptyRDD(), myschema_xb)
rows = [myRow]
rdf = spark.createDataFrame(rows, myschema_xb)
appended = mydf.union(rdf)

The call to rdf = spark.createDataFrame(rows, myschema_xb) causes an exception:

ValueError: Unexpected tuple 'h' with StructType.

Now the part I am curious about is if I change the element MyIds to myIds (ie. lower case the first letter), the code works, and my new dataframe (appended) has the single row of data.

What is this exception mean & why does it go away when I change the case of my element?

(FYI, our databricks runtime environment is Scala 2.11)

Thanks.

回答1:

The issue should be from how Row objects sort the keys/fields, from the documentation:

Row can be used to create a row object by using named arguments, the fields will be sorted by names.

In myschema_xb, the three columns are defined in the order [_xmlns, _Version, MyIds]. When you define myRow with the keys: (_xmlns, _Version, MyIds), the actual Row object generated will be:

Row(MyIds=[Row(_ID='XY', _ID_Context='Exxwhy', _Type=9), Row(_ID='9152', _ID_Context='LNUMB', _Type=21)], _Version=12.3, _xmlns='http://some.where.com')

Which has MyIds moved to the first column and this does not match the schema and thus yields ERROR. While when you use lowercase column-name myIds, the keys in Row object are sorted as ['_Version', '_xmlns', 'myIds'] which had myIds in the right column, but _Version and _xmls switched. This does not yield error since simple datatype can pass through the typecasting, but the resulting dataframe is incorrect.

To overcome this issue, you should set up a Row-like class and customize the order of keys to make sure the order of fields matches exactly with those shown in your schema:

from pyspark.sql import Row

MyOuterROW = Row('_xmlns', '_Version', 'MyIds')
MyInnerRow = Row('_ID', '_ID_Context', '_Type')

myRow = MyOuterROW( 
    "http://some.where.com", 
    12.3, 
    [ 
        MyInnerROW("XY", "Exxwhy", 9), 
        MyInnerROW("9152", "LNUMB", 21) 
    ] 
)              
print(myRow)
#Row(_xmlns='http://some.where.com', _Version=12.3, MyIds=[Row(_ID='XY', _ID_Context='Exxwhy', _Type=9), Row(_ID='9152', _ID_Context='LNUMB', _Type=21)])

rdf = spark.createDataFrame([myRow], schema=myschema_xb)

来源：https://stackoverflow.com/questions/59142230/creating-dataframe-specific-schema-structfield-starting-with-capital-letter

标签

python

pyspark

schema

azure-databricks

pyspark-dataframes