问题
Apologies for the lengthy post for a seemingly simple curiosity, but I wanted to give full context...
In Databricks, I am creating a "row" of data based on a specific schema definition, and then inserting that row into an empty dataframe (also based on the same specific schema).
The schema definition looks like this:
myschema_xb = StructType(
[
StructField("_xmlns", StringType(), True),
StructField("_Version", DoubleType(), True),
StructField("MyIds",
ArrayType(
StructType(
[
StructField("_ID", StringType(), True),
StructField("_ID_Context", StringType(), True),
StructField("_Type", LongType(), True),
]
),
True
),
True
),
]
)
And the row entry is thusly:
myRow = Row(
_xmlns="http://some.where.com",
_Version=12.3,
MyIds=[
Row(
_ID="XY",
_ID_Context="Exxwhy",
_Type=9
),
Row(
_ID="9152",
_ID_Context="LNUMB",
_Type=21
),
]
)
Lastly, the databricks notebook code is:
mydf = spark.createDataFrame(sc.emptyRDD(), myschema_xb)
rows = [myRow]
rdf = spark.createDataFrame(rows, myschema_xb)
appended = mydf.union(rdf)
The call to rdf = spark.createDataFrame(rows, myschema_xb)
causes an exception:
ValueError: Unexpected tuple 'h' with StructType
.
Now the part I am curious about is if I change the element MyIds
to myIds
(ie. lower case the first letter), the code works, and my new dataframe (appended
) has the single row of data.
What is this exception mean & why does it go away when I change the case of my element?
(FYI, our databricks runtime environment is Scala 2.11)
Thanks.
回答1:
The issue should be from how Row objects sort the keys/fields, from the documentation:
Row can be used to create a row object by using named arguments, the fields will be sorted by names.
In myschema_xb
, the three columns are defined in the order [_xmlns, _Version, MyIds]
. When you define myRow with the keys: (_xmlns, _Version, MyIds)
, the actual Row object generated will be:
Row(MyIds=[Row(_ID='XY', _ID_Context='Exxwhy', _Type=9), Row(_ID='9152', _ID_Context='LNUMB', _Type=21)], _Version=12.3, _xmlns='http://some.where.com')
Which has MyIds
moved to the first column and this does not match the schema and thus yields ERROR. While when you use lowercase column-name myIds
, the keys in Row object are sorted as ['_Version', '_xmlns', 'myIds']
which had myIds
in the right column, but _Version
and _xmls
switched. This does not yield error since simple datatype can pass through the typecasting, but the resulting dataframe is incorrect.
To overcome this issue, you should set up a Row-like class and customize the order of keys to make sure the order of fields matches exactly with those shown in your schema:
from pyspark.sql import Row
MyOuterROW = Row('_xmlns', '_Version', 'MyIds')
MyInnerRow = Row('_ID', '_ID_Context', '_Type')
myRow = MyOuterROW(
"http://some.where.com",
12.3,
[
MyInnerROW("XY", "Exxwhy", 9),
MyInnerROW("9152", "LNUMB", 21)
]
)
print(myRow)
#Row(_xmlns='http://some.where.com', _Version=12.3, MyIds=[Row(_ID='XY', _ID_Context='Exxwhy', _Type=9), Row(_ID='9152', _ID_Context='LNUMB', _Type=21)])
rdf = spark.createDataFrame([myRow], schema=myschema_xb)
来源:https://stackoverflow.com/questions/59142230/creating-dataframe-specific-schema-structfield-starting-with-capital-letter