PySpark trying to apply previous field's schema to next field

眉间皱痕 提交于 2019-12-11 07:05:54

问题


Having this weird issue with PySpark. It seems to be trying to apply the schema for the previous field, to the next field, as it's processing.

Simplest test case I could come up with:

%pyspark
from pyspark.sql.types import (
    DateType,
    StructType,
    StructField,
    StringType,
)

from datetime import date
from pyspark.sql import Row


schema = StructType(
    [
        StructField("date", DateType(), True),
        StructField("country", StringType(), True),
    ]
)

test = spark.createDataFrame(
    [
        Row(
            date=date(2019, 1, 1),
            country="RU",
        ),
    ],
    schema
)

Stacktrace:

Fail to execute line 26:     schema
Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-8579306903394369208.py", line 380, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 26, in <module>
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 691, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 423, in _createFromLocal
    data = [schema.toInternal(row) for row in data]
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 601, in toInternal
    for f, v, c in zip(self.fields, obj, self._needConversion))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 601, in <genexpr>
    for f, v, c in zip(self.fields, obj, self._needConversion))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 439, in toInternal
    return self.dataType.toInternal(obj)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 175, in toInternal
    return d.toordinal() - self.EPOCH_ORDINAL
AttributeError: 'str' object has no attribute 'toordinal'

Bonus information from running it locally rather than in Zepplin:

self = DateType, d = 'RU'

    def toInternal(self, d):
        if d is not None:
>           return d.toordinal() - self.EPOCH_ORDINAL
E           AttributeError: 'str' object has no attribute 'toordinal'

e.g., it's trying to apply DateType to country. If I get rid of date, it's fine. If I get rid of country, it's fine. Both together, is a no go.

Any ideas? Am I missing something obvious?


回答1:


If you're going to use a list of Rows, you don't need to specify the schema as well. This is because the Row already knows the schema.

The problem is happening because the pyspark.sql.Row object does not maintain the order that you specified for the fields.

print(Row(date=date(2019, 1, 1), country="RU"))
#Row(country='RU', date=datetime.date(2019, 1, 1))

From the docs:

Row can be used to create a row object by using named arguments, the fields will be sorted by names.

As you can see, the country field is being put first. When spark tries to create the DataFrame with the specified schema, it expects the first item to be a DateType.

One way to fix this is to put the fields in your schema in alphabetical order:

schema = StructType(
    [
        StructField("country", StringType(), True),
        StructField("date", DateType(), True)
    ]
)

test = spark.createDataFrame(
    [
        Row(date=date(2019, 1, 1), country="RU")
    ],
    schema
)
test.show()
#+-------+----------+
#|country|      date|
#+-------+----------+
#|     RU|2019-01-01|
#+-------+----------+

Or in this case, there's no need to even pass in the schema to createDataFrame. It will be inferred from the Rows:

test = spark.createDataFrame(
    [
        Row(date=date(2019, 1, 1), country="RU")
    ]
)

And if you wanted to reorder the columns, use select:

test = test.select("date", "country")
test.show()
#+----------+-------+
#|      date|country|
#+----------+-------+
#|2019-01-01|     RU|
#+----------+-------+


来源:https://stackoverflow.com/questions/54484067/pyspark-trying-to-apply-previous-fields-schema-to-next-field

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!