PySpark flatmap should return tuples with typed values

蹲街弑〆低调 提交于 2019-12-10 12:18:34

问题


I'm using Jupyter Notebook with PySpark. Within that I have a have a dataframe that has a schema with column names and types (integer, ...) for those columns. Now I use methods like flatMap but this returns a list of tuples that have no fixed type anymore. Is there a way to achieve that?

df.printSchema()
root
 |-- name: string (nullable = true)
 |-- ...
 |-- ...
 |-- ratings: integer (nullable = true)

Then I use flatMap to do some calculations with the rating values (obfuscated here):

df.flatMap(lambda row: (row.id, 5 if (row.ratings > 5) else row.ratings))
y_rate.toDF().printSchema()

And now I get an error:

TypeError: Can not infer schema for type:

Is there any way to use map/flatMap/reduce by keeping the schema? or at least returning tuples that have values of a specific type?


回答1:


First of all you're using a wrong function. flatMap will map and flatten so assuming your data looks like this:

df = sc.parallelize([("foo", 0), ("bar", 10)]).toDF(["id", "ratings"])

output of the flatMap will be equivalent to:

sc.parallelize(['foo', 0, 'bar', 5])

Hence the error you see. If you really want to make it work you should use map:

df.rdd.map(lambda row: (row.id, 5 if (row.ratings > 5) else row.ratings)).toDF()
## DataFrame[_1: string, _2: bigint]

Next, mapping over DataFrame is no longer supported in 2.0. You should extract rdd first (see df.rdd.map above).

Finally passing data between Python and JVM is extremely inefficient. It not only requires passing data between Python and JVM with corresponding serialization / deserialization and schema inference (if schema is not explicitly provided) which also breaks laziness. It is better to use SQL expressions for things like this:

from pyspark.sql.functions import when

df.select(df.id, when(df.ratings > 5, 5).otherwise(df.ratings))

If for some reason you need plain Python code an UDF could be a better choice.



来源:https://stackoverflow.com/questions/37225223/pyspark-flatmap-should-return-tuples-with-typed-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!