Pandas scalar UDF failing, IllegalArgumentException

落爺英雄遲暮 提交于 2020-06-16 07:58:16

问题


First off, I apologize if my issue is simple. I did spend a lot of time researching it.

I am trying to set up a scalar Pandas UDF in a PySpark script as described here.

Here is my code:

from pyspark import SparkContext
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import SQLContext
sc.install_pypi_package("pandas")
import pandas as pd
sc.install_pypi_package("PyArrow")

df = spark.createDataFrame(
    [("a", 1, 0), ("a", -1, 42), ("b", 3, -1), ("b", 10, -2)],
    ("key", "value1", "value2")
)

df.show()

@F.pandas_udf("double", F.PandasUDFType.SCALAR)
def pandas_plus_one(v):
    return pd.Series(v + 1)

df.select(pandas_plus_one(df.value1)).show()
# Also fails
#df.select(pandas_plus_one(df["value1"])).show()
#df.select(pandas_plus_one("value1")).show()
#df.select(pandas_plus_one(F.col("value1"))).show()

The script fails at the last statement:

An error occurred while calling o209.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 8.0 failed 4 times, most recent failure: Lost task 2.3 in stage 8.0 (TID 30, ip-10-160-2-53.ec2.internal, executor 3): java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181) at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172) at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) ...

What am I missing here? I am just following the manual. Thanks for your help


回答1:


Pyarrow rolled out a new version 0.15 on october 5 which causes pandas Udf to throw error. Spark needs to upgrade to be compatible with this(which might take some time). You can follow the progress here https://issues.apache.org/jira/projects/SPARK/issues/SPARK-29367?filter=allissues

Solution:

1) You need to install Pyarrow 0.14.1 or lower. < sc.install_pypi_package("pyarrow==0.14.1") > (or) 2) Set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1 from where you are using Python



来源:https://stackoverflow.com/questions/58458415/pandas-scalar-udf-failing-illegalargumentexception

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!