I have recently started getting a bunch of errors on a number of pyspark
jobs running on EMR clusters. The erros are
java.lang.IllegalArgumentE
It's not a bug. We made an important protocol change in 0.15.0 that makes the default behavior of pyarrow incompatible with older versions of Arrow in Java -- your Spark environment seems to be using an older version.
Your options are
ARROW_PRE_0_15_IPC_FORMAT=1
from where you are using PythonHopefully the Spark community will be able to upgrade to 0.15.0 in Java soon so this issue goes away.
This is discussed in http://arrow.apache.org/blog/2019/10/06/0.15.0-release/