Spark now offers predefined functions that can be used in dataframes, and it seems they are highly optimized. My original question was going to be on which is faster, but I
After years, when I have a more spark knowledge and had second look on the question, just realized what @alfredox really want to ask. So I revised again, and divide the answer into two parts:
To answer Why native DF function (native Spark-SQL function) is faster:
Basically, why native Spark function is ALWAYS faster than Spark UDF, regardless your UDF is implemented in Python or Scala.
Firstly, we need to understand what Tungsten, which is firstly introduced in Spark 1.4.
It is a backend and what it focus on:
- Off-Heap Memory Management using binary in-memory data representation aka Tungsten row format and managing memory explicitly,
- Cache Locality which is about cache-aware computations with cache-aware layout for high cache hit rates,
- Whole-Stage Code Generation (aka CodeGen).
One of the biggest Spark performance killer is GC. The GC would pause the every threads in JVM until the GC finished. This is exactly why Off-Heap Memory Management being introduced.
When executing Spark-SQL native functions, the data will stays in tungsten backend. However, in Spark UDF scenario, the data will be moved out from tungsten into JVM (Scala scenario) or JVM and Python Process (Python) to do the actual process, and then move back into tungsten. As a result of that:
To answer if Python would necessarily slower than Scala:
Since 30th October, 2017, Spark just introduced vectorized udfs for pyspark.
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
The reason that Python UDF is slow, is probably the PySpark UDF is not implemented in a most optimized way:
According to the paragraph from the link.
Spark added a Python API in version 0.7, with support for user-defined functions. These user-defined functions operate one-row-at-a-time, and thus suffer from high serialization and invocation overhead.
However the newly vectorized udfs seem to be improving the performance a lot:
ranging from 3x to over 100x.