问题
I want to sum different columns in a spark dataframe.
Code
from pyspark.sql import functions as F
cols = ["A.p1","B.p1"]
df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols)
# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Why isn't approach #2. & #3. not working? I am on Spark 2.2
回答1:
Because,
# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Here you are using python in-built sum function which takes iterable as input,so it works. https://docs.python.org/2/library/functions.html#sum
#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Here you are using pyspark sum function which takes column as input but you are trying to get it at row level. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum
#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Here, df.select() returns a dataframe and trying to sum over a dataframe. In this case, I think, you got to iterate rowwise and apply sum over it.
回答2:
TL;DR builtins.sum
is just fine.
Following your comments:
Using native python sum() is not benefitting from spark optimization. so whats the spark way of doing it
and
its not a pypark function so it wont be really be completely benefiting from spark right.
I can see you are making incorrect assumptions.
Let's decompose the problem:
[df[col] for col in ["`A.p1`","`B.p1`"]]
creates a list of Columns
:
[Column<b'A.p1'>, Column<b'B.p1'>]
Let's call it iterable
.
sum
reduces output by taking elements of this list and calling __add__
method (+
). Imperative equivalent is:
accum = iterable[0]
for element in iterable[1:]:
accum = accum + element
This gives Column
:
Column<b'(A.p1 + B.p1)'>
which is the same as calling
df["`A.p1`"] + df["`B.p1`"]
No data has been touched and when evaluated it is benefits from all Spark optimizations.
来源:https://stackoverflow.com/questions/47690615/whats-is-the-correct-way-to-sum-different-dataframe-columns-in-a-list-in-pyspark