Whats is the correct way to sum different dataframe columns in a list in pyspark?

问题

I want to sum different columns in a spark dataframe.

Code

from pyspark.sql import functions as F
cols = ["A.p1","B.p1"]
df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols)

# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))

Why isn't approach #2. & #3. not working? I am on Spark 2.2

回答1:

Because,

# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

Here you are using python in-built sum function which takes iterable as input,so it works. https://docs.python.org/2/library/functions.html#sum

#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

Here you are using pyspark sum function which takes column as input but you are trying to get it at row level. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum

#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))

Here, df.select() returns a dataframe and trying to sum over a dataframe. In this case, I think, you got to iterate rowwise and apply sum over it.

回答2:

TL;DR builtins.sum is just fine.

Following your comments:

Using native python sum() is not benefitting from spark optimization. so whats the spark way of doing it

and

its not a pypark function so it wont be really be completely benefiting from spark right.

I can see you are making incorrect assumptions.

Let's decompose the problem:

[df[col] for col in ["`A.p1`","`B.p1`"]]

creates a list of Columns:

[Column<b'A.p1'>, Column<b'B.p1'>]

Let's call it iterable.

sum reduces output by taking elements of this list and calling __add__ method (+). Imperative equivalent is:

accum = iterable[0]
for element in iterable[1:]:
    accum = accum + element

This gives Column:

Column<b'(A.p1 + B.p1)'>

which is the same as calling

df["`A.p1`"] + df["`B.p1`"]

No data has been touched and when evaluated it is benefits from all Spark optimizations.

来源：https://stackoverflow.com/questions/47690615/whats-is-the-correct-way-to-sum-different-dataframe-columns-in-a-list-in-pyspark

标签

python

apache-spark

pyspark

apache-spark-sql

pyspark-sql