I have a concept I hope you can help to clarify:
What\'s the difference between the following three ways of referring to a column in PySpark dataframe. I know diffe
In most practical applictions, there is almost no difference. However, they are implemented by calls to different underlying functions (source) and thus are not exactly the same.
We can illustrate with a small example:
df = spark.createDataFrame(
[(1,'a', 0), (2,'b',None), (None,'c',3)],
['col', '2col', 'third col']
)
df.show()
#+----+----+---------+
#| col|2col|third col|
#+----+----+---------+
#| 1| a| 0|
#| 2| b| null|
#|null| c| 3|
#+----+----+---------+
df.col
This is the least flexible. You can only reference columns that are valid to be accessed using the .
operator. This rules out column names containing spaces or special characters and column names that start with an integer.
This syntax makes a call to df.__getattr__("col")
.
print(df.__getattr__.__doc__)
#Returns the :class:`Column` denoted by ``name``.
#
# >>> df.select(df.age).collect()
# [Row(age=2), Row(age=5)]
#
# .. versionadded:: 1.3
Using the .
syntax, you can only access the first column of this example dataframe.
>>> df.2col
File "<ipython-input-39-8e82c2dd5b7c>", line 1
df.2col
^
SyntaxError: invalid syntax
Under the hood, it checks to see if the column name is contained in df.columns
and then returns the pyspark.sql.Column
specified.
df["col"]
This makes a call to df.__getitem__
. You have some more flexibility in that you can do everything that __getattr__
can do, plus you can specify any column name.
df["2col"]
#Column<2col>
Once again, under the hood some conditionals are checked and in this case the pyspark.sql.Column
specified by the input string is returned.
In addition, you can as pass in multiple columns (as a list
or tuple
) or column expressions.
from pyspark.sql.functions import expr
df[['col', expr('`third col` IS NULL')]].show()
#+----+-------------------+
#| col|(third col IS NULL)|
#+----+-------------------+
#| 1| false|
#| 2| true|
#|null| false|
#+----+-------------------+
Note that in the case of multiple columns, __getitem__
is just making a call to pyspark.sql.DataFrame.select.
Finally, you can also access columns by index:
df[2]
#Column<third col>
pyspark.sql.functions.col
This is the Spark native way of selecting a column and returns a expression
(this is the case for all column functions) which selects the column on based on the given name. This is useful shorthand when you need to specify that you want a column and not a string literal.
For example, supposed we wanted to make a new column that would take on the either the value from "col"
or "third col"
based on the value of "2col"
:
from pyspark.sql.functions import when
df.withColumn(
'new',
f.when(df['2col'].isin(['a', 'c']), 'third col').otherwise('col')
).show()
#+----+----+---------+---------+
#| col|2col|third col| new|
#+----+----+---------+---------+
#| 1| a| 0|third col|
#| 2| b| null| col|
#|null| c| 3|third col|
#+----+----+---------+---------+
Oops, that's not what I meant. Spark thought I wanted the literal strings "col"
and "third col"
. Instead, what I should have written is:
from pyspark.sql.functions import col
df.withColumn(
'new',
when(df['2col'].isin(['a', 'c']), col('third col')).otherwise(col('col'))
).show()
#+----+----+---------+---+
#| col|2col|third col|new|
#+----+----+---------+---+
#| 1| a| 0| 0|
#| 2| b| null| 2|
#|null| c| 3| 3|
#+----+----+---------+---+
Because is col() creates the column expression without checking there's two interesting side effects of this.
age = col('dob') / 365
if_expr = when(age < 18, 'underage').otherwise('adult')
df1 = df.read.csv(path).withColumn('age_category', if_expr)
df2 = df.read.parquet(path)\
.select('*', age.alias('age'), if_expr.alias('age_category'))
age
generates Column<b'(dob / 365)'>
if_expr
generates Column<b'CASE WHEN ((dob / 365) < 18) THEN underage ELSE adult END'>