PySpark DataFrame Column Reference: df.col vs. df['col'] vs. F.col('col')?

后端 未结 1 469
[愿得一人]
[愿得一人] 2020-12-06 02:10

I have a concept I hope you can help to clarify:

What\'s the difference between the following three ways of referring to a column in PySpark dataframe. I know diffe

相关标签:
1条回答
  • In most practical applictions, there is almost no difference. However, they are implemented by calls to different underlying functions (source) and thus are not exactly the same.

    We can illustrate with a small example:

    df = spark.createDataFrame(
        [(1,'a', 0), (2,'b',None), (None,'c',3)], 
        ['col', '2col', 'third col']
    )
    
    df.show()
    #+----+----+---------+
    #| col|2col|third col|
    #+----+----+---------+
    #|   1|   a|        0|
    #|   2|   b|     null|
    #|null|   c|        3|
    #+----+----+---------+
    

    1. df.col

    This is the least flexible. You can only reference columns that are valid to be accessed using the . operator. This rules out column names containing spaces or special characters and column names that start with an integer.

    This syntax makes a call to df.__getattr__("col").

    print(df.__getattr__.__doc__)
    #Returns the :class:`Column` denoted by ``name``.
    #
    #        >>> df.select(df.age).collect()
    #        [Row(age=2), Row(age=5)]
    #
    #        .. versionadded:: 1.3
    

    Using the . syntax, you can only access the first column of this example dataframe.

    >>> df.2col
      File "<ipython-input-39-8e82c2dd5b7c>", line 1
        df.2col
           ^
    SyntaxError: invalid syntax
    

    Under the hood, it checks to see if the column name is contained in df.columns and then returns the pyspark.sql.Column specified.

    2. df["col"]

    This makes a call to df.__getitem__. You have some more flexibility in that you can do everything that __getattr__ can do, plus you can specify any column name.

    df["2col"]
    #Column<2col> 
    

    Once again, under the hood some conditionals are checked and in this case the pyspark.sql.Column specified by the input string is returned.

    In addition, you can as pass in multiple columns (as a list or tuple) or column expressions.

    from pyspark.sql.functions import expr
    df[['col', expr('`third col` IS NULL')]].show()
    #+----+-------------------+
    #| col|(third col IS NULL)|
    #+----+-------------------+
    #|   1|              false|
    #|   2|               true|
    #|null|              false|
    #+----+-------------------+
    

    Note that in the case of multiple columns, __getitem__ is just making a call to pyspark.sql.DataFrame.select.

    Finally, you can also access columns by index:

    df[2]
    #Column<third col>
    

    3. pyspark.sql.functions.col

    This is the Spark native way of selecting a column and returns a expression (this is the case for all column functions) which selects the column on based on the given name. This is useful shorthand when you need to specify that you want a column and not a string literal.

    For example, supposed we wanted to make a new column that would take on the either the value from "col" or "third col" based on the value of "2col":

    from pyspark.sql.functions import when
    
    df.withColumn(
        'new', 
        f.when(df['2col'].isin(['a', 'c']), 'third col').otherwise('col')
    ).show()
    #+----+----+---------+---------+
    #| col|2col|third col|      new|
    #+----+----+---------+---------+
    #|   1|   a|        0|third col|
    #|   2|   b|     null|      col|
    #|null|   c|        3|third col|
    #+----+----+---------+---------+
    

    Oops, that's not what I meant. Spark thought I wanted the literal strings "col" and "third col". Instead, what I should have written is:

    from pyspark.sql.functions import col
    df.withColumn(
        'new', 
        when(df['2col'].isin(['a', 'c']), col('third col')).otherwise(col('col'))
    ).show()
    #+----+----+---------+---+
    #| col|2col|third col|new|
    #+----+----+---------+---+
    #|   1|   a|        0|  0|
    #|   2|   b|     null|  2|
    #|null|   c|        3|  3|
    #+----+----+---------+---+
    

    Because is col() creates the column expression without checking there's two interesting side effects of this.

    1. It can be re-used as it's not df specific
    2. It can be used before the df is assigned
    age = col('dob') / 365
    if_expr = when(age < 18, 'underage').otherwise('adult')
    
    df1 = df.read.csv(path).withColumn('age_category', if_expr)
    
    df2 = df.read.parquet(path)\
        .select('*', age.alias('age'), if_expr.alias('age_category'))
    

    age generates Column<b'(dob / 365)'>
    if_expr generates Column<b'CASE WHEN ((dob / 365) < 18) THEN underage ELSE adult END'>

    0 讨论(0)
提交回复
热议问题