How to delete columns in pyspark dataframe

前端 未结 8 1549
滥情空心
滥情空心 2021-01-30 01:55
>>> a
DataFrame[id: bigint, julian_date: string, user_id: bigint]
>>> b
DataFrame[id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigi         


        
相关标签:
8条回答
  • 2021-01-30 02:27

    Reading the Spark documentation I found an easier solution.

    Since version 1.4 of spark there is a function drop(col) which can be used in pyspark on a dataframe.

    You can use it in two ways

    1. df.drop('age').collect()
    2. df.drop(df.age).collect()

    Pyspark Documentation - Drop

    0 讨论(0)
  • 2021-01-30 02:36

    You can delete column like this:

    df.drop("column Name).columns
    

    In your case :

    df.drop("id").columns
    

    If you want to drop more than one column you can do:

    dfWithLongColName.drop("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME")
    
    0 讨论(0)
  • 2021-01-30 02:37

    You can use two way:

    1: You just keep the necessary columns:

    drop_column_list = ["drop_column"]
    df = df.select([column for column in df.columns if column not in drop_column_list])  
    

    2: This is the more elegant way.

    df = df.drop("col_name")
    

    You should avoid the collect() version, because it will send to the master the complete dataset, it will take a big computing effort!

    0 讨论(0)
  • 2021-01-30 02:40

    You could either explicitly name the columns you want to keep, like so:

    keep = [a.id, a.julian_date, a.user_id, b.quan_created_money, b.quan_created_cnt]
    

    Or in a more general approach you'd include all columns except for a specific one via a list comprehension. For example like this (excluding the id column from b):

    keep = [a[c] for c in a.columns] + [b[c] for c in b.columns if c != 'id']
    

    Finally you make a selection on your join result:

    d = a.join(b, a.id==b.id, 'outer').select(*keep)
    
    0 讨论(0)
  • 2021-01-30 02:40

    Consider 2 dataFrames:

    >>> aDF.show()
    +---+----+
    | id|datA|
    +---+----+
    |  1|  a1|
    |  2|  a2|
    |  3|  a3|
    +---+----+
    

    and

    >>> bDF.show()
    +---+----+
    | id|datB|
    +---+----+
    |  2|  b2|
    |  3|  b3|
    |  4|  b4|
    +---+----+
    

    To accomplish what you are looking for, there are 2 ways:

    1. Different joining condition. Instead of saying aDF.id == bDF.id

    aDF.join(bDF, aDF.id == bDF.id, "outer")
    

    Write this:

    aDF.join(bDF, "id", "outer").show()
    +---+----+----+
    | id|datA|datB|
    +---+----+----+
    |  1|  a1|null|
    |  3|  a3|  b3|
    |  2|  a2|  b2|
    |  4|null|  b4|
    +---+----+----+
    

    This will automatically get rid of the extra the dropping process.

    2. Use Aliasing: You will lose data related to B Specific Id's in this.

    >>> from pyspark.sql.functions import col
    >>> aDF.alias("a").join(bDF.alias("b"), aDF.id == bDF.id, "outer").drop(col("b.id")).show()
    
    +----+----+----+
    |  id|datA|datB|
    +----+----+----+
    |   1|  a1|null|
    |   3|  a3|  b3|
    |   2|  a2|  b2|
    |null|null|  b4|
    +----+----+----+
    
    0 讨论(0)
  • 2021-01-30 02:51

    Maybe a little bit off topic, but here is the solution using Scala. Make an Array of column names from your oldDataFrame and delete the columns that you want to drop ("colExclude"). Then pass the Array[Column] to select and unpack it.

    val columnsToKeep: Array[Column] = oldDataFrame.columns.diff(Array("colExclude"))
                                                   .map(x => oldDataFrame.col(x))
    val newDataFrame: DataFrame = oldDataFrame.select(columnsToKeep: _*)
    
    0 讨论(0)
提交回复
热议问题