How to “negative select” columns in spark's dataframe

前端 未结 9 1987
野的像风
野的像风 2020-12-15 05:35

I can\'t figure it out, but guess it\'s simple. I have a spark dataframe df. This df has columns \"A\",\"B\" and \"C\". Now let\'s say I have an Array containing the name of

相关标签:
9条回答
  • 2020-12-15 05:54

    In pyspark you can do

    df.select(list(set(df.columns) - set(["B"])))
    

    Using more than one line you can also do

    cols = df.columns
    cols.remove("B")
    df.select(cols)
    
    0 讨论(0)
  • 2020-12-15 05:56

    OK, it's ugly, but this quick spark shell session shows something that works:

    scala> val myRDD = sc.parallelize(List.range(1,10))
    myRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:21
    
    scala> val myDF = myRDD.toDF("a")
    myDF: org.apache.spark.sql.DataFrame = [a: int]
    
    scala> val myOtherRDD = sc.parallelize(List.range(1,10))
    myOtherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <console>:21
    
    scala> val myotherDF = myRDD.toDF("b")
    myotherDF: org.apache.spark.sql.DataFrame = [b: int]
    
    scala> myDF.unionAll(myotherDF)
    res2: org.apache.spark.sql.DataFrame = [a: int]
    
    scala> myDF.join(myotherDF)
    res3: org.apache.spark.sql.DataFrame = [a: int, b: int]
    
    scala> val twocol = myDF.join(myotherDF)
    twocol: org.apache.spark.sql.DataFrame = [a: int, b: int]
    
    scala> val cols = Array("a", "b")
    cols: Array[String] = Array(a, b)
    
    scala> val selectedCols = cols.filter(_!="b")
    selectedCols: Array[String] = Array(a)
    
    scala> twocol.select(selectedCols.head, selectedCols.tail: _*)
    res4: org.apache.spark.sql.DataFrame = [a: int]
    

    Providings varargs to a function that requires one is treated in other SO questions. The signature of select is there to ensure your list of selected columns is not empty – which makes the conversion from the list of selected columns to varargs a bit more complex.

    0 讨论(0)
  • 2020-12-15 06:00
    val columns = Seq("A","B","C")
    
    df.select(columns.diff(Seq("B")))
    
    0 讨论(0)
提交回复
热议问题