Dropping multiple columns from Spark dataframe by Iterating through the columns from a Scala List of Column names

后端 未结 4 1804
情歌与酒
情歌与酒 2020-12-29 00:35

I have a dataframe which has columns around 400, I want to drop 100 columns as per my requirement. So i have created a Scala List of 100 column names. And then i want to ite

相关标签:
4条回答
  • 2020-12-29 01:21

    Answer:

    val colsToRemove = Seq("colA", "colB", "colC", etc) 
    
    val filteredDF = df.select(df.columns .filter(colName => !colsToRemove.contains(colName)) .map(colName => new Column(colName)): _*) 
    
    0 讨论(0)
  • 2020-12-29 01:25

    If you just want to do nothing more complex than dropping several named columns, as opposed to selecting them by a particular condition, you can simply do the following:

    df.drop("colA", "colB", "colC")
    
    0 讨论(0)
  • 2020-12-29 01:26

    This should work fine :

    val dropList : List[String]  |
    val df : DataFrame  |
    val test_df = df.drop(dropList : _*) 
    
    0 讨论(0)
  • 2020-12-29 01:27

    You can just do,

    def dropColumns(inputDF: DataFrame, dropList: List[String]): DataFrame = 
        dropList.foldLeft(inputDF)((df, col) => df.drop(col))
    

    It will return you the DataFrame without the columns passed in dropList.

    As an example (of what's happening behind the scene), let me put it this way.

    scala> val list = List(0, 1, 2, 3, 4, 5, 6, 7)
    list: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7)
    
    scala> val removeThese = List(0, 2, 3)
    removeThese: List[Int] = List(0, 2, 3)
    
    scala> removeThese.foldLeft(list)((l, r) => l.filterNot(_ == r))
    res2: List[Int] = List(1, 4, 5, 6, 7)
    

    The returned list (in our case, map it to your DataFrame) is the latest filtered. After each fold, the latest is passed to the next function (_, _) => _.

    0 讨论(0)
提交回复
热议问题