multiple conditions for filter in spark data frames

前端未结

关注

 11  1083

I have a data frame with four fields. one of the field name is Status and i am trying to use a OR condition in .filter for a dataframe . I tried below queries but no luck.

相关标签:

11条回答

遥遥无期

2020-12-03 04:58

In java spark dataset it can be used as

Dataset userfilter = user.filter(col("gender").isin("male","female"));

0 讨论(0)
发布评论:

提交评论
- 加载中...

轻奢々

2020-12-03 05:01

In spark/scala, it's pretty easy to filter with varargs.

val d = spark.read...//data contains column named matid
val ids = Seq("BNBEL0608AH", "BNBEL00608H")
val filtered = d.filter($"matid".isin(ids:_*))

0 讨论(0)

终归单人心

2020-12-03 05:03
```
df2 = df1.filter("Status = 2 OR Status = 3")
```
Worked for me.
0 讨论(0)
发布评论:

提交评论
- 加载中...

抹茶落季

2020-12-03 05:05

You need to use filter

package dataframe

import org.apache.spark.sql.SparkSession
/**
 * @author vaquar.khan@gmail.com
 */
//

object DataFrameExample{
  //
  case class Employee(id: Integer, name: String, address: String, salary: Double, state: String,zip:Integer)
  //
  def main(args: Array[String]) {
    val spark =
      SparkSession.builder()
        .appName("DataFrame-Basic")
        .master("local[4]")
        .getOrCreate()

    import spark.implicits._

    // create a sequence of case class objects 

    // (we defined the case class above)

    val emp = Seq( 
    Employee(1, "vaquar khan", "111 algoinquin road chicago", 120000.00, "AZ",60173),
    Employee(2, "Firdos Pasha", "1300 algoinquin road chicago", 2500000.00, "IL",50112),
    Employee(3, "Zidan khan", "112 apt abcd timesqure NY", 50000.00, "NY",55490),
    Employee(4, "Anwars khan", "washington dc", 120000.00, "VA",33245),
    Employee(5, "Deepak sharma ", "rolling edows schumburg", 990090.00, "IL",60172),
    Employee(6, "afaq khan", "saeed colony Bhopal", 1000000.00, "AZ",60173)
    )

    val employee=spark.sparkContext.parallelize(emp, 4).toDF()

     employee.printSchema()

    employee.show()


    employee.select("state", "zip").show()

    println("*** use filter() to choose rows")

    employee.filter($"state".equalTo("IL")).show()

    println("*** multi contidtion in filer || ")

    employee.filter($"state".equalTo("IL") || $"state".equalTo("AZ")).show()

    println("*** multi contidtion in filer &&  ")

    employee.filter($"state".equalTo("AZ") && $"zip".equalTo("60173")).show()

  }
}

0 讨论(0)

天涯浪人

2020-12-03 05:06
For future references : we can use isInCollection to filter ,here is a example : Note : It will look for exact match
```
  def getSelectedTablesRows(allTablesInfoDF: DataFrame, tableNames: Seq[String]): DataFrame = {

    allTablesInfoDF.where(col("table_name").isInCollection(tableNames))

  }
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
天命终不由人

2020-12-03 05:09
This question has been answered but for future reference, I would like to mention that, in the context of this question, the where and filter methods in Dataset/Dataframe supports two syntaxes: The SQL string parameters:
```
df2 = df1.filter(("Status = 2 or Status = 3"))
```
and Col based parameters (mentioned by @David ):
```
df2 = df1.filter($"Status" === 2 || $"Status" === 3)
```
It seems the OP'd combined these two syntaxes. Personally, I prefer the first syntax because it's cleaner and more generic.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页