问题
I am trying to get all rows within a dataframe where a columns value is not within a list (so filtering by exclusion).
As an example:
df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')]
,schema=('id','bar'))
I get the data frame:
+---+---+
| id|bar|
+---+---+
| 1| a|
| 2| b|
| 3| b|
| 4| c|
| 5| d|
+---+---+
I only want to exclude rows where bar is ('a' or 'b').
Using an SQL expression string it would be:
df.filter('bar not in ("a","b")').show()
Is there a way of doing it without using the string for the SQL expression, or excluding one item at a time?
Edit:
I am likely to have a list, ['a','b'], of the excluded values that I would like to use.
回答1:
It looks like the ~ gives the functionality that I need, but I am yet to find any appropriate documentation on it.
df.filter(~col('bar').isin(['a','b'])).show()
+---+---+
| id|bar|
+---+---+
| 4| c|
| 5| d|
+---+---+
回答2:
Also could be like this
df.filter(col('bar').isin(['a','b']) == False).show()
回答3:
Got a gotcha for those with their headspace in Pandas and moving to pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
spark_conf = SparkConf().setMaster("local").setAppName("MyAppName")
sc = SparkContext(conf = spark_conf)
sqlContext = SQLContext(sc)
records = [
{"colour": "red"},
{"colour": "blue"},
{"colour": None},
]
pandas_df = pd.DataFrame.from_dict(records)
pyspark_df = sqlContext.createDataFrame(records)
So if we wanted the rows that are not red:
pandas_df[~pandas_df["colour"].isin(["red"])]
Looking good, and in our pyspark DataFrame
pyspark_df.filter(~pyspark_df["colour"].isin(["red"])).collect()
So after some digging, I found this: https://issues.apache.org/jira/browse/SPARK-20617 So to include nothingness in our results:
pyspark_df.filter(~pyspark_df["colour"].isin(["red"]) | pyspark_df["colour"].isNull()).show()
回答4:
df.filter((df.bar != 'a') & (df.bar != 'b'))
来源:https://stackoverflow.com/questions/41775281/filtering-a-pyspark-dataframe-using-isin-by-exclusion