问题
I have a pyspark dataframe where occasionally the columns will have a wrong value that matches another column. It would look something like this:
| Date | Latitude |
| 2017-01-01 | 43.4553 |
| 2017-01-02 | 42.9399 |
| 2017-01-03 | 43.0091 |
| 2017-01-04 | 2017-01-04 |
Obviously, the last Latitude value is incorrect. I need to remove any and all rows that are like this. I thought about using .isin()
but I can't seem to get it to work. If I try
df['Date'].isin(['Latitude'])
I get:
Column<(Date IN (Latitude))>
Any suggestions?
回答1:
Yes. You can use ~
(invert operator) to exclude elements using the isin()
function. There is no real documentation on it, but you can try if it gives you the desired output.
df = df.filter(~df['Date'].isin(df['Latitude']))
回答2:
If you're more comfortable with SQL syntax, here is an alternative way using a pyspark-sql
condition inside the filter()
:
df = df.filter("Date NOT IN (Latitude)")
Or equivalently using pyspark.sql.DataFrame.where():
df = df.where("Date NOT IN (Latitude)")
来源:https://stackoverflow.com/questions/49992272/find-and-remove-matching-column-values-in-pyspark