Find and remove matching column values in pyspark

痴心易碎 提交于 2019-12-12 19:15:31

问题


I have a pyspark dataframe where occasionally the columns will have a wrong value that matches another column. It would look something like this:

| Date         | Latitude      |
| 2017-01-01   | 43.4553       |
| 2017-01-02   | 42.9399       |
| 2017-01-03   | 43.0091       |
| 2017-01-04   | 2017-01-04    |

Obviously, the last Latitude value is incorrect. I need to remove any and all rows that are like this. I thought about using .isin() but I can't seem to get it to work. If I try

df['Date'].isin(['Latitude'])

I get:

Column<(Date IN (Latitude))>

Any suggestions?


回答1:


Yes. You can use ~ (invert operator) to exclude elements using the isin() function. There is no real documentation on it, but you can try if it gives you the desired output.

df = df.filter(~df['Date'].isin(df['Latitude']))



回答2:


If you're more comfortable with SQL syntax, here is an alternative way using a pyspark-sql condition inside the filter():

df = df.filter("Date NOT IN (Latitude)")

Or equivalently using pyspark.sql.DataFrame.where():

df = df.where("Date NOT IN (Latitude)")


来源:https://stackoverflow.com/questions/49992272/find-and-remove-matching-column-values-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!