I\'m trying to filter a PySpark dataframe that has None
as a row value:
df.select(\'dt_mvmt\').distinct().collect()
[Row(dt_mvmt=u\'2016-03-27\
None/Null is a data type of the class NoneType in pyspark/python so, Below will not work as you are trying to compare NoneType object with string object
Wrong way of filretingdf[df.dt_mvmt == None].count() 0 df[df.dt_mvmt != None].count() 0
df=df.where(col("dt_mvmt").isNotNull()) returns all records with dt_mvmt as None/Null
if column = None
COLUMN_OLD_VALUE
----------------
None
1
None
100
20
------------------
Use create a temptable on data frame:
sqlContext.sql("select * from tempTable where column_old_value='None' ").show()
So use : column_old_value='None'
PySpark provides various filtering options based on arithmetic, logical and other conditions. Presence of NULL values can hamper further processes. Removing them or statistically imputing them could be a choice.
Below set of code can be considered:
# Dataset is df
# Column name is dt_mvmt
# Before filtering make sure you have the right count of the dataset
df.count() # Some number
# Filter here
df = df.filter(df.dt_mvmt.isNotNull())
# Check the count to ensure there are NULL values present (This is important when dealing with large dataset)
df.count() # Count should be reduced if NULL values are present
I would also try:
df = df.dropna(subset=["dt_mvmt"])