Filter Pyspark dataframe column with None value

前端 未结 10 1572
小鲜肉
小鲜肉 2020-11-29 18:10

I\'m trying to filter a PySpark dataframe that has None as a row value:

df.select(\'dt_mvmt\').distinct().collect()

[Row(dt_mvmt=u\'2016-03-27\         


        
相关标签:
10条回答
  • 2020-11-29 18:28

    None/Null is a data type of the class NoneType in pyspark/python so, Below will not work as you are trying to compare NoneType object with string object

    Wrong way of filreting

    df[df.dt_mvmt == None].count() 0 df[df.dt_mvmt != None].count() 0

    correct

    df=df.where(col("dt_mvmt").isNotNull()) returns all records with dt_mvmt as None/Null

    0 讨论(0)
  • 2020-11-29 18:31

    if column = None

    COLUMN_OLD_VALUE
    ----------------
    None
    1
    None
    100
    20
    ------------------
    

    Use create a temptable on data frame:

    sqlContext.sql("select * from tempTable where column_old_value='None' ").show()
    

    So use : column_old_value='None'

    0 讨论(0)
  • 2020-11-29 18:31

    PySpark provides various filtering options based on arithmetic, logical and other conditions. Presence of NULL values can hamper further processes. Removing them or statistically imputing them could be a choice.

    Below set of code can be considered:

    # Dataset is df
    # Column name is dt_mvmt
    # Before filtering make sure you have the right count of the dataset
    df.count() # Some number
    
    # Filter here
    df = df.filter(df.dt_mvmt.isNotNull())
    
    # Check the count to ensure there are NULL values present (This is important when dealing with large dataset)
    df.count() # Count should be reduced if NULL values are present
    
    0 讨论(0)
  • 2020-11-29 18:38

    I would also try:

    df = df.dropna(subset=["dt_mvmt"])

    0 讨论(0)
提交回复
热议问题