How to handle null entries in SparkR

前端 未结 2 824
滥情空心
滥情空心 2020-12-20 17:05

I have a SparkSQL DataFrame.

Some entries in this data are empty but they don\'t behave like NULL or NA. How could I remove them? Any ideas?

In R I can easi

相关标签:
2条回答
  • 2020-12-20 17:38

    It is not the nicest workaround, but if you cast them as strings, they are stored as "NaN" and then you can filter them, a short example:

    testFrame   <- createDataFrame(sqlContext, data.frame(a=c(1,2,3),b=c(1,NA,3)))
    testFrame$c <- cast(testFrame$b,"string")
    
    resultFrame <- collect(filter(testFrame, testFrame$c!="NaN"))
    resultFrame$c <- NULL
    

    This omits the entire row where the element in column b is missing.

    0 讨论(0)
  • 2020-12-20 17:59

    SparkR Column provides a long list of useful methods including isNull and isNotNull:

    > people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA))
    > people <- createDataFrame(sqlContext, people_local)
    > head(people)
    
      Id Age
    1  1  21
    2  2  18
    3  3  NA
    
    > filter(people, isNotNull(people$Age)) %>% head()
      Id Age
    1  1  21
    2  2  18
    3  3  30
    
    > filter(people, isNull(people$Age)) %>% head()
      Id Age
    1  4  NA
    

    Please keep in mind that there is no distinction between NA and NaN in SparkR.

    If you prefer operations on a whole data frame there is a set of NA functions including fillna and dropna:

    > fillna(people, 99) %>% head()
     Id Age
    1  1  21
    2  2  18
    3  3  30
    4  4  99
    
    > dropna(people) %>% head()
     Id Age
    1  1  21
    2  2  18
    3  3  30
    

    Both can be adjusted to consider only some subset of columns (cols), and dropna has some additional useful parameters. For example you can specify minimal number of not null columns:

    > people_with_names_local <- data.frame(
        Id=1:4, Age=c(21, 18, 30, NA), Name=c("Alice", NA, "Bob", NA))
    > people_with_names <- createDataFrame(sqlContext, people_with_names_local)
    > people_with_names %>% head()
      Id Age  Name
    1  1  21 Alice
    2  2  18  <NA>
    3  3  30   Bob
    4  4  NA  <NA>
    
    > dropna(people_with_names, minNonNulls=2) %>% head()
      Id Age  Name
    1  1  21 Alice
    2  2  18  <NA>
    3  3  30   Bob
    
    0 讨论(0)
提交回复
热议问题