How to convert empty arrays to nulls?

后端 未结 7 456
Happy的楠姐
Happy的楠姐 2021-01-13 18:28

I have below dataframe and i need to convert empty arrays to null.

+----+---------+-----------+
|  id|count(AS)|count(asdr)|
+----+---------+-----------+
|11         


        
相关标签:
7条回答
  • 2021-01-13 18:57

    For your given dataframe, you can simply do the following

    from pyspark.sql import functions as F
    df.withColumn("count(AS)", F.when((F.size(F.col("count(AS)")) == 0), F.lit(None)).otherwise(F.col("count(AS)"))) \
        .withColumn("count(asdr)", F.when((F.size(F.col("count(asdr)")) == 0), F.lit(None)).otherwise(F.col("count(asdr)"))).show()
    

    You should have output dataframe as

    +----+---------+-----------+
    |  id|count(AS)|count(asdr)|
    +----+---------+-----------+
    |1110| [12, 45]|   [50, 55]|
    |1111|     null|       null|
    |1112| [45, 46]|   [50, 50]|
    |1113|     null|       null|
    +----+---------+-----------+
    

    Updated

    In case you have more than two array columns and you want to apply the above logic dynamically, you can use the following logic

    from pyspark.sql import functions as F
    for c in df.dtypes:
        if "array" in c[1]:
            df = df.withColumn(c[0], F.when((F.size(F.col(c[0])) == 0), F.lit(None)).otherwise(F.col(c[0])))
    df.show()
    

    Here,
    df.dtypes would give you array of tuples with column name and datatype. As for the dataframe in the question it would be

    [('id', 'bigint'), ('count(AS)', 'array<bigint>'), ('count(asdr)', 'array<bigint>')]
    

    withColumn is applied to only array columns ("array" in c[1]) where F.size(F.col(c[0])) == 0 is the condition checking for when function which checks for the size of the array. if the condition is true i.e. empty array then None is populated else original value is populated. The loop is applied to all the array columns.

    0 讨论(0)
  • 2021-01-13 18:59

    I don't think thats possible with na.fill, but this should work for you. The code converts all empty ArrayType-columns to null and keeps the other columns as they are:

    import spark.implicits._
    import org.apache.spark.sql.types.ArrayType
    import org.apache.spark.sql.functions._
    
    val df = Seq(
      (110, Seq.empty[Int]),
      (111, Seq(1,2,3))
    ).toDF("id","arr")
    
    // get names of array-type columns
    val arrColsNames = df.schema.fields.filter(f => f.dataType.isInstanceOf[ArrayType]).map(_.name)
    
    // map all empty arrays to nulls
    val emptyArraysAsNulls = arrColsNames.map(n => when(size(col(n))>0,col(n)).as(n))
    
    // non-array-type columns, keep them as they are
    val keepCols = df.columns.filterNot(arrColsNames.contains).map(col)
    
    df
      .select((keepCols ++ emptyArraysAsNulls):_*)
      .show()
    
    +---+---------+
    | id|      arr|
    +---+---------+
    |110|     null|
    |111|[1, 2, 3]|
    +---+---------+
    
    0 讨论(0)
  • 2021-01-13 18:59

    There is no easy solution like df.na.fill here. One way would be to loop over all relevant columns and replace values where appropriate. Example using foldLeft in scala:

    val columns = df.schema.filter(_.dataType.typeName == "array").map(_.name)
    
    val df2 = columns.foldLeft(df)((acc, colname) => acc.withColumn(colname, 
        when(size(col(colname)) === 0, null).otherwise(col(colname))))
    

    First, all columns of array type is extracted and then these are iterated through. Since the size function is only defined for columns of array type this is a necessary step (and avoids looping over all columns).

    Using the dataframe:

    +----+--------+-----+
    |  id|    col1| col2|
    +----+--------+-----+
    |1110|[12, 11]|   []|
    |1111|      []| [11]|
    |1112|   [123]|[321]|
    +----+--------+-----+
    

    The result is as follows:

    +----+--------+-----+
    |  id|    col1| col2|
    +----+--------+-----+
    |1110|[12, 11]| null|
    |1111|    null| [11]|
    |1112|   [123]|[321]|
    +----+--------+-----+
    
    0 讨论(0)
  • 2021-01-13 19:03

    By taking Ramesh Maharajans above solution as reference. I have found an another way of solution using UDFs. hope this helps you for multiple rules on your dataframe.

    df

    |store|   1|   2|   3|
    +-----+----+----+----+
    |  103|[90]|  []|  []|
    |  104|  []|[67]|[90]|
    |  101|[34]|  []|  []|
    |  102|[35]|  []|  []|
    +-----+----+----+----+
    

    use below code, import import pyspark.sql.functions as psf This code works in pyspark

    def udf1(x :list):
        if x==[]: return "null"
        else: return x
    udf2 = udf(udf1, ArrayType(IntegerType()))
    
    for c in df.dtypes:
        if "array" in c[1]:
            df=df.withColumn(c[0],udf2(psf.col(c[0])))
    df.show()
    

    output

    |store|   1|   2|   3|
    +-----+----+----+----+
    |  103|[90]|null|null|
    |  104|null|[67]|[90]|
    |  101|[34]|null|null|
    |  102|[35]|null|null|
    +-----+----+----+----+
    
    0 讨论(0)
  • 2021-01-13 19:05

    You need to check for the size of the array type column. Like:

    df.show()
    +----+---+
    |  id|arr|
    +----+---+
    |1110| []|
    +----+---+
    
    df.withColumn("arr", when(size(col("arr")) == 0 , lit(None)).otherwise(col("arr") ) ).show()
    
    +----+----+
    |  id| arr|
    +----+----+
    |1110|null|
    +----+----+
    
    0 讨论(0)
  • 2021-01-13 19:09
    df.withColumn("arr", when(size(col("arr")) == 0, lit(None)).otherwise(col("arr") ) ).show()
    

    Please keep in mind, it's also not work in pyspark.

    0 讨论(0)
提交回复
热议问题