I have below dataframe and i need to convert empty arrays to null.
+----+---------+-----------+
| id|count(AS)|count(asdr)|
+----+---------+-----------+
|11
For your given dataframe
, you can simply do the following
from pyspark.sql import functions as F
df.withColumn("count(AS)", F.when((F.size(F.col("count(AS)")) == 0), F.lit(None)).otherwise(F.col("count(AS)"))) \
.withColumn("count(asdr)", F.when((F.size(F.col("count(asdr)")) == 0), F.lit(None)).otherwise(F.col("count(asdr)"))).show()
You should have output dataframe
as
+----+---------+-----------+
| id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]| [50, 55]|
|1111| null| null|
|1112| [45, 46]| [50, 50]|
|1113| null| null|
+----+---------+-----------+
Updated
In case you have more than two array columns and you want to apply the above logic dynamically, you can use the following logic
from pyspark.sql import functions as F
for c in df.dtypes:
if "array" in c[1]:
df = df.withColumn(c[0], F.when((F.size(F.col(c[0])) == 0), F.lit(None)).otherwise(F.col(c[0])))
df.show()
Here,
df.dtypes
would give you array of tuples with column name and datatype. As for the dataframe in the question it would be
[('id', 'bigint'), ('count(AS)', 'array<bigint>'), ('count(asdr)', 'array<bigint>')]
withColumn
is applied to only array columns ("array" in c[1])
where F.size(F.col(c[0])) == 0
is the condition checking for when
function which checks for the size of the array. if the condition is true i.e. empty array then None is populated else original value is populated. The loop is applied to all the array columns.
I don't think thats possible with na.fill
, but this should work for you. The code converts all empty ArrayType-columns to null and keeps the other columns as they are:
import spark.implicits._
import org.apache.spark.sql.types.ArrayType
import org.apache.spark.sql.functions._
val df = Seq(
(110, Seq.empty[Int]),
(111, Seq(1,2,3))
).toDF("id","arr")
// get names of array-type columns
val arrColsNames = df.schema.fields.filter(f => f.dataType.isInstanceOf[ArrayType]).map(_.name)
// map all empty arrays to nulls
val emptyArraysAsNulls = arrColsNames.map(n => when(size(col(n))>0,col(n)).as(n))
// non-array-type columns, keep them as they are
val keepCols = df.columns.filterNot(arrColsNames.contains).map(col)
df
.select((keepCols ++ emptyArraysAsNulls):_*)
.show()
+---+---------+
| id| arr|
+---+---------+
|110| null|
|111|[1, 2, 3]|
+---+---------+
There is no easy solution like df.na.fill
here. One way would be to loop over all relevant columns and replace values where appropriate. Example using foldLeft
in scala:
val columns = df.schema.filter(_.dataType.typeName == "array").map(_.name)
val df2 = columns.foldLeft(df)((acc, colname) => acc.withColumn(colname,
when(size(col(colname)) === 0, null).otherwise(col(colname))))
First, all columns of array type is extracted and then these are iterated through. Since the size
function is only defined for columns of array type this is a necessary step (and avoids looping over all columns).
Using the dataframe:
+----+--------+-----+
| id| col1| col2|
+----+--------+-----+
|1110|[12, 11]| []|
|1111| []| [11]|
|1112| [123]|[321]|
+----+--------+-----+
The result is as follows:
+----+--------+-----+
| id| col1| col2|
+----+--------+-----+
|1110|[12, 11]| null|
|1111| null| [11]|
|1112| [123]|[321]|
+----+--------+-----+
By taking Ramesh Maharajans above solution as reference. I have found an another way of solution using UDFs. hope this helps you for multiple rules on your dataframe.
df
|store| 1| 2| 3|
+-----+----+----+----+
| 103|[90]| []| []|
| 104| []|[67]|[90]|
| 101|[34]| []| []|
| 102|[35]| []| []|
+-----+----+----+----+
use below code, import import pyspark.sql.functions as psf
This code works in pyspark
def udf1(x :list):
if x==[]: return "null"
else: return x
udf2 = udf(udf1, ArrayType(IntegerType()))
for c in df.dtypes:
if "array" in c[1]:
df=df.withColumn(c[0],udf2(psf.col(c[0])))
df.show()
output
|store| 1| 2| 3|
+-----+----+----+----+
| 103|[90]|null|null|
| 104|null|[67]|[90]|
| 101|[34]|null|null|
| 102|[35]|null|null|
+-----+----+----+----+
You need to check for the size
of the array type column. Like:
df.show()
+----+---+
| id|arr|
+----+---+
|1110| []|
+----+---+
df.withColumn("arr", when(size(col("arr")) == 0 , lit(None)).otherwise(col("arr") ) ).show()
+----+----+
| id| arr|
+----+----+
|1110|null|
+----+----+
df.withColumn("arr", when(size(col("arr")) == 0, lit(None)).otherwise(col("arr") ) ).show()
Please keep in mind, it's also not work in pyspark.