I have a dataframe df
that contains one column of type array
df.show()
looks like
|ID|ArrayOfString|Age|Gender|
+--+-------
No need for a UDF if you already know which fields contain arrays. You can simply use Spark's cast function:
import org.apache.spark.sql.functions._
val dumpCSV = df.withColumn("ArrayOfString", col("ArrayOfString").cast("string"))
.write
.csv(path="/home/me/saveDF")
Hope that helps.
CSV is not the ideal export format, but if you just want to visually inspect your data, this will work [Scala]. Quick and dirty solution.
case class example ( id: String, ArrayOfString: String, Age: String, Gender: String)
df.rdd.map{line => example(line(0).toString, line(1).toString, line(2).toString , line(3).toString) }.toDF.write.csv("/tmp/example.csv")
Pyspark implementation.
In this example, change the field column_as_array
to column_as_string
before saving.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def array_to_string(my_list):
return '[' + ','.join([str(elem) for elem in my_list]) + ']'
array_to_string_udf = udf(array_to_string, StringType())
df = df.withColumn('column_as_str', array_to_string_udf(df["column_as_array"]))
Then you can drop the old column (array type) before saving.
df.drop("column_as_array").write.csv(...)
Here is a method for converting all ArrayType
(of any underlying type) columns of a DataFrame
to StringType
columns:
def stringifyArrays(dataFrame: DataFrame): DataFrame = {
val colsToStringify = dataFrame.schema.filter(p => p.dataType.typeName == "array").map(p => p.name)
colsToStringify.foldLeft(dataFrame)((df, c) => {
df.withColumn(c, concat(lit("["), concat_ws(", ", col(c).cast("array<string>")), lit("]")))
})
}
Also, it doesn't use a UDF.
To answer DreamerP's question (from one of the comments) :
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def array_to_string(my_list):
return '[' + ','.join([str(elem) for elem in my_list]) + ']'
array_to_string_udf = udf(array_to_string, StringType())
df = df.withColumn('Antecedent_as_str', array_to_string_udf(df["Antecedent"]))
df = df.withColumn('Consequent_as_str', array_to_string_udf(df["Consequent"]))
df = df.drop("Consequent")
df = df.drop("Antecedent")
df.write.csv("foldername")
The reason why you are getting this error is that csv file format doesn't support array types, you'll need to express it as a string to be able to save.
Try the following :
import org.apache.spark.sql.functions._
val stringify = udf((vs: Seq[String]) => vs match {
case null => null
case _ => s"""[${vs.mkString(",")}]"""
})
df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)
or
import org.apache.spark.sql.Column
def stringify(c: Column) = concat(lit("["), concat_ws(",", c), lit("]"))
df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)