Spark 2.0.x dump a csv file from a dataframe containing one array of type string

前端 未结 6 1270
难免孤独
难免孤独 2020-11-29 07:07

I have a dataframe df that contains one column of type array

df.show() looks like

|ID|ArrayOfString|Age|Gender|
+--+-------         


        
相关标签:
6条回答
  • 2020-11-29 07:41

    No need for a UDF if you already know which fields contain arrays. You can simply use Spark's cast function:

    import org.apache.spark.sql.functions._
    val dumpCSV = df.withColumn("ArrayOfString", col("ArrayOfString").cast("string"))
                    .write
                    .csv(path="/home/me/saveDF")
    

    Hope that helps.

    0 讨论(0)
  • 2020-11-29 07:44

    CSV is not the ideal export format, but if you just want to visually inspect your data, this will work [Scala]. Quick and dirty solution.

    case class example ( id: String, ArrayOfString: String, Age: String, Gender: String)
    
    df.rdd.map{line => example(line(0).toString, line(1).toString, line(2).toString , line(3).toString) }.toDF.write.csv("/tmp/example.csv")
    
    0 讨论(0)
  • 2020-11-29 07:54

    Pyspark implementation.

    In this example, change the field column_as_array to column_as_string before saving.

    from pyspark.sql.functions import udf
    from pyspark.sql.types import StringType
    
    def array_to_string(my_list):
        return '[' + ','.join([str(elem) for elem in my_list]) + ']'
    
    array_to_string_udf = udf(array_to_string, StringType())
    
    df = df.withColumn('column_as_str', array_to_string_udf(df["column_as_array"]))
    

    Then you can drop the old column (array type) before saving.

    df.drop("column_as_array").write.csv(...)
    
    0 讨论(0)
  • 2020-11-29 07:55

    Here is a method for converting all ArrayType (of any underlying type) columns of a DataFrame to StringType columns:

    def stringifyArrays(dataFrame: DataFrame): DataFrame = {
      val colsToStringify = dataFrame.schema.filter(p => p.dataType.typeName == "array").map(p => p.name)
    
      colsToStringify.foldLeft(dataFrame)((df, c) => {
        df.withColumn(c, concat(lit("["), concat_ws(", ", col(c).cast("array<string>")), lit("]")))
      })
    }
    

    Also, it doesn't use a UDF.

    0 讨论(0)
  • 2020-11-29 07:59

    To answer DreamerP's question (from one of the comments) :

    from pyspark.sql.functions import udf
    from pyspark.sql.types import StringType
    
    def array_to_string(my_list):
        return '[' + ','.join([str(elem) for elem in my_list]) + ']'
    
    array_to_string_udf = udf(array_to_string, StringType())
    
    df = df.withColumn('Antecedent_as_str', array_to_string_udf(df["Antecedent"]))
    df = df.withColumn('Consequent_as_str', array_to_string_udf(df["Consequent"]))
    df = df.drop("Consequent")
    df = df.drop("Antecedent")
    df.write.csv("foldername")
    
    0 讨论(0)
  • 2020-11-29 08:01

    The reason why you are getting this error is that csv file format doesn't support array types, you'll need to express it as a string to be able to save.

    Try the following :

    import org.apache.spark.sql.functions._
    
    val stringify = udf((vs: Seq[String]) => vs match {
      case null => null
      case _    => s"""[${vs.mkString(",")}]"""
    })
    
    df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)
    

    or

    import org.apache.spark.sql.Column
    
    def stringify(c: Column) = concat(lit("["), concat_ws(",", c), lit("]"))
    
    df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)
    
    0 讨论(0)
提交回复
热议问题