Spark 2.0.x dump a csv file from a dataframe containing one array of type string

前端未结

关注

 6  1270

I have a dataframe df that contains one column of type array

df.show() looks like

|ID|ArrayOfString|Age|Gender|
+--+-------


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  再見小時候        
                
              
                            
                2020-11-29 07:41
              
            
            
                                                                       
No need for a UDF if you already know which fields contain arrays. You can simply use Spark's cast function:

import org.apache.spark.sql.functions._
val dumpCSV = df.withColumn("ArrayOfString", col("ArrayOfString").cast("string"))
                .write
                .csv(path="/home/me/saveDF")


Hope that helps.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  执念已碎        
                
              
                            
                2020-11-29 07:44
              
            
            
                                                                       
CSV is not the ideal export format, but if you just want to visually inspect your data, this will work [Scala]. Quick and dirty solution. 

case class example ( id: String, ArrayOfString: String, Age: String, Gender: String)

df.rdd.map{line => example(line(0).toString, line(1).toString, line(2).toString , line(3).toString) }.toDF.write.csv("/tmp/example.csv")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光说笑        
                
              
                            
                2020-11-29 07:54
              
            
            
                                                                       
Pyspark implementation.
In this example, change the field column_as_array to column_as_string before saving.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def array_to_string(my_list):
    return '[' + ','.join([str(elem) for elem in my_list]) + ']'

array_to_string_udf = udf(array_to_string, StringType())

df = df.withColumn('column_as_str', array_to_string_udf(df["column_as_array"]))

Then you can drop the old column (array type) before saving.
df.drop("column_as_array").write.csv(...)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感动是毒        
                
              
                            
                2020-11-29 07:55
              
            
            
                                                                       
Here is a method for converting all ArrayType (of any underlying type) columns of a DataFrame to StringType columns:

def stringifyArrays(dataFrame: DataFrame): DataFrame = {
  val colsToStringify = dataFrame.schema.filter(p => p.dataType.typeName == "array").map(p => p.name)

  colsToStringify.foldLeft(dataFrame)((df, c) => {
    df.withColumn(c, concat(lit("["), concat_ws(", ", col(c).cast("array<string>")), lit("]")))
  })
}


Also, it doesn't use a UDF.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一整个雨季        
                
              
                            
                2020-11-29 07:59
              
            
            
                                                                       
To answer DreamerP's question (from one of the comments) :
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def array_to_string(my_list):
    return '[' + ','.join([str(elem) for elem in my_list]) + ']'

array_to_string_udf = udf(array_to_string, StringType())

df = df.withColumn('Antecedent_as_str', array_to_string_udf(df["Antecedent"]))
df = df.withColumn('Consequent_as_str', array_to_string_udf(df["Consequent"]))
df = df.drop("Consequent")
df = df.drop("Antecedent")
df.write.csv("foldername")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  没有蜡笔的小新        
                
              
                            
                2020-11-29 08:01
              
            
            
                                                                       
The reason why you are getting this error is that csv file format doesn't support array types, you'll need to express it as a string to be able to save.

Try the following :



import org.apache.spark.sql.functions._

val stringify = udf((vs: Seq[String]) => vs match {
  case null => null
  case _    => s"""[${vs.mkString(",")}]"""
})

df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)


or

import org.apache.spark.sql.Column

def stringify(c: Column) = concat(lit("["), concat_ws(",", c), lit("]"))

df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复