Spark: Convert column of string to an array

前端 未结 3 1296
南笙
南笙 2020-12-24 04:46

How to convert a column that has been read as a string into a column of arrays? i.e. convert from below schema

scala> test.printSchema
root
 |-- a: long (         


        
相关标签:
3条回答
  • 2020-12-24 04:53

    Using a UDF would give you exact required schema. Like this:

    val toArray = udf((b: String) => b.split(",").map(_.toLong))
    
    val test1 = test.withColumn("b", toArray(col("b")))
    

    It would give you schema as follows:

    scala> test1.printSchema
    root
     |-- a: long (nullable = true)
     |-- b: array (nullable = true)
     |    |-- element: long (containsNull = true)
    
    +---+-----+
    |  a|  b  |
    +---+-----+
    |  1|[2,3]|
    +---+-----+
    |  2|[4,5]|
    +---+-----+
    

    As far as applying schema on file read itself is concerned, I think that is a tough task. So, for now you can apply transformation after creating DataFrameReader of test.

    I hope this helps!

    0 讨论(0)
  • 2020-12-24 04:55

    There are various method,

    The best way to do is using split function and cast to array<long>

    data.withColumn("b", split(col("b"), ",").cast("array<long>"))
    

    You can also create simple udf to convert the values

    val tolong = udf((value : String) => value.split(",").map(_.toLong))
    
    data.withColumn("newB", tolong(data("b"))).show
    

    Hope this helps!

    0 讨论(0)
  • 2020-12-24 04:57

    In python (pyspark) it would be:

    from pyspark.sql.types import *
    from pyspark.sql.functions import col, split
    test = test.withColumn(
            "b",
            split(col("b"), ",\s*").cast("array<int>").alias("ev")
     )
    
    0 讨论(0)
提交回复
热议问题