Create a dataframe in pyspark that contains a single column of tuples

后端 未结 1 983
小鲜肉
小鲜肉 2021-01-17 02:15

I have an RDD that contains the following [(\'column 1\',value), (\'column 2\',value), (\'column 3\',value), ... , (\'column 100\',value)]. I want to create a dataframe that

相关标签:
1条回答
  • 2021-01-17 02:55

    struct is a s correct way to represent product types, like tuple, in Spark SQL and this is exactly what you get using your code:

    df = (sc.parallelize([("a", 1)]).toDF(["char", "int"])
        .select(my_udf("char", "int").alias("pair")))
    df.printSchema()
    
    ## root
    ##  |-- pair: struct (nullable = true)
    ##  |    |-- char: string (nullable = false)
    ##  |    |-- count: integer (nullable = false)
    

    There is no other way to represent a tuple unless you want to create an UDT (no longer supported in 2.0.0) or store pickled objects as BinaryType.

    Moreover struct fields are locally represented as tuple:

    isinstance(df.first().pair, tuple)
    ## True
    

    I guess you may be confused by square brackets when you call show:

    df.show()
    
    ## +-----+
    ## | pair|
    ## +-----+
    ## |[a,1]|
    ## +-----+
    

    which are simply a representation of choice render by JVM counterpart and don't indicate Python types.

    0 讨论(0)
提交回复
热议问题