I have an RDD that contains the following [(\'column 1\',value), (\'column 2\',value), (\'column 3\',value), ... , (\'column 100\',value)]. I want to create a dataframe that
struct
is a s correct way to represent product types, like tuple
, in Spark SQL and this is exactly what you get using your code:
df = (sc.parallelize([("a", 1)]).toDF(["char", "int"])
.select(my_udf("char", "int").alias("pair")))
df.printSchema()
## root
## |-- pair: struct (nullable = true)
## | |-- char: string (nullable = false)
## | |-- count: integer (nullable = false)
There is no other way to represent a tuple unless you want to create an UDT (no longer supported in 2.0.0) or store pickled objects as BinaryType
.
Moreover struct
fields are locally represented as tuple
:
isinstance(df.first().pair, tuple)
## True
I guess you may be confused by square brackets when you call show
:
df.show()
## +-----+
## | pair|
## +-----+
## |[a,1]|
## +-----+
which are simply a representation of choice render by JVM counterpart and don't indicate Python types.