Create DataFrame from list of tuples using pyspark

后端 未结 1 692
抹茶落季
抹茶落季 2020-12-29 09:30

I am working with data extracted from SFDC using simple-salesforce package. I am using Python3 for scripting and Spark 1.5.2.

I created an rdd containing the followi

相关标签:
1条回答
  • 2020-12-29 09:50

    Hey could you next time provide a working example. That would be easier.

    The way how your RDD is presented is basically weird to create a DataFrame. This is how you create a DF according to Spark Documentation.

    >>> l = [('Alice', 1)]
    >>> sqlContext.createDataFrame(l).collect()
    [Row(_1=u'Alice', _2=1)]
    >>> sqlContext.createDataFrame(l, ['name', 'age']).collect()
    [Row(name=u'Alice', age=1)]
    

    So concerning your example you can create your desired output like this way:

    # Your data at the moment
    data = sc.parallelize([ 
    [('Id', 'a0w1a0000003xB1A'), ('PackSize', 1.0), ('Name', 'A')],
    [('Id', 'a0w1a0000003xAAI'), ('PackSize', 1.0), ('Name', 'B')],
    [('Id', 'a0w1a00000xB3AAI'), ('PackSize', 30.0), ('Name', 'C')]
        ])
    # Convert to tuple
    data_converted = data.map(lambda x: (x[0][1], x[1][1], x[2][1]))
    
    # Define schema
    schema = StructType([
        StructField("Id", StringType(), True),
        StructField("Packsize", StringType(), True),
        StructField("Name", StringType(), True)
    ])
    
    # Create dataframe
    DF = sqlContext.createDataFrame(data_converted, schema)
    
    # Output
    DF.show()
    +----------------+--------+----+
    |              Id|Packsize|Name|
    +----------------+--------+----+
    |a0w1a0000003xB1A|     1.0|   A|
    |a0w1a0000003xAAI|     1.0|   B|
    |a0w1a00000xB3AAI|    30.0|   C|
    +----------------+--------+----+
    

    Hope this helps

    0 讨论(0)
提交回复
热议问题