How to create a DataFrame out of rows while retaining existing schema?

问题

If I call map or mapPartition and my function receives rows from PySpark what is the natural way to create either a local PySpark or Pandas DataFrame? Something that combines the rows and retains the schema?

Currently I do something like:

def combine(partition):
    rows = [x for x in partition]
    dfpart = pd.DataFrame(rows,columns=rows[0].keys())
    pandafunc(dfpart)

mydf.mapPartition(combine)

回答1:

Spark >= 2.3.0

Since Spark 2.3.0 it is possible to use Pandas Series or DataFrame by partition or group. See for example:

Applying UDFs on GroupedData in PySpark (with functioning python example)
Efficient string suffix detection

Spark < 2.3.0

what is the natural way to create either a local PySpark

There is no such thing. Spark distributed data structures cannot be nested or you prefer another perspective you cannot nest actions or transformations.

or Pandas DataFrame

It is relatively easy but you have to remember at least few things:

Pandas and Spark DataFrames are not even remotely equivalent. These are different structures, with different properties and in general you cannot replace one with another.
Partitions can be empty.
It looks like you're passing dictionaries. Remember that base Python dictionary is unordered (unlike collections.OrderedDict for example). So passing columns may not work as expected.

import pandas as pd

rdd = sc.parallelize([
    {"x": 1, "y": -1}, 
    {"x": -3, "y": 0},
    {"x": -0, "y": 4}
])

def combine(iter):
    rows = list(iter)
    return [pd.DataFrame(rows)] if rows else []

rdd.mapPartitions(combine).first()
##    x  y
## 0  1 -1

回答2:

You could use toPandas(),

pandasdf = mydf.toPandas()

回答3:

In order to create a spark SQL dataframe you need a hive context:

hc = HiveContext(sparkContext)

With the HiveContext you can create a SQL dataframe via the inferSchema function:

sparkSQLdataframe = hc.inferSchema(rows)

回答4:

It's actually possible to convert Spark rows to Pandas inside executors & finally create Spark DataFrame out of those output using mapPartitions. See my gist in Github

# Convert function to use in mapPartitions
def rdd_to_pandas(rdd_):
    # convert rows to dict
    rows = (row_.asDict() for row_ in rdd_)
    # create pandas dataframe
    pdf = pd.DataFrame(rows)

    # Rows/Pandas DF can be empty depending on patiition logic.
    # Make sure to check it here, otherwise it will throw untrackable error
    if len(pdf) > 0:
        #
        # Do something with pandas DataFrame 
        #
        pass

    return pdf.to_dict(orient='records')

# Create Spark DataFrame from resulting RDD
rdf = spark.createDataFrame(df.rdd.mapPartitions(rdd_to_pandas))

来源：https://stackoverflow.com/questions/34438829/how-to-create-a-dataframe-out-of-rows-while-retaining-existing-schema

标签

python

pandas

apache-spark

pyspark

pyspark-sql