How to yield pandas dataframe rows to spark dataframe

泄露秘密 提交于 2021-01-01 08:10:36

问题


Hi I'm making transformation, I have created some_function(iter) generator to yield Row(id=index, api=row['api'], A=row['A'], B=row['B'] to yield transformed rows from pandas dataframe to rdd and to spark dataframe. I'm getting errors. (I must use pandas to transform data as there is a large amount of legacy code)

Input Spark DataFrame

respond_sdf.show()
    +-------------------------------------------------------------------+
    |content                                                            |
    +-------------------------------------------------------------------+
    |{'api': ['api_1', 'api_1', 'api_1'],'A': [1,2,3], 'B': [4,5,6] }   |
    |{'api': ['api_2', 'api_2', 'api_2'],'A': [7,8,9], 'B': [10,11,12] }|
    +-------------------------------------------------------------------+

Expected Spark Dataframe after transformation

transform_df.show()
    +-------------------+
    |  api   |  A  |  B |
    +-------------------+
    | api_1  |  1  |  4 |
    | api_1  |  3  |  5 |
    | api_1  |  4  |  6 |
    | api_2  |  7  | 10 |
    | api_2  |  8  | 11 |
    | api_2  |  9  | 12 |
    +-------------------+

Minimum example code

#### IMPORT PYSPARK ###

import pandas as pd
import pyspark
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, IntegerType,StringType
spark = pyspark.sql.SparkSession.builder.appName("test") \
    .master('local[*]') \
    .getOrCreate()
sc = spark.sparkContext


####### INPUT DATAFRAME WITH LIST OF JSONS ########################

rdd_list = [["{'api': ['api_1', 'api_1', 'api_1'],'A': [1,2,3], 'B': [4,5,6] }"],
            ["{'api': ['api_2', 'api_2', 'api_2'],'A': [7,8,9], 'B': [10,11,12] }"]]

schema = StructType([StructField('content', StringType(), True)])

jsons = sc.parallelize(rdd_list)
respond_sdf = spark.createDataFrame(jsons, schema)
respond_sdf.show(truncate=False)


####### TRANSFORMATION DATAFRAME ########################

# Pandas transformation function returning pandas dataframe
def pandas_function(url_json):
    # Complex Pandas transformation
    url = url_json[0]
    json = url_json[1]
    df = pd.DataFrame(eval(json))
    return df

# Generator returing Row from pandas dataframe
def some_function(iter):
  # Pandas generator
  pandas_df = pandas_function(iter)
  for index, row in pandas_df.iterrows():
      ## ERROR COMES FROM THIS ROW
      yield Row(id=index, api=row['api'], A=row['A'], B=row['B'])

# Creating transformation spark dataframe
schema = StructType([
  StructField('API', StringType(), True),
  StructField('A', IntegerType(), True),
  StructField('B', IntegerType(), True)
  ])


rdd = respond_sdf.rdd.map(lambda x: some_function(x))
transform_df = spark.createDataFrame(rdd,schema)
transform_df.show()

I'm getting error below:

raise TypeError(new_msg("StructType can not accept object %r in type %s"
TypeError: StructType can not accept object <generator object some_function at 0x7f69b43def90> in type <class 'generator'>

Full error:

Py4JJavaError: An error occurred while calling o462.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 37.0 failed 1 times, most recent failure: Lost task 2.0 in stage 37.0 (TID 97, dpc, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 271, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
    return f(*args, **kwargs)
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 612, in prepare
    verify_func(obj)
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 1408, in verify
    verify_value(obj)
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 1395, in verify_struct
    raise TypeError(new_msg("StructType can not accept object %r in type %s"
TypeError: StructType can not accept object <generator object some_function at 0x7f69b43def90> in type <class 'generator'>

I'm following advice from the link below: pySpark convert result of mapPartitions to spark DataFrame


回答1:


EDIT: In Spark 3.0 there is also a mapInPandas function which should be more efficient because there is no need to group by.

import pyspark.sql.functions as F

def pandas_function(iterator):
    for df in iterator:
        yield pd.concat(pd.DataFrame(x) for x in df['content'].map(eval))

transformed_df = respond_sdf.mapInPandas(pandas_function, "api string, A int, B int")
transformed_df.show()

Another way: using pandas_udf and apply:

import pyspark.sql.functions as F

@F.pandas_udf("api string, A int, B int", F.PandasUDFType.GROUPED_MAP)
def pandas_function(url_json):
    df = pd.DataFrame(eval(url_json['content'][0]))
    return df

transformed_df = respond_sdf.groupBy(F.monotonically_increasing_id()).apply(pandas_function)
transformed_df.show()

+-----+---+---+
|  api|  A|  B|
+-----+---+---+
|api_2|  7| 10|
|api_2|  8| 11|
|api_2|  9| 12|
|api_1|  1|  4|
|api_1|  2|  5|
|api_1|  3|  6|
+-----+---+---+

Old answer (not very scalable...):

def pandas_function(url_json):
    df = pd.DataFrame(eval(url_json))
    return df

transformed_df = spark.createDataFrame(pd.concat(respond_sdf.rdd.map(lambda r: pandas_function(r[0])).collect()))
transformed_df.show()
+-----+---+---+
|  api|  A|  B|
+-----+---+---+
|api_1|  1|  4|
|api_1|  2|  5|
|api_1|  3|  6|
|api_2|  7| 10|
|api_2|  8| 11|
|api_2|  9| 12|
+-----+---+---+



回答2:


Thanks to @mck examples, From Spark 2.4 I found there is also applyInPandas function, which returns spark dataframe.

def pandas_function(url_json):
    df = pd.DataFrame(eval(url_json['content'][0]))
    return df

respond_sdf.groupby(F.monotonically_increasing_id()).applyInPandas(pandas_function, schema="api string, A int, B int").show()

+-----+---+---+
|  api|  A|  B|
+-----+---+---+
|api_2|  7| 10|
|api_2|  8| 11|
|api_2|  9| 12|
|api_1|  1|  4|
|api_1|  2|  5|
|api_1|  3|  6|
+-----+---+---+


来源:https://stackoverflow.com/questions/65412532/how-to-yield-pandas-dataframe-rows-to-spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!