问题
Hi I'm making transformation, I have created some_function(iter)
generator to yield Row(id=index, api=row['api'], A=row['A'], B=row['B']
to yield transformed rows from pandas dataframe to rdd and to spark dataframe. I'm getting errors. (I must use pandas to transform data as there is a large amount of legacy code)
Input Spark DataFrame
respond_sdf.show()
+-------------------------------------------------------------------+
|content |
+-------------------------------------------------------------------+
|{'api': ['api_1', 'api_1', 'api_1'],'A': [1,2,3], 'B': [4,5,6] } |
|{'api': ['api_2', 'api_2', 'api_2'],'A': [7,8,9], 'B': [10,11,12] }|
+-------------------------------------------------------------------+
Expected Spark Dataframe after transformation
transform_df.show()
+-------------------+
| api | A | B |
+-------------------+
| api_1 | 1 | 4 |
| api_1 | 3 | 5 |
| api_1 | 4 | 6 |
| api_2 | 7 | 10 |
| api_2 | 8 | 11 |
| api_2 | 9 | 12 |
+-------------------+
Minimum example code
#### IMPORT PYSPARK ###
import pandas as pd
import pyspark
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, IntegerType,StringType
spark = pyspark.sql.SparkSession.builder.appName("test") \
.master('local[*]') \
.getOrCreate()
sc = spark.sparkContext
####### INPUT DATAFRAME WITH LIST OF JSONS ########################
rdd_list = [["{'api': ['api_1', 'api_1', 'api_1'],'A': [1,2,3], 'B': [4,5,6] }"],
["{'api': ['api_2', 'api_2', 'api_2'],'A': [7,8,9], 'B': [10,11,12] }"]]
schema = StructType([StructField('content', StringType(), True)])
jsons = sc.parallelize(rdd_list)
respond_sdf = spark.createDataFrame(jsons, schema)
respond_sdf.show(truncate=False)
####### TRANSFORMATION DATAFRAME ########################
# Pandas transformation function returning pandas dataframe
def pandas_function(url_json):
# Complex Pandas transformation
url = url_json[0]
json = url_json[1]
df = pd.DataFrame(eval(json))
return df
# Generator returing Row from pandas dataframe
def some_function(iter):
# Pandas generator
pandas_df = pandas_function(iter)
for index, row in pandas_df.iterrows():
## ERROR COMES FROM THIS ROW
yield Row(id=index, api=row['api'], A=row['A'], B=row['B'])
# Creating transformation spark dataframe
schema = StructType([
StructField('API', StringType(), True),
StructField('A', IntegerType(), True),
StructField('B', IntegerType(), True)
])
rdd = respond_sdf.rdd.map(lambda x: some_function(x))
transform_df = spark.createDataFrame(rdd,schema)
transform_df.show()
I'm getting error below:
raise TypeError(new_msg("StructType can not accept object %r in type %s"
TypeError: StructType can not accept object <generator object some_function at 0x7f69b43def90> in type <class 'generator'>
Full error:
Py4JJavaError: An error occurred while calling o462.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 37.0 failed 1 times, most recent failure: Lost task 2.0 in stage 37.0 (TID 97, dpc, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
process()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
serializer.dump_stream(out_iter, outfile)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 271, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
return f(*args, **kwargs)
File "/usr/lib/spark/python/pyspark/sql/session.py", line 612, in prepare
verify_func(obj)
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1408, in verify
verify_value(obj)
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1395, in verify_struct
raise TypeError(new_msg("StructType can not accept object %r in type %s"
TypeError: StructType can not accept object <generator object some_function at 0x7f69b43def90> in type <class 'generator'>
I'm following advice from the link below: pySpark convert result of mapPartitions to spark DataFrame
回答1:
EDIT: In Spark 3.0 there is also a mapInPandas
function which should be more efficient because there is no need to group by.
import pyspark.sql.functions as F
def pandas_function(iterator):
for df in iterator:
yield pd.concat(pd.DataFrame(x) for x in df['content'].map(eval))
transformed_df = respond_sdf.mapInPandas(pandas_function, "api string, A int, B int")
transformed_df.show()
Another way: using pandas_udf
and apply
:
import pyspark.sql.functions as F
@F.pandas_udf("api string, A int, B int", F.PandasUDFType.GROUPED_MAP)
def pandas_function(url_json):
df = pd.DataFrame(eval(url_json['content'][0]))
return df
transformed_df = respond_sdf.groupBy(F.monotonically_increasing_id()).apply(pandas_function)
transformed_df.show()
+-----+---+---+
| api| A| B|
+-----+---+---+
|api_2| 7| 10|
|api_2| 8| 11|
|api_2| 9| 12|
|api_1| 1| 4|
|api_1| 2| 5|
|api_1| 3| 6|
+-----+---+---+
Old answer (not very scalable...):
def pandas_function(url_json):
df = pd.DataFrame(eval(url_json))
return df
transformed_df = spark.createDataFrame(pd.concat(respond_sdf.rdd.map(lambda r: pandas_function(r[0])).collect()))
transformed_df.show()
+-----+---+---+
| api| A| B|
+-----+---+---+
|api_1| 1| 4|
|api_1| 2| 5|
|api_1| 3| 6|
|api_2| 7| 10|
|api_2| 8| 11|
|api_2| 9| 12|
+-----+---+---+
回答2:
Thanks to @mck examples, From Spark 2.4 I found there is also applyInPandas
function, which returns spark dataframe.
def pandas_function(url_json):
df = pd.DataFrame(eval(url_json['content'][0]))
return df
respond_sdf.groupby(F.monotonically_increasing_id()).applyInPandas(pandas_function, schema="api string, A int, B int").show()
+-----+---+---+
| api| A| B|
+-----+---+---+
|api_2| 7| 10|
|api_2| 8| 11|
|api_2| 9| 12|
|api_1| 1| 4|
|api_1| 2| 5|
|api_1| 3| 6|
+-----+---+---+
来源:https://stackoverflow.com/questions/65412532/how-to-yield-pandas-dataframe-rows-to-spark-dataframe