Spark: Parallelizing creation of multiple DataFrames

问题

I'm currently generating DataFrames based on a list of IDs - each query based on one ID gives back a manageable subset of a very large PostgreSQL table. I then partition that output based on the file structure I need to write out. The problem is that I'm hitting a speed limit and majorly under-utilizing my executor resources.

I’m not sure if this is a matter of rethinking my architecture or if there is some simple way to get around this, but basically I want to get more parallelization of tasks but am failing to keep all of my 16 executors busy while trying to do this ETL job as quickly as possible.

So...here’s what I thought I could do to speed this up:

Parallelize a list.
Then each element in the list, out on an executor, selects a (relatively small) DataFrame via jdbc.
Then foreachPartition (of which there are necessarily few), I need to do some action (which includes atomic writes of data from each partition), and those partition actions can branch out to worker nodes/executors as well.

Current code looks something like this, but of course throws “py4j.Py4JException: Method getnewargs([]) does not exist” because the spark session context can’t be passed into the foreach closure that would allow this to stay on the executors:

spark = SparkSession \
    .builder \
    .appName
    ... etc

# the list, distributed to workers
idsAndRegionsToProcess = sc.parallelize(idList)

# the final thing that needs to be done with the data
# (each partition written to a file and sent somewhere)
def transformAndLoad(iterator, someField, someOtherField):
    for row in iterator:
        ...do stuff
    ...write a file to S3

# !! The issue is here (well, at least with my current approach)!!
# In theory these are the first operations that really need to be
# running on various nodes.
def dataMove(idAndRegion, spark):
        # now pull dataFrames from Postgres foreach id
        postgresDF = spark.read \
            .format("jdbc") \
            .option("url" …
        .option("dbtable", "(select id, someField, someOtherField from table_region_“ + idAndRegion[1] + ” where id = ‘“ + idAndRegion[0] + ”') as \history") \
        … more setup        

    postgresDF.repartition('someOtherField')
    postgresDF.persist(StorageLevel.MEMORY_AND_DISK)
    postgresDF.foreachPartition(lambda iterator: transformAndLoad(iterator, someField, someOtherField))

# invoking the problematic code on the parallelized list
idsAndRegionsToProcess.foreach(lambda idAndRegion: dataMove(idAndRegion, spark))

I get that this isn’t quite possible this way, but maybe I’m missing a subtlety that would make this possible? This seems a lot more efficient than selecting 1TB of data and then processing that, but maybe there is some underlying pagination that I don't know about.

I have very similar working code with a regular loop operating on a collected list using almost this exact code otherwise, but this was painfully slow and isn’t coming close to utilizing the executors.

For some extra context I’m on EMR and YARN and my spark-submit (from the master node) looks like this: spark-submit --packages org.postgresql:postgresql:9.4.1207.jre7 --deploy-mode cluster --num-executors 16 --executor-memory 3g --master yarn DataMove.py

Also, selecting these DataFrames is not problematic as the result is a small subset of the data and the database is indexed correctly, but selecting each entire table seems like it would be absolutely impossible as there could be up to a TB of data in some of them. Also, the repartition divides it out by what needs to be written into each (individual and specifically-named) file going to s3.

I would be open to any suggestions, even if it just means using my working code and somehow getting it to kick off as many jobs as it can while other things are still running from the last. But first and foremost, can my approach here work?

回答1:

You could look into running your data workload as separate jobs / applications on your Spark cluster as described here:

https://spark.apache.org/docs/latest/submitting-applications.html

But your comment about storing the data in multiple partitions should also greatly help to reduce the memory needed to process it. You may be able to avoid splitting it up into separate jobs that way.

The Spark UI at:

http://localhost:4040

is your friend in figuring out what steps your job is creating in Spark internally and what resources it consumes. Based on those insights you can optimize it and reduce the amount of memory needed or improve the processing speed.

来源：https://stackoverflow.com/questions/41111108/spark-parallelizing-creation-of-multiple-dataframes

标签

apache-spark

pyspark-sql