How to run multiple jobs in one Sparkcontext from separate threads in PySpark?

前端 未结 2 1125
太阳男子
太阳男子 2020-12-01 14:31

It is understood from Spark documentation about Scheduling Within an Application:

Inside a given Spark application (SparkContext instance), multiple

2条回答
  •  有刺的猬
    2020-12-01 14:57

    I was running into the same issue, so I created a tiny self-contained example. I create multiple threads using python's threading module and submit multiple spark jobs simultaneously.

    Note that by default, spark will run the jobs in First-In First-Out (FIFO): http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application. In the example below, I change it to FAIR scheduling

    # Prereqs:
    # set 
    # spark.dynamicAllocation.enabled         true
    # spark.shuffle.service.enabled           true
      spark.scheduler.mode                    FAIR
    # in spark-defaults.conf
    
    import threading
    from pyspark import SparkContext, SparkConf
    
    def task(sc, i):
      print sc.parallelize(range(i*10000)).count()
    
    def run_multiple_jobs():
      conf = SparkConf().setMaster('local[*]').setAppName('appname')
      # Set scheduler to FAIR: http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
      conf.set('spark.scheduler.mode', 'FAIR')
      sc = SparkContext(conf=conf)
      for i in range(4):
        t = threading.Thread(target=task, args=(sc, i))
        t.start()
        print 'spark task', i, 'has started'
    
    
    run_multiple_jobs()
    

    Output:

    spark task 0 has started
    spark task 1 has started
    spark task 2 has started
    spark task 3 has started
    30000
    0 
    10000
    20000
    

提交回复
热议问题