问题
Hello everyone first and foremost i'm aware of the existence of this thread, Task is running on only one executor in spark.
However this is not my case as i'm using repartition(n)
on my dataframe.
Basically i'm loading a DataFrame by fetching data from an ElasticSearch index through Spark as follows:
spark = SparkSession.builder \
.appName("elastic") \
.master("yarn")\
.config('spark.submit.deployMode','client')\
.config("spark.jars",pathElkJar) \
.enableHiveSupport() \
.getOrCreate()
es_reader = (spark.read
.format("org.elasticsearch.spark.sql")
.option("es.read.field.include",includeFieldsString)
.option("es.query",q)
.option("es.nodes",elasticClusterIP)
.option("es.port",port)
.option("es.resource",indexNameTable)
.option("es.nodes.wan.only" , 'true')
.option("es.net.ssl", 'true')
.option("es.net.ssl.cert.allow.self.signed", "true")
.option("es.net.http.auth.user" ,elkUser )
.option("es.net.http.auth.pass" , elkPassword)
.option("es.read.metadata", "false")
.option("es.read.field.as.array.include","system_auth_hostname")
#.option("es.mapping.exclude", "index")
#.option("es.mapping.id", "_id")
#.option("es.read.metadata._id","_id")
#.option("delimiter", ",")
#.option("inferSchema","true")
#.option("first_row_is_header","true")
)
df = es_reader.load()
YARN correctly adds by default 2 executors to my application as i did not specify else. DF is not partitioned when loading data from ElasticSearch therefore i ran the following to check executor behavior:
df = df.repartition(2)
print('Number of partitions: {}'.format(df.rdd.getNumPartitions()))
>> Number of partitions: 2
df.count()
I expected to see from Spark UI both executor working on the count()
task, however i get a strange behavior where i have three task completed, without repartitioning it's two task for an action, where the first and longer one is ran by only one executor as you can see in the following image:
Count()-twoExecutor-twoPartitions.
Things work as expected if i save and load from an hive table already partitioned in two (OS:linux/windows):
df.write.mode('overwrite').format("parquet").partitionBy('OS').saveAsTable('test_executors')
df2 = spark.read.load("path")
df2.count()
In this case i get the following: Count()-twoExecutor-twoPartitions_loadFromHiveTable
Where i get the desired behavior of two executor cuncurrently working on the count()
task.
The problem seems to be lying in the repartition(n)
which i think correctly partitions the DF, i checked with df.rdd.getNumPartitions()
but doesn't parallelize work amongst executors.
If necessary i can provide the details to the tasks in the attached images. Thanks in advance!
来源:https://stackoverflow.com/questions/64972467/spark-task-runs-on-only-one-executor