Task is running on only one executor in spark [duplicate]

问题

This question already has answers here:

Partitioning in spark while reading from RDBMS via JDBC (1 answer)

What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters? (4 answers)

Spark 2.1 Hangs while reading a huge datasets (1 answer)

Closed 2 years ago.

I am running below code in spark using Java.

Code

Test.java

package com.sample;

import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.storage.StorageLevel;

import com.addition.AddTwoNumbers;

public class Test{

    private static final String APP_NAME = "Test";
    private static final String LOCAL = "local";
    private static final String MASTER_IP = "spark://10.180.181.26:7077";

    public static void main(String[] args) {

        SparkConf conf = new SparkConf().setAppName(APP_NAME).setMaster(MASTER_IP);
        String connection = "jdbc:oracle:thin:test/test@//xyz00aie.in.oracle.com:1521/PDX2600N";
        // Create Spark Context
        SparkContext context = new SparkContext(conf);
        // Create Spark Session

        SparkSession sparkSession = new SparkSession(context);
        long startTime = System.currentTimeMillis();
        System.out.println("Start time is : " + startTime);
        Dataset<Row> txnDf = sparkSession.read().format("jdbc").option("url", connection)
                .option("dbtable", "CI_TXN_DETAIL_STG_100M").load();

        System.out.println(txnDf.filter((txnDf.col("TXN_DETAIL_ID").gt(new Integer(1286001510)))
                .and(txnDf.col("TXN_DETAIL_ID").lt(new Integer(1303001510)))).count());


        sparkSession.stop();
    }

}

I am simply trying to find count of range of rows. Range is 20 Million.

Below is snapshot of spark dashboard

Here I can see Active task only on one Executor. I have total of 10 Executors running.

My question

Why my application is showing active task on one Executor instead of distributing it across all 10 executors?

Below is my spark-submit command :

./spark-submit --class com.sample.Test--conf spark.sql.shuffle.partitions=5001 --conf spark.yarn.executor.memoryOverhead=11264 --executor-memory=91GB --conf spark.yarn.driver.memoryOverhead=11264 --driver-memory=91G --executor-cores=17  --driver-cores=17 --conf spark.default.parallelism=306 --jars /scratch/rmbbuild/spark_ormb/drools-jars/ojdbc6.jar,/scratch/rmbbuild/spark_ormb/drools-jars/Addition-1.0.jar --driver-class-path /scratch/rmbbuild/spark_ormb/drools-jars/ojdbc6.jar --master spark://10.180.181.26:7077 "/scratch/rmbbuild/spark_ormb/POC-jar/Test-0.0.1-SNAPSHOT.jar" > /scratch/rmbbuild/spark_ormb/POC-jar/logs/log18.txt

回答1:

Looks like all data are read in one partition, and goes to one executor. For use more executors, more partitions have to be created. Parameter "numPartitions" can be used, along with partition column, as specified here:

https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#jdbc-reads

Also this link can be useful:

Spark: Difference between numPartitions in read.jdbc(..numPartitions..) and repartition(..numPartitions..)

来源：https://stackoverflow.com/questions/53425983/task-is-running-on-only-one-executor-in-spark

标签

apache-spark

apache-spark-2.0

spark-submit