Spark Streaming - Same processing time for 4 cores and 16 cores. Why?

问题

Scenario: I am doing some testing with spark streaming. The files with around 100 records comes in every 25 seconds.

Problem: The processing is taking on average 23 seconds for 4 core pc using local[*] in the program. When i deploy the same app to server with 16 cores i was expecting an improvement in processing time. However, i see it is still taking same time in 16 cores as well (also checked cpu usages in ubuntu and cpu is being fully utilized). All the configurations are default provided by spark.

Question: Should not processing time decrease with increase in number of cores available for the streaming job?

Code:

  val conf = new SparkConf()
  .setMaster("local[*]")
  .setAppName(this.getClass.getCanonicalName)
  .set("spark.hadoop.validateOutputSpecs", "false")

val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(25))
val sqc = new SQLContext(sc)

val gpsLookUpTable = MapInput.cacheMappingTables(sc, sqc).persist(StorageLevel.MEMORY_AND_DISK_SER_2)
val broadcastTable = sc.broadcast(gpsLookUpTable)
jsonBuilder.append("[")
ssc.textFileStream("hdfs://localhost:9000/inputDirectory/")
  .foreachRDD { rdd =>
  if (!rdd.partitions.isEmpty) {

    val header = rdd.first().split(",")
    val rowsWithoutHeader = Utils.dropHeader(rdd)
rowsWithoutHeader.foreach { row =>
      jsonBuilder.append("{")
      val singleRowArray = row.split(",")
      (header, singleRowArray).zipped
        .foreach { (x, y) =>
        jsonBuilder.append(convertToStringBasedOnDataType(x, y))
        // GEO Hash logic here
        if (x.equals("GPSLat") || x.equals("Lat")) {
          lattitude = y.toDouble
        }
        else if (x.equals("GPSLon") || x.equals("Lon")) {
          longitude = y.toDouble
          if (x.equals("Lon")) {
            // This section is used to convert GPS Look Up to GPS LookUP with Hash
            jsonBuilder.append(convertToStringBasedOnDataType("geoCode", GeoHash.encode(lattitude, longitude)))
          }
          else {
            val selectedRow = broadcastTable.value
              .filter("geoCode LIKE '" + GeoHash.subString(lattitude, longitude) + "%'")
              .withColumn("Distance", calculateDistance(col("Lat"), col("Lon")))
              .orderBy("Distance")
              .select("TrackKM", "TrackName").take(1)
            if (selectedRow.length != 0) {
              jsonBuilder.append(convertToStringBasedOnDataType("TrackKm", selectedRow(0).get(0)))
              jsonBuilder.append(convertToStringBasedOnDataType("TrackName", selectedRow(0).get(1)))
            }
            else {
              jsonBuilder.append(convertToStringBasedOnDataType("TrackKm", "NULL"))
              jsonBuilder.append(convertToStringBasedOnDataType("TrackName", "NULL"))
            }
          }
        }
      }
      jsonBuilder.setLength(jsonBuilder.length - 1)
      jsonBuilder.append("},")
    }
  sc.parallelize(Seq(jsonBuilder.toString)).repartition(1).saveAsTextFile("hdfs://localhost:9000/outputDirectory")

回答1:

It sounds like you are using only one thread, whether the application runs on a machine with 4 or 16 cores won't matter if that is the case.

It sounds like 1 file comes in, that 1 file is 1 RDD partition with 100 rows. You iterate over the rows in that RDD and append the jsonBuilder. At the end you call repartition(1) which will make the writing of the file single threaded.

You could reparation your data-set to 12 RDD partitions after you pick up the file, to ensure that other threads work on the rows. But unless I am missing something you are lucky this isn't happening. What happens if two threads are calling jsonBuilder.append("{") at the same time? Won't they create invalid JSON. I could be missing something here.

You could test to see if I am correct about the single threaded-ness of your application by adding logging like this:

scala> val rdd1 = sc.parallelize(1 to 10).repartition(1)
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at repartition at <console>:21

scala> rdd1.foreach{ r => {println(s"${Thread.currentThread.getName()} => $r")} }
Executor task launch worker-40 => 1
Executor task launch worker-40 => 2
Executor task launch worker-40 => 3
Executor task launch worker-40 => 4
Executor task launch worker-40 => 5
Executor task launch worker-40 => 6
Executor task launch worker-40 => 7
Executor task launch worker-40 => 8
Executor task launch worker-40 => 9
Executor task launch worker-40 => 10

scala> val rdd3 = sc.parallelize(1 to 10).repartition(3)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[40] at repartition at <console>:21

scala> rdd3.foreach{ r => {println(s"${Thread.currentThread.getName()} => $r")} }
Executor task launch worker-109 => 1
Executor task launch worker-108 => 2
Executor task launch worker-95 => 3
Executor task launch worker-95 => 4
Executor task launch worker-109 => 5
Executor task launch worker-108 => 6
Executor task launch worker-108 => 7
Executor task launch worker-95 => 8
Executor task launch worker-109 => 9
Executor task launch worker-108 => 10

来源：https://stackoverflow.com/questions/32556654/spark-streaming-same-processing-time-for-4-cores-and-16-cores-why

标签

apache-spark

apache-spark-sql

spark-streaming