How to avoid using of collect in Spark RDD in Scala?

会有一股神秘感。 提交于 2020-05-15 09:35:06

问题


I have a List and has to create Map from this for further use, I am using RDD, but with use of collect(), job is failing in cluster. Any help is appreciated.

Please help. Below is the sample code from List to rdd.collect. I have to use this Map data further but how to use without collect?

This code creates a Map from RDD (List) Data. List Format->(asdfg/1234/wert,asdf)

 //List Data to create Map
 val listData = methodToGetListData(ListData).toList
//Creating RDD from above List  

  val rdd = sparkContext.makeRDD(listData)

      implicit val formats = Serialization.formats(NoTypeHints)
      val res = rdd
        .map(map => (getRPath(map._1), getAttribute(map._1), map._2))
        .groupBy(_._1)
        .map(tuple => {
          Map(
            "P_Id" -> "1234",
            "R_Time" -> "27-04-2020",
            "S_Time" -> "27-04-2020",
            "r_path" -> tuple._1,
            "S_Tag" -> "12345,
            tuple._1 -> (tuple._2.map(a => (a._2, a._3)).toMap)
          )
        })

      res.collect()
    }

回答1:



Q: how to use without collect?


Answer : collect will hit.. it will move the data to driver node. if data is huge. Never do that.


I dont exactly know what is the use case to prepare a map but it can be achievable using built in spark API i.e.collectionAccumulator ... in detail,

collectionAccumulator[scala.collection.mutable.Map[String, String]]


Lets suppose, this is your sample dataframe and you want to make a map.

+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree                  |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909  |1234     |Cables-1             |23-12-2020   |LC        |Installed   |ABCD1234     |0          |Cables      |ASDF123   |12345    |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111  |Cables-11            |23-12-2022   |LC1       |Installed1  |ABCD12341    |0          |Cables1     |ASDF1231  |123451   |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+

From this you want to make a map (nested map I prefixed with nestedmap key name in your example) then...

Below is the full example have a look and modify accordingly.

package examples

import org.apache.log4j.Level

object GrabMapbetweenClosure extends App {
  val logger = org.apache.log4j.Logger.getLogger("org")
  logger.setLevel(Level.WARN)


  import org.apache.spark.sql.SparkSession

  val spark = SparkSession
    .builder()
    .master("local[*]")
    .appName(this.getClass.getName)
    .getOrCreate()

  import spark.implicits._

  var mutableMapAcc = spark.sparkContext.collectionAccumulator[scala.collection.mutable.Map[String, String]]("mutableMap")

  val df = Seq(
    ("-0909", "1234", "Cables-1", "23-12-2020", "LC", "Installed", "ABCD1234"
      , "0", "Cables", "ASDF123", "12345", "Start~>HInfo->Cables->Cables-1")
    , ("-09091", "1234111", "Cables-11", "23-12-2022", "LC1", "Installed1", "ABCD12341"
      , "0", "Cables1", "ASDF1231", "123451", "Start~>HInfo->Cables->Cables-11")

  ).toDF("Item_Id", "Parent_Id", "object_class_instance", "Received_Time", "CablesName", "CablesStatus", "CablesHInfoID",
    "CablesIndex", "object_class", "ServiceTag", "Scan_Time", "relation_tree"
  )

  df.show(false)
  df.foreachPartition { partition => // for performance sake I used foreachPartition
    partition.foreach {
      record => {
        mutableMapAcc.add(scala.collection.mutable.Map(
          "Item_Id" -> record.getAs[String]("Item_Id")
          , "CablesStatus" -> record.getAs[String]("CablesStatus")
          , "CablesHInfoID" -> record.getAs[String]("CablesHInfoID")
          , "Parent_Id" -> record.getAs[String]("Parent_Id")
          , "CablesIndex" -> record.getAs[String]("CablesIndex")
          , "object_class_instance" -> record.getAs[String]("object_class_instance")
          , "Received_Time" -> record.getAs[String]("Received_Time")
          , "object_class" -> record.getAs[String]("object_class")
          , "CablesName" -> record.getAs[String]("CablesName")
          , "ServiceTag" -> record.getAs[String]("ServiceTag")
          , "Scan_Time" -> record.getAs[String]("Scan_Time")
          , "relation_tree" -> record.getAs[String]("relation_tree")

        )
        )
      }
    }
  }
  println("FinalMap : " + mutableMapAcc.value.toString)

}


Result :

+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree                  |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909  |1234     |Cables-1             |23-12-2020   |LC        |Installed   |ABCD1234     |0          |Cables      |ASDF123   |12345    |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111  |Cables-11            |23-12-2022   |LC1       |Installed1  |ABCD12341    |0          |Cables1     |ASDF1231  |123451   |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+

FinalMap : [Map(Scan_Time -> 123451, ServiceTag -> ASDF1231, Received_Time -> 23-12-2022, object_class_instance -> Cables-11, CablesHInfoID -> ABCD12341, Parent_Id -> 1234111, Item_Id -> -09091, CablesIndex -> 0, object_class -> Cables1, relation_tree -> Start~>HInfo->Cables->Cables-11, CablesName -> LC1, CablesStatus -> Installed1), Map(Scan_Time -> 12345, ServiceTag -> ASDF123, Received_Time -> 23-12-2020, object_class_instance -> Cables-1, CablesHInfoID -> ABCD1234, Parent_Id -> 1234, Item_Id -> -0909, CablesIndex -> 0, object_class -> Cables, relation_tree -> Start~>HInfo->Cables->Cables-1, CablesName -> LC, CablesStatus -> Installed)]

Similar problem was solved here.



来源:https://stackoverflow.com/questions/61457624/how-to-avoid-using-of-collect-in-spark-rdd-in-scala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!