Access Spark broadcast variable in different classes

懵懂的女人 提交于 2020-02-27 08:23:06

问题


I am broadcasting a value in Spark Streaming application . But I am not sure how to access that variable in a different class than the class where it was broadcasted.

My code looks as follows:

object AppMain{
  def main(args: Array[String]){
    //...
    val broadcastA = sc.broadcast(a)
    //..
    lines.foreachRDD(rdd => {
    val obj = AppObject1
    rdd.filter(p => obj.apply(p))
    rdd.count
  }
}

object AppObject1: Boolean{
  def apply(str: String){
    AnotherObject.process(str)
  }
}
object AnotherObject{
  // I want to use broadcast variable in this object
  val B = broadcastA.Value // compilation error here
  def process(): Boolean{
   //need to use B inside this method
  }
}

Can anyone suggest how to access broadcast variable in this case?


回答1:


There is nothing particularly Spark specific here ignoring possible serialization issues. If you want to use some object it has to be available in the current scope and you can achieve this the same way as usual:

  • you can define your helpers in a scope where broadcast is already defined:

    {
        ...
        val x = sc.broadcast(1)
        object Foo {
          def foo = x.value
        }
        ...
    }
    
  • you can use it as a constructor argument:

    case class Foo(x: org.apache.spark.broadcast.Broadcast[Int]) {
      def foo = x.value
    }
    
    ...
    
    Foo(sc.broadcast(1)).foo
    
  • method argument

    case class Foo() {
      def foo(x: org.apache.spark.broadcast.Broadcast[Int]) = x.value
    }
    
    ...
    
    Foo().foo(sc.broadcast(1))
    
  • or even mixed-in your helpers like this:

    trait Foo {
      val x: org.apache.spark.broadcast.Broadcast[Int]
      def foo = x.value
    }
    
    object Main extends Foo {
      val sc = new SparkContext("local",  "test", new SparkConf())
      val x = sc.broadcast(1)
    
      def main(args: Array[String]) {
        sc.parallelize(Seq(None)).map(_ => foo).first
        sc.stop
      }
    }
    



回答2:


Just a short take on performance considerations that were introduced earlier.

Options proposed by zero233 are indeed very elegant way of doing this kind of things in Scala. At the same time it is important to understand implications of using certain patters in distributed system.

It is not the best idea to use mixin approach / any logic that uses enclosing class state. Whenever you use a state of enclosing class within lambdas Spark will have to serialize outer object. This is not always true but you'd better off writing safer code than one day accidentally blow up the whole cluster.

Being aware of this, I would personally go for explicit argument passing to the methods as this would not result in outer class serialization (method argument approach).




回答3:


you can use classes and pass the broadcast variable to classes

your psudo code should look like :

object AppMain{
   def main(args: Array[String]){
     //...
     val broadcastA = sc.broadcast(a)
     //..
     lines.foreach(rdd => {
       val obj = new AppObject1(broadcastA)
       rdd.filter(p => obj.apply(p))
       rdd.count
     })
   }
}

class AppObject1(bc : Broadcast[String]){
   val anotherObject = new AnotherObject(bc)
   def apply(str: String): Boolean ={
      anotherObject.process(str)
   }
}

class AnotherObject(bc : Broadcast[String]){
  // I want to use broadcast variable in this object
  def process(str : String): Boolean = {
    val a = bc.value
    true
    //need to use B inside this method
  }
}


来源:https://stackoverflow.com/questions/36642943/access-spark-broadcast-variable-in-different-classes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!