Transforming PySpark RDD with Scala

后端 未结 1 1832
挽巷
挽巷 2021-01-03 02:39

TL;DR - I have what looks like a DStream of Strings in a PySpark application. I want to send it as a DStream[String] to a Scala library. Strings are not

相关标签:
1条回答
  • 2021-01-03 03:13

    Long story short there is no supported way to do something like this. Don't try this in production. You've been warned.

    In general Spark doesn't use Py4j for anything else than some basic RPC calls on the driver and doesn't start Py4j gateway on any other machine. When it is required (mostly MLlib and some parts of SQL) Spark uses Pyrolite to serialize objects passed between JVM and Python.

    This part of the API is either private (Scala) or internal (Python) and as such not intended for general usage. While theoretically you access it anyway either per batch:

    package dummy
    
    import org.apache.spark.api.java.JavaRDD
    import org.apache.spark.streaming.api.java.JavaDStream
    import org.apache.spark.sql.DataFrame
    
    object PythonRDDHelper {
      def go(rdd: JavaRDD[Any]) = {
        rdd.rdd.collect {
          case s: String => s
        }.take(5).foreach(println)
      }
    }
    

    complete stream:

    object PythonDStreamHelper {
      def go(stream: JavaDStream[Any]) = {
        stream.dstream.transform(_.collect {
          case s: String => s
        }).print
      }
    }
    

    or exposing individual batches as DataFrames (probably the least evil option):

    object PythonDataFrameHelper {
      def go(df: DataFrame) = {
        df.show
      }
    }
    

    and use these wrappers as follows:

    from pyspark.streaming import StreamingContext
    from pyspark.mllib.common import _to_java_object_rdd
    from pyspark.rdd import RDD
    
    ssc = StreamingContext(spark.sparkContext, 10)
    spark.catalog.listTables()
    
    q = ssc.queueStream([sc.parallelize(["foo", "bar"]) for _ in range(10)]) 
    
    # Reserialize RDD as Java RDD<Object> and pass 
    # to Scala sink (only for output)
    q.foreachRDD(lambda rdd: ssc._jvm.dummy.PythonRDDHelper.go(
        _to_java_object_rdd(rdd)
    ))
    
    # Reserialize and convert to JavaDStream<Object>
    # This is the only option which allows further transformations
    # on DStream
    ssc._jvm.dummy.PythonDStreamHelper.go(
        q.transform(lambda rdd: RDD(  # Reserialize but keep as Python RDD
            _to_java_object_rdd(rdd), ssc.sparkContext
        ))._jdstream
    )
    
    # Convert to DataFrame and pass to Scala sink.
    # Arguably there are relatively few moving parts here. 
    q.foreachRDD(lambda rdd: 
        ssc._jvm.dummy.PythonDataFrameHelper.go(
            rdd.map(lambda x: (x, )).toDF()._jdf
        )
    )
    
    ssc.start()
    ssc.awaitTerminationOrTimeout(30)
    ssc.stop()
    

    this is not supported, untested and as such rather useless for anything else than the experiments with Spark API.

    0 讨论(0)
提交回复
热议问题