How to get the output from console streaming sink in Zeppelin?

前端 未结 2 1066
隐瞒了意图╮
隐瞒了意图╮ 2020-12-09 12:22

I\'m struggling to get the console sink working with PySpark Structured Streaming when run from Zeppelin. Basically, I\'m not seeing any results printed to the

相关标签:
2条回答
  • 2020-12-09 12:45

    Console sink is not a good choice for interactive notebook-based workflow. Even in Scala, where the output can be captured, it requires awaitTermination call (or equivalent) in the same paragraph, effectively blocking the note.

    %spark
    
    spark
      .readStream
      .format("socket")
      .option("host", "localhost")
      .option("port", "9999")
      .option("includeTimestamp", "true")
      .load()
      .writeStream
      .outputMode("append")
      .format("console")
      .option("truncate", "false")
      .start()
      .awaitTermination() // Block execution, to force Zeppelin to capture the output
    

    Chained awaitTermination could be replaced with standalone call in the same paragraph would work as well:

    %spark
    
    val query = df
      .writeStream
      ...
      .start()
    
    query.awaitTermination()
    

    Without it, Zeppelin has no reason to wait for any output. PySpark just adds another problem on top of that - indirect execution. Because of that, even blocking the query won't help you here.

    Moreover continuous output from the stream can cause rendering issues and memory problems when browsing the note (it might be possible to use Zeppelin display system via InterpreterContext or REST API, to achieve a bit more sensible behavior, where the output is overwritten or periodically cleared).

    A much better choice for testing with Zeppelin is memory sink. This way you can start a query without blocking:

    %pyspark
    
    query = (windowedCounts
      .writeStream
      .outputMode("complete")
      .format("memory")
      .queryName("some_name")
      .start())
    

    and query the result on demand in another paragraph:

    %pyspark
    
    spark.table("some_name").show()
    

    It can be coupled with reactive streams or similar solution to provide interval based updates.

    It is also possible to use StreamingQueryListener with Py4j callbacks to couple rx with onQueryProgress events, although query listeners are not supported in PySpark, and require a bit of code, to glue things together. Scala interface:

    package com.example.spark.observer
    
    import org.apache.spark.sql.streaming.StreamingQueryListener
    import org.apache.spark.sql.streaming.StreamingQueryListener._
    
    trait PythonObserver {
      def on_next(o: Object): Unit
    }
    
    class PythonStreamingQueryListener(observer: PythonObserver) 
        extends StreamingQueryListener {
      override def onQueryProgress(event: QueryProgressEvent): Unit = {
        observer.on_next(event)
      }
      override def onQueryStarted(event: QueryStartedEvent): Unit = {}
      override def onQueryTerminated(event: QueryTerminatedEvent): Unit = {}
    }
    

    build a jar, adjusting build definition to reflect desired Scala and Spark version:

    scalaVersion := "2.11.8"  
    
    val sparkVersion = "2.2.0"
    
    libraryDependencies ++= Seq(
      "org.apache.spark" %% "spark-sql" % sparkVersion,
      "org.apache.spark" %% "spark-streaming" % sparkVersion
    )
    

    put it on the Spark classpath, patch StreamingQueryManager:

    %pyspark
    
    from pyspark.sql.streaming import StreamingQueryManager
    from pyspark import SparkContext
    
    def addListener(self, listener):
        jvm = SparkContext._active_spark_context._jvm
        jlistener = jvm.com.example.spark.observer.PythonStreamingQueryListener(
            listener
        )
        self._jsqm.addListener(jlistener)
        return jlistener
    
    
    StreamingQueryManager.addListener = addListener
    

    start callback server:

    %pyspark
    
    sc._gateway.start_callback_server()
    

    and add listener:

    %pyspark
    
    from rx.subjects import Subject
    
    class StreamingObserver(Subject):
        class Java:
            implements = ["com.example.spark.observer.PythonObserver"]
    
    observer = StreamingObserver()
    spark.streams.addListener(observer)
    

    Finally you can use subscribe and block execution:

    %pyspark
    
    (observer
        .map(lambda p: p.progress().name())
        # .filter() can be used to print only for a specific query
        .subscribe(lambda n: spark.table(n).show() if n else None))
    input()  # Block execution to capture the output 
    

    The last step should be executed after you started streaming query.

    It is also possible to skip rx and use minimal observer like this:

    class StreamingObserver(object):
        class Java:
            implements = ["com.example.spark.observer.PythonObserver"]
    
        def on_next(self, value):
            try:
                name = value.progress().name()
                if name:
                    spark.table(name).show()
            except: pass
    

    It gives a bit less control than the Subject (one caveat is that this can interfere with other code printing to stdout and can be stopped only by removing listener. With Subject you can easily dispose subscribed observer, once you're done), but otherwise should work more or less the same.

    Note that any blocking action will be sufficient to capture the output from the listener and it doesn't have to be executed in the same cell. For example

    %pyspark
    
    observer = StreamingObserver()
    spark.streams.addListener(observer)
    

    and

    %pyspark
    
    import time
    
    time.sleep(42)
    

    would work in a similar way, printing table for a defined time interval.

    For completeness you can implement StreamingQueryManager.removeListener.

    0 讨论(0)
  • 2020-12-09 12:45

    zeppelin-0.7.3-bin-all uses Spark 2.1.0 (so no rate format to test Structured Streaming with unfortunately).


    Make sure that when you start a streaming query with socket source nc -lk 9999 has already been started (as the query simply stops otherwise).

    Also make sure that the query is indeed up and running.

    val lines = spark
      .readStream
      .format("socket")
      .option("host", "localhost")
      .option("port", 9999)
      .load
    val q = lines.writeStream.format("console").start
    

    It's indeed true that you won't be able to see the output in a Zeppelin notebook possibly because:

    1. Streaming queries start on their own threads (that seems to be outside Zeppelin's reach)

    2. console sink writes to standard output (uses Dataset.show operator on that separate thread).

    All this makes "intercepting" the output not available in Zeppelin.

    So we come to answer the real question:

    Where is the standard output written to in Zeppelin?

    Well, with a very limited understanding of the internals of Zeppelin, I thought it could be logs/zeppelin-interpreter-spark-[hostname].log, but unfortunately could not find the output from the console sink. That's where you can find the logs from Spark (and Structured Streaming in particular) that use log4j but console sink does not use.

    It looks as if your only long-term solution were to write your own console-like custom sink and use a log4j logger. Honestly, that is not that hard as it sounds. Follow the sources of console sink.

    0 讨论(0)
提交回复
热议问题