How to access statistics endpoint for a Spark Streaming application?

寵の児 提交于 2020-01-03 03:09:08

问题


As of Spark 2.2.0, there's are new endpoints in the API for getting information about streaming jobs.

I run Spark on EMR clusters, using Spark 2.2.0 in cluster mode.

When I hit the endpoint for my streaming jobs, all it gives me is the error message:

no streaming listener attached to <stream name>

I've dug through the Spark codebase a bit, but this feature is not very well documented. So I'm curious if this is a bug? Is there some configuration I need to do to get this endpoint working?


This appears to be an issue specifically when running on the cluster. The same code running on Spark 2.2.0 on my local machine shows the statistics as expected, but gives that error message when run on the cluster.


回答1:


I'm using the very latest Spark 2.3.0-SNAPSHOT built today from the master so YMMV. It worked fine.

Is there some configuration I need to do to get this endpoint working?

No. It's supposed to work fine with no changes to the default configuration.

Make sure the you use the host and port of the driver (as rumors are that you could also access 18080 of Spark History Server that does show all the same endpoints, and the same jobs running, but no streaming listener attached).


As you can see in the source code where the error message lives it can happen only when ui.getStreamingJobProgressListener has not been registered (that ends up in case None).

So the question now is why would that SparkListener not be registered?

That leads us to the streamingJobProgressListener var that is set using setStreamingJobProgressListener method exclusively while StreamingTab is being instantiated (which was the reason why I asked you if you can see the Streaming tab).

In other words, if you see the Streaming tab in web UI, you have the streaming metric endpoint(s) available. Check the URL to the endpoint which should be in the format:

http://[driverHost]:[port]/api/v1/applications/[appId]/streaming/statistics

I tried to reproduce your case and did the following that led me to a working case.

  1. Started one of the official examples of Spark Streaming applications.

    $ ./bin/run-example streaming.StatefulNetworkWordCount localhost 9999
    

    I did run nc -lk 9999 first.

  2. Opened the web UI @ http://localhost:4040/streaming to make sure the Streaming tab is there.

  3. Made sure http://localhost:4040/api/v1/applications/ responds with application ids.

    $ http http://localhost:4040/api/v1/applications/
    HTTP/1.1 200 OK
    Content-Encoding: gzip
    Content-Length: 266
    Content-Type: application/json
    Date: Wed, 13 Dec 2017 07:58:04 GMT
    Server: Jetty(9.3.z-SNAPSHOT)
    Vary: Accept-Encoding, User-Agent
    
    [
        {
            "attempts": [
                {
                    "appSparkVersion": "2.3.0-SNAPSHOT",
                    "completed": false,
                    "duration": 0,
                    "endTime": "1969-12-31T23:59:59.999GMT",
                    "endTimeEpoch": -1,
                    "lastUpdated": "2017-12-13T07:53:53.751GMT",
                    "lastUpdatedEpoch": 1513151633751,
                    "sparkUser": "jacek",
                    "startTime": "2017-12-13T07:53:53.751GMT",
                    "startTimeEpoch": 1513151633751
                }
            ],
            "id": "local-1513151634282",
            "name": "StatefulNetworkWordCount"
        }
    ]
    
  4. Accessed the endpoint for the Spark Streaming application @ http://localhost:4040/api/v1/applications/local-1513151634282/streaming/statistics.

    $ http http://localhost:4040/api/v1/applications/local-1513151634282/streaming/statistics
    HTTP/1.1 200 OK
    Content-Encoding: gzip
    Content-Length: 219
    Content-Type: application/json
    Date: Wed, 13 Dec 2017 08:00:10 GMT
    Server: Jetty(9.3.z-SNAPSHOT)
    Vary: Accept-Encoding, User-Agent
    
    {
        "avgInputRate": 0.0,
        "avgProcessingTime": 30,
        "avgSchedulingDelay": 0,
        "avgTotalDelay": 30,
        "batchDuration": 1000,
        "numActiveBatches": 0,
        "numActiveReceivers": 1,
        "numInactiveReceivers": 0,
        "numProcessedRecords": 0,
        "numReceivedRecords": 0,
        "numReceivers": 1,
        "numRetainedCompletedBatches": 376,
        "numTotalCompletedBatches": 376,
        "startTime": "2017-12-13T07:53:54.921GMT"
    }
    


来源:https://stackoverflow.com/questions/47780148/how-to-access-statistics-endpoint-for-a-spark-streaming-application

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!