How to access statistics endpoint for a Spark Streaming application?

问题

As of Spark 2.2.0, there's are new endpoints in the API for getting information about streaming jobs.

I run Spark on EMR clusters, using Spark 2.2.0 in cluster mode.

When I hit the endpoint for my streaming jobs, all it gives me is the error message:

no streaming listener attached to <stream name>

I've dug through the Spark codebase a bit, but this feature is not very well documented. So I'm curious if this is a bug? Is there some configuration I need to do to get this endpoint working?

This appears to be an issue specifically when running on the cluster. The same code running on Spark 2.2.0 on my local machine shows the statistics as expected, but gives that error message when run on the cluster.

回答1:

I'm using the very latest Spark 2.3.0-SNAPSHOT built today from the master so YMMV. It worked fine.

Is there some configuration I need to do to get this endpoint working?

No. It's supposed to work fine with no changes to the default configuration.

Make sure the you use the host and port of the driver (as rumors are that you could also access 18080 of Spark History Server that does show all the same endpoints, and the same jobs running, but no streaming listener attached).

As you can see in the source code where the error message lives it can happen only when ui.getStreamingJobProgressListener has not been registered (that ends up in case None).

So the question now is why would that SparkListener not be registered?

That leads us to the streamingJobProgressListener var that is set using setStreamingJobProgressListener method exclusively while StreamingTab is being instantiated (which was the reason why I asked you if you can see the Streaming tab).

In other words, if you see the Streaming tab in web UI, you have the streaming metric endpoint(s) available. Check the URL to the endpoint which should be in the format:

http://[driverHost]:[port]/api/v1/applications/[appId]/streaming/statistics

I tried to reproduce your case and did the following that led me to a working case.

Started one of the official examples of Spark Streaming applications.
```
$ ./bin/run-example streaming.StatefulNetworkWordCount localhost 9999
```
I did run nc -lk 9999 first.
Opened the web UI @ http://localhost:4040/streaming to make sure the Streaming tab is there.

Made sure http://localhost:4040/api/v1/applications/ responds with application ids.

$ http http://localhost:4040/api/v1/applications/
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Length: 266
Content-Type: application/json
Date: Wed, 13 Dec 2017 07:58:04 GMT
Server: Jetty(9.3.z-SNAPSHOT)
Vary: Accept-Encoding, User-Agent

[
    {
        "attempts": [
            {
                "appSparkVersion": "2.3.0-SNAPSHOT",
                "completed": false,
                "duration": 0,
                "endTime": "1969-12-31T23:59:59.999GMT",
                "endTimeEpoch": -1,
                "lastUpdated": "2017-12-13T07:53:53.751GMT",
                "lastUpdatedEpoch": 1513151633751,
                "sparkUser": "jacek",
                "startTime": "2017-12-13T07:53:53.751GMT",
                "startTimeEpoch": 1513151633751
            }
        ],
        "id": "local-1513151634282",
        "name": "StatefulNetworkWordCount"
    }
]

Accessed the endpoint for the Spark Streaming application @ http://localhost:4040/api/v1/applications/local-1513151634282/streaming/statistics.

$ http http://localhost:4040/api/v1/applications/local-1513151634282/streaming/statistics
HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Length: 219
Content-Type: application/json
Date: Wed, 13 Dec 2017 08:00:10 GMT
Server: Jetty(9.3.z-SNAPSHOT)
Vary: Accept-Encoding, User-Agent

{
    "avgInputRate": 0.0,
    "avgProcessingTime": 30,
    "avgSchedulingDelay": 0,
    "avgTotalDelay": 30,
    "batchDuration": 1000,
    "numActiveBatches": 0,
    "numActiveReceivers": 1,
    "numInactiveReceivers": 0,
    "numProcessedRecords": 0,
    "numReceivedRecords": 0,
    "numReceivers": 1,
    "numRetainedCompletedBatches": 376,
    "numTotalCompletedBatches": 376,
    "startTime": "2017-12-13T07:53:54.921GMT"
}

来源：https://stackoverflow.com/questions/47780148/how-to-access-statistics-endpoint-for-a-spark-streaming-application

标签

apache-spark

spark-streaming

amazon-emr