问题
As of Spark 2.2.0, there's are new endpoints in the API for getting information about streaming jobs.
I run Spark on EMR clusters, using Spark 2.2.0 in cluster mode.
When I hit the endpoint for my streaming jobs, all it gives me is the error message:
no streaming listener attached to <stream name>
I've dug through the Spark codebase a bit, but this feature is not very well documented. So I'm curious if this is a bug? Is there some configuration I need to do to get this endpoint working?
This appears to be an issue specifically when running on the cluster. The same code running on Spark 2.2.0 on my local machine shows the statistics as expected, but gives that error message when run on the cluster.
回答1:
I'm using the very latest Spark 2.3.0-SNAPSHOT built today from the master so YMMV. It worked fine.
Is there some configuration I need to do to get this endpoint working?
No. It's supposed to work fine with no changes to the default configuration.
Make sure the you use the host and port of the driver (as rumors are that you could also access 18080
of Spark History Server that does show all the same endpoints, and the same jobs running, but no streaming listener attached).
As you can see in the source code where the error message lives it can happen only when ui.getStreamingJobProgressListener
has not been registered (that ends up in case None
).
So the question now is why would that SparkListener
not be registered?
That leads us to the streamingJobProgressListener var that is set using setStreamingJobProgressListener method exclusively while StreamingTab
is being instantiated (which was the reason why I asked you if you can see the Streaming tab).
In other words, if you see the Streaming tab in web UI, you have the streaming metric endpoint(s) available. Check the URL to the endpoint which should be in the format:
http://[driverHost]:[port]/api/v1/applications/[appId]/streaming/statistics
I tried to reproduce your case and did the following that led me to a working case.
Started one of the official examples of Spark Streaming applications.
$ ./bin/run-example streaming.StatefulNetworkWordCount localhost 9999
I did run
nc -lk 9999
first.Opened the web UI @ http://localhost:4040/streaming to make sure the Streaming tab is there.
Made sure http://localhost:4040/api/v1/applications/ responds with application ids.
$ http http://localhost:4040/api/v1/applications/ HTTP/1.1 200 OK Content-Encoding: gzip Content-Length: 266 Content-Type: application/json Date: Wed, 13 Dec 2017 07:58:04 GMT Server: Jetty(9.3.z-SNAPSHOT) Vary: Accept-Encoding, User-Agent [ { "attempts": [ { "appSparkVersion": "2.3.0-SNAPSHOT", "completed": false, "duration": 0, "endTime": "1969-12-31T23:59:59.999GMT", "endTimeEpoch": -1, "lastUpdated": "2017-12-13T07:53:53.751GMT", "lastUpdatedEpoch": 1513151633751, "sparkUser": "jacek", "startTime": "2017-12-13T07:53:53.751GMT", "startTimeEpoch": 1513151633751 } ], "id": "local-1513151634282", "name": "StatefulNetworkWordCount" } ]
Accessed the endpoint for the Spark Streaming application @ http://localhost:4040/api/v1/applications/local-1513151634282/streaming/statistics.
$ http http://localhost:4040/api/v1/applications/local-1513151634282/streaming/statistics HTTP/1.1 200 OK Content-Encoding: gzip Content-Length: 219 Content-Type: application/json Date: Wed, 13 Dec 2017 08:00:10 GMT Server: Jetty(9.3.z-SNAPSHOT) Vary: Accept-Encoding, User-Agent { "avgInputRate": 0.0, "avgProcessingTime": 30, "avgSchedulingDelay": 0, "avgTotalDelay": 30, "batchDuration": 1000, "numActiveBatches": 0, "numActiveReceivers": 1, "numInactiveReceivers": 0, "numProcessedRecords": 0, "numReceivedRecords": 0, "numReceivers": 1, "numRetainedCompletedBatches": 376, "numTotalCompletedBatches": 376, "startTime": "2017-12-13T07:53:54.921GMT" }
来源:https://stackoverflow.com/questions/47780148/how-to-access-statistics-endpoint-for-a-spark-streaming-application