问题
I have a google dataproc spark cluster set up with one master node, and 16 worker nodes. The master has 2 cpus and 13g of memory and each worker has 2 cpus and 3.5g of memory. I am running a rather network-intensive job where I have an array of 16 objects and I partition this array into 16 partitions so each worker gets one object. The objects make about 2.5 million web requests and aggregates them to send back to the master. Each request is a Solr response and is less than 50k. One field (an ID, as a string) from the response is extracted and added to the list to send back to the master. This process will finish in about 1-2 hours.
However, at some point in the execution, I keep getting an error where the master loses an executor's heartbeat and kills it. No more details are sent to the master than the time out and the worker's log shows that it was just running as normal. I tried to install stackdriver monitoring to see if this is a RAM problem, but the agent latency is over an hour when it should be max 2 minutes so I do not have any up-to-date memory information.
Does anyone have an idea as to why this is happening? My ideas are maybe the network ports are being flooded from the job so the instance can't send out the heartbeat or instance metrics, possibly a RAM issue (I get the same error for pretty most RAM values I try), or there is some issue on Google's side.
回答1:
Thanks to the comment by @Dennis, I managed to find that an OOM exception was being thrown by the executor being killed. I never saw it before because this error was only ouputted in standard out, instead of any of the error logs as one would expect.
来源:https://stackoverflow.com/questions/39168693/google-dataproc-timing-out-and-killing-excutors