Can someone help me understand the difference/comparision between running spark on kubernetes vs Hadoop ecosystem?
Be forewarned this is a theoretical answer, because I don't run Spark anymore, and thus I haven't run Spark on kubernetes, but I have maintained both a Hadoop cluster and now a kubernetes cluster, and so I can speak to some of their differences.
Kubernetes is as much a battle hardened resource manager with api access to all its components as a reasonable person could wish for. It provides very painless declarative resource limitations (both cpu and ram, plus even syscall capacities), very, very painless log egress (both back to the user via kubectl
and out of the cluster using multiple flavors of log management approaches), unprecedented level of metrics gathering and egress allowing one to keep an eye on the health of the cluster and the jobs therein, and the list goes on and on.
But perhaps the biggest reason one would choose to run Spark on kubernetes is the same reason one would choose to run kubernetes at all: shared resources rather than having to create new machines for different workloads (well, plus all of those benefits above). So if you have a Spark cluster, it is very, very likely it is going to burn $$$ while a job isn't actively running on it, versus kubernetes will cheerfully schedule other jobs onto those Nodes while they aren't running Spark jobs. Yes, I am aware that Mesos and Yarn are "generic" cluster resource managers, but it has not been my experience that they are as painless or ubiquitous as kubernetes.
I would welcome someone posting the counter narrative, or contributing more hands-on experience of Spark on kubernetes, but tho