发表新帖

发表新帖

spark over kubernetes vs yarn/hadoop ecosystem [closed]

前端未结

关注

 2  1010

轻奢々 2021-02-03 11:04

2条回答

你的背包 (楼主)

2021-02-03 11:56

To complete Matthew L Daniel opinion, the mine focuses on 2 interesting concepts that Kubernetes can bring to data pipelines: - namespaces + resource quotas help to easier separate and share resources by for instance reserving much more resources to data intensive/more unpredictable/business critical parts without necessarily new node every time - horizontal scaling - basically when Kubernetes scheduler doesn't succeed to allocate new pods that may be created with Spark's dynamic resource allocation in the future (not implemented yet), it's able to mount necessary nodes dynamically (e.g. through https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#introduction). That said horizontal scaling are currently difficult to achieve in Apache Spark since it requires to keep the external shuffle service even for a shut down executor. So even if our load decreases, we'll still keep the nodes created to handle its increase. But when this problem will be solved Kubernetes autoscaling will be an interesting option to reduce costs, improve processing performances and make pipelines elastic.

However please notice that all these sayings are based only on personal observations and some local tests on early Spark on Kubernetes feature (2.3.0).

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题