Spark Java Code Structure

不想你离开。 提交于 2019-12-08 06:39:47

问题


All Spark Java examples that I can find online use a single static class which contains the entire program functionality. How would I structure an ordinary Java program containing several non-static classes so that I can make calls to Spark preferably from several Java objects?

There are several reasons why this is not entirely straight forward:

  1. JavaSparkContext needs to be available everywhere where new RDDs need to be created and it is not serializable. At the time of writing only a single spark context can work in a single JVM reliably. For now I am using one static class in my program just for the JavaSparkContext, HiveContext and SparkConf so that they are available everywhere.
  2. Anonymous classes are not practicable: Almost all examples online use anonymous classes exclusively to be passed to Spark operations. But using an anonymous class requires the enclosing class to be serializable and causes the entire enclosing class to be send to the worker nodes. That's not necessarily what people want. To prevent this you have to define a separate class outside the enclosing class which implements the interface for the call. Now only the contents of the new class are serialized. (By implementing the call-containing interface a class also implements Serializable.) Alternatively if you want to have the code inside the enclosing class you can use a static nested class.

There are probably even more things that demand a special structure when you use Spark. I wonder if the structuring that I used for solving the two issues above can be improved?

来源:https://stackoverflow.com/questions/34750854/spark-java-code-structure

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!