I am new new to scala and spark and trying to understand few basic stuff out here.
Spark version used 1.5.
why does value of sum does not ge
The way you reason about the program is wrong. foreach
is executed independently on each executor and modifies its own copy of sum
. There is no global shared state here. Just count values directly:
df.select("column1").distinct.count
If you really want to handle this manually you'll need some type of reduce
:
df.select("column1").distinct.rdd.map(_ => 1L).reduce(_ + _)
Read the Programming Guide, it has a section devoted to this: Understanding Closures. If you actually need to collect some state, you can use Accumulators (but note that you can't access the value from the executor nodes, only amend it). But try doing without them first: think in terms of available transformations instead of mutating state.