I am looking for a functionality similar to the distributed cache of Hadoop in Spark. I need a relatively small data file (with some index values) to be present in all nodes
As long as we use Broadcast variables, it should be effective with larger dataset as well.
From the Spark documentation "Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner."
Please have a look at SparkContext.addFile()
method.
Guess that is what you were looking for.