Hadoop DistributedCache functionality in Spark

前端 未结 2 1117
南方客
南方客 2021-01-12 15:24

I am looking for a functionality similar to the distributed cache of Hadoop in Spark. I need a relatively small data file (with some index values) to be present in all nodes

相关标签:
2条回答
  • 2021-01-12 15:53

    As long as we use Broadcast variables, it should be effective with larger dataset as well.

    From the Spark documentation "Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner."

    0 讨论(0)
  • 2021-01-12 15:55

    Please have a look at SparkContext.addFile() method. Guess that is what you were looking for.

    0 讨论(0)
提交回复
热议问题