What does in-memory data storage mean in the context of Apache Spark?

后端 未结 1 1795
不思量自难忘°
不思量自难忘° 2021-02-01 09:33

I have read that Apache Spark stores data in-memory. However, Apache Spark is meant for analyzing huge volumes of data (a.k.a big data analytics). In this context, what does in-

1条回答
  •  清歌不尽
    2021-02-01 09:52

    In Hadoop, data is persisted to disk between steps, so a typical multi-step job ends up looking something like this:

    hdfs -> read & map -> persist -> read & reduce -> hdfs -> read & map -> persist -> read and reduce -> hdfs
    

    This is a brilliant design, and it makes perfect sense to use, when you're batch-processing files that fits the map-reduce pattern well. But for some workloads, this can be extremely slow - iterative algorithms are especially affected negatively. You've spent time creating some data structure (a graph for instance), and all you want to do in each step, is update a score. Persisting and reading the entire graph to/from disk will slow down your job.

    Spark uses a more general engine that supports cyclic data flows, and will try to keep things in memory in between job steps. What this means is, if you can create a data structure and partitioning strategy, where your data doesn't shuffle around between each step in your job, you can efficiently update it without serialising and writing everything to disk in between steps. That's the reason why Spark's got a chart on their front page showing a 100x speedup on logical regression.

    If you write a Spark job that just computes a value from each input line in your dataset, and write that back to disk, Hadoop and Spark will be pretty much equal in terms of performance (start-up time is faster in Spark, but that hardly matters when we spend hours on processing data in a single step).

    If Spark cannot hold an RDD in memory in between steps, it will spill it to disk, much like Hadoop does. But remember that Spark isn't a silver bullet, and there will be corner cases where you'll have to fight Spark's in-memory nature causing OutOfMemory problems, where Hadoop would just write everything to disk.

    I personally like to think of it this way: In your 500 64GB machines cluster, Hadoop is created to efficiently batch process your 500 TB job faster by distributing disk reads and writes. Spark utilises the fact that 500*64GB=32TB worth of memory can likely solve quite a few of your other problems entirely in-memory!

    0 讨论(0)
提交回复
热议问题