Difference between RDDs and Batches in Spark?

问题

RDD is a collection of elements partitioned across the nodes of the cluster. It's core component and abstraction.

Batches: SparkStreaming API simply divides the data into batches, that batches also same collection of Streaming objects/elements. Based on requirement a set of batches defined in the form time based batch window and intensive online activity based batch window.

What is the difference between Rdd & Batches exactly?

回答1:

RDDs and batches are essentially different but related things in Spark. As mentioned in the question, RDDs are a fundamental Spark concept, as they are form the base data structure for distributed computations in Spark.

An RDD[T]s is a virtual collection of elements of type [T] distributed over partitions in a cluster.

In Spark Streaming, a "batch" is the result of collecting data during batchInterval time. The data is collected in 'blocks', and the size of the blocks is determined by the spark.streaming.blockInterval config parameter.

Those blocks are submitted to the Spark Core engine for processing. The set of blocks for each batch becomes one RDD and each block is one RDD partition.

It would be incorrect to say that batches and RDDs are the same thing. A Spark Streaming batch of data becomes an RDD when it's submitted for processing to the Spark Core.

回答2:

One batch is essentially one RDD, however in Streaming you are typically not operating on RDDs but on DStreams which offer the mentioned time- and window-based functionality. You have to explicitly dive down to the RDD using foreachRDD.

DStream is the abstraction for describing a streaming job. At runtime, the DStream is translated into RDD because Spark Streaming works on top of Spark Core, and Spark only knows how to process RDDs. This is why it's not real stream processing but micro-batching.

回答3:

The basic difference lays in the architecture of Spark and Spark Streaming (micro-batch). As you may know, for offline processing you don't need Spark Streaming - this was created to process data online, or as it arrives, and this is treated as a continuous series of batch computations on small batches of data.

The creators of Spark have decided to provide an abstraction called DStreams (discretized Streams). These internally are represented as a sequence of RDDs arriving at each time step (e.g., 0.5 seconds) - each of them has one time slice of the data in the stream. At the beginning of each time interval (e.g., 0.5 seconds) a new batch is created - when new data arrives, this belongs to this batch - until it ends growing.

From a high level perspective, DStreams provide the same operations as RDDs, but they provide additional methods related to time (such as sliding windows).

PS: I have seen right now the youtube link. I guess that's the best answer - it explains thoroughly what you want to know :)

来源：https://stackoverflow.com/questions/33438168/difference-between-rdds-and-batches-in-spark

标签

apache-spark

spark-streaming

rdd