Difference between RDDs and Batches in Spark?

大城市里の小女人 提交于 2019-12-12 11:32:00

问题


RDD is a collection of elements partitioned across the nodes of the cluster. It's core component and abstraction.

Batches: SparkStreaming API simply divides the data into batches, that batches also same collection of Streaming objects/elements. Based on requirement a set of batches defined in the form time based batch window and intensive online activity based batch window.

What is the difference between Rdd & Batches exactly?


回答1:


RDDs and batches are essentially different but related things in Spark. As mentioned in the question, RDDs are a fundamental Spark concept, as they are form the base data structure for distributed computations in Spark.

An RDD[T]s is a virtual collection of elements of type [T] distributed over partitions in a cluster.

In Spark Streaming, a "batch" is the result of collecting data during batchInterval time. The data is collected in 'blocks', and the size of the blocks is determined by the spark.streaming.blockInterval config parameter.

Those blocks are submitted to the Spark Core engine for processing. The set of blocks for each batch becomes one RDD and each block is one RDD partition.

It would be incorrect to say that batches and RDDs are the same thing. A Spark Streaming batch of data becomes an RDD when it's submitted for processing to the Spark Core.




回答2:


One batch is essentially one RDD, however in Streaming you are typically not operating on RDDs but on DStreams which offer the mentioned time- and window-based functionality. You have to explicitly dive down to the RDD using foreachRDD.

DStream is the abstraction for describing a streaming job. At runtime, the DStream is translated into RDD because Spark Streaming works on top of Spark Core, and Spark only knows how to process RDDs. This is why it's not real stream processing but micro-batching.




回答3:


The basic difference lays in the architecture of Spark and Spark Streaming (micro-batch). As you may know, for offline processing you don't need Spark Streaming - this was created to process data online, or as it arrives, and this is treated as a continuous series of batch computations on small batches of data.

The creators of Spark have decided to provide an abstraction called DStreams (discretized Streams). These internally are represented as a sequence of RDDs arriving at each time step (e.g., 0.5 seconds) - each of them has one time slice of the data in the stream. At the beginning of each time interval (e.g., 0.5 seconds) a new batch is created - when new data arrives, this belongs to this batch - until it ends growing.

From a high level perspective, DStreams provide the same operations as RDDs, but they provide additional methods related to time (such as sliding windows).

PS: I have seen right now the youtube link. I guess that's the best answer - it explains thoroughly what you want to know :)



来源:https://stackoverflow.com/questions/33438168/difference-between-rdds-and-batches-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!