I am new to spark and trying to understand the difference between normal RDD and a pair RDD. What are the use-cases where a pair RDD is used as opposed to a normal RDD? If possi
The key differences are:
pairRDD operations (such as map, reduceByKey etc) produce key,value pairs. Whereas operations on RDD(such as flatMap or reduce) gives you a collection of values or a single value
pairRDD operations are applied on each key/element in parallel.Operations on RDD (like flatMap) are applied to the whole collection.
PairRDDs are KEY/VALUE pairs.
Example: If you have a csv with details of airport in a country. We create normal RDD by reading that CSV from path.(columns:Airport ID, Name of airport, Main city served by airport, County where airport is located)
JavaRDD<String> airports = sc.textFile("in/airports.text");
If we want an RDD with airport names and country in which it located,here we have to create pair RDD from above RDD.
JavaPairRDD<String,String> AirportspairRDD = airports.mapToPair((PairFunction<String, String, String>) s -> {
return new Tuple2<>(s.split(",")[1],s.split(",")[3]);
});
Pair RDD is just a way of referring to an RDD containing key/value pairs, i.e. tuples of data. It's not really a matter of using one as opposed to using the other. For instance, if you want to calculate something based on an ID, you'd group your input together by ID. This example just splits a line of text and returns a Pair RDD using the first word as the key [1]:
val pairs = lines.map(x => (x.split(" ")(0), x))
The Pair RDD that you end up with allows you to reduce values or to sort data based on the key, to name a few examples.
It would probably do you good to read the link at the bottom, from which I shamelessly copied the example, since the understanding of Pair RDDs and how to work with tuples is quite fundamental to many of the things that you will do in Spark. Read up on 'Transformations on Pair RDDs' to get an understanding of what you typically would want to do once you have your pairs.
[1] https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html
Spark provides special operations on RDDs containing key/value pairs. These RDDs are called pair RDDs. Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel or regroup data across the network. For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs together by grouping elements with the same key. It is common to extract fields from an RDD (representing, for instance, an event time, customer ID, or other identifier) and use those fields as keys in pair RDD operations.