How to combine two RDD[String]s index-wise?

前端未结

关注

 2  1591

I\'m working with Spark RDDs and created two idential length arrays, one is the hour of tweet, and the other is the text of a tweet. I\'m looking to combine these into one data

相关标签:

2条回答

别那么骄傲

2021-01-23 22:33
The answer by Ramesh Maharjan can work only under very specific assumptions:
- Both RDDs have the same number of partitions.
- Corresponding partitions have the same number of elements.
This trivial for ParallelCollectionRDD but it is hard or impossible to get in general.

It is much better, but costlier, to join:
```
split_time.zipWithIndex.map(_.swap).join(
  split_text.zipWithIndex.map(_.swap)
).values
```
or:
```
val split_time_with_index = split_time.zipWithIndex.map(_.swap)
val split_text_with_index = split_text.zipWithIndex.map(_.swap) 

val partitioner = new org.apache.spark.RangePartitioner(
  split_time.getNumPartitions, split_time
)

split_time.join(split_text, partitioner)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
天命终不由人

2021-01-23 22:54
You should go with .zip to combine both rdds into RDD[(String, String)]

for example I created two rdds
```
val split_time = sparkContext.parallelize(Array("17", "17", "17", "17", "17", "17", "17", "17", "17", "17"))
val split_text = sparkContext.parallelize(Array("17", "17", "17", "17", "colts", "17", "17", "colts", "17", "17"))
```
zip combines both rdds as I have mentioned above into RDD[Tuple2[String, String]]
```
val tweet_tuple = split_time.zip(split_text)
```
After combining all you need is to apply .filter
```
tweet_tuple.filter(line => line._1 == "17" && line._2.toString.matches("colts"))
```
The output should be
```
(17,colts)
(17,colts)
```
Updated

Since your split_text rdd are collection of sentences, contains should be used instead of matches. So the following logic should work after you've zipped.
```
tweet_tuple.filter(line => line._1 == "17" && line._2.toString.contains("colts"))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...