I\'m working with Spark RDDs and created two idential length arrays, one is the hour of tweet, and the other is the text of a tweet. I\'m looking to combine these into one data
The answer by Ramesh Maharjan can work only under very specific assumptions:
This trivial for ParallelCollectionRDD
but it is hard or impossible to get in general.
It is much better, but costlier, to join
:
split_time.zipWithIndex.map(_.swap).join(
split_text.zipWithIndex.map(_.swap)
).values
or:
val split_time_with_index = split_time.zipWithIndex.map(_.swap)
val split_text_with_index = split_text.zipWithIndex.map(_.swap)
val partitioner = new org.apache.spark.RangePartitioner(
split_time.getNumPartitions, split_time
)
split_time.join(split_text, partitioner)
You should go with .zip
to combine both rdds
into RDD[(String, String)]
for example I created two rdds
val split_time = sparkContext.parallelize(Array("17", "17", "17", "17", "17", "17", "17", "17", "17", "17"))
val split_text = sparkContext.parallelize(Array("17", "17", "17", "17", "colts", "17", "17", "colts", "17", "17"))
zip
combines both rdds
as I have mentioned above into RDD[Tuple2[String, String]]
val tweet_tuple = split_time.zip(split_text)
After combining all you need is to apply .filter
tweet_tuple.filter(line => line._1 == "17" && line._2.toString.matches("colts"))
The output should be
(17,colts)
(17,colts)
Updated
Since your split_text
rdd are collection of sentences, contains
should be used instead of matches
. So the following logic should work after you've zip
ped.
tweet_tuple.filter(line => line._1 == "17" && line._2.toString.contains("colts"))