How to combine two RDD[String]s index-wise?

前端 未结 2 1589
清酒与你
清酒与你 2021-01-23 22:10

I\'m working with Spark RDDs and created two idential length arrays, one is the hour of tweet, and the other is the text of a tweet. I\'m looking to combine these into one data

相关标签:
2条回答
  • 2021-01-23 22:33

    The answer by Ramesh Maharjan can work only under very specific assumptions:

    • Both RDDs have the same number of partitions.
    • Corresponding partitions have the same number of elements.

    This trivial for ParallelCollectionRDD but it is hard or impossible to get in general.

    It is much better, but costlier, to join:

    split_time.zipWithIndex.map(_.swap).join(
      split_text.zipWithIndex.map(_.swap)
    ).values
    

    or:

    val split_time_with_index = split_time.zipWithIndex.map(_.swap)
    val split_text_with_index = split_text.zipWithIndex.map(_.swap) 
    
    val partitioner = new org.apache.spark.RangePartitioner(
      split_time.getNumPartitions, split_time
    )
    
    split_time.join(split_text, partitioner)
    
    0 讨论(0)
  • 2021-01-23 22:54

    You should go with .zip to combine both rdds into RDD[(String, String)]

    for example I created two rdds

    val split_time = sparkContext.parallelize(Array("17", "17", "17", "17", "17", "17", "17", "17", "17", "17"))
    val split_text = sparkContext.parallelize(Array("17", "17", "17", "17", "colts", "17", "17", "colts", "17", "17"))
    

    zip combines both rdds as I have mentioned above into RDD[Tuple2[String, String]]

    val tweet_tuple = split_time.zip(split_text)
    

    After combining all you need is to apply .filter

    tweet_tuple.filter(line => line._1 == "17" && line._2.toString.matches("colts"))
    

    The output should be

    (17,colts)
    (17,colts)
    

    Updated

    Since your split_text rdd are collection of sentences, contains should be used instead of matches. So the following logic should work after you've zipped.

    tweet_tuple.filter(line => line._1 == "17" && line._2.toString.contains("colts"))
    
    0 讨论(0)
提交回复
热议问题