How do I iterate RDD's in apache spark (scala)

后端 未结 5 1605
遇见更好的自我
遇见更好的自我 2020-12-01 01:52

I use the following command to fill an RDD with a bunch of arrays containing 2 strings [\"filename\", \"content\"].

Now I want to iterate over every of those occurre

相关标签:
5条回答
  • 2020-12-01 01:59

    I would try making use of a partition mapping function. The code below shows how an entire RDD dataset can be processed in a loop so that each input goes through the very same function. I am afraid I have no knowledge about Scala, so everything I have to offer is java code. However, it should not be very difficult to translate it into scala.

    JavaRDD<String> res = file.mapPartitions(new FlatMapFunction <Iterator<String> ,String>(){ 
          @Override
          public Iterable<String> call(Iterator <String> t) throws Exception {  
    
              ArrayList<String[]> tmpRes = new ArrayList <>();
              String[] fillData = new String[2];
    
              fillData[0] = "filename";
              fillData[1] = "content";
    
              while(t.hasNext()){
                   tmpRes.add(fillData);  
              }
    
              return Arrays.asList(tmpRes);
          }
    
    }).cache();
    
    0 讨论(0)
  • 2020-12-01 02:04

    The fundamental operations in Spark are map and filter.

    val txtRDD = someRDD filter { case(id, content) => id.endsWith(".txt") }
    

    the txtRDD will now only contain files that have the extension ".txt"

    And if you want to word count those files you can say

    //split the documents into words in one long list
    val words = txtRDD flatMap { case (id,text) => text.split("\\s+") }
    // give each word a count of 1
    val wordT = words map (x => (x,1))  
    //sum up the counts for each word
    val wordCount = wordsT reduceByKey((a, b) => a + b)
    

    You want to use mapPartitions when you have some expensive initialization you need to perform -- for example, if you want to do Named Entity Recognition with a library like the Stanford coreNLP tools.

    Master map, filter, flatMap, and reduce, and you are well on your way to mastering Spark.

    0 讨论(0)
  • 2020-12-01 02:07

    You call various methods on the RDD that accept functions as parameters.

    // set up an example -- an RDD of arrays
    val sparkConf = new SparkConf().setMaster("local").setAppName("Example")
    val sc = new SparkContext(sparkConf)
    val testData = Array(Array(1,2,3), Array(4,5,6,7,8))
    val testRDD = sc.parallelize(testData, 2)
    
    // Print the RDD of arrays.
    testRDD.collect().foreach(a => println(a.size))
    
    // Use map() to create an RDD with the array sizes.
    val countRDD = testRDD.map(a => a.size)
    
    // Print the elements of this new RDD.
    countRDD.collect().foreach(a => println(a))
    
    // Use filter() to create an RDD with just the longer arrays.
    val bigRDD = testRDD.filter(a => a.size > 3)
    
    // Print each remaining array.
    bigRDD.collect().foreach(a => {
        a.foreach(e => print(e + " "))
        println()
      })
    }
    

    Notice that the functions you write accept a single RDD element as input, and return data of some uniform type, so you create an RDD of the latter type. For example, countRDD is an RDD[Int], while bigRDD is still an RDD[Array[Int]].

    It will probably be tempting at some point to write a foreach that modifies some other data, but you should resist for reasons described in this question and answer.

    Edit: Don't try to print large RDDs

    Several readers have asked about using collect() and println() to see their results, as in the example above. Of course, this only works if you're running in an interactive mode like the Spark REPL (read-eval-print-loop.) It's best to call collect() on the RDD to get a sequential array for orderly printing. But collect() may bring back too much data and in any case too much may be printed. Here are some alternative ways to get insight into your RDDs if they're large:

    1. RDD.take(): This gives you fine control on the number of elements you get but not where they came from -- defined as the "first" ones which is a concept dealt with by various other questions and answers here.

      // take() returns an Array so no need to collect()
      myHugeRDD.take(20).foreach(a => println(a))
      
    2. RDD.sample(): This lets you (roughly) control the fraction of results you get, whether sampling uses replacement, and even optionally the random number seed.

      // sample() does return an RDD so you may still want to collect()
      myHugeRDD.sample(true, 0.01).collect().foreach(a => println(a))
      
    3. RDD.takeSample(): This is a hybrid: using random sampling that you can control, but both letting you specify the exact number of results and returning an Array.

      // takeSample() returns an Array so no need to collect() 
      myHugeRDD.takeSample(true, 20).foreach(a => println(a))
      
    4. RDD.count(): Sometimes the best insight comes from how many elements you ended up with -- I often do this first.

      println(myHugeRDD.count())       
      
    0 讨论(0)
  • 2020-12-01 02:16

    what the wholeTextFiles return is a Pair RDD:

    def wholeTextFiles(path: String, minPartitions: Int): RDD[(String, String)]

    Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

    Here is an example of reading the files at a local path then printing every filename and content.

    val conf = new SparkConf().setAppName("scala-test").setMaster("local")
    val sc = new SparkContext(conf)
    sc.wholeTextFiles("file:///Users/leon/Documents/test/")
      .collect
      .foreach(t => println(t._1 + ":" + t._2));
    

    the result:

    file:/Users/leon/Documents/test/1.txt:{"name":"tom","age":12}
    
    file:/Users/leon/Documents/test/2.txt:{"name":"john","age":22}
    
    file:/Users/leon/Documents/test/3.txt:{"name":"leon","age":18}
    

    or converting the Pair RDD to a RDD first

    sc.wholeTextFiles("file:///Users/leon/Documents/test/")
      .map(t => t._2)
      .collect
      .foreach { x => println(x)}
    

    the result:

    {"name":"tom","age":12}
    
    {"name":"john","age":22}
    
    {"name":"leon","age":18}
    

    And I think wholeTextFiles is more compliant for small files.

    0 讨论(0)
  • 2020-12-01 02:18
    for (element <- YourRDD)
    {
         // do what you want with element in each iteration, and if you want the index of element, simply use a counter variable in this loop beginning from 0 
         println (element._1) // this will print all filenames
    }
    
    0 讨论(0)
提交回复
热议问题