Splitting strings in Apache Spark using Scala

前端 未结 4 941
独厮守ぢ
独厮守ぢ 2021-02-03 14:24

I have a dataset, which contains lines in the format (tab separated):

Title<\\t>Text

Now for every word in Text, I want to c

4条回答
  •  后悔当初
    2021-02-03 15:21

    So... In spark you work using distributed data structure called RDD. They provide functionality similar to scala's collection types.

    val fileRdd = sc.textFile("s3n://file.txt")
    // RDD[ String ]
    
    val splitRdd = fileRdd.map( line => line.split("\t") )
    // RDD[ Array[ String ]
    
    val yourRdd = splitRdd.flatMap( arr => {
      val title = arr( 0 )
      val text = arr( 1 )
      val words = text.split( " " )
      words.map( word => ( word, title ) )
    } )
    // RDD[ ( String, String ) ]
    
    // Now, if you want to print this...
    yourRdd.foreach( { case ( word, title ) => println( s"{ $word, $title }" ) } )
    
    // if you want to count ( this count is for non-unique words), 
    val countRdd = yourRdd
      .groupBy( { case ( word, title ) => title } )  // group by title
      .map( { case ( title, iter ) => ( title, iter.size ) } ) // count for every title
    

提交回复
热议问题