I have a dataset, which contains lines in the format (tab separated):
Title<\\t>Text
Now for every word in Text
, I want to c
So... In spark you work using distributed data structure called RDD. They provide functionality similar to scala's collection types.
val fileRdd = sc.textFile("s3n://file.txt")
// RDD[ String ]
val splitRdd = fileRdd.map( line => line.split("\t") )
// RDD[ Array[ String ]
val yourRdd = splitRdd.flatMap( arr => {
val title = arr( 0 )
val text = arr( 1 )
val words = text.split( " " )
words.map( word => ( word, title ) )
} )
// RDD[ ( String, String ) ]
// Now, if you want to print this...
yourRdd.foreach( { case ( word, title ) => println( s"{ $word, $title }" ) } )
// if you want to count ( this count is for non-unique words),
val countRdd = yourRdd
.groupBy( { case ( word, title ) => title } ) // group by title
.map( { case ( title, iter ) => ( title, iter.size ) } ) // count for every title