I have a dataset, which contains lines in the format (tab separated):
Title<\\t>Text
Now for every word in Text
, I want to c
The answer which proved above is not good enough.
.map( line => line.split("\t") )
may cause:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18.0 failed 4 times, most recent failure: Lost task 0.3 in stage 18.0 (TID 1485, ip-172-31-113-181.us-west-2.compute.internal, executor 10): java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 14
in case the last column is empty. the best result explained here - Split 1 column into 3 columns in spark scala
Another version with DataFrame API
// read into DataFrame
val viewsDF=spark.read.text("s3n://file.txt")
// Split
val splitedViewsDF = viewsDF.withColumn("col1", split($"value", "\\t").getItem(0)).withColumn("col2", split($"value", "\\s+").getItem(1)).drop($"value"))
scala> val viewsDF=spark.read.text("spark-labs/data/wiki-pageviews.txt")
viewsDF: org.apache.spark.sql.DataFrame = [value: string]
scala> viewsDF.printSchema
root
|-- value: string (nullable = true)
scala> viewsDF.limit(5).show
+------------------+
| value|
+------------------+
| aa Main_Page 3 0|
| aa Main_page 1 0|
| aa User:Savh 1 0|
| aa Wikipedia 1 0|
|aa.b User:Savh 1 0|
+------------------+
scala> val splitedViewsDF = viewsDF.withColumn("col1", split($"value", "\\s+").getItem(0)).withColumn("col2", split($"value", "\\s+").getItem(1)).withColumn("col3", split($"value", "\\s+").getItem(2)).drop($"value")
splitedViewsDF: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more field]
scala>
scala> splitedViewsDF.printSchema
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: string (nullable = true)
scala> splitedViewsDF.limit(5).show
+----+---------+----+
|col1| col2|col3|
+----+---------+----+
| aa|Main_Page| 3|
| aa|Main_page| 1|
| aa|User:Savh| 1|
| aa|Wikipedia| 1|
|aa.b|User:Savh| 1|
+----+---------+----+
scala>
This is how it can be solved using the newer dataframe API. First read the data using "\t" as a delimiter:
val df = spark.read
.option("delimiter", "\t")
.option("header", false)
.csv("s3n://file.txt")
.toDF("title", "text")
Then, split
the text
column on space and explode
to get one word per row.
val df2 = df.select($"title", explode(split($"text", " ")).as("words"))
Finally, group on the title
column and count the number of words for each.
val countDf = df2.groupBy($"title").agg(count($"words"))
So... In spark you work using distributed data structure called RDD. They provide functionality similar to scala's collection types.
val fileRdd = sc.textFile("s3n://file.txt")
// RDD[ String ]
val splitRdd = fileRdd.map( line => line.split("\t") )
// RDD[ Array[ String ]
val yourRdd = splitRdd.flatMap( arr => {
val title = arr( 0 )
val text = arr( 1 )
val words = text.split( " " )
words.map( word => ( word, title ) )
} )
// RDD[ ( String, String ) ]
// Now, if you want to print this...
yourRdd.foreach( { case ( word, title ) => println( s"{ $word, $title }" ) } )
// if you want to count ( this count is for non-unique words),
val countRdd = yourRdd
.groupBy( { case ( word, title ) => title } ) // group by title
.map( { case ( title, iter ) => ( title, iter.size ) } ) // count for every title