I have a dataset, which contains lines in the format (tab separated):
Title<\\t>Text
Now for every word in Text
, I want to c
This is how it can be solved using the newer dataframe API. First read the data using "\t" as a delimiter:
val df = spark.read
.option("delimiter", "\t")
.option("header", false)
.csv("s3n://file.txt")
.toDF("title", "text")
Then, split
the text
column on space and explode
to get one word per row.
val df2 = df.select($"title", explode(split($"text", " ")).as("words"))
Finally, group on the title
column and count the number of words for each.
val countDf = df2.groupBy($"title").agg(count($"words"))