Splitting strings in Apache Spark using Scala

前端 未结 4 934
独厮守ぢ
独厮守ぢ 2021-02-03 14:24

I have a dataset, which contains lines in the format (tab separated):

Title<\\t>Text

Now for every word in Text, I want to c

4条回答
  •  说谎
    说谎 (楼主)
    2021-02-03 15:18

    This is how it can be solved using the newer dataframe API. First read the data using "\t" as a delimiter:

    val df = spark.read
      .option("delimiter", "\t")
      .option("header", false)
      .csv("s3n://file.txt")
      .toDF("title", "text")
    

    Then, split the text column on space and explode to get one word per row.

    val df2 = df.select($"title", explode(split($"text", " ")).as("words"))
    

    Finally, group on the title column and count the number of words for each.

    val countDf = df2.groupBy($"title").agg(count($"words"))
    

提交回复
热议问题