发表新帖

发表新帖

Splitting strings in Apache Spark using Scala

前端未结

关注

 4  942

独厮守ぢ 2021-02-03 14:24

I have a dataset, which contains lines in the format (tab separated):

Title<\\t>Text

Now for every word in Text, I want to c

4条回答

说谎 (楼主)

2021-02-03 15:18
This is how it can be solved using the newer dataframe API. First read the data using "\t" as a delimiter:
```
val df = spark.read
  .option("delimiter", "\t")
  .option("header", false)
  .csv("s3n://file.txt")
  .toDF("title", "text")
```
Then, split the text column on space and explode to get one word per row.
```
val df2 = df.select($"title", explode(split($"text", " ")).as("words"))
```
Finally, group on the title column and count the number of words for each.
```
val countDf = df2.groupBy($"title").agg(count($"words"))
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...

热议问题