How to count unique words in a stream?

我与影子孤独终老i 提交于 2019-12-01 03:07:50

You can solve the problem by storing all words which you've already seen. Having this knowledge you can filter out all duplicate words. The rest can then be counted by a map operator with parallelism 1. The following code snippet does exactly that.

val env = StreamExecutionEnvironment.getExecutionEnvironment

val inputStream = env.fromElements("foo", "bar", "foobar", "bar", "barfoo", "foobar", "foo", "fo")

// filter words out which we have already seen
val uniqueWords = inputStream.keyBy(x => x).filterWithState{
  (word, seenWordsState: Option[Set[String]]) => seenWordsState match {
    case None => (true, Some(HashSet(word)))
    case Some(seenWords) => (!seenWords.contains(word), Some(seenWords + word))
  }
}

// count the number of incoming (first seen) words
val numberUniqueWords = uniqueWords.keyBy(x => 0).mapWithState{
  (word, counterState: Option[Int]) =>
    counterState match {
      case None => (1, Some(1))
      case Some(counter) => (counter + 1, Some(counter + 1))
    }
}.setParallelism(1)

numberUniqueWords.print();

env.execute()
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!