发表新帖

发表新帖

Find the 10 most frequently used words in a large book [duplicate]

前端未结

关注

 2  750

慢半拍i 2021-01-03 11:00

2条回答

时光说笑 (楼主)

2021-01-03 11:19

This is a classic problem in the field of streaming algorithms. There's clearly no way to do this that works in certain degenerate cases; you'll need to settle for a bunch of elements that are approximately (in a well-defined sense) the top k words in your stream. I don't know any classic references, but a quick Google brought me to this. It seems to have a nice survey on various techniques for doing streaming top-K. You might check the references therein for other ideas.

One other idea (and one that doesn't fly in the streaming model) is just to randomly sample as many words as will fit into memory, sort-and-uniq them, and do another pass over the file counting hits of the words in your sample. Then you can easily find the top k.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题