Finding the most common three-item sequence in a very large file

前端未结

关注

 5  1691

耶瑟儿～ 2021-02-02 17:09

I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page

5条回答

误落风尘 (楼主)

2021-02-02 17:56
If you are using Unix, the sort command can cope with arbitrarily large files. So you could do something like this:
1. sort -k1,1 -s logfile > sorted (note that this is a stable sort (-s) on the first column)
2. Perform some custom processing of sorted that outputs each triplet as a new line to another file, say triplets, using either C++ or a shell script. So in the example given you get a file with three lines: 1-2-3, 2-3-4, 2-3-4. This processing is quick because Step 1 means that you are only dealing with one user's visits at a time, so you can work through the sorted file a line at a time.
3. sort triplets | uniq -c | sort -r -n | head -1 should give the most common triplet and its count (it sorts the triplets, counts the occurrences of each, sorts them in descending order of count and takes the top one).
This approach might not have optimal performance, but it shouldn't run out of memory.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...