Finding the most common three-item sequence in a very large file

前端 未结 5 1691
耶瑟儿~
耶瑟儿~ 2021-02-02 17:09

I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page

5条回答
  •  误落风尘
    2021-02-02 17:56

    If you are using Unix, the sort command can cope with arbitrarily large files. So you could do something like this:

    1. sort -k1,1 -s logfile > sorted (note that this is a stable sort (-s) on the first column)
    2. Perform some custom processing of sorted that outputs each triplet as a new line to another file, say triplets, using either C++ or a shell script. So in the example given you get a file with three lines: 1-2-3, 2-3-4, 2-3-4. This processing is quick because Step 1 means that you are only dealing with one user's visits at a time, so you can work through the sorted file a line at a time.
    3. sort triplets | uniq -c | sort -r -n | head -1 should give the most common triplet and its count (it sorts the triplets, counts the occurrences of each, sorts them in descending order of count and takes the top one).

    This approach might not have optimal performance, but it shouldn't run out of memory.

提交回复
热议问题