Finding the most common three-item sequence in a very large file

前端 未结 5 1678
耶瑟儿~
耶瑟儿~ 2021-02-02 17:09

I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page

相关标签:
5条回答
  • 2021-02-02 17:34

    If you want to quickly get an approximate result, use hash tables, as you intended, but add a limited-size queue to each hash table to drop least recently used entries.

    If you want exact result, use external sort procedure to sort logs by userid, then combine every 3 consecutive entries and sort again, this time - by page IDs.

    Update (sort by timestamp)

    Some preprocessing may be needed to properly use logfiles' timestamps:

    • If the logfiles are already sorted by timestamp, no preprocessing needed.
    • If there are several log files (possibly coming from independent processes), and each file is already sorted by timestamp, open all these files and use merge sort to read them.
    • If files are almost sorted by timestamp (as if several independent processes write logs to single file), use binary heap to get data in correct order.
    • If files are not sorted by timestamp (which is not likely in practice), use external sort by timestamp.

    Update2 (Improving approximate method)

    Approximate method with LRU queue should produce quite good results for randomly distributed data. But webpage visits may have different patterns at different time of day, or may be different on weekends. The original approach may give poor results for such data. To improve this, hierarchical LRU queue may be used.

    Partition LRU queue into log(N) smaller queues. With sizes N/2, N/4, ... Largest one should contain any elements, next one - only elements, seen at least 2 times, next one - at least 4 times, ... If element is removed from some sub-queue, it is added to other one, so it lives in all sub-queues, which are lower in hierarchy, before it is completely removed. Such a priority queue is still of O(1) complexity, but allows much better approximation for most popular page.

    0 讨论(0)
  • 2021-02-02 17:38

    If you have 1000 web pages then you have 1 billion possible 3-page sequences. If you have a simple array of 32-bit counters then you'd use 4GB of memory. There might be ways to prune this down by discarding data as you go, but if you want to guarantee to get the correct answer then this is always going to be your worst case - there's no avoiding it, and inventing ways to save memory in the average case will make the worst case even more memory hungry.

    On top of that, you have to track the users. For each user you need to store the last two pages they visited. Assuming the users are referred to by name in the logs, you'd need to store the users' names in a hash table, plus the two page numbers, so let's say 24 bytes per user on average (probably conservative - I'm assuming short user names). With 1000 users that would be 24KB; with 1000000 users 24MB.

    Clearly the sequence counters dominate the memory problem.

    If you do only have 1000 pages then 4GB of memory is not unreasonable in a modern 64-bit machine, especially with a good amount of disk-backed virtual memory. If you don't have enough swap space, you could just create an mmapped file (on Linux - I presume Windows has something similar), and rely on the OS to always have to most used cases cached in memory.

    So, basically, the maths dictates that if you have a large number of pages to track, and you want to be able to cope with the worst case, then you're going to have to accept that you'll have to use disk files.

    I think that a limited-capacity hash table is probably the right answer. You could probably optimize it for a specific machine by sizing it according to the memory available. Having got that you need to handle the case where the table reaches capacity. It may not need to be terribly efficient if it's likely you rarely get there. Here's some ideas:

    • Evict the least commonly used sequences to file, keeping the most common in memory. I'd need two passes over the table to determine what level is below average, and then to do the eviction. Somehow you'd need to know where you'd put each entry, whenever you get a hash-miss, which might prove tricky.

    • Just dump the whole table to file, and build a new one from scratch. Repeat. Finally, recombine the matching entries from all the tables. The last part might also prove tricky.

    • Use an mmapped file to extend the table. Ensure that the file is used primarily for the least-commonly used sequences, as in my first suggestion. Basically, you'd simply use it as virtual memory - the file would be meaningless later, after the addresses have been forgotten, but you wouldn't need to keep it that long. I'm assuming there isn't enough regular virtual memory here, and/or you don't want to use it. Obviously, this is for 64-bit systems only.

    0 讨论(0)
  • 2021-02-02 17:47

    There's probably syntax errors galore here, but this should take a limited amount of RAM for a virtually unlimited length log file.

    typedef int pageid;
    typedef int userid;
    typedef pageid[3] sequence;
    typedef int sequence_count;
    
    const int num_pages = 1000; //where 1-1000 inclusive are valid pageids
    const int num_passes = 4;
    std::unordered_map<userid, sequence> userhistory;
    std::unordered_map<sequence, sequence_count> visits;
    sequence_count max_count=0;
    sequence max_sequence={};
    userid curuser;
    pageid curpage;
    for(int pass=0; pass<num_passes; ++pass) { //have to go in four passes
        std::ifstream logfile("log.log");
        pageid minpage = num_pages/num_passes*pass; //where first page is in a range
        pageid maxpage = num_pages/num_passes*(pass+1)+1;
        if (pass==num_passes-1) //if it's last pass, fix rounding errors
            maxpage = MAX_INT;
        while(logfile >> curuser >> curpage) { //read in line
            sequence& curhistory = userhistory[curuser]; //find that user's history
            curhistory[2] = curhistory[1];
            curhistory[1] = curhistory[0];
            curhistory[0] = curhistory[curpage]; //push back new page for that user
            //if they visited three pages in a row
            if (curhistory[2] > minpage && curhistory[2]<maxpage) { 
                sequence_count& count = visits[curhistory]; //get times sequence was hit
                ++count; //and increase it
                if (count > max_count) { //if that's new max
                    max_count = count;  //update the max
                    max_sequence = curhistory; //arrays, so this is memcpy or something
                }
            }
        }
    }
    std::cout << "The sequence visited the most is :\n";
    std::cout << max_sequence[2] << '\n';
    std::cout << max_sequence[1] << '\n';
    std::cout << max_sequence[0] << '\n';
    std::cout << "with " << max_count << " visits.\n";
    

    Note that If you pageid or userid are strings instead of ints, you'll take a significant speed/size/caching penalty.

    [EDIT2] It now works in 4 (customizable) passes, which means it uses less memory, making this work realistically in RAM. It just goes proportionately slower.

    0 讨论(0)
  • 2021-02-02 17:56

    I think you only have to store the most recently seen triple for each userid right? So you have two hash tables. The first containing key of userid, value of most recently seen triple has size equal to number of userids.

    EDIT: assumes file sorted by timestamp already.

    The second hash table has a key of userid:page-triple, and a value of count of times seen.

    I know you said c++ but here's some awk which does this in a single pass (should be pretty straight-forward to convert to c++):

    #  $1 is userid, $2 is pageid
    
    {
        old = ids[$1];          # map with id, most-recently-seen triple
        split(old,oldarr,"-"); 
        oldarr[1]=oldarr[2]; 
        oldarr[2]=oldarr[3]; 
        oldarr[3] = $2; 
        ids[$1]=oldarr[1]"-"oldarr[2]"-"oldarr[3]; # save new most-recently-seen
        tripleid = $1":"ids[$1];  # build a triple-id of userid:triple
        if (oldarr[1] != "") { # don't accumulate incomplete triples
            triples[tripleid]++; }   # count this triple-id
    }
    END {
        MAX = 0;
        for (tid in  triples) {
            print tid" "triples[tid];
            if (triples[tid] > MAX) MAX = tid;
        }
        print "MAX is->" MAX" seen "triples[tid]" times";
    }
    
    0 讨论(0)
  • 2021-02-02 17:56

    If you are using Unix, the sort command can cope with arbitrarily large files. So you could do something like this:

    1. sort -k1,1 -s logfile > sorted (note that this is a stable sort (-s) on the first column)
    2. Perform some custom processing of sorted that outputs each triplet as a new line to another file, say triplets, using either C++ or a shell script. So in the example given you get a file with three lines: 1-2-3, 2-3-4, 2-3-4. This processing is quick because Step 1 means that you are only dealing with one user's visits at a time, so you can work through the sorted file a line at a time.
    3. sort triplets | uniq -c | sort -r -n | head -1 should give the most common triplet and its count (it sorts the triplets, counts the occurrences of each, sorts them in descending order of count and takes the top one).

    This approach might not have optimal performance, but it shouldn't run out of memory.

    0 讨论(0)
提交回复
热议问题