I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page
If you want to quickly get an approximate result, use hash tables, as you intended, but add a limited-size queue to each hash table to drop least recently used entries.
If you want exact result, use external sort procedure to sort logs by userid, then combine every 3 consecutive entries and sort again, this time - by page IDs.
Update (sort by timestamp)
Some preprocessing may be needed to properly use logfiles' timestamps:
Update2 (Improving approximate method)
Approximate method with LRU queue should produce quite good results for randomly distributed data. But webpage visits may have different patterns at different time of day, or may be different on weekends. The original approach may give poor results for such data. To improve this, hierarchical LRU queue may be used.
Partition LRU queue into log(N) smaller queues. With sizes N/2, N/4, ... Largest one should contain any elements, next one - only elements, seen at least 2 times, next one - at least 4 times, ... If element is removed from some sub-queue, it is added to other one, so it lives in all sub-queues, which are lower in hierarchy, before it is completely removed. Such a priority queue is still of O(1) complexity, but allows much better approximation for most popular page.
If you have 1000 web pages then you have 1 billion possible 3-page sequences. If you have a simple array of 32-bit counters then you'd use 4GB of memory. There might be ways to prune this down by discarding data as you go, but if you want to guarantee to get the correct answer then this is always going to be your worst case - there's no avoiding it, and inventing ways to save memory in the average case will make the worst case even more memory hungry.
On top of that, you have to track the users. For each user you need to store the last two pages they visited. Assuming the users are referred to by name in the logs, you'd need to store the users' names in a hash table, plus the two page numbers, so let's say 24 bytes per user on average (probably conservative - I'm assuming short user names). With 1000 users that would be 24KB; with 1000000 users 24MB.
Clearly the sequence counters dominate the memory problem.
If you do only have 1000 pages then 4GB of memory is not unreasonable in a modern 64-bit machine, especially with a good amount of disk-backed virtual memory. If you don't have enough swap space, you could just create an mmapped file (on Linux - I presume Windows has something similar), and rely on the OS to always have to most used cases cached in memory.
So, basically, the maths dictates that if you have a large number of pages to track, and you want to be able to cope with the worst case, then you're going to have to accept that you'll have to use disk files.
I think that a limited-capacity hash table is probably the right answer. You could probably optimize it for a specific machine by sizing it according to the memory available. Having got that you need to handle the case where the table reaches capacity. It may not need to be terribly efficient if it's likely you rarely get there. Here's some ideas:
Evict the least commonly used sequences to file, keeping the most common in memory. I'd need two passes over the table to determine what level is below average, and then to do the eviction. Somehow you'd need to know where you'd put each entry, whenever you get a hash-miss, which might prove tricky.
Just dump the whole table to file, and build a new one from scratch. Repeat. Finally, recombine the matching entries from all the tables. The last part might also prove tricky.
Use an mmapped file to extend the table. Ensure that the file is used primarily for the least-commonly used sequences, as in my first suggestion. Basically, you'd simply use it as virtual memory - the file would be meaningless later, after the addresses have been forgotten, but you wouldn't need to keep it that long. I'm assuming there isn't enough regular virtual memory here, and/or you don't want to use it. Obviously, this is for 64-bit systems only.
There's probably syntax errors galore here, but this should take a limited amount of RAM for a virtually unlimited length log file.
typedef int pageid;
typedef int userid;
typedef pageid[3] sequence;
typedef int sequence_count;
const int num_pages = 1000; //where 1-1000 inclusive are valid pageids
const int num_passes = 4;
std::unordered_map<userid, sequence> userhistory;
std::unordered_map<sequence, sequence_count> visits;
sequence_count max_count=0;
sequence max_sequence={};
userid curuser;
pageid curpage;
for(int pass=0; pass<num_passes; ++pass) { //have to go in four passes
std::ifstream logfile("log.log");
pageid minpage = num_pages/num_passes*pass; //where first page is in a range
pageid maxpage = num_pages/num_passes*(pass+1)+1;
if (pass==num_passes-1) //if it's last pass, fix rounding errors
maxpage = MAX_INT;
while(logfile >> curuser >> curpage) { //read in line
sequence& curhistory = userhistory[curuser]; //find that user's history
curhistory[2] = curhistory[1];
curhistory[1] = curhistory[0];
curhistory[0] = curhistory[curpage]; //push back new page for that user
//if they visited three pages in a row
if (curhistory[2] > minpage && curhistory[2]<maxpage) {
sequence_count& count = visits[curhistory]; //get times sequence was hit
++count; //and increase it
if (count > max_count) { //if that's new max
max_count = count; //update the max
max_sequence = curhistory; //arrays, so this is memcpy or something
}
}
}
}
std::cout << "The sequence visited the most is :\n";
std::cout << max_sequence[2] << '\n';
std::cout << max_sequence[1] << '\n';
std::cout << max_sequence[0] << '\n';
std::cout << "with " << max_count << " visits.\n";
Note that If you pageid
or userid
are string
s instead of int
s, you'll take a significant speed/size/caching penalty.
[EDIT2] It now works in 4 (customizable) passes, which means it uses less memory, making this work realistically in RAM. It just goes proportionately slower.
I think you only have to store the most recently seen triple for each userid right? So you have two hash tables. The first containing key of userid, value of most recently seen triple has size equal to number of userids.
EDIT: assumes file sorted by timestamp already.
The second hash table has a key of userid:page-triple, and a value of count of times seen.
I know you said c++ but here's some awk which does this in a single pass (should be pretty straight-forward to convert to c++):
# $1 is userid, $2 is pageid
{
old = ids[$1]; # map with id, most-recently-seen triple
split(old,oldarr,"-");
oldarr[1]=oldarr[2];
oldarr[2]=oldarr[3];
oldarr[3] = $2;
ids[$1]=oldarr[1]"-"oldarr[2]"-"oldarr[3]; # save new most-recently-seen
tripleid = $1":"ids[$1]; # build a triple-id of userid:triple
if (oldarr[1] != "") { # don't accumulate incomplete triples
triples[tripleid]++; } # count this triple-id
}
END {
MAX = 0;
for (tid in triples) {
print tid" "triples[tid];
if (triples[tid] > MAX) MAX = tid;
}
print "MAX is->" MAX" seen "triples[tid]" times";
}
If you are using Unix, the sort
command can cope with arbitrarily large files. So you could do something like this:
sort -k1,1 -s logfile > sorted
(note that this is a stable sort (-s
) on the first column)sorted
that outputs each triplet as a new line to another file, say triplets
, using either C++ or a shell script. So in the example given you get a file with three lines: 1-2-3, 2-3-4, 2-3-4. This processing is quick because Step 1 means that you are only dealing with one user's visits at a time, so you can work through the sorted
file a line at a time.sort triplets | uniq -c | sort -r -n | head -1
should give the most common triplet and its count (it sorts the triplets, counts the occurrences of each, sorts them in descending order of count and takes the top one).This approach might not have optimal performance, but it shouldn't run out of memory.