Finding the most common three-item sequence in a very large file

前端 未结 5 1680
耶瑟儿~
耶瑟儿~ 2021-02-02 17:09

I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page

5条回答
  •  离开以前
    2021-02-02 17:47

    There's probably syntax errors galore here, but this should take a limited amount of RAM for a virtually unlimited length log file.

    typedef int pageid;
    typedef int userid;
    typedef pageid[3] sequence;
    typedef int sequence_count;
    
    const int num_pages = 1000; //where 1-1000 inclusive are valid pageids
    const int num_passes = 4;
    std::unordered_map userhistory;
    std::unordered_map visits;
    sequence_count max_count=0;
    sequence max_sequence={};
    userid curuser;
    pageid curpage;
    for(int pass=0; pass> curuser >> curpage) { //read in line
            sequence& curhistory = userhistory[curuser]; //find that user's history
            curhistory[2] = curhistory[1];
            curhistory[1] = curhistory[0];
            curhistory[0] = curhistory[curpage]; //push back new page for that user
            //if they visited three pages in a row
            if (curhistory[2] > minpage && curhistory[2] max_count) { //if that's new max
                    max_count = count;  //update the max
                    max_sequence = curhistory; //arrays, so this is memcpy or something
                }
            }
        }
    }
    std::cout << "The sequence visited the most is :\n";
    std::cout << max_sequence[2] << '\n';
    std::cout << max_sequence[1] << '\n';
    std::cout << max_sequence[0] << '\n';
    std::cout << "with " << max_count << " visits.\n";
    

    Note that If you pageid or userid are strings instead of ints, you'll take a significant speed/size/caching penalty.

    [EDIT2] It now works in 4 (customizable) passes, which means it uses less memory, making this work realistically in RAM. It just goes proportionately slower.

提交回复
热议问题