Algorithmic issue: determining “user sessions”

后端 未结 4 2190
日久生厌
日久生厌 2021-02-20 06:12

I\'ve got a real little interesting (at least to me) problem to solve (and, no, it is not homework). It is equivalent to this: you need to determine \"sessions\" and \"sessions

4条回答
  •  悲哀的现实
    2021-02-20 06:47

    Maximum Delay
    If the log entries have a "maximum delay" (e.g. with a maximum delay of 2 hours, an 8:12 event will never be listed after a 10:12 event), you could look ahead and sort.

    Do Sort
    Alternatively, I'd first try sorting - at least to make sure it doesnt work. A timestamp can be reasonably stored in 8 bytes (4 even for your purposes, you could put 250 Millions of then into a gigabyte). Quicksort might not be the best choice here as it has low locality, insertion sort is almost-perfect for almost-sorted data (though it has bad locality, too), alternatively, quick-sorting chunk-wise, then merging chunks with a merge sort should do, even though it increases memory requirements.

    Squash and conquer
    Alternatively, you can use the following strategy:

    1. transform each event into a "session of duration 0"
    2. Split your list of sessions into chunks (e.g. 1K values / chunk)
    3. Within each chunk, sort by session start
    4. Merge all sessions than can be merged (having sorted before allows you to reduce your look ahead).
    5. Compact the list of remaining sessions into a large single list
    6. repeat with step 2 until the list doesn't get any shorter.
    7. sort-and-merge over all

    If your log files have the kind of "temporal locality" your question suggests, already a single pass should reduce the data to allow a "full" sort.

    [edit] [This site]1 demonstrates an "optimized quicksort with insertion sort finish" that's quite good on almost-sorted data. As has this guys std::sort

提交回复
热议问题