How to speed up external merge sort in Java

后端 未结 4 1407
忘掉有多难
忘掉有多难 2021-01-21 05:02

I am writing code for the external merge sort. The idea is that the input files contain too many numbers to be stored in an array so you read some of it and put it into files to

相关标签:
4条回答
  • 2021-01-21 05:43

    We have implemented a public domain external sort in Java:

    http://code.google.com/p/externalsortinginjava/

    It might be faster than yours. We use strings and not integers, but you could easily modify our code by substituting integers for strings (the code was made hackable by design). At the very least, you can compare with our design.

    Looking at your code, it seems like you are reading the data in units of integers. So IO will be a bottleneck I would guess. With external memory algorithms, you want to read and write blocks of data---especially in Java.

    0 讨论(0)
  • 2021-01-21 05:51

    I would use memory mapped files. It can be as much as 10x faster than using this type of IO. I suspect it will be much faster in this case as well. The mapped buffers use virtual memory rather heap space to store data and can be larger than your available physical memory.

    0 讨论(0)
  • 2021-01-21 05:53

    You are sorting integers so you should check out radix sort. The core idea of radix sort is that you can sort n byte integers with n passes through the data with radix 256.

    You can combine this with merge sort theory.

    0 讨论(0)
  • 2021-01-21 05:56

    You might wish to merge k>2 segments at a time. This reduces the amount of I/O from n log k / log 2 to n log n / log k.

    Edit: In pseudocode, this would look something like this:

    void sort(List list) {
        if (list fits in memory) {
            list.sort();
        } else {
            sublists = partition list into k about equally big sublists
            for (sublist : sublists) {
                sort(sublist);
            }
            merge(sublists);
        }
    }
    
    void merge(List[] sortedsublists) {
        keep a pointer in each sublist, which initially points to its first element
        do {
            find the pointer pointing at the smallest element
            add the element it points to to the result list
            advance that pointer
        } until all pointers have reached the end of their sublist
        return the result list
    }
    

    To efficiently find the "smallest" pointer, you might employ a PriorityQueue.

    0 讨论(0)
提交回复
热议问题