Radix Sort on an Array of Strings?

后端未结

关注

 2  1933

I\'ve been researching around, and while I\'ve figured out the general idea of using Radix Sort to alphabetize an array of strings, I know I\'m going the wrong direction.

相关标签:

2条回答

囚心锁ツ

2021-01-03 14:43
The slides you've found are great! But where did those queues come from in your code?

Anyway, here you are (live example):
```
template <typename E>
size_t bin(const E& elem, size_t digit)
{
    return elem.size() > digit ? size_t(elem[digit]) + 1 : 0;
}

template <size_t R, typename C, typename P>
void radix_sort(P& pos, const C& data, size_t digit)
{
    using A = std::array<size_t, R + 1>;
    A count = {};
    P prev(pos);

    for (auto i : prev)
        ++count[bin(data[i], digit)];

    A done = {}, offset = {{0}};
    std::partial_sum(count.begin(), count.end() - 1, offset.begin() + 1);

    for (auto i : prev)
    {
        size_t b = bin(data[i], digit);
        pos[offset[b] + done[b]++] = i;
    }
}

struct shorter
{
    template <typename A>
    bool operator()(const A& a, const A& b) { return a.size() < b.size(); }
};

template <size_t R, typename C>
std::vector<size_t> radix_sort(const C& data)
{
    std::vector<size_t> pos(data.size());
    std::iota(pos.begin(), pos.end(), 0);

    size_t width = std::max_element(data.begin(), data.end(), shorter())->size();

    for (long digit = long(width) - 1; digit >= 0; --digit)
        radix_sort<R>(pos, data, size_t(digit));

    return pos;
}
```
which you can use like that
```
int main()
{
    std::vector<std::string> data = generate();
    std::vector<size_t> pos = radix_sort<128>(data);
    for (auto i : pos)
        std::cout << data[i] << std::endl;
}
```
where generate() is included in the live example and generates the strings given in your question.

I am not trying to explain how this works here, I assume you can figure out since you are working on the problem. But a few comments are in order.
- We are neither sorting the input sequence in-place, nor returning a sorted copy; we are just returning a sequence of positions of input elements in the sorted sequence.
- We are processing strings from right to left.
- The complexity is O(lw) where l is the input length (number of input strings) and w is the maximum input width (max. length of all input strings). So this algorithm makes sense if the string width does not vary too much.
- The first template parameter R of radix_sort() is the number of possible values for each digit (letter) in the input. E.g. R = 128 means that possible values are 0..127. This should be fine for your input. I haven't tried to do anything clever with respect to ASCII codes, but you can customize function bin() for that.
- In the output of bin(), value 0 is reserved to mean "we are past the end of this string". Such strings are placed before others that are still continuing.
- I have tried to give self-explanatory names to variables and functions, and use standard library calls for common tasks where possible.
- The code is generic, e.g. it can sort any random access container containing random access containers, not just vectors of strings.
- I am using C++11 features here and there for convenience, but nothing is really necessary: one could easily do the same just with C++03.
0 讨论(0)
发布评论:

提交评论
- 加载中...

庸人自扰

2021-01-03 14:45

Very similar to iavr, but sorting in place (benchmarked against iavr's solution with g++ -O3 and takes about 2020ms compared to iavr's 1780ms), enjoying a regular interface and resuable code. The problem with Iavr's implementation is that its logic only works with containers of strings, and is not easily extensible to other types. Obviously his specialized version is more efficient, but it might be worth it to sacrifice some performance for regularity. You can find the rest of the code at radix sort implementation

General Radix sort:

template <typename T> 
using Iter_value = std::iterator_traits<T>::value_type;

// intermediate struct to get partial template specialization
template<typename Iter, typename T, size_t range = 256>
struct rdx_impl {
    static void rdx_sort(Iter begin, Iter end, int bits) { 
        // bits is # bits to consider up to if a max val is known ahead of time
        // most efficent (theoretically) when digits are base n, having lg(n) bits
        constexpr size_t digit_bits {8};        // # bits in digit, 8 works well for 32 and 64 bit vals

            size_t d {0};                   // current digit #
            for (long long mask = (1 << digit_bits) - 1;
                d * digit_bits < bits;) {// ex. 0x000000ff for setting lower 8 bits on 32 bit num
                cnt_sort(begin, end, range, Digit_cmp<T>(mask, digit_bits*d));
                ++d;
            }
        }
    };

// specialization of rdx_sort for strings
struct Shorter {
    template <typename Seq>
    bool operator()(const Seq& a, const Seq& b) { return a.size() < b.size(); }
};
template <typename Iter>    
struct rdx_impl<Iter, std::string> {    // enough to hold ASCII char range
    static void rdx_sort(Iter begin, Iter end, int) {
        // ignore additional int argument
        int len_max = std::max_element(begin, end, Shorter())->size();
        for (int d = len_max - 1; d >= 0; --d)
            cnt_sort(begin, end, 128, Digit_cmp<std::string>(d));
    }
};

// generic call interface for all iterators 
template <typename Iter>   // use intermediate struct for partial specialization
void rdx_sort(Iter begin, Iter end, int bits) {
    rdx_impl<Iter, Iter_value<Iter>>::rdx_sort(begin, end, bits);
}

Counting sort to sort on each digit (in place):

template <typename Iter, typename Op>
void cnt_sort(Iter begin, Iter end, size_t range, Op op) {
    using T = typename Iter::value_type;
    std::vector<int> counts(range);   // init to 0
    for (auto i = begin; i != end; ++i) // count # elems == i
        ++counts[op(*i)]; 
    for (size_t i = 1; i < range; ++i)
        counts[i] += counts[i-1];   // turn into # elems <= i
    std::vector<T> res(end - begin);
    for (auto j = end;;) {
        --j;
        res[--counts[op(*j)]] = *j;
        if (j == begin) break;
    }
    // ~18% of time is spent on copying
    std::copy(res.begin(), res.end(), begin);
}

Extract value of digit:

template <typename T>   // overload digit_cmp for non-integral types top provide radix sort with digits
class Digit_cmp {   // functor for comparing a "digit" (particular bits)
    const long long mask; // 0..63 bitfield to test against
    const size_t to_shift;
public:
    Digit_cmp(long long m, size_t ts) : mask{m}, to_shift{ts} {}
    // by default assumes integral, just shifts
    size_t operator()(T n) const {    // char assuming r = 8
        return (n >> to_shift) & mask; // shift then mask for unit digit
    }
};
// specialization for strings
template <>
class Digit_cmp<std::string> {
    const size_t digit;
public:
    Digit_cmp(size_t d) : digit{d} {}
    size_t operator()(const std::string& str) {
        // 0 indicates past the end of the string
        return str.size() > digit ? str[digit] : 0;
    }
};

0 讨论(0)