I want to sort a large array of integers (say 1 millon elements) lexicographically.
Example:
input [] = { 100, 21 , 22 , 99 , 1 , 927 }
sorted[] = { 1
Based on @Oswald's answer, below is some code that does the same.
#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;
bool compare(string a, string b){
// Check each digit
int i = 0, j = 0;
while(i < a.size() && j < b.size()){
// If different digits
if(a[i] - '0' != b[j] - '0')
return (a[i] - '0' < b[j] - '0');
i++, j++;
}
// Different sizes
return (a.size() < b.size());
}
int main(){
vector<string> array = {"1","2","3","4","5","6","7","8","9","10","11","12"};
sort(array.begin(), array.end(), compare);
for(auto value : array)
cout << value << " ";
return 0;
}
Input: 1 2 3 4 5 6 7 8 9 10 11 12
Output: 1 10 11 12 2 3 4 5 6 7 8 9
To provide any custom sort ordering, you can provide a comparator to std::sort
. In this case, it's going to be somewhat complex, using logarithms to inspect individual digits of your number in base 10.
Here's an example — comments inline describe what's going on.
#include <iostream>
#include <algorithm>
#include <cmath>
#include <cassert>
int main() {
int input[] { 100, 21, 22, 99, 1, 927, -50, -24, -160 };
/**
* Sorts the array lexicographically.
*
* The trick is that we have to compare digits left-to-right
* (considering typical Latin decimal notation) and that each of
* two numbers to compare may have a different number of digits.
*
* This is very efficient in storage space, but inefficient in
* execution time; an approach that pre-visits each element and
* stores a translated representation will at least double your
* storage requirements (possibly a problem with large inputs)
* but require only a single translation of each element.
*/
std::sort(
std::begin(input),
std::end(input),
[](int lhs, int rhs) -> bool {
// Returns true if lhs < rhs
// Returns false otherwise
const auto BASE = 10;
const bool LHS_FIRST = true;
const bool RHS_FIRST = false;
const bool EQUAL = false;
// There's no point in doing anything at all
// if both inputs are the same; strict-weak
// ordering requires that we return `false`
// in this case.
if (lhs == rhs) {
return EQUAL;
}
// Compensate for sign
if (lhs < 0 && rhs < 0) {
// When both are negative, sign on its own yields
// no clear ordering between the two arguments.
//
// Remove the sign and continue as for positive
// numbers.
lhs *= -1;
rhs *= -1;
}
else if (lhs < 0) {
// When the LHS is negative but the RHS is not,
// consider the LHS "first" always as we wish to
// prioritise the leading '-'.
return LHS_FIRST;
}
else if (rhs < 0) {
// When the RHS is negative but the LHS is not,
// consider the RHS "first" always as we wish to
// prioritise the leading '-'.
return RHS_FIRST;
}
// Counting the number of digits in both the LHS and RHS
// arguments is *almost* trivial.
const auto lhs_digits = (
lhs == 0
? 1
: std::ceil(std::log(lhs+1)/std::log(BASE))
);
const auto rhs_digits = (
rhs == 0
? 1
: std::ceil(std::log(rhs+1)/std::log(BASE))
);
// Now we loop through the positions, left-to-right,
// calculating the digit at these positions for each
// input, and comparing them numerically. The
// lexicographic nature of the sorting comes from the
// fact that we are doing this per-digit comparison
// rather than considering the input value as a whole.
const auto max_pos = std::max(lhs_digits, rhs_digits);
for (auto pos = 0; pos < max_pos; pos++) {
if (lhs_digits - pos == 0) {
// Ran out of digits on the LHS;
// prioritise the shorter input
return LHS_FIRST;
}
else if (rhs_digits - pos == 0) {
// Ran out of digits on the RHS;
// prioritise the shorter input
return RHS_FIRST;
}
else {
const auto lhs_x = (lhs / static_cast<decltype(BASE)>(std::pow(BASE, lhs_digits - 1 - pos))) % BASE;
const auto rhs_x = (rhs / static_cast<decltype(BASE)>(std::pow(BASE, rhs_digits - 1 - pos))) % BASE;
if (lhs_x < rhs_x)
return LHS_FIRST;
else if (rhs_x < lhs_x)
return RHS_FIRST;
}
}
// If we reached the end and everything still
// matches up, then something probably went wrong
// as I'd have expected to catch this in the tests
// for equality.
assert("Unknown case encountered");
}
);
std::cout << '{';
for (auto x : input)
std::cout << x << ", ";
std::cout << '}';
// Output: {-160, -24, -50, 1, 100, 21, 22, 927, 99, }
}
There are quicker ways to calculate the number of digits in a number, but the above will get you started.
Here's another algorithm which does some of the computation before sorting. It seems to be quite fast, despite the additional copying (see comparisons).
Note:
std::numeric_limits<int>::max()/10
N.B. you can optimize count_digits
and my_pow10
; for example, see Three Optimization Tips for C++ from Andrei Alexandrescu and Any way faster than pow() to compute an integer power of 10 in C++?
Helpers:
#include <random>
#include <vector>
#include <utility>
#include <cmath>
#include <iostream>
#include <algorithm>
#include <limits>
#include <iterator>
// non-optimized version
int count_digits(int p) // returns `0` for `p == 0`
{
int res = 0;
for(; p != 0; ++res)
{
p /= 10;
}
return res;
}
// non-optimized version
int my_pow10(unsigned exp)
{
int res = 1;
for(; exp != 0; --exp)
{
res *= 10;
}
return res;
}
Algorithm (note - not in-place):
// helper to provide integers with the same number of digits
template<class T, class U>
std::pair<T, T> lexicographic_pair_helper(T const p, U const maxDigits)
{
auto const digits = count_digits(p);
// append zeros so that `l` has `maxDigits` digits
auto const l = static_cast<T>( p * my_pow10(maxDigits-digits) );
return {l, p};
}
template<class RaIt>
using pair_vec
= std::vector<std::pair<typename std::iterator_traits<RaIt>::value_type,
typename std::iterator_traits<RaIt>::value_type>>;
template<class RaIt>
pair_vec<RaIt> lexicographic_sort(RaIt p_beg, RaIt p_end)
{
if(p_beg == p_end) return {};
auto max = *std::max_element(p_beg, p_end);
auto maxDigits = count_digits(max);
pair_vec<RaIt> result;
result.reserve( std::distance(p_beg, p_end) );
for(auto i = p_beg; i != p_end; ++i)
result.push_back( lexicographic_pair_helper(*i, maxDigits) );
using value_type = typename pair_vec<RaIt>::value_type;
std::sort(begin(result), end(result),
[](value_type const& l, value_type const& r)
{
if(l.first < r.first) return true;
if(l.first > r.first) return false;
return l.second < r.second; }
);
return result;
}
Usage example:
int main()
{
std::vector<int> input = { 100, 21 , 22 , 99 , 1 , 927 };
// generate some numbers
/*{
constexpr int number_of_elements = 1E6;
std::random_device rd;
std::mt19937 gen( rd() );
std::uniform_int_distribution<>
dist(0, std::numeric_limits<int>::max()/10);
for(int i = 0; i < number_of_elements; ++i)
input.push_back( dist(gen) );
}*/
std::cout << "unsorted: ";
for(auto const& e : input) std::cout << e << ", ";
std::cout << "\n\n";
auto sorted = lexicographic_sort(begin(input), end(input));
std::cout << "sorted: ";
for(auto const& e : sorted) std::cout << e.second << ", ";
std::cout << "\n\n";
}
Use std::sort() with a suitable comparison function. This cuts down on the memory requirements.
The comparison function can use n % 10
, n / 10 % 10
, n / 100 % 10
etc. to access the individual digits (for positive integers; negative integers work a bit differently).
I believe the following works as a sort comparison function for positive integers provided the integer type used is substantially narrower than the double
type (e.g., 32-bit int
and 64-bit double
) and the log10
routine used returns exactly correct results for exact powers of 10 (which a good implementation does):
static const double limit = .5 * (log(INT_MAX) - log(INT_MAX-1));
double lx = log10(x);
double ly = log10(y);
double fx = lx - floor(lx); // Get the mantissa of lx.
double fy = ly - floor(ly); // Get the mantissa of ly.
return fabs(fx - fy) < limit ? lx < ly : fx < fy;
It works by comparing the mantissas of the logarithms. The mantissas are the fractional parts of the logarithm, and they indicate the value of the significant digits of a number without the magnitude (e.g., the logarithms of 31, 3.1, and 310 have exactly the same mantissa).
The purpose of fabs(fx - fy) < limit
is to allow for errors in taking the logarithm, which occur both because implementations of log10
are imperfect and because the floating-point format forces some error. (The integer portions of the logarithms of 31 and 310 use different numbers of bits, so there are different numbers of bits left for the significand, so they end up being rounded to slightly different values.) As long as the integer type is substantially narrower than the double
type, the calculated limit
will be much larger than the error in log10
. Thus, the test fabs(fx - fy) < limit
essentially tells us whether two calculated mantissas would be equal if calculated exactly.
If the mantissas differ, they indicate the lexicographic order, so we return fx < fy
. If they are equal, then the integer portion of the logarithm tells us the order, so we return lx < ly
.
It is simple to test whether log10
returns correct results for every power of ten, since there are so few of them. If it does not, adjustments can be made easily: Insert if (1-fx < limit) fx = 0; if (1-fu < limit) fy = 0;
. This allows for when log10
returns something like 4.99999… when it should have returned 5.
This method has the advantage of not using loops or division (which is time-consuming on many processors).
The task sounds like a natural fit for an MSD variant of Radix Sort with padding ( http://en.wikipedia.org/wiki/Radix_sort ).
Depends on how much code you want to throw at it. The simple code as the others show is O(log n) complexity, while a fully optimized radix sort would be O(kn).