Given the following array:
y = %w[A1 A2 B5 B12 A6 A8 B10 B3 B4 B8]
=> [\"A1\", \"A2\", \"B5\", \"B12\", \"A6\", \"A8\", \"B10\", \"B3\", \"B4\", \"B8\"]
You are sorting strings. Strings are sorted like strings, not like numbers. If you want to sort like numbers, then you should sort numbers, not strings. The string 'B10'
is lexicographically smaller than the string 'B3'
, that's not something unique to Ruby, that's not even something unique to programming, that's how lexicographically sorting a piece of text works pretty much everywhere, in programming, databases, lexicons, dictionaries, phonebooks, etc.
You should split your strings into their numerical and non-numerical components, and convert the numerical components to numbers. Array sorting is lexicographic, so this will end up sorting exactly right:
y.sort_by {|s| # use `sort_by` for a keyed sort, not `sort`
s.
split(/(\d+)/). # split numeric parts from non-numeric
map {|s| # the below parses numeric parts as decimals, ignores the rest
begin Integer(s, 10); rescue ArgumentError; s end }}
#=> ["A1", "A2", "A6", "A8", "B3", "B4", "B5", "B8", "B10", "B12"]
A natural or lexicographic sort, not a standard character-value-based sort, would be needed. Something like these gems would be a starting point: https://github.com/dogweather/naturally, https://github.com/johnnyshields/naturalsort
Humans treat a string like "A2" as "A" followed by the number 2, and sort by using character-string sorting for the string part and numeric sorting for the numeric part. Standard sort()
uses character-value sorting treating the string as a sequence of characters regardless of what the characters are. So for sort()
"A10" and "A2" look like [ 'A', '1', '0' ] and [ 'A', '2' ], since '1' sorts before '2' and the following characters can't change that order "A10" thus sorts before "A2". For humans the same strings look like [ "A", 10 ] and [ "A", 2 ], 10 sorts after 2 so we get the opposite result. The strings can be manipulated to make the character-value-based sort()
produce the expected result by making the numeric portion fixed-width and zero-padding it on the left to avoid embedded spaces, making "A2" turn into "A02" which does sort before "A10" using standard sort()
.
Here are a couple of ways to do that.
arr = ["A1", "A2", "B5", "B12", "A6", "AB12", "A8", "B10", "B3", "B4",
"B8", "AB2"]
Sort on a 2-element array
arr.sort_by { |s| [s[/\D+/], s[/\d+/].to_i] }
#=> ["A1", "A2", "A6", "A8", "AB2", "AB12", "B3", "B4", "B5", "B8",
# "B10", "B12"]
This is similar to @Jorg's solution except I've computed the two elements of the comparison array separately, rather than splitting the string into two parts and converting the latter to an integer.
Enumerable#sort_by compares each pair of elements of arr
with the spaceship method, <=>
. As the elements being compared are arrays, the method Array#<=> is used. See in particular the third paragraph of that doc.
sort_by
compares the following 2-element arrays:
arr.each { |s| puts "%s-> [%s, %d]" %
["\"#{s}\"".ljust(7), "\"#{s[/\D+/]}\"".ljust(4), s[/\d+/].to_i] }
"A1" -> ["A" , 1]
"A2" -> ["A" , 2]
"B5" -> ["B" , 5]
"B12" -> ["B" , 12]
"A6" -> ["A" , 6]
"AB12" -> ["AB", 12]
"A8" -> ["A" , 8]
"B10" -> ["B" , 10]
"B3" -> ["B" , 3]
"B4" -> ["B" , 4]
"B8" -> ["B" , 8]
"AB2" -> ["AB", 2]
Insert spaces between the alphameric and numeric parts of the string
max_len = arr.max_by(&:size).size
#=> 4
arr.sort_by { |s| "%s%s%d" % [s[/\D+/], " "*(max_len-s.size), s[/\d+/].to_i] }
#=> ["A1", "A2", "A6", "A8", "AB2", "AB12", "B3", "B4", "B5", "B8",
# "B10", "B12"]
Here sort_by
compares the following strings:
arr.each { |s| puts "%s-> \"%s\"" %
["\"#{s}\"".ljust(7), s[/\D+/] + " "*(max_len-s.size) + s[/\d+/]] }
"A1" -> "A 1"
"A2" -> "A 2"
"B5" -> "B 5"
"B12" -> "B 12"
"A6" -> "A 6"
"AB12" -> "AB12"
"A8" -> "A 8"
"B10" -> "B 10"
"B3" -> "B 3"
"B4" -> "B 4"
"B8," -> "B 8"
"AB2" -> "AB 2"
If you know what the maximum amount of digits in your numbers is you can also prefix your numbers with 0
during comparison.
y.sort_by { |string| string.gsub(/\d+/) { |digits| format('%02d', digits.to_i) } }
#=> ["A1", "A2", "A6", "A8", "B3", "B4", "B5", "B8", "B10", "B12"]
Here '%02d'
specifies the following, the %
denotes the formatting of a value, the 0
then specifies to prefix the number with 0
s, the 2
specifies the total length of the number, the d
specifies that you want the output in decimals (base 10). You can find additional info here.
This means that 'A1'
will be converted to 'A01'
, 'B8'
will become 'B08'
and 'B12'
will stay 'B12'
, since it already has 2 digits. This is only used during comparison.