I need to write a Java Comparator class that compares Strings, however with one twist. If the two strings it is comparing are the same at the beginning and end of the strin
The Alphanum Algorithm
From the website
"People sort strings with numbers differently than software. Most sorting algorithms compare ASCII values, which produces an ordering that is inconsistent with human logic. Here's how to fix it."
Edit: Here's a link to the Java Comparator Implementation from that site.
Here is the solution with the following advantages over Alphanum Algorithm:
"0001"
equals "1"
, "01234"
is less than "4567"
)public class NumberAwareComparator implements Comparator<String>
{
@Override
public int compare(String s1, String s2)
{
int len1 = s1.length();
int len2 = s2.length();
int i1 = 0;
int i2 = 0;
while (true)
{
// handle the case when one string is longer than another
if (i1 == len1)
return i2 == len2 ? 0 : -1;
if (i2 == len2)
return 1;
char ch1 = s1.charAt(i1);
char ch2 = s2.charAt(i2);
if (Character.isDigit(ch1) && Character.isDigit(ch2))
{
// skip leading zeros
while (i1 < len1 && s1.charAt(i1) == '0')
i1++;
while (i2 < len2 && s2.charAt(i2) == '0')
i2++;
// find the ends of the numbers
int end1 = i1;
int end2 = i2;
while (end1 < len1 && Character.isDigit(s1.charAt(end1)))
end1++;
while (end2 < len2 && Character.isDigit(s2.charAt(end2)))
end2++;
int diglen1 = end1 - i1;
int diglen2 = end2 - i2;
// if the lengths are different, then the longer number is bigger
if (diglen1 != diglen2)
return diglen1 - diglen2;
// compare numbers digit by digit
while (i1 < end1)
{
if (s1.charAt(i1) != s2.charAt(i2))
return s1.charAt(i1) - s2.charAt(i2);
i1++;
i2++;
}
}
else
{
// plain characters comparison
if (ch1 != ch2)
return ch1 - ch2;
i1++;
i2++;
}
}
}
}
Instead of reinventing the wheel, I'd suggest to use a locale-aware Unicode-compliant string comparator that has built-in number sorting from the ICU4J library.
import com.ibm.icu.text.Collator;
import com.ibm.icu.text.RuleBasedCollator;
import java.util.Arrays;
import java.util.List;
import java.util.Locale;
public class CollatorExample {
public static void main(String[] args) {
// Make sure to choose correct locale: in Turkish uppercase of "i" is "İ", not "I"
RuleBasedCollator collator = (RuleBasedCollator) Collator.getInstance(Locale.US);
collator.setNumericCollation(true); // Place "10" after "2"
collator.setStrength(Collator.PRIMARY); // Case-insensitive
List<String> strings = Arrays.asList("10", "20", "A20", "2", "t1ab", "01", "T010T01", "t1aB",
"_2", "001", "_200", "1", "A 02", "t1Ab", "a2", "_1", "t1A", "_01",
"100", "02", "T0010T01", "t1AB", "10", "A01", "010", "t1a"
);
strings.sort(collator);
System.out.println(String.join(", ", strings));
// Output: _1, _01, _2, _200, 01, 001, 1,
// 2, 02, 10, 10, 010, 20, 100, A 02, A01,
// a2, A20, t1A, t1a, t1ab, t1aB, t1Ab, t1AB,
// T010T01, T0010T01
}
}
The implementation I propose here is simple and efficient. It does not allocate any extra memory, directly or indirectly by using regular expressions or methods such as substring(), split(), toCharArray(), etc.
This implementation first goes across both strings to search for the first characters that are different, at maximal speed, without doing any special processing during this. Specific number comparison is triggered only when these characters are both digits. A side-effect of this implementation is that a digit is considered as greater than other letters, contrarily to default lexicographic order.
public static final int compareNatural (String s1, String s2)
{
// Skip all identical characters
int len1 = s1.length();
int len2 = s2.length();
int i;
char c1, c2;
for (i = 0, c1 = 0, c2 = 0; (i < len1) && (i < len2) && (c1 = s1.charAt(i)) == (c2 = s2.charAt(i)); i++);
// Check end of string
if (c1 == c2)
return(len1 - len2);
// Check digit in first string
if (Character.isDigit(c1))
{
// Check digit only in first string
if (!Character.isDigit(c2))
return(1);
// Scan all integer digits
int x1, x2;
for (x1 = i + 1; (x1 < len1) && Character.isDigit(s1.charAt(x1)); x1++);
for (x2 = i + 1; (x2 < len2) && Character.isDigit(s2.charAt(x2)); x2++);
// Longer integer wins, first digit otherwise
return(x2 == x1 ? c1 - c2 : x1 - x2);
}
// Check digit only in second string
if (Character.isDigit(c2))
return(-1);
// No digits
return(c1 - c2);
}
I realize you're in java, but you can take a look at how StrCmpLogicalW works. It's what Explorer uses to sort filenames in Windows. You can look at the WINE implementation here.