Natural Sort in MySQL

后端 未结 21 1065
南旧
南旧 2020-11-22 02:25

Is there an elegant way to have performant, natural sorting in a MySQL database?

For example if I have this data set:

  • Final Fantasy
  • Final Fant
相关标签:
21条回答
  • 2020-11-22 02:54

    A lot of other answers I see here (and in the duplicate questions) basically only work for very specifically formatted data, e.g. a string that's entirely a number, or for which there's a fixed-length alphabetic prefix. This isn't going to work in the general case.

    It's true that there's not really any way to implement a 100% general nat-sort in MySQL, because to do it what you really need is a modified comparison function, that switches between lexicographic sorting of the strings and numeric sort if/when it encounters a number. Such code could implement any algorithm you could desire for recognising and comparing the numeric portions within two strings. Unfortunately, though, the comparison function in MySQL is internal to its code, and cannot be changed by the user.

    This leaves a hack of some kind, where you try to create a sort key for your string in which the numeric parts are re-formatted so that the standard lexicographic sort actually sorts them the way you want.

    For plain integers up to some maximum number of digits, the obvious solution is to simply left-pad them with zeros so that they're all fixed width. This is the approach taken by the Drupal plugin, and the solutions of @plalx / @RichardToth. (@Christian has a different and much more complex solution, but it offers no advantages that I can see).

    As @tye points out, you can improve on this by prepending a fixed-digit length to each number, rather than simply left-padding it. There's much, much more you can improve on, though, even given the limitations of what is essentially an awkward hack. Yet, there doesn't seem to be any pre-built solutions out there!

    For example, what about:

    • Plus and minus signs? +10 vs 10 vs -10
    • Decimals? 8.2, 8.5, 1.006, .75
    • Leading zeros? 020, 030, 00000922
    • Thousand separators? "1,001 Dalmations" vs "1001 Dalmations"
    • Version numbers? MariaDB v10.3.18 vs MariaDB v10.3.3
    • Very long numbers? 103,768,276,592,092,364,859,236,487,687,870,234,598.55

    Extending on @tye's method, I've created a fairly compact NatSortKey() stored function that will convert an arbitrary string into a nat-sort key, and that handles all of the above cases, is reasonably efficient, and preserves a total sort-order (no two different strings have sort keys that compare equal). A second parameter can be used to limit the number of numbers processed in each string (e.g. to the first 10 numbers, say), which can be used to ensure the output fits within a given length.

    NOTE: Sort-key string generated with a given value of this 2nd parameter should only be sorted against other strings generated with the same value for the parameter, or else they might not sort correctly!

    You can use it directly in ordering, e.g.

    SELECT myString FROM myTable ORDER BY NatSortKey(myString,0);  ### 0 means process all numbers - resulting sort key might be quite long for certain inputs
    

    But for efficient sorting of large tables, it's better to pre-store the sort key in another column (possibly with an index on it):

    INSERT INTO myTable (myString,myStringNSK) VALUES (@theStringValue,NatSortKey(@theStringValue,10)), ...
    ...
    SELECT myString FROM myTable ORDER BY myStringNSK;
    

    [Ideally, you'd make this happen automatically by creating the key column as a computed stored column, using something like:

    CREATE TABLE myTable (
    ...
    myString varchar(100),
    myStringNSK varchar(150) AS (NatSortKey(myString,10)) STORED,
    ...
    KEY (myStringNSK),
    ...);
    

    But for now neither MySQL nor MariaDB allow stored functions in computed columns, so unfortunately you can't yet do this.]


    My function affects sorting of numbers only. If you want to do other sort-normalization things, such as removing all punctuation, or trimming whitespace off each end, or replacing multi-whitespace sequences with single spaces, you could either extend the function, or it could be done before or after NatSortKey() is applied to your data. (I'd recommend using REGEXP_REPLACE() for this purpose).

    It's also somewhat Anglo-centric in that I assume '.' for a decimal point and ',' for the thousands-separator, but it should be easy enough to modify if you want the reverse, or if you want that to be switchable as a parameter.

    It might be amenable to further improvement in other ways; for example it currently sorts negative numbers by absolute value, so -1 comes before -2, rather than the other way around. There's also no way to specify a DESC sort order for numbers while retaining ASC lexicographical sort for text. Both of these issues can be fixed with a little more work; I will updated the code if/when I get the time.

    There are lots of other details to be aware of - including some critical dependencies on the chaset and collation that you're using - but I've put them all into a comment block within the SQL code. Please read this carefully before using the function for yourself!

    So, here's the code. If you find a bug, or have an improvement I haven't mentioned, please let me know in the comments!


    delimiter $$
    CREATE DEFINER=CURRENT_USER FUNCTION NatSortKey (s varchar(100), n int) RETURNS varchar(350) DETERMINISTIC
    BEGIN
    /****
      Converts numbers in the input string s into a format such that sorting results in a nat-sort.
      Numbers of up to 359 digits (before the decimal point, if one is present) are supported.  Sort results are undefined if the input string contains numbers longer than this.
      For n>0, only the first n numbers in the input string will be converted for nat-sort (so strings that differ only after the first n numbers will not nat-sort amongst themselves).
      Total sort-ordering is preserved, i.e. if s1!=s2, then NatSortKey(s1,n)!=NatSortKey(s2,n), for any given n.
      Numbers may contain ',' as a thousands separator, and '.' as a decimal point.  To reverse these (as appropriate for some European locales), the code would require modification.
      Numbers preceded by '+' sort with numbers not preceded with either a '+' or '-' sign.
      Negative numbers (preceded with '-') sort before positive numbers, but are sorted in order of ascending absolute value (so -7 sorts BEFORE -1001).
      Numbers with leading zeros sort after the same number with no (or fewer) leading zeros.
      Decimal-part-only numbers (like .75) are recognised, provided the decimal point is not immediately preceded by either another '.', or by a letter-type character.
      Numbers with thousand separators sort after the same number without them.
      Thousand separators are only recognised in numbers with no leading zeros that don't immediately follow a ',', and when they format the number correctly.
      (When not recognised as a thousand separator, a ',' will instead be treated as separating two distinct numbers).
      Version-number-like sequences consisting of 3 or more numbers separated by '.' are treated as distinct entities, and each component number will be nat-sorted.
      The entire entity will sort after any number beginning with the first component (so e.g. 10.2.1 sorts after both 10 and 10.995, but before 11)
      Note that The first number component in an entity like this is also permitted to contain thousand separators.
    
      To achieve this, numbers within the input string are prefixed and suffixed according to the following format:
      - The number is prefixed by a 2-digit base-36 number representing its length, excluding leading zeros.  If there is a decimal point, this length only includes the integer part of the number.
      - A 3-character suffix is appended after the number (after the decimals if present).
        - The first character is a space, or a '+' sign if the number was preceded by '+'.  Any preceding '+' sign is also removed from the front of the number.
        - This is followed by a 2-digit base-36 number that encodes the number of leading zeros and whether the number was expressed in comma-separated form (e.g. 1,000,000.25 vs 1000000.25)
        - The value of this 2-digit number is: (number of leading zeros)*2 + (1 if comma-separated, 0 otherwise)
      - For version number sequences, each component number has the prefix in front of it, and the separating dots are removed.
        Then there is a single suffix that consists of a ' ' or '+' character, followed by a pair base-36 digits for each number component in the sequence.
    
      e.g. here is how some simple sample strings get converted:
      'Foo055' --> 'Foo0255 02'
      'Absolute zero is around -273 centigrade' --> 'Absolute zero is around -03273 00 centigrade'
      'The $1,000,000 prize' --> 'The $071000000 01 prize'
      '+99.74 degrees' --> '0299.74+00 degrees'
      'I have 0 apples' --> 'I have 00 02 apples'
      '.5 is the same value as 0000.5000' --> '00.5 00 is the same value as 00.5000 08'
      'MariaDB v10.3.0018' --> 'MariaDB v02100130218 000004'
    
      The restriction to numbers of up to 359 digits comes from the fact that the first character of the base-36 prefix MUST be a decimal digit, and so the highest permitted prefix value is '9Z' or 359 decimal.
      The code could be modified to handle longer numbers by increasing the size of (both) the prefix and suffix.
      A higher base could also be used (by replacing CONV() with a custom function), provided that the collation you are using sorts the "digits" of the base in the correct order, starting with 0123456789.
      However, while the maximum number length may be increased this way, note that the technique this function uses is NOT applicable where strings may contain numbers of unlimited length.
    
      The function definition does not specify the charset or collation to be used for string-type parameters or variables:  The default database charset & collation at the time the function is defined will be used.
      This is to make the function code more portable.  However, there are some important restrictions:
    
      - Collation is important here only when comparing (or storing) the output value from this function, but it MUST order the characters " +0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" in that order for the natural sort to work.
        This is true for most collations, but not all of them, e.g. in Lithuanian 'Y' comes before 'J' (according to Wikipedia).
        To adapt the function to work with such collations, replace CONV() in the function code with a custom function that emits "digits" above 9 that are characters ordered according to the collation in use.
    
      - For efficiency, the function code uses LENGTH() rather than CHAR_LENGTH() to measure the length of strings that consist only of digits 0-9, '.', and ',' characters.
        This works for any single-byte charset, as well as any charset that maps standard ASCII characters to single bytes (such as utf8 or utf8mb4).
        If using a charset that maps these characters to multiple bytes (such as, e.g. utf16 or utf32), you MUST replace all instances of LENGTH() in the function definition with CHAR_LENGTH()
    
      Length of the output:
    
      Each number converted adds 5 characters (2 prefix + 3 suffix) to the length of the string. n is the maximum count of numbers to convert;
      This parameter is provided as a means to limit the maximum output length (to input length + 5*n).
      If you do not require the total-ordering property, you could edit the code to use suffixes of 1 character (space or plus) only; this would reduce the maximum output length for any given n.
      Since a string of length L has at most ((L+1) DIV 2) individual numbers in it (every 2nd character a digit), for n<=0 the maximum output length is (inputlength + 5*((inputlength+1) DIV 2))
      So for the current input length of 100, the maximum output length is 350.
      If changing the input length, the output length must be modified according to the above formula.  The DECLARE statements for x,y,r, and suf must also be modified, as the code comments indicate.
    ****/
      DECLARE x,y varchar(100);            # need to be same length as input s
      DECLARE r varchar(350) DEFAULT '';   # return value:  needs to be same length as return type
      DECLARE suf varchar(101);   # suffix for a number or version string. Must be (((inputlength+1) DIV 2)*2 + 1) chars to support version strings (e.g. '1.2.33.5'), though it's usually just 3 chars. (Max version string e.g. 1.2. ... .5 has ((length of input + 1) DIV 2) numeric components)
      DECLARE i,j,k int UNSIGNED;
      IF n<=0 THEN SET n := -1; END IF;   # n<=0 means "process all numbers"
      LOOP
        SET i := REGEXP_INSTR(s,'\\d');   # find position of next digit
        IF i=0 OR n=0 THEN RETURN CONCAT(r,s); END IF;   # no more numbers to process -> we're done
        SET n := n-1, suf := ' ';
        IF i>1 THEN
          IF SUBSTRING(s,i-1,1)='.' AND (i=2 OR SUBSTRING(s,i-2,1) RLIKE '[^.\\p{L}\\p{N}\\p{M}\\x{608}\\x{200C}\\x{200D}\\x{2100}-\\x{214F}\\x{24B6}-\\x{24E9}\\x{1F130}-\\x{1F149}\\x{1F150}-\\x{1F169}\\x{1F170}-\\x{1F189}]') AND (SUBSTRING(s,i) NOT RLIKE '^\\d++\\.\\d') THEN SET i:=i-1; END IF;   # Allow decimal number (but not version string) to begin with a '.', provided preceding char is neither another '.', nor a member of the unicode character classes: "Alphabetic", "Letter", "Block=Letterlike Symbols" "Number", "Mark", "Join_Control"
          IF i>1 AND SUBSTRING(s,i-1,1)='+' THEN SET suf := '+', j := i-1; ELSE SET j := i; END IF;   # move any preceding '+' into the suffix, so equal numbers with and without preceding "+" signs sort together
          SET r := CONCAT(r,SUBSTRING(s,1,j-1)); SET s = SUBSTRING(s,i);   # add everything before the number to r and strip it from the start of s; preceding '+' is dropped (not included in either r or s)
        END IF;
        SET x := REGEXP_SUBSTR(s,IF(SUBSTRING(s,1,1) IN ('0','.') OR (SUBSTRING(r,-1)=',' AND suf=' '),'^\\d*+(?:\\.\\d++)*','^(?:[1-9]\\d{0,2}(?:,\\d{3}(?!\\d))++|\\d++)(?:\\.\\d++)*+'));   # capture the number + following decimals (including multiple consecutive '.<digits>' sequences)
        SET s := SUBSTRING(s,LENGTH(x)+1);   # NOTE: LENGTH() can be safely used instead of CHAR_LENGTH() here & below PROVIDED we're using a charset that represents digits, ',' and '.' characters using single bytes (e.g. latin1, utf8)
        SET i := INSTR(x,'.');
        IF i=0 THEN SET y := ''; ELSE SET y := SUBSTRING(x,i); SET x := SUBSTRING(x,1,i-1); END IF;   # move any following decimals into y
        SET i := LENGTH(x);
        SET x := REPLACE(x,',','');
        SET j := LENGTH(x);
        SET x := TRIM(LEADING '0' FROM x);   # strip leading zeros
        SET k := LENGTH(x);
        SET suf := CONCAT(suf,LPAD(CONV(LEAST((j-k)*2,1294) + IF(i=j,0,1),10,36),2,'0'));   # (j-k)*2 + IF(i=j,0,1) = (count of leading zeros)*2 + (1 if there are thousands-separators, 0 otherwise)  Note the first term is bounded to <= base-36 'ZY' as it must fit within 2 characters
        SET i := LOCATE('.',y,2);
        IF i=0 THEN
          SET r := CONCAT(r,LPAD(CONV(LEAST(k,359),10,36),2,'0'),x,y,suf);   # k = count of digits in number, bounded to be <= '9Z' base-36
        ELSE   # encode a version number (like 3.12.707, etc)
          SET r := CONCAT(r,LPAD(CONV(LEAST(k,359),10,36),2,'0'),x);   # k = count of digits in number, bounded to be <= '9Z' base-36
          WHILE LENGTH(y)>0 AND n!=0 DO
            IF i=0 THEN SET x := SUBSTRING(y,2); SET y := ''; ELSE SET x := SUBSTRING(y,2,i-2); SET y := SUBSTRING(y,i); SET i := LOCATE('.',y,2); END IF;
            SET j := LENGTH(x);
            SET x := TRIM(LEADING '0' FROM x);   # strip leading zeros
            SET k := LENGTH(x);
            SET r := CONCAT(r,LPAD(CONV(LEAST(k,359),10,36),2,'0'),x);   # k = count of digits in number, bounded to be <= '9Z' base-36
            SET suf := CONCAT(suf,LPAD(CONV(LEAST((j-k)*2,1294),10,36),2,'0'));   # (j-k)*2 = (count of leading zeros)*2, bounded to fit within 2 base-36 digits
            SET n := n-1;
          END WHILE;
          SET r := CONCAT(r,y,suf);
        END IF;
      END LOOP;
    END
    $$
    delimiter ;
    
    0 讨论(0)
  • 2020-11-22 02:56

    Another option is to do the sorting in memory after pulling the data from mysql. While it won't be the best option from a performance standpoint, if you are not sorting huge lists you should be fine.

    If you take a look at Jeff's post, you can find plenty of algorithms for what ever language you might be working with. Sorting for Humans : Natural Sort Order

    0 讨论(0)
  • 2020-11-22 02:57

    A simplified non-udf version of the best response of @plaix/Richard Toth/Luke Hoggett, which works only for the first integer in the field, is

    SELECT name,
    LEAST(
        IFNULL(NULLIF(LOCATE('0', name), 0), ~0),
        IFNULL(NULLIF(LOCATE('1', name), 0), ~0),
        IFNULL(NULLIF(LOCATE('2', name), 0), ~0),
        IFNULL(NULLIF(LOCATE('3', name), 0), ~0),
        IFNULL(NULLIF(LOCATE('4', name), 0), ~0),
        IFNULL(NULLIF(LOCATE('5', name), 0), ~0),
        IFNULL(NULLIF(LOCATE('6', name), 0), ~0),
        IFNULL(NULLIF(LOCATE('7', name), 0), ~0),
        IFNULL(NULLIF(LOCATE('8', name), 0), ~0),
        IFNULL(NULLIF(LOCATE('9', name), 0), ~0)
    ) AS first_int
    FROM table
    ORDER BY IF(first_int = ~0, name, CONCAT(
        SUBSTR(name, 1, first_int - 1),
        LPAD(CAST(SUBSTR(name, first_int) AS UNSIGNED), LENGTH(~0), '0'),
        SUBSTR(name, first_int + LENGTH(CAST(SUBSTR(name, first_int) AS UNSIGNED)))
    )) ASC
    
    0 讨论(0)
  • 2020-11-22 02:58

    Regarding the best response from Richard Toth https://stackoverflow.com/a/12257917/4052357

    Watch out for UTF8 encoded strings that contain 2byte (or more) characters and numbers e.g.

    12 南新宿
    

    Using MySQL's LENGTH() in udf_NaturalSortFormat function will return the byte length of the string and be incorrect, instead use CHAR_LENGTH() which will return the correct character length.

    In my case using LENGTH() caused queries to never complete and result in 100% CPU usage for MySQL

    DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
    DELIMITER ;;
    CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50)) 
    RETURNS varchar(4000)
    LANGUAGE SQL
    DETERMINISTIC
    NO SQL
    SQL SECURITY INVOKER
    BEGIN
        DECLARE sortString varchar(4000);
        DECLARE numStartIndex int;
        DECLARE numEndIndex int;
        DECLARE padLength int;
        DECLARE totalPadLength int;
        DECLARE i int;
        DECLARE sameOrderCharsLen int;
    
        SET totalPadLength = 0;
        SET instring = TRIM(instring);
        SET sortString = instring;
        SET numStartIndex = udf_FirstNumberPos(instring);
        SET numEndIndex = 0;
        SET i = 1;
        SET sameOrderCharsLen = CHAR_LENGTH(sameOrderChars);
    
        WHILE (i <= sameOrderCharsLen) DO
            SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
            SET i = i + 1;
        END WHILE;
    
        WHILE (numStartIndex <> 0) DO
            SET numStartIndex = numStartIndex + numEndIndex;
            SET numEndIndex = numStartIndex;
    
            WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
                SET numEndIndex = numEndIndex + 1;
            END WHILE;
    
            SET numEndIndex = numEndIndex - 1;
    
            SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);
    
            IF padLength < 0 THEN
                SET padLength = 0;
            END IF;
    
            SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));
    
            SET totalPadLength = totalPadLength + padLength;
            SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, CHAR_LENGTH(instring) - numEndIndex));
        END WHILE;
    
        RETURN sortString;
    END
    ;;
    

    p.s. I would have added this as a comment to the original but I don't have enough reputation (yet)

    0 讨论(0)
  • 2020-11-22 03:00

    To order:
    0
    1
    2
    10
    23
    101
    205
    1000
    a
    aac
    b
    casdsadsa
    css

    Use this query:

    SELECT 
        column_name 
    FROM 
        table_name 
    ORDER BY
        column_name REGEXP '^\d*[^\da-z&\.\' \-\"\!\@\#\$\%\^\*\(\)\;\:\\,\?\/\~\`\|\_\-]' DESC, 
        column_name + 0, 
        column_name;
    
    0 讨论(0)
  • 2020-11-22 03:02

    I have tried several solutions but the actually it is very simple:

    SELECT test_column FROM test_table ORDER BY LENGTH(test_column) DESC, test_column DESC
    
    /* 
    Result 
    --------
    value_1
    value_2
    value_3
    value_4
    value_5
    value_6
    value_7
    value_8
    value_9
    value_10
    value_11
    value_12
    value_13
    value_14
    value_15
    ...
    */
    
    0 讨论(0)
提交回复
热议问题