Case Insensitive String comp in C

前端 未结 11 1127
天涯浪人
天涯浪人 2020-11-27 03:45

I have two postcodes char* that I want to compare, ignoring case. Is there a function to do this?

Or do I have to loop through each use the tolower func

相关标签:
11条回答
  • 2020-11-27 03:55

    Simple solution:

    int str_case_ins_cmp(const char* a, const char* b) {
      int rc;
    
      while (1) {
        rc = tolower((unsigned char)*a) - tolower((unsigned char)*b);
        if (rc || !*a) {
          break;
        }
    
        ++a;
        ++b;
      }
    
      return rc;
    }
    
    0 讨论(0)
  • 2020-11-27 04:02

    You can get an idea, how to implement an efficient one, if you don't have any in the library, from here

    It use a table for all 256 chars.

    • in that table for all chars, except letters - used its ascii codes.
    • for upper case letter codes - the table list codes of lower cased symbols.

    then we just need to traverse a strings and compare our table cells for a given chars:

    const char *cm = charmap,
            *us1 = (const char *)s1,
            *us2 = (const char *)s2;
    while (cm[*us1] == cm[*us2++])
        if (*us1++ == '\0')
            return (0);
    return (cm[*us1] - cm[*--us2]);
    
    0 讨论(0)
  • 2020-11-27 04:05

    I'm not really a fan of the most-upvoted answer here (in part because it seems like it isn't correct since it should continue if it reads a null terminator in either string--but not both strings at once--and it doesn't do this), so I wrote my own.

    This is a direct drop-in replacement for strncmp(), and has been tested with numerous test cases, as shown below.

    It is identical to strncmp() except:

    1. It is case-insensitive.
    2. The behavior is NOT undefined (it is well-defined) if either string is a null ptr. Regular strncmp() has undefined behavior if either string is a null ptr (see: https://en.cppreference.com/w/cpp/string/byte/strncmp).
    3. It returns INT_MIN as a special sentinel error value if either input string is a NULL ptr.

    LIMITATIONS: Note that this code works on the original 7-bit ASCII character set only (decimal values 0 to 127, inclusive), NOT on unicode characters, such as unicode character encodings UTF-8 (the most popular), UTF-16, and UTF-32.

    Here is the code only (no comments):

    int strncmpci(const char * str1, const char * str2, size_t num)
    {
        int ret_code = 0;
        size_t chars_compared = 0;
    
        if (!str1 || !str2)
        {
            ret_code = INT_MIN;
            return ret_code;
        }
    
        while ((*str1 || *str2) && (chars_compared < num))
        {
            ret_code = tolower((int)(*str1)) - tolower((int)(*str2));
            if (ret_code != 0)
            {
                break;
            }
            chars_compared++;
            str1++;
            str2++;
        }
    
        return ret_code;
    }
    

    Fully-commented version:

    /// \brief      Perform a case-insensitive string compare (`strncmp()` case-insensitive) to see
    ///             if two C-strings are equal.
    /// \note       1. Identical to `strncmp()` except:
    ///               1. It is case-insensitive.
    ///               2. The behavior is NOT undefined (it is well-defined) if either string is a null
    ///               ptr. Regular `strncmp()` has undefined behavior if either string is a null ptr
    ///               (see: https://en.cppreference.com/w/cpp/string/byte/strncmp).
    ///               3. It returns `INT_MIN` as a special sentinel value for certain errors.
    ///             - Posted as an answer here: https://stackoverflow.com/a/55293507/4561887.
    ///               - Aided/inspired, in part, by `strcicmp()` here:
    ///                 https://stackoverflow.com/a/5820991/4561887.
    /// \param[in]  str1        C string 1 to be compared.
    /// \param[in]  str2        C string 2 to be compared.
    /// \param[in]  num         max number of chars to compare
    /// \return     A comparison code (identical to `strncmp()`, except with the addition
    ///             of `INT_MIN` as a special sentinel value):
    ///
    ///             INT_MIN (usually -2147483648 for int32_t integers)  Invalid arguments (one or both
    ///                      of the input strings is a NULL pointer).
    ///             <0       The first character that does not match has a lower value in str1 than
    ///                      in str2.
    ///              0       The contents of both strings are equal.
    ///             >0       The first character that does not match has a greater value in str1 than
    ///                      in str2.
    int strncmpci(const char * str1, const char * str2, size_t num)
    {
        int ret_code = 0;
        size_t chars_compared = 0;
    
        // Check for NULL pointers
        if (!str1 || !str2)
        {
            ret_code = INT_MIN;
            return ret_code;
        }
    
        // Continue doing case-insensitive comparisons, one-character-at-a-time, of `str1` to `str2`,
        // as long as at least one of the strings still has more characters in it, and we have
        // not yet compared `num` chars.
        while ((*str1 || *str2) && (chars_compared < num))
        {
            ret_code = tolower((int)(*str1)) - tolower((int)(*str2));
            if (ret_code != 0)
            {
                // The 2 chars just compared don't match
                break;
            }
            chars_compared++;
            str1++;
            str2++;
        }
    
        return ret_code;
    }
    

    Test code:

    Download the entire sample code, with unit tests, from my eRCaGuy_hello_world repository here: "strncmpci.c":

    (this is just a snippet)

    int main()
    {
        printf("-----------------------\n"
               "String Comparison Tests\n"
               "-----------------------\n\n");
    
        int num_failures_expected = 0;
    
        printf("INTENTIONAL UNIT TEST FAILURE to show what a unit test failure looks like!\n");
        EXPECT_EQUALS(strncmpci("hey", "HEY", 3), 'h' - 'H');
        num_failures_expected++;
        printf("------ beginning ------\n\n");
    
    
        const char * str1;
        const char * str2;
        size_t n;
    
        // NULL ptr checks
        EXPECT_EQUALS(strncmpci(NULL, "", 0), INT_MIN);
        EXPECT_EQUALS(strncmpci("", NULL, 0), INT_MIN);
        EXPECT_EQUALS(strncmpci(NULL, NULL, 0), INT_MIN);
        EXPECT_EQUALS(strncmpci(NULL, "", 10), INT_MIN);
        EXPECT_EQUALS(strncmpci("", NULL, 10), INT_MIN);
        EXPECT_EQUALS(strncmpci(NULL, NULL, 10), INT_MIN);
    
        EXPECT_EQUALS(strncmpci("", "", 0), 0);
        EXPECT_EQUALS(strncmp("", "", 0), 0);
    
        str1 = "";
        str2 = "";
        n = 0;
        EXPECT_EQUALS(strncmpci(str1, str2, n), 0);
        EXPECT_EQUALS(strncmp(str1, str2, n), 0);
    
        str1 = "hey";
        str2 = "HEY";
        n = 0;
        EXPECT_EQUALS(strncmpci(str1, str2, n), 0);
        EXPECT_EQUALS(strncmp(str1, str2, n), 0);
    
        str1 = "hey";
        str2 = "HEY";
        n = 3;
        EXPECT_EQUALS(strncmpci(str1, str2, n), 0);
        EXPECT_EQUALS(strncmp(str1, str2, n), 'h' - 'H');
    
        str1 = "heY";
        str2 = "HeY";
        n = 3;
        EXPECT_EQUALS(strncmpci(str1, str2, n), 0);
        EXPECT_EQUALS(strncmp(str1, str2, n), 'h' - 'H');
    
        str1 = "hey";
        str2 = "HEdY";
        n = 3;
        EXPECT_EQUALS(strncmpci(str1, str2, n), 'y' - 'd');
        EXPECT_EQUALS(strncmp(str1, str2, n), 'h' - 'H');
    
        str1 = "heY";
        str2 = "hEYd";
        n = 3;
        EXPECT_EQUALS(strncmpci(str1, str2, n), 0);
        EXPECT_EQUALS(strncmp(str1, str2, n), 'e' - 'E');
    
        str1 = "heY";
        str2 = "heyd";
        n = 6;
        EXPECT_EQUALS(strncmpci(str1, str2, n), -'d');
        EXPECT_EQUALS(strncmp(str1, str2, n), 'Y' - 'y');
    
        str1 = "hey";
        str2 = "hey";
        n = 6;
        EXPECT_EQUALS(strncmpci(str1, str2, n), 0);
        EXPECT_EQUALS(strncmp(str1, str2, n), 0);
    
        str1 = "hey";
        str2 = "heyd";
        n = 6;
        EXPECT_EQUALS(strncmpci(str1, str2, n), -'d');
        EXPECT_EQUALS(strncmp(str1, str2, n), -'d');
    
        str1 = "hey";
        str2 = "heyd";
        n = 3;
        EXPECT_EQUALS(strncmpci(str1, str2, n), 0);
        EXPECT_EQUALS(strncmp(str1, str2, n), 0);
    
        str1 = "hEY";
        str2 = "heyYOU";
        n = 3;
        EXPECT_EQUALS(strncmpci(str1, str2, n), 0);
        EXPECT_EQUALS(strncmp(str1, str2, n), 'E' - 'e');
    
        str1 = "hEY";
        str2 = "heyYOU";
        n = 10;
        EXPECT_EQUALS(strncmpci(str1, str2, n), -'y');
        EXPECT_EQUALS(strncmp(str1, str2, n), 'E' - 'e');
    
        str1 = "hEYHowAre";
        str2 = "heyYOU";
        n = 10;
        EXPECT_EQUALS(strncmpci(str1, str2, n), 'h' - 'y');
        EXPECT_EQUALS(strncmp(str1, str2, n), 'E' - 'e');
    
        EXPECT_EQUALS(strncmpci("nice to meet you.,;", "NICE TO MEET YOU.,;", 100), 0);
        EXPECT_EQUALS(strncmp(  "nice to meet you.,;", "NICE TO MEET YOU.,;", 100), 'n' - 'N');
        EXPECT_EQUALS(strncmp(  "nice to meet you.,;", "nice to meet you.,;", 100), 0);
    
        EXPECT_EQUALS(strncmpci("nice to meet you.,;", "NICE TO UEET YOU.,;", 100), 'm' - 'u');
        EXPECT_EQUALS(strncmp(  "nice to meet you.,;", "nice to uEET YOU.,;", 100), 'm' - 'u');
        EXPECT_EQUALS(strncmp(  "nice to meet you.,;", "nice to UEET YOU.,;", 100), 'm' - 'U');
    
        EXPECT_EQUALS(strncmpci("nice to meet you.,;", "NICE TO MEET YOU.,;", 5), 0);
        EXPECT_EQUALS(strncmp(  "nice to meet you.,;", "NICE TO MEET YOU.,;", 5), 'n' - 'N');
    
        EXPECT_EQUALS(strncmpci("nice to meet you.,;", "NICE eo UEET YOU.,;", 5), 0);
        EXPECT_EQUALS(strncmp(  "nice to meet you.,;", "nice eo uEET YOU.,;", 5), 0);
    
        EXPECT_EQUALS(strncmpci("nice to meet you.,;", "NICE eo UEET YOU.,;", 100), 't' - 'e');
        EXPECT_EQUALS(strncmp(  "nice to meet you.,;", "nice eo uEET YOU.,;", 100), 't' - 'e');
    
        EXPECT_EQUALS(strncmpci("nice to meet you.,;", "nice-eo UEET YOU.,;", 5), ' ' - '-');
        EXPECT_EQUALS(strncmp(  "nice to meet you.,;", "nice-eo UEET YOU.,;", 5), ' ' - '-');
    
    
        if (globals.error_count == num_failures_expected)
        {
            printf(ANSI_COLOR_GRN "All unit tests passed!" ANSI_COLOR_OFF "\n");
        }
        else
        {
            printf(ANSI_COLOR_RED "FAILED UNIT TESTS! NUMBER OF UNEXPECTED FAILURES = %i"
                ANSI_COLOR_OFF "\n", globals.error_count - num_failures_expected);
        }
    
        assert(globals.error_count == num_failures_expected);
        return globals.error_count;
    }
    

    Sample output:

    $ gcc -Wall -Wextra -Werror -ggdb -std=c11 -o ./bin/tmp strncmpci.c && ./bin/tmp
    -----------------------
    String Comparison Tests
    -----------------------
    
    INTENTIONAL UNIT TEST FAILURE to show what a unit test failure looks like!
    FAILED at line 250 in function main! strncmpci("hey", "HEY", 3) != 'h' - 'H'
      a: strncmpci("hey", "HEY", 3) is 0
      b: 'h' - 'H' is 32
    
    ------ beginning ------
    
    All unit tests passed!
    

    References:

    1. This question & other answers here served as inspiration and gave some insight (Case Insensitive String comp in C)
    2. http://www.cplusplus.com/reference/cstring/strncmp/
    3. https://en.wikipedia.org/wiki/ASCII
    4. https://en.cppreference.com/w/c/language/operator_precedence

    Topics to further research

    1. (Note: this is C++, not C) Lowercase of Unicode character
    2. tolower_tests.c on OnlineGDB: https://onlinegdb.com/HyZieXcew

    TODO:

    1. Make a version of this code which also works on Unicode's UTF-8 implementation (character encoding)!
    0 讨论(0)
  • 2020-11-27 04:06

    I've found built-in such method named from which contains additional string functions to the standard header .

    Here's the relevant signatures :

    int  strcasecmp(const char *, const char *);
    int  strncasecmp(const char *, const char *, size_t);
    

    I also found it's synonym in xnu kernel (osfmk/device/subrs.c) and it's implemented in the following code, so you wouldn't expect to have any change of behavior in number compared to the original strcmp function.

    tolower(unsigned char ch) {
        if (ch >= 'A' && ch <= 'Z')
            ch = 'a' + (ch - 'A');
        return ch;
     }
    
    int strcasecmp(const char *s1, const char *s2) {
        const unsigned char *us1 = (const u_char *)s1,
                            *us2 = (const u_char *)s2;
    
        while (tolower(*us1) == tolower(*us2++))
            if (*us1++ == '\0')
                return (0);
        return (tolower(*us1) - tolower(*--us2));
    }
    
    0 讨论(0)
  • 2020-11-27 04:06

    Additional pitfalls to watch out for when doing case insensitive compares:


    Comparing as lower or as upper case? (common enough issue)

    Both below will return 0 with strcicmpL("A", "a") and strcicmpU("A", "a").
    Yet strcicmpL("A", "_") and strcicmpU("A", "_") can return different signed results as '_' is often between the upper and lower case letters.

    This affects the sort order when used with qsort(..., ..., ..., strcicmp). Non-standard library C functions like the commonly available stricmp() or strcasecmp() tend to be well defined and favor comparing via lowercase. Yet variations exist.

    int strcicmpL(char const *a, char const *b) {
      while (*a) {
        int d = tolower(*a) - tolower(*b);
        if (d) {
            return d;
        } 
        a++;
        b++;
      } 
      return 0;
    }
    
    int strcicmpU(char const *a, char const *b) {
      while (*a) {
        int d = toupper(*a) - toupper(*b);
        if (d) {
            return d;
        } 
        a++;
        b++;
      } 
      return 0;
    }
    

    char can have a negative value. (not rare)

    touppper(int) and tolower(int) are specified for unsigned char values and the negative EOF. Further, strcmp() returns results as if each char was converted to unsigned char, regardless if char is signed or unsigned.

    tolower(*a); // Potential UB
    tolower((unsigned char) *a); // Correct
    

    Locale (less common)

    Although character sets using ASCII code (0-127) are ubiquitous, the remainder codes tend to have locale specific issues. So strcasecmp("\xE4", "a") might return a 0 on one system and non-zero on another.


    Unicode (the way of the future)

    If a solution needs to handle more than ASCII consider a unicode_strcicmp(). As C lib does not provide such a function, a pre-coded function from some alternate library is recommended. Writing your own unicode_strcicmp() is a daunting task.


    Do all letters map one lower to one upper? (pedantic)

    [A-Z] maps one-to-one with [a-z], yet various locales map various lower case chracters to one upper and visa-versa. Further, some uppercase characters may lack a lower case equivalent and again, visa-versa.

    This obliges code to covert through both tolower() and tolower().

    int d = tolower(toupper(*a)) - tolower(toupper(*b));
    

    Again, potential different results for sorting if code did tolower(toupper(*a)) vs. toupper(tolower(*a)).


    Portability

    @B. Nadolson recommends to avoid rolling your own strcicmp() and this is reasonable, except when code needs high equivalent portable functionality.

    Below is an approach that even performed faster than some system provided functions. It does a single compare per loop rather than two by using 2 different tables that differ with '\0'. Your results may vary.

    static unsigned char low1[UCHAR_MAX + 1] = {
      0, 1, 2, 3, ...
      '@', 'a', 'b', 'c', ... 'z', `[`, ...  // @ABC... Z[...
      '`', 'a', 'b', 'c', ... 'z', `{`, ...  // `abc... z{...
    }
    static unsigned char low2[UCHAR_MAX + 1] = {
    // v--- Not zero, but A which matches none in `low1[]`
      'A', 1, 2, 3, ...
      '@', 'a', 'b', 'c', ... 'z', `[`, ...
      '`', 'a', 'b', 'c', ... 'z', `{`, ...
    }
    
    int strcicmp_ch(char const *a, char const *b) {
      // compare using tables that differ slightly.
      while (low1[(unsigned char)*a] == low2[(unsigned char)*b]) {
        a++;
        b++;
      }
      // Either strings differ or null character detected.
      // Perform subtraction using same table.
      return (low1[(unsigned char)*a] - low1[(unsigned char)*b]);
    }
    
    0 讨论(0)
  • 2020-11-27 04:13

    As others have stated, there is no portable function that works on all systems. You can partially circumvent this with simple ifdef:

    #include <stdio.h>
    
    #ifdef _WIN32
    #include <string.h>
    #define strcasecmp _stricmp
    #else // assuming POSIX or BSD compliant system
    #include <strings.h>
    #endif
    
    int main() {
        printf("%d", strcasecmp("teSt", "TEst"));
    }
    
    0 讨论(0)
提交回复
热议问题