How to correctly count æ ø å (Unicode as UTF-8) characters in C?

问题

I am writing a simple program that counts characters from a textfile (UTF-8) that I put in a linked list. Everything seem to work well except that it counts æ ø å (three last characters in the norwegian alphabet) twice for each instance. So if the string is æøå, I get 6 instead of 3. How to fix this?

int length()
{
  pointer = root; // Reset pointer
  int i; // Looping through data in node 
  int len = 0; // Counting characters
  int sizedata = sizeof(pointer->data); // Sets size limit for data in node

  while(pointer != NULL)
    {
      for(i = 0; i < sizedata; i++) // Looping through data in node
    {
      if(pointer->data[i] == '\0') break; // Stops count on end of string
      len++; // Counting characters
    }
      pointer = pointer->next; // Linking to next node
    }
  printf("Length of text is: %d characters\n", len);
}

回答1:

I changed the code according to this site. Everything is the same expect for the if statement before len++;

int length()
{
    pointer = root; // Reset pointer
    int i; // Looping through data in node 
    int len = 0; // Counting characters
    int sizedata = sizeof(pointer->data); // Sets size limit for data in node

    while(pointer != NULL)
    {
        for(i = 0; i < sizedata; i++) // Looping through data in node
        {
            if(pointer->data[i] == '\0') break; // Stops count on end of string
            if ((pointer->data[i] & 0xC0) != 0x80)  //count characters
                len++;
        }
        pointer = pointer->next; // Linking to next node
    }
    printf("Length of text is: %d characters\n", len);
}

Note (thanks @Eljay): This is counting Unicode code points (that are encoded in UTF-8), but not characters (glyphs). Some characters are made up of multiple code points. For example, x̝̌ is 78 cc 9d cc 8c, for the x and the two combining code points. This routine would count that 1 character as a length of 3 (code points).

回答2:

Your text file seems to be encoded in UTF-8. Then you should respect the encoding length of a character, which can be derived from the first byte of a byte sequence, see this Wikipedia atice.

The code below just counts the length of a single string, i.e. it does not make use of your linked list structure.

The length of a byte sequence is stored in an array for easy look-up; -1 marks an illegal value for a first byte in a sequence. The bytes marked cont are continuation bytes and should only occur after the first byte of a sequence in well-formed UTF-8 strings. Igor's solution, which is admirably concise in comparison to this one, just skips them.

I've cast the char pointer to a byte (or uint8_t in <stdint.h>), so that the array indices are guaranteed to be non-negative.

This solution is probably needlessly long, but may serve as a starting point when trying to decode the characters (as opposed to just counting them).

Anyway, here goes:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

static char utf8_len[256] = {
    1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
    1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
    1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
    1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
    1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
    1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
    1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
    1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
   -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,   /* cont */
   -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,   /* cont */
   -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,   /* cont */
   -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,   /* cont */
   -1, -1,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
    2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
    3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
    4,  4,  4,  4,  4, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
};

/*
 *      Return character count of an UTF-8 encodes string; -1 indicates
 *      a decoding error.
 */
int str_length(const char *s)
{
    const uint8_t *p = (const uint8_t *) s;
    int len = 0;

    while (*p) {
        int cl = utf8_len[*p];

        if (cl <= 0) return -1;
        len++;
        p += cl;
    }

    return len;
}

来源：https://stackoverflow.com/questions/25803627/how-to-correctly-count-%c3%a6-%c3%b8-%c3%a5-unicode-as-utf-8-characters-in-c

标签

unicode

character

counting