问题
I am writing a simple program that counts characters from a textfile (UTF-8) that I put in a linked list. Everything seem to work well except that it counts æ ø å (three last characters in the norwegian alphabet) twice for each instance. So if the string is æøå, I get 6 instead of 3. How to fix this?
int length()
{
pointer = root; // Reset pointer
int i; // Looping through data in node
int len = 0; // Counting characters
int sizedata = sizeof(pointer->data); // Sets size limit for data in node
while(pointer != NULL)
{
for(i = 0; i < sizedata; i++) // Looping through data in node
{
if(pointer->data[i] == '\0') break; // Stops count on end of string
len++; // Counting characters
}
pointer = pointer->next; // Linking to next node
}
printf("Length of text is: %d characters\n", len);
}
回答1:
I changed the code according to this site. Everything is the same expect for the if
statement before len++
;
int length()
{
pointer = root; // Reset pointer
int i; // Looping through data in node
int len = 0; // Counting characters
int sizedata = sizeof(pointer->data); // Sets size limit for data in node
while(pointer != NULL)
{
for(i = 0; i < sizedata; i++) // Looping through data in node
{
if(pointer->data[i] == '\0') break; // Stops count on end of string
if ((pointer->data[i] & 0xC0) != 0x80) //count characters
len++;
}
pointer = pointer->next; // Linking to next node
}
printf("Length of text is: %d characters\n", len);
}
Note (thanks @Eljay): This is counting Unicode code points (that are encoded in UTF-8), but not characters (glyphs). Some characters are made up of multiple code points. For example, x̝̌ is 78 cc 9d cc 8c, for the x and the two combining code points. This routine would count that 1 character as a length of 3 (code points).
回答2:
Your text file seems to be encoded in UTF-8. Then you should respect the encoding length of a character, which can be derived from the first byte of a byte sequence, see this Wikipedia atice.
The code below just counts the length of a single string, i.e. it does not make use of your linked list structure.
The length of a byte sequence is stored in an array for easy look-up; -1 marks an illegal value for a first byte in a sequence. The bytes marked cont
are continuation bytes and should only occur after the first byte of a sequence in well-formed UTF-8 strings. Igor's solution, which is admirably concise in comparison to this one, just skips them.
I've cast the char pointer to a byte (or uint8_t
in <stdint.h>
), so that the array indices are guaranteed to be non-negative.
This solution is probably needlessly long, but may serve as a starting point when trying to decode the characters (as opposed to just counting them).
Anyway, here goes:
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
static char utf8_len[256] = {
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, /* cont */
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, /* cont */
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, /* cont */
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, /* cont */
-1, -1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
};
/*
* Return character count of an UTF-8 encodes string; -1 indicates
* a decoding error.
*/
int str_length(const char *s)
{
const uint8_t *p = (const uint8_t *) s;
int len = 0;
while (*p) {
int cl = utf8_len[*p];
if (cl <= 0) return -1;
len++;
p += cl;
}
return len;
}
来源:https://stackoverflow.com/questions/25803627/how-to-correctly-count-%c3%a6-%c3%b8-%c3%a5-unicode-as-utf-8-characters-in-c