UTF-8 to unicode converter for embeded system display

问题

I have an embedded system that gets UTF-8 encoded data to display via UPNP. The display device has the ability to display characters. I need a way to convert the UTF-8 data I recieve via UPNP to unicode. The display is on a PIC, and it is sent data via a UPNP bridge running linux. Is there a simple way to do the conversion before I send it to the display board in linux?

回答1:

If you have a real operating system and hosted C environment at your disposal, the best approach would be to simply ensure that your program runs in a locale that uses UTF-8 as its encoding and use mbrtowc or mbtowc to convert UTF-8 sequences to Unicode codepoint values (wchar_t is a Unicode codepoint number on Linux and any C implementation that defines __STDC_ISO_10646__).

If you do want to skip the system library routines and do UTF-8 decoding yourself, be careful. I once did a casual survey using Google code search and found that somewhere between 1/3 and 2/3 of the UTF-8 code out in the wild was dangerously wrong. Here is a fully correct, fast, and simple implementation I would highly recommend:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

My implementation in musl is somewhat smaller in binary size and seems to be faster, but it's also a bit harder to understand.

回答2:

To convert an array of bytes encoded as UFT-8 into an array of Unicode code points:

The trick is to detect various encoding mistakes.

#include <limits.h>
#include <stdio.h>
#include <stdbool.h>
#include <stdint.h>

typedef struct {
  uint32_t UnicodePoint;  // Accumulated code point
  uint32_t Min;           // Minimum acceptable codepoint
  int i;                  // Index of char/wchar_t remaining
  bool e;                 // Error flag
} UTF_T;

static bool IsSurrogate(unsigned c) {
  return (c >= 0xD800) && (c <= 0xDFFF);
}

// Return true if more bytes needed to complete codepoint
static bool Put8(UTF_T *U, unsigned ch) {
  ch &= 0xFF;
  if (U->i == 0) {
    if (ch <= 0x7F) {
      U->UnicodePoint = ch;
      return false; /* No more needed */
    } else if (ch <= 0xBF) {
      goto fail;
    } else if (ch <= 0xDF) {
      U->Min = 0x80;
      U->UnicodePoint = ch & 0x1F;
      U->i = 1;
    } else if (ch <= 0xEF) {
      U->Min = 0x800;
      U->UnicodePoint = ch & 0x0F;
      U->i = 2;
    } else if (ch <= 0xF7) {
      U->Min = 0x10000;
      U->UnicodePoint = ch & 0x07;
      U->i = 3;
    } else {
      goto fail;
    }
    return true; /* More needed */
  }
  // If expected continuation character missing ...
  if ((ch & (~0x3F)) != 0x80) {
    goto fail;
  }
  U->UnicodePoint <<= 6;
  U->UnicodePoint |= (ch & 0x3F);
  // If last continuation character ...
  if (--(U->i) == 0) {
    // If codepoint out of range ...
    if ((U->UnicodePoint < U->Min) || (U->UnicodePoint > 0x10FFFF) 
        || IsSurrogate(U->UnicodePoint)) {
      goto fail;
    }
    return false /* No more needed */;
  }
  return true; /* More needed */

  fail:
  U->UnicodePoint = -1;
  U->i = 0;
  U->e = true;
  return false /* No more needed */;
}

/* return 0:OK, else error */
bool ConvertUTF8toUnicodeCodepoints(const char *UTF8, size_t Length, 
    uint32_t *CodePoints, size_t *OutLen) {
  UTF_T U = { 0 };
  *OutLen = 0;
  for (size_t i = 0; i < Length;) {
    while (Put8(&U, UTF8[i++])) {
      // Needed bytes not available?
      if (i >= Length) {
        return true;
      }
    }
    if (U.e) break;
    CodePoints[(*OutLen)++] = U.UnicodePoint;
  }
  return U.e;
}

This is based on some old code, please advise as it may not be up to current standards.
Not the prettiest with goto and magic numbers.

What is nice about this approach is rather than CodePoints[(*OutLen)++] = U.UnicodePoint for consuming the codepoint, if one wanted to extract UTF16 (BE or LE), one could easily write consumer code for the UTF_T block and not need to change to the UTF8 -> codepoint part.

回答3:

I would use the Unicode manipulation functions of GLib, a LGPL-licensed utility library. It sounds like g_utf8_to_ucs4() is what you are looking for.

来源：https://stackoverflow.com/questions/19915999/utf-8-to-unicode-converter-for-embeded-system-display

标签

Linux

upnp