My setup: gcc-4.9.2, UTF-8 environment.
The following C-program works in ASCII, but does not in UTF-8.
Create input file:
echo -n \'привет мир\'
This is more of a corollary to the other answers, but I'll try to explain this from a slightly different angle.
Here is Jonathan Leffler's version of your code, with three slight changes: (1) I made explicit the actual individual bytes in the UTF-8 strings; and (2) I modified the sprintf
formatting string width specifier to hopefully do what you are actually attempting to do. Also tangentially (3) I used perror
to get a slightly more useful error message when something fails.
#include
#include
#include
#define SIZE 40
int main(void)
{
char buf[SIZE + 1];
char *pat = "\320\277\321\200\320\270\320\262\320\265\321\202"
" \320\274\320\270\321\200"; /* "привет мир" */
char str[SIZE + 2];
FILE *f1 = fopen("\320\262\321\205\320\276\320\264", "r"); /* "вход" */
FILE *f2 = fopen("\320\262\321\213\321\205\320\276\320\264", "w"); /* "выход" */
if (f1 == 0 || f2 == 0)
{
perror("Failed to open one or both files"); /* use perror() */
return(1);
}
size_t nbytes;
if ((nbytes = fread(buf, 1, SIZE, f1)) > 0)
{
buf[nbytes] = 0;
if (strncmp(buf, pat, nbytes) == 0)
{
sprintf(str, "%*s\n", 1+(int)nbytes, buf); /* nbytes+1 length specifier */
fwrite(str, 1, 1+nbytes, f2); /* +1 here too */
}
}
fclose(f1);
fclose(f2);
return(0);
}
The behavior of sprintf
with a positive numeric width specifier is to pad with spaces from the left, so the space you tried to use is superfluous. But you have to make sure the target field is wider than the string you are printing in order for any padding to actually take place.
Just to make this answer self-contained, I will repeat what others have already said. A traditional char
is always exactly one byte, but one character in UTF-8 is usually not exactly one byte, except when all your characters are actually ASCII. One of the attractions of UTF-8 is that legacy C code doesn't need to know anything about UTF-8 in order to continue to work, but of course, the assumption that one char is one glyph cannot hold. (As you can see, for example, the glyph п in "привет мир" maps to the two bytes -- and hence, two char
s -- "\320\277"
.)
This is clearly less than ideal, but demonstrates that you can treat UTF-8 as "just bytes" if your code doesn't particularly care about glyph semantics. If yours does, you are better off switching to wchar_t
as outlined e.g. here: http://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html
However, the standard wchar_t
is less than ideal when the standard expectation is UTF-8. See e.g. the GNU libunistring documentation for a less intrusive alternative, and a bit of background. With that, you should be able to replace char
with uint8_t
and the various str*
functions with u8_str*
replacements and be done. The assumption that one glyph equals one byte will still need to be addressed, but that becomes a minor technicality in your example program. An adaptation is available at http://ideone.com/p0VfXq (though unfortunately the library is not available on http://ideone.com/ so it cannot be demonstrated there).