问题
Consider this program:
#include <stdio.h>
int main(int argc, char* argv[]) {
printf("%s\n", argv[1]);
return 0;
}
I compile it like this:
x86_64-w64-mingw32-gcc -o alpha alpha.c
The problem is if I give it a non ASCII argument:
$ ./alpha róisín
r�is�n
How can I write and/or compile this program such that it accepts non ASCII characters?
To respond to alk: no, the program is printing wrongly. See this example:
$ echo Ω | od -tx1c
0000000 ce a9 0a
316 251 \n
0000003
$ ./alpha Ω | od -tx1c
0000000 4f 0d 0a
O \r \n
0000003
回答1:
The easiest way to do this is with wmain
:
#include <fcntl.h>
#include <stdio.h>
int wmain (int argc, wchar_t** argv) {
_setmode(_fileno(stdout), _O_WTEXT);
wprintf(L"%s\n", argv[1]);
return 0;
}
It can also be done with GetCommandLineW
; here is a simple version of the code
found at the HandBrake repo:
#include <stdio.h>
#include <windows.h>
int get_argv_utf8(int* argc_ptr, char*** argv_ptr) {
int argc;
char** argv;
wchar_t** argv_utf16 = CommandLineToArgvW(GetCommandLineW(), &argc);
int i;
int offset = (argc + 1) * sizeof(char*);
int size = offset;
for (i = 0; i < argc; i++)
size += WideCharToMultiByte(CP_UTF8, 0, argv_utf16[i], -1, 0, 0, 0, 0);
argv = malloc(size);
for (i = 0; i < argc; i++) {
argv[i] = (char*) argv + offset;
offset += WideCharToMultiByte(CP_UTF8, 0, argv_utf16[i], -1,
argv[i], size-offset, 0, 0);
}
*argc_ptr = argc;
*argv_ptr = argv;
return 0;
}
int main(int argc, char** argv) {
get_argv_utf8(&argc, &argv);
printf("%s\n", argv[1]);
return 0;
}
回答2:
Since you're using MinGW (actually MinGW-w64, but that shouldn't matter in this case), you have access to the Windows API, so the following should work for you. It could probably be cleaner and actually tested properly, but it should provide a good idea at the least:
#define _WIN32_WINNT 0x0600
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>
#include <windows.h>
int main (void)
{
int argc;
int i;
LPWSTR *argv;
argv = CommandLineToArgvW(GetCommandLineW(), &argc);
if (argv == NULL)
{
FormatMessageA(
(
FORMAT_MESSAGE_ALLOCATE_BUFFER |
FORMAT_MESSAGE_FROM_SYSTEM |
FORMAT_MESSAGE_IGNORE_INSERTS),
NULL,
GetLastError(),
0,
(LPWSTR)&error, 0,
NULL);
fprintf(stderr, error);
fprintf(stderr, "\n");
LocalFree(error);
return EXIT_FAILURE;
}
for (i = 0; i < argc; ++i)
wprintf(L"argv[%d]: %ls\n", i, argv[i]);
// You must free argv using LocalFree!
LocalFree(argv);
return 0;
}
Bear in mind this one issue with it: Windows will not compose your strings for you. I use my own Windows keyboard layout that uses combining characters (I'm weird), so when I type
example -o àlf
in my Windows Command Prompt, I get the following output:
argv[0]: example
argv[1]: -o
argv[2]: a\u0300lf
The a\u0300
is U+0061 (LATIN SMALL LETTER A)
followed by a representation of the Unicode code point U+0300 (COMBINING GRAVE ACCENT)
. If I instead use
example -o àlf
which uses the precomposed character U+00E0 (LATIN SMALL LETTER A WITH GRAVE)
, the output would have differed:
argv[0]: example
argv[1]: -o
argv[2]: \u00E0lf
where \u00E0
is a representation of the precomposed character à
represented by Unicode code point U+00E0. However, while I may be an odd person for doing this, Vietnamese code page 1258 actually includes combining characters. This shouldn't affect filename handling ordinarily, but there may be some difficulty encountered.
For arguments that are just strings, you may want to look into normalization with the NormalizeString function. The documentation and examples linked in it should help you to understand how the function works. Normalization and a few other things in Unicode can be a long journey, but if this sort of thing excites you, it's also a fun journey.
回答3:
Try compiling and running the following program:
#include <stdio.h>
int main()
{
int i = 0;
for( i=0; i<256; i++){
printf("\nASCII Character #%d:%c ", i, i);
}
printf("\n");
return 0;
}
In your output you should see those little question marks from number 128 and onward. FYI I am using Ubuntu, and when I compile and run this program (whith GNOME Terminal) this happens to me as well.
However, if I go to Terminal > Set character encoding... and select Western (WINDOWS-1252) as opposed to Unicode (UTF-8), and rerun the program, the extended ASCII characters display properly.
I don't know the exact steps for Windows/MinGW, but, in short, changing the character encoding should fix your problem.
来源:https://stackoverflow.com/questions/30832756/accept-non-ascii-characters