This is a follow-up to my previous question: Does .NET interop copy array data back and forth, or does it pin the array?
My method is a COM interface method (rather
I think this is a good question, and the char
(System.Char
) interop behavior does deserve some attention.
In managed code, sizeof(char)
is always equal 2
(two bytes), because in .NET characters are always Unicode.
Nevertheless, the marshaling rules differ between cases when char
for P/Invoke (calling an exported DLL API) and COM (calling a COM interface method).
For P/Invoke, CharSet can be used explictly with any [DllImport]
attribute, or implicitly via [module|assembly: DefaultCharSet(CharSet.Auto|Ansi|Unicode)]
, to change the default setting for all [DllImport]
declarations per module or per assembly.
The default value is CharSet.Ansi
, which means there will be Unicode-to-ANSI conversion. I ussualy change the default to Unicode with [module: DefaultCharSet(CharSet.Unicode)]
, and then selectively use [DllImport(CharSet = CharSet.Ansi)]
in those rare case where I need call an ANSI API.
It is also possible to alter any specific char
-typed parameter with MarshalAs(UnmanagedType.U1|U2)
or MarshalAs(UnmanagedType.LPArray, ArraySubType = UnmanagedType.U1|U2)
(for a char[]
parameter). E.g., you may have something like this:
[DllImport("Test.dll", ExactSpelling = true, CharSet = CharSet.Unicode)]
static extern bool TestApi(
int length,
[In, Out, MarshalAs(UnmanagedType.LPArray] char[] buff1,
[In, Out, MarshalAs(UnmanagedType.LPArray,
ArraySubType = UnmanagedType.U1)] char[] buff2);
In this case, buff1
will be passed as an array of double-byte values (as is), but buff2
will be converted to and from an array of single byte values. Note, this still will be a smart, Unicode-to-OS-current-code-page (and back) conversion for buff2
. E.g, a Unicode '\x20AC' (€
) will become \x80
in the unmanaged code (rovided the OS code page is Windows-1252
). This is how marshalling of MarshalAs(UnmanagedType.LPArray, ArraySubType = UnmanagedType.U1)] char[] buff
would be different from MarshalAs(UnmanagedType.LPArray, ArraySubType = UnmanagedType.U1)] ushort[] buff
. For ushort
, 0x20AC
would be simply converted to 0xAC
.
For calling a COM interface method, the story is quite different. There, char
is always treated as a double-byte value representing a Unicode character. Perhaps, the reason for such design decision could be implied from Don Box's "Essential COM" (quoting the footnote from this page):
The
OLECHAR
type was chosen in favor of the commonTCHAR
data type used by the Win32 API to alleviate the need to support two versions of each interface (CHAR
andWCHAR
). By supporting only one character type, object developers are decoupled from the state of the UNICODE preprocessor symbol used by their clients.
Apparently, the same concept made its way to .NET. I'm pretty confident this is true even for legacy ANSI platforms (like Windows 95, where Marshal.SystemDefaultCharSize == 1
).
Note that DefaultCharSet
doesn't have any effect on char
when it's a part of the COM interface method signature. Neither there is a way to apply CharSet
explicitly. However, you still have full control over the marshaling behavior of each individual parameter with MarshalAs
, in exactly the same way as for P/Invoke above. E.g., your Next
method might look like below, in case the unmanaged COM code expects a buffer of ANSI characters:
void Next(ref int pcch,
[In, Out, MarshalAs(UnmanagedType.LPArray,
ArraySubType = UnmanagedType.U1, SizeParamIndex = 0)] char [] pchText);