I'm trying to use the Japanese morphological analyzer MeCab in a C# program (Visual Studio 2010 Express, Windows 7), and something's going wrong with the encoding. If my input (pasted into a textbox) is this:
一方、広義の「ネコ」は、ネコ類(ネコ科動物)の一部、あるいはその全ての獣を指す包括的名称を指す。
Then my output (in another textbox) looks like this:
? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ( åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ) åè©ž,サ変接続,*,*,*,*,* ? åè©ž,サ変接続,*,*,*,*,* ????????????????????????? åè©ž,サ変接続,*,*,*,*,* EOS
I would guess that that's text in some other encoding being mistaken for UTF-8-encoded text. But assuming that it's EUC-JP and using Encoding.Convert to turn it into UTF-8 doesn't change the output; assuming that it's Shift-JIS and doing the same gives different gibberish. Also, while it's definitely processing the text - that's how MeCab output is supposed to be formatted - it doesn't appear to be interpreting the input as UTF-8, either. If it were doing so, there wouldn't be all those identical lines in the output starting with one-character "compounds," which it's clearly unable to identify.
I get yet another different-looking set of gibberish when I run the sentence through MeCab's command line. But, again, it's just a row of single question marks and parentheses going down the left, so it's not just the problem that the Windows command line doesn't support fonts with Japanese characters; again, it's just not reading the input in as UTF-8. (I did install MeCab in UTF-8 mode.)
The relevant parts of the code look like this:
[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)] private extern static IntPtr mecab_new2(string arg); [DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)] [return: MarshalAs(UnmanagedType.AnsiBStr)] private extern static string mecab_sparse_tostr(IntPtr m, string str); [DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)] private extern static void mecab_destroy(IntPtr m); private string meCabParse(string jpnText) { IntPtr mecab = mecab_new2(""); string parsedText = mecab_sparse_tostr(mecab, jpnText); mecab_destroy(mecab); return parsedText; }
(In terms of fiddling with plausible-looking things to see if they make a difference, I've tried switching "UnmanagedType.AnsiBStr" to "UnmanagedType.BStr," which gives the error "AccessViolationException was unhandled," and adding "CharSet=CharSet.Unicode" to the DllImport parameters, which turned the output into just "EOS".)
This is how I've been doing the conversion:
// 65001 = UTF-8 codepage, 20932 = EUC-JP codepage private string convertEncoding(string sourceString, int sourceCodepage, int targetCodepage) { Encoding sourceEncoding = Encoding.GetEncoding(sourceCodepage); Encoding targetEncoding = Encoding.GetEncoding(targetCodepage); // convert source string into byte array byte[] sourceBytes = sourceEncoding.GetBytes(sourceString); // convert those bytes into target encoding byte[] targetBytes = Encoding.Convert(sourceEncoding, targetEncoding, sourceBytes); // byte array to char array char[] targetChars = new char[targetEncoding.GetCharCount(targetBytes, 0, targetBytes.Length)]; //char array to targt-encoded string targetEncoding.GetChars(targetBytes, 0, targetBytes.Length, targetChars, 0); string targetString = new string(targetChars); return targetString; } private string meCabParse(string jpnText) { // convert the text from the string from UTF-8 to EUC-JP jpnText = convertEncoding(jpnText, 65001, 20932); IntPtr mecab = mecab_new2(""); string parsedText = mecab_sparse_tostr(mecab, jpnText); // annnd convert back to UTF-8 parsedText = convertEncoding(parsedText, 20932, 65001); mecab_destroy(mecab); }
Suggestions/taunts?
I came across this thread looking for a way to do the same. I used your code as a starting point and this blog post for figuring out how to marshal UTF8 strings.
The following code gives me properly encoded output:
public class Mecab
{
[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl, CharSet=CharSet.Unicode)]
private extern static IntPtr mecab_new2(string arg);
[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
private extern static IntPtr mecab_sparse_tostr(IntPtr m, byte[] str);
[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
private extern static void mecab_destroy(IntPtr m);
public static String Parse(String input)
{
IntPtr mecab = mecab_new2("");
IntPtr nativeStr = mecab_sparse_tostr(mecab, Encoding.UTF8.GetBytes(input));
int size = nativeArraySize(nativeStr) - 1;
byte[] data = new byte[size];
Marshal.Copy(nativeStr, data, 0, size);
mecab_destroy(mecab);
return Encoding.UTF8.GetString(data);
}
private static int nativeArraySize(IntPtr ptr)
{
int size = 0;
while (Marshal.ReadByte(ptr, size) > 0)
size++;
return size;
}
}
来源:https://stackoverflow.com/questions/6365931/trying-to-get-libmecab-dll-mecab-to-work-with-c-sharp