问题
I'm currently working on and hitting an issue with converting a UTF-8 string to a UCS-2 string with the icu library. There are several number of ways to do this in the library, but so far none of them seem to be working, but considering the popularity of this library I'm under the assumption that I'm doing something wrong.
First off is the common code. In all cases I'm creating and passing a string on an object, but until it reaches the conversion steps there is no manipulation.
The currently utf-8 string being used is simply "ĩ".
For the sake of simplicity I'll represent the string being used as uniString
in this code
UErrorCode resultCode = U_ZERO_ERROR;
UConverter* m_pConv = ucnv_open("ISO-8859-1", &resultCode);
// Change the callback to error out instead of the default
const void* oldContext;
UConverterFromUCallback oldFromAction;
UConverterToUCallback oldToAction;
ucnv_setFromUCallBack(m_pConv, UCNV_FROM_U_CALLBACK_STOP, NULL, &oldFromAction, &oldContext, &resultCode);
ucnv_setToUCallBack(m_pConv, UCNV_TO_U_CALLBACK_STOP, NULL, &oldToAction, &oldContext, &resultCode);
int32_t outputLength = 0;
int bodySize = uniString.length();
int targetSize = bodySize * 4;
char* target = new char[targetSize];
printf("Body: %s\n", uniString.c_str());
if (U_SUCCESS(resultCode))
{
// outputLength = ucnv_convert("ISO-8859-1", "UTF-8", target, targetSize, uniString.c_str(), bodySize, &resultCode);
outputLength = ucnv_fromAlgorithmic(m_pConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
uniString.length(), &resultCode);
ucnv_close(m_pConv);
}
printf("ISO-8859-1 DGF just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(),
outputLength ? target : "invalid_char", resultCode, outputLength);
if (resultCode == U_INVALID_CHAR_FOUND || resultCode == U_ILLEGAL_CHAR_FOUND || resultCode == U_TRUNCATED_CHAR_FOUND)
{
if (resultCode == U_INVALID_CHAR_FOUND)
{
printf("Unmapped input character, cannot be converted to Latin1");
m_pConv = ucnv_open("UCS-2", &resultCode);
if (U_SUCCESS(resultCode))
{
// outputLength = ucnv_convert("UCS-2", "UTF-8", target, targetSize, uniString.c_str(), bodySize, &resultCode);
outputLength = ucnv_fromAlgorithmic(m_pConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
uniString.length(), &resultCode);
ucnv_close(m_pConv);
}
printf("UCS-2 DGF just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(),
outputLength ? target : "invalid_char", resultCode, outputLength);
if (U_SUCCESS(resultCode))
{
pdus = SegmentText(target, pText, SEGMENT_SIZE_UNICODE_MAX, true);
}
}
else
{
printf("DecodeText(): Text contents does not appear to be valid UTF-8");
}
}
else
{
printf("DecodeText(): Text successfully converted to Latin1");
std::string newBody(target, outputLength);
pdus = SegmentText(newBody, pPdu, SEGMENT_SIZE_MAX);
}
The problem is the ucnv_fromAlgorithmic
function is throwing an error U_INVALID_CHAR_FOUND
for the ucs-2 conversion. This makes sense for the ISO-8859-1
attempt, but not the ucs-2.
The other attempt was to use ucnv_convert
which you can see is commented out. This function attempted conversion, but didn't fail on the ISO-8859-1
attempt as it should.
So the question is, does anyone have experience with these function and see something incorrect or is there something incorrect about the assumption of conversion for this character?
回答1:
You need to reset resultCode
to U_ZERO_ERROR
before calling ucnv_open
. Quote from manual:
"ICU functions that take a reference (C++) or a pointer (C) to a UErrorCode first test if(U_FAILURE(errorCode)) { return immediately; } so that in a chain of such functions the first one that sets an error code causes the following ones to not perform any operation"
来源:https://stackoverflow.com/questions/22209841/utf-8-to-ucs-2-conversion-with-icu-library