I am writing a web crawler to fetch some Chinese web files. The fetched files are encoded in utf-8. And I need to read those file to do some parse, such as extracting the UR
if you need to display characters correctly, you can use libiconv from GNU. if you only need to process urls, std::string works fine. the problem is windows console's code page, not the string itself. use locale depends on os and stdc++lib's implementation, so I don't encourage using .
window's MultiByteToWideChar may help, but you need to check MS's specifications on how there functions perform conversions on strings.
This code may help (it was compiled with VC++ 2010). I tested it with an UTF-8 file containing non-latin characters and it seems to work, but I don't know if it will work fine with Chinese characters. Check the following links for more information: _setmode and codecvt_utf8.
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>
#include <fcntl.h>
#include <io.h>
using namespace std; // Sorry for this!
void read_all_lines(const wchar_t *filename)
{
wifstream wifs;
wstring txtline;
int c = 0;
wifs.open(filename);
if(!wifs.is_open())
{
wcerr << L"Unable to open file" << endl;
return;
}
// We are going to read an UTF-8 file
wifs.imbue(locale(wifs.getloc(), new codecvt_utf8<wchar_t, 0x10ffff, consume_header>()));
while(getline(wifs, txtline))
wcout << ++c << L'\t' << txtline << L'\n';
wcout << endl;
}
int _tmain(int argc, _TCHAR* argv[])
{
// Console output will be UTF-16 characters
_setmode(_fileno(stdout), _O_U16TEXT);
if(argc < 2)
{
wcerr << L"Filename expected!" << endl;
return 1;
}
read_all_lines(argv[1]);
return 0;
}
If Chinese characters don't look as expected, make sure the console is using a font that supports UTF-16 (ie. don't use bitmap fonts).
In general, use the w
variants, (wstring
, wfstream
, wcout
), set your locales to match the requirements, hang an L
on the front of string literals. locale::global(locale(""))
sets up to match the environment default, then on each stream that isn't running according to that default e.g. wcout.imbue(locale("Chinese_China.936"))
might be Microsoft's name for your terminal's locale settings. This has always been enough to do what I want, hope it works as well for you.
#include <iostream>
#include <locale>
using namespace std;
int main() {
locale::global(locale(""));
wstring word;
while (wcin >>word)
wcout<<word<<'\n';
wcout<<L"好運n";
}