How to read an UTF-8 encoded file containing Chinese characters and output them correctly on console?

后端 未结 3 1690
生来不讨喜
生来不讨喜 2020-12-10 07:55

I am writing a web crawler to fetch some Chinese web files. The fetched files are encoded in utf-8. And I need to read those file to do some parse, such as extracting the UR

相关标签:
3条回答
  • 2020-12-10 08:44

    if you need to display characters correctly, you can use libiconv from GNU. if you only need to process urls, std::string works fine. the problem is windows console's code page, not the string itself. use locale depends on os and stdc++lib's implementation, so I don't encourage using .

    window's MultiByteToWideChar may help, but you need to check MS's specifications on how there functions perform conversions on strings.

    0 讨论(0)
  • 2020-12-10 08:49

    This code may help (it was compiled with VC++ 2010). I tested it with an UTF-8 file containing non-latin characters and it seems to work, but I don't know if it will work fine with Chinese characters. Check the following links for more information: _setmode and codecvt_utf8.

    #include <iostream>
    #include <fstream>
    #include <string>
    #include <locale>
    #include <codecvt>
    #include <fcntl.h>
    #include <io.h>
    
    using namespace std;    // Sorry for this!
    
    void read_all_lines(const wchar_t *filename)
    {
        wifstream wifs;
        wstring txtline;
        int c = 0;
    
        wifs.open(filename);
        if(!wifs.is_open())
        {
            wcerr << L"Unable to open file" << endl;
            return;
        }
        // We are going to read an UTF-8 file
        wifs.imbue(locale(wifs.getloc(), new codecvt_utf8<wchar_t, 0x10ffff, consume_header>()));
        while(getline(wifs, txtline))
            wcout << ++c << L'\t' << txtline << L'\n';
        wcout << endl;
    }
    
    int _tmain(int argc, _TCHAR* argv[])
    {
        // Console output will be UTF-16 characters
        _setmode(_fileno(stdout), _O_U16TEXT);
        if(argc < 2)
        {
            wcerr << L"Filename expected!" << endl;
            return 1;
        }
        read_all_lines(argv[1]);
        return 0;
    }
    

    If Chinese characters don't look as expected, make sure the console is using a font that supports UTF-16 (ie. don't use bitmap fonts).

    0 讨论(0)
  • 2020-12-10 08:59

    In general, use the w variants, (wstring, wfstream, wcout), set your locales to match the requirements, hang an L on the front of string literals. locale::global(locale("")) sets up to match the environment default, then on each stream that isn't running according to that default e.g. wcout.imbue(locale("Chinese_China.936")) might be Microsoft's name for your terminal's locale settings. This has always been enough to do what I want, hope it works as well for you.

    #include <iostream>
    #include <locale>
    using namespace std;
    int main() {
      locale::global(locale(""));
      wstring word;
      while (wcin >>word)
        wcout<<word<<'\n';
      wcout<<L"好運n";
    }
    
    0 讨论(0)
提交回复
热议问题