问题
I have to first read a file in Cyrillic, then randomly pick random number of lines and write modified text to a different file. No problem with Latin letter, but I run into a problem with Cyrillic text, because I get some rubbish. So this is how I tried to do the thing.
Say, file input.txt
is
ааааааа
ббббббб
ввввввв
I have to read it, and put every line into a vector:
vector<wstring> inputVector;
wstring inputString, result;
wifstream inputStream;
inputStream.open("input.txt");
while(!inputStream.eof())
{
getline(inputStream, inputString);
inputVector.push_back(inputString);
}
inputStream.close();
srand(time(NULL));
int numLines = rand() % inputVector.size();
for(int i = 0; i < numLines; i++)
{
int randomLine = rand() % inputVector.size();
result += inputVector[randomLine];
}
wofstream resultStream;
resultStream.open("result.txt");
resultStream << result;
resultStream.close();
So how can I do work with Cyrillic so it produces readable things, not just symbols?
回答1:
Because you saw something like ■a a a a a a a 1♦1♦1♦1♦1♦1♦1♦ 2♦2♦2♦2♦2♦2♦2♦ printed to the console, it appears that input.txt
is encoded in a UTF-16 encoding, probably UTF-16 LE + BOM. You can use your original code if you change the encoding of the file to UTF-8.
The reason for using UTF-8 is that, regardless of the char type of the file stream, basic_fstream
's underlying basic_filebuf
uses a codecvt
object to convert a stream of char
objects to/from a stream of objects of the char type; i.e. when reading, the char
stream that is read from the file is converted to a wchar_t
stream, but when writing, a wchar_t
stream is converted to a char
stream that is then written to the file. In the case of std::wifstream
, the codecvt
object is an instance of the standard std::codecvt<wchar_t, char, mbstate_t>
, which generally converts UTF-8 to UCS-16.
As explained on the MSDN documentation page for basic_filebuf:
Objects of type basic_filebuf are created with an internal buffer of type char * regardless of the char_type specified by the type parameter Elem. This means that a Unicode string (containing wchar_t characters) will be converted to an ANSI string (containing char characters) before it is written to the internal buffer.
Similarly, when reading a Unicode string (containing wchar_t
characters), the basic_filebuf
converts the ANSI string read from the file to the wchar_t
string returned to getline
and other read operations.
If you change the encoding of input.txt
to UTF-8, your original program should work correctly.
For reference, this works for me:
#include <cstdlib>
#include <ctime>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
int main()
{
using namespace std;
vector<wstring> inputVector;
wstring inputString, result;
wifstream inputStream;
inputStream.open("input.txt");
while(!inputStream.eof())
{
getline(inputStream, inputString);
inputVector.push_back(inputString);
}
inputStream.close();
srand(time(NULL));
int numLines = rand() % inputVector.size();
for(int i = 0; i < numLines; i++)
{
int randomLine = rand() % inputVector.size();
result += inputVector[randomLine];
}
wofstream resultStream;
resultStream.open("result.txt");
resultStream << result;
resultStream.close();
return EXIT_SUCCESS;
}
Note that the encoding of result.txt
will also be UTF-8 (generally).
回答2:
Why would you use wifstream
-- are you confident that your file consists of a sequence of (system-dependent) wide characters? Almost certainly that is not the case. (Most notably because the system's wide character set isn't actually definite outside the scope of a C++ program).
Instead, just read the input byte stream as it is and echo it accordingly:
std::ifstream infile(thefile);
std::string line;
std::vector<std::string> input;
while (std::getline(infile, line)) // like this!!
{
input.push_back(line);
}
// etc.
来源:https://stackoverflow.com/questions/7521842/reading-and-writing-files-in-cyrillic-in-c