Why does Java read a big file faster than C++?

后端 未结 5 2063
太阳男子
太阳男子 2021-01-31 14:19

I have a 2 GB file (iputfile.txt) in which every line in the file is a word, just like:

apple
red
beautiful
smell
spark
input

5条回答
  •  借酒劲吻你
    2021-01-31 14:38

    There are a number of significant differences in the way the languages handle I/O, all of which can make a difference, one way or another.

    Perhaps the first (and most important) question is: how is the data encoded in the text file. If it is single-byte characters (ISO 8859-1 or UTF-8), then Java has to convert it into UTF-16 before processing; depending on the locale, C++ may (or may not) also convert or do some additional checking.

    As has been pointed out (partially, at least), in C++, >> uses a locale specific isspace, getline will simply compare for '\n', which is probably faster. (Typical implementations of isspace will use a bitmap, which means an additional memory access for each character.)

    Optimization levels and specific library implementations may also vary. It's not unusual in C++ for one library implementation to be 2 or 3 times faster than another.

    Finally, a most significant difference: C++ distinguishes between text files and binary files. You've opened the file in text mode; this means that it will be "preprocessed" at the lowest level, before even the extraction operators see it. This depends on the platform: for Unix platforms, the "preprocessing" is a no-op; on Windows, it will convert CRLF pairs into '\n', which will have a definite impact on performance. If I recall correctly (I've not used Java for some years), Java expects higher level functions to handle this, so functions like readLine will be slightly more complicated. Just guessing here, but I suspect that the additional logic at the higher level costs less in runtime than the buffer preprocessing at the lower level. (If you are testing under Windows, you might experiment with opening the file in binary mode in C++. This should make no difference in the behavior of the program when you use >>; any extra CR will be considered white space. With getline, you'll have to add logic to remove any trailing '\r' to your code.)

提交回复
热议问题