Ruby Reads Different File Sizes for Line Reads

落爺英雄遲暮 提交于 2019-12-24 21:01:09

问题


I need to do something where the file sizes are crucial. This is producing strange results

filename = "testThis.txt"
total_chars = 0
file = File.new(filename, "r")
file_for_writing = nil
while (line = file.gets)
  total_chars += line.length
end
puts "original size #{File.size(filename)}"
puts "Totals #{total_chars}"

like this

original size 20121
Totals 20061

Why is the second one coming up short?

Edit: Answerers' hunches are right: the test file has 60 lines in it. If I change this line

  total_chars += line.length + 1

it works perfectly. But on *nix this change would be wrong?

Edit: Follow up is now here. Thanks!


回答1:


There are special characters stored in the file that delineate the lines:

  • CR LF (0x0D 0x0A) (\r\n) on Windows/DOS and
  • 0x0A (\n) on UNIX systems.

Ruby's gets uses the UNIX method. So, if you read a Windows file you would lose 1 byte for every line you read as the \r\n bytes are converted to \n.

Also String.length is not a good measure of the size of the string (in bytes). If the String is not ASCII, one character may be represented by more than one byte (Unicode). That is, it returns the number of characters in the String, not the number of bytes.

To get the size of a file, use File.size(file_name).




回答2:


My guess would be that you are on Windows, and your "testThis.txt" file has \r\n line endings. When the file is opened in text mode, each line ending will be converted to a single \n character. Therefore you'll lose 1 character per line.

Does your test file have 60 lines in it? That would be consistent with this explanation.




回答3:


The line-ending issues is the most likely culprit here.

It's also worth noting that if the character encoding of the text file is something other than ASCII, you will have a discrepancy between the 2 as well. If the file is UTF-8, this will work for english and some european languages that use just standard ASCII alphabet symbols. Beyond that, the file size and character counts can vary wildly (up to 4 or even 6 times the file size compared to the character count).

Relying on '1 character = 1 byte' is just asking for trouble as it is almost certainly going to fail at some point.



来源:https://stackoverflow.com/questions/625733/ruby-reads-different-file-sizes-for-line-reads

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!