I need the size in bytes of each line in a file, so I can get a percentage of the file read. I already got the size of the file with file.length()
, but how do I get
You probably use about the following to read the file
FileInputStream fis = new FileInputStream(path);
BufferedReader br = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
String line;
while ((line = br.readLine()) != null) {
/* process line */
/* report percentage */
}
You need to specify the encoding already at the beginning. If you don't, you should get UTF-8 on Android. It is the default but that can be changed. I would assume that no device does that though.
To repeat what the other answers already stated: The character count is not always the same as the byte count. Especially the UTF encodings are tricky. There are currently 249,764 assigned Unicode characters and potentially over a million (WP) and UTF uses 1 to 4 byte to be able to encode all of them. UTF-32 is the simplest case since it will always use 4 bytes. UTF-8 does that dynamically and uses 1 to 4 bytes. Simple ASCII characters use just 1 byte. (source: UTF & BOM FAQ)
To get the amount of bytes you can use e.g. line.getBytes("UTF-8").length()
. One big disadvantage is that this is very inefficient since it creates copy of the String internal array each time and throws it away after that. That is #1 addressed at Android | Performance Tips
It is also not 100% accurate in terms of actual bytes read from the file for following reasons:
UTF-16 textfiles for example often start with a special 2 byte BOM (Byte Order Mark) to signal whether they have to interpreted little or big endian. Those 2 (UTF-8: 3, UTF-32: 4) bytes are not reported when you just look at the String
you get from your reader. So you are already some bytes off here.
Turning every line of a file into an UTF-16 String
will include those BOM bytes for each line. So getBytes
will report 2 bytes too much for each line.
Line ending characters are not part of the resulting line-String
. To make things worse you have different ways of signaling the end of a line. Usually the Unix-Style '\n'
which is only 1 character or the Windows-Style '\r''\n'
which is two characters. The BufferedReader
will simply skip those. Here your calculation is missing a very variable amount of bytes. From 1 byte for Unix/UTF-8 to 8 bytes for Windows/UTF-32.
The last two reasons would negate each other if you have Unix/UTF-16, but that is probably not the typical case. The effect of the error also depends on line length: if you have an error of 4 byte for each line that is in total only 10 bytes long your progress will be quite considerably wrong (if my math is good your progress would be at 140% or 60% when after the last line, depending on whether your calculation assumes -4 or +4 byte per line)
That means so far that regardless of what you do, you get no more than an approximation.
Getting the actual byte-count could probably be done if you write your own special byte counting Reader
but that would be quite a lot of work.
An alternative would be to use a custom InputStream
that counts how much bytes are actually read from the underlying stream. That's not too hard to do and it does not care for encodings.
The big disadvantage is that it does not increase linearly with the lines you read since BufferedReader
will fill it's internal buffer and read lines from there, then read the next chunk from the file and so on. If the buffer is large enough you are at 100% at the first line already. But I assume your files are big enough or you would not want to find out about the progress.
This for example would be such an implementation. It works but I can't guarantee that it is perfect. It won't work if streams use mark()
and reset()
. File reading should no do that though.
static class CountingInputStream extends FilterInputStream {
private long bytesRead;
protected CountingInputStream(InputStream in) {
super(in);
}
@Override
public int read() throws IOException {
int result = super.read();
if (result != -1) bytesRead += 1;
return result;
}
@Override
public int read(byte[] b) throws IOException {
int result = super.read(b);
if (result != -1) bytesRead += result;
return result;
}
@Override
public int read(byte[] b, int off, int len) throws IOException {
int result = super.read(b, off, len);
if (result != -1) bytesRead += result;
return result;
}
@Override
public long skip(long n) throws IOException {
long result = super.skip(n);
if (result != -1) bytesRead += result;
return result;
}
public long getBytesRead() {
return bytesRead;
}
}
Using the following code
File file = new File("mytestfile.txt");
int linesRead = 0;
long progress = 0;
long fileLength = file.length();
String line;
CountingInputStream cis = new CountingInputStream(new FileInputStream(file));
BufferedReader br = new BufferedReader(new InputStreamReader(cis, "UTF-8"), 8192);
while ((line = br.readLine()) != null) {
long newProgress = cis.getBytesRead();
if (progress != newProgress) {
progress = newProgress;
int percent = (int) ((progress * 100) / fileLength);
System.out.println(String.format("At line: %4d, bytes: %6d = %3d%%", linesRead, progress, percent));
}
linesRead++;
}
System.out.println("Total lines: " + linesRead);
System.out.println("Total bytes: " + fileLength);
br.close();
I get output like
At line: 0, bytes: 8192 = 5%
At line: 82, bytes: 16384 = 10%
At line: 178, bytes: 24576 = 15%
....
At line: 1621, bytes: 155648 = 97%
At line: 1687, bytes: 159805 = 100%
Total lines: 1756
Total bytes: 159805
or in case of the same file UTF-16 encoded
At line: 0, bytes: 24576 = 7%
At line: 82, bytes: 40960 = 12%
At line: 178, bytes: 57344 = 17%
.....
At line: 1529, bytes: 303104 = 94%
At line: 1621, bytes: 319488 = 99%
At line: 1687, bytes: 319612 = 100%
Total lines: 1756
Total bytes: 319612
Instead of printing that you could update your progress.
So, what is the best approach?
String#length()
(and maybe add +1 or +2 for the line ending)
String#length()
is fast and simple and as long as you know what files you have you should have no problems.String#getBytes()
, the longer processing 1 line takes the lower the impact of temporary arrays and their garbage collection. The inaccuracy should be within acceptable bounds. Just make sure not to crash if progress > 100% or < 100% at the end.InputStreamReader
and BufferedReader
since Reader already operates on characters. Android's implementation may help as starting point.You need to know the encoding - otherwise it's a meaningless question. For example, "foo" is 6 bytes in UTF-16, but 3 bytes in ASCII. Assuming you're reading a line at a time (given your question) you should know which encoding you're using as you should have specified it when you started to read.
You can call String.getBytes(charset)
to get the encoded representation of a particular string.
Do not just call String.getBytes()
as that will use the platform default encoding.
Note that all of this is somewhat make-work... you've read the bytes, decoded them to text, then you're re-encoding them into bytes...
final String hello_str = "Hello World";
hello_str.getBytes().length is the "byte size", i.e. the number of bytes
If the File is an ASCII file, then you can use String.length(); otheriwse it gets more complex.
Consider you have a string variable called hello_str
final String hello_str = "Hello World";
//Check Character length
hello_str.length() //output will be 11
// Check encoded sizes
final byte[] utf8Bytes = hello_str.getBytes("UTF-8");
utf8Bytes.length //output will be 11
final byte[] utf16Bytes= hello_str.getBytes("UTF-16");
utf16Bytes.length // output will be "24"
final byte[] utf32Bytes = hello_str.getBytes("UTF-32");
utf32Bytes.length // output will be "44"