Is there any way to get the size in bytes of a string in Java?

前端 未结 5 756
旧巷少年郎
旧巷少年郎 2021-01-21 19:20

I need the size in bytes of each line in a file, so I can get a percentage of the file read. I already got the size of the file with file.length(), but how do I get

相关标签:
5条回答
  • 2021-01-21 19:33

    You probably use about the following to read the file

    FileInputStream fis = new FileInputStream(path);
    BufferedReader br = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
    String line;
    while ((line = br.readLine()) != null) {
       /* process line */
       /* report percentage */
    }
    

    You need to specify the encoding already at the beginning. If you don't, you should get UTF-8 on Android. It is the default but that can be changed. I would assume that no device does that though.

    To repeat what the other answers already stated: The character count is not always the same as the byte count. Especially the UTF encodings are tricky. There are currently 249,764 assigned Unicode characters and potentially over a million (WP) and UTF uses 1 to 4 byte to be able to encode all of them. UTF-32 is the simplest case since it will always use 4 bytes. UTF-8 does that dynamically and uses 1 to 4 bytes. Simple ASCII characters use just 1 byte. (source: UTF & BOM FAQ)

    To get the amount of bytes you can use e.g. line.getBytes("UTF-8").length(). One big disadvantage is that this is very inefficient since it creates copy of the String internal array each time and throws it away after that. That is #1 addressed at Android | Performance Tips

    It is also not 100% accurate in terms of actual bytes read from the file for following reasons:

    • UTF-16 textfiles for example often start with a special 2 byte BOM (Byte Order Mark) to signal whether they have to interpreted little or big endian. Those 2 (UTF-8: 3, UTF-32: 4) bytes are not reported when you just look at the String you get from your reader. So you are already some bytes off here.

    • Turning every line of a file into an UTF-16 String will include those BOM bytes for each line. So getBytes will report 2 bytes too much for each line.

    • Line ending characters are not part of the resulting line-String. To make things worse you have different ways of signaling the end of a line. Usually the Unix-Style '\n' which is only 1 character or the Windows-Style '\r''\n' which is two characters. The BufferedReader will simply skip those. Here your calculation is missing a very variable amount of bytes. From 1 byte for Unix/UTF-8 to 8 bytes for Windows/UTF-32.

    The last two reasons would negate each other if you have Unix/UTF-16, but that is probably not the typical case. The effect of the error also depends on line length: if you have an error of 4 byte for each line that is in total only 10 bytes long your progress will be quite considerably wrong (if my math is good your progress would be at 140% or 60% when after the last line, depending on whether your calculation assumes -4 or +4 byte per line)

    That means so far that regardless of what you do, you get no more than an approximation.

    Getting the actual byte-count could probably be done if you write your own special byte counting Reader but that would be quite a lot of work.

    An alternative would be to use a custom InputStream that counts how much bytes are actually read from the underlying stream. That's not too hard to do and it does not care for encodings.

    The big disadvantage is that it does not increase linearly with the lines you read since BufferedReader will fill it's internal buffer and read lines from there, then read the next chunk from the file and so on. If the buffer is large enough you are at 100% at the first line already. But I assume your files are big enough or you would not want to find out about the progress.

    This for example would be such an implementation. It works but I can't guarantee that it is perfect. It won't work if streams use mark() and reset(). File reading should no do that though.

    static class CountingInputStream extends FilterInputStream {
        private long bytesRead;
    
        protected CountingInputStream(InputStream in) {
            super(in);
        }
    
        @Override
        public int read() throws IOException {
            int result = super.read();
            if (result != -1) bytesRead += 1;
            return result;
        }
        @Override
        public int read(byte[] b) throws IOException {
            int result = super.read(b);
            if (result != -1) bytesRead += result;
            return result;
        }
        @Override
        public int read(byte[] b, int off, int len) throws IOException {
            int result = super.read(b, off, len);
            if (result != -1) bytesRead += result;
            return result;
        }
        @Override
        public long skip(long n) throws IOException {
            long result = super.skip(n);
            if (result != -1) bytesRead += result;
            return result;
        }
    
        public long getBytesRead() {
            return bytesRead;
        }
    }
    

    Using the following code

    File file = new File("mytestfile.txt");
    int linesRead = 0;
    long progress = 0;
    long fileLength = file.length();
    String line;
    
    CountingInputStream cis = new CountingInputStream(new FileInputStream(file));
    BufferedReader br = new BufferedReader(new InputStreamReader(cis, "UTF-8"), 8192);
    while ((line = br.readLine()) != null) {
        long newProgress = cis.getBytesRead();
        if (progress != newProgress) {
            progress = newProgress;
            int percent = (int) ((progress * 100) / fileLength);
            System.out.println(String.format("At line: %4d, bytes: %6d = %3d%%", linesRead, progress, percent));
        }
        linesRead++;
    }
    System.out.println("Total lines: " + linesRead);
    System.out.println("Total bytes: " + fileLength);
    br.close();
    

    I get output like

    At line:    0, bytes:   8192 =   5%
    At line:   82, bytes:  16384 =  10%
    At line:  178, bytes:  24576 =  15%
    ....
    At line: 1621, bytes: 155648 =  97%
    At line: 1687, bytes: 159805 = 100%
    Total lines: 1756
    Total bytes: 159805
    

    or in case of the same file UTF-16 encoded

    At line:    0, bytes:  24576 =   7%
    At line:   82, bytes:  40960 =  12%
    At line:  178, bytes:  57344 =  17%
    .....
    At line: 1529, bytes: 303104 =  94%
    At line: 1621, bytes: 319488 =  99%
    At line: 1687, bytes: 319612 = 100%
    Total lines: 1756
    Total bytes: 319612
    

    Instead of printing that you could update your progress.

    So, what is the best approach?

    • If you know that you have simple ASCII text in an encoding that uses only 1 byte for those characters: just use String#length() (and maybe add +1 or +2 for the line ending) String#length() is fast and simple and as long as you know what files you have you should have no problems.
    • If your have international text where the simple approach won't work:
      • for smaller files where processing each line takes rather long: String#getBytes(), the longer processing 1 line takes the lower the impact of temporary arrays and their garbage collection. The inaccuracy should be within acceptable bounds. Just make sure not to crash if progress > 100% or < 100% at the end.
      • for larger files above approach. The larger the file the better. Updating progress in 0.001% steps is just slowing down things. Decreasing the reader's buffer size would increases the accuracy but it also decreases the read performance.
    • If you have enough time: write your own Reader that tells you the exact byte position. Maybe a combination of InputStreamReader and BufferedReader since Reader already operates on characters. Android's implementation may help as starting point.
    0 讨论(0)
  • 2021-01-21 19:36

    You need to know the encoding - otherwise it's a meaningless question. For example, "foo" is 6 bytes in UTF-16, but 3 bytes in ASCII. Assuming you're reading a line at a time (given your question) you should know which encoding you're using as you should have specified it when you started to read.

    You can call String.getBytes(charset) to get the encoded representation of a particular string.

    Do not just call String.getBytes() as that will use the platform default encoding.

    Note that all of this is somewhat make-work... you've read the bytes, decoded them to text, then you're re-encoding them into bytes...

    0 讨论(0)
  • 2021-01-21 19:54
    final String hello_str = "Hello World";
    
    hello_str.getBytes().length is the "byte size", i.e. the number of bytes
    
    0 讨论(0)
  • 2021-01-21 19:57

    If the File is an ASCII file, then you can use String.length(); otheriwse it gets more complex.

    0 讨论(0)
  • 2021-01-21 19:59

    Consider you have a string variable called hello_str

    final String hello_str = "Hello World";
    
     //Check Character length
     hello_str.length() //output will be 11
     // Check encoded sizes
     final byte[] utf8Bytes = hello_str.getBytes("UTF-8");
     utf8Bytes.length  //output will be 11
    
     final byte[] utf16Bytes= hello_str.getBytes("UTF-16");
     utf16Bytes.length // output will be "24"
    
      final byte[] utf32Bytes = hello_str.getBytes("UTF-32");
      utf32Bytes.length // output will be "44"
    
    0 讨论(0)
提交回复
热议问题