Fastest way to check if a byte array is all zeros

前端 未结 5 777
栀梦
栀梦 2021-02-03 18:06

I have a byte[4096] and was wondering what the fastest way is to check if all values are zero?

Is there any way faster than doing:

byte[] b          


        
相关标签:
5条回答
  • 2021-02-03 18:43

    Someone suggested checking 4 or 8 bytes at a time. You actually can do this in Java:

    LongBuffer longBuffer = ByteBuffer.wrap(b).asLongBuffer();
    while (longBuffer.hasRemaining()) {
        if (longBuffer.get() != 0) {
            return false;
        }
    }
    return true;
    

    Whether this is faster than checking byte values is uncertain, since there is so much potential for optimization.

    0 讨论(0)
  • 2021-02-03 18:44

    This may not be the fastest or most memory performant solution but it's a one liner:

    byte[] arr = randomByteArray();
    assert Arrays.equals(arr, new byte[arr.length]);
    
    0 讨论(0)
  • 2021-02-03 18:56

    I have rewritten this answer as I was first summing all bytes, this is however incorrect as Java has signed bytes, hence I need to or. Also I have changed the JVM warmup to be correct now.

    Your best bet really is to simply loop over all values.

    I suppose you have three major options available:

    1. Or all elements and check the sum.
    2. Do branchless comparisons.
    3. Do comparisons with a branch.

    I don't know how good the performance is of adding bytes using Java (low level performance), I do know that Java uses (low level) branch predictors if you give branched comparisons.

    Therefore I expect the following to happen on:

    byte[] array = new byte[4096];
    for (byte b : array) {
        if (b != 0) {
            return false;
        }
    }
    
    1. Relatively slow comparison in the first few iterations when the branch predictor is still seeding itself.
    2. Very fast branch comparisons due to branch prediction as every value should be zero anyway.

    If it would hit a non-zero value, then the branch predictor would fail, causing a slow-down of the comparison, but then you are also at the end of your computation as you want to return false either way. I think the cost of one failing branch prediction is an order of magnitude smaller as the cost of continuing to iterate over the array.

    I furthermore believe that for (byte b : array) should be allowed as it should get compiled directly into indexed array iteration as as far as I know there is no such thing as a PrimitiveArrayIterator which would cause some extra method calls (as iterating over a list) until the code gets inlined.

    Update

    I wrote my own benchmarks which give some interesting results... Unfortunately I couldn't use any of the existing benchmark tools as they are pretty hard to get installed correctly.

    I also decided to group options 1 and 2 together, as I think they are actually the same as with branchless you usually or everything (minus the condition) and then check the final result. And the condition here is x > 0 and hence a or of zero is a noop presumably.

    The code:

    public class Benchmark {
        private void start() {
            //setup byte arrays
            List<byte[]> arrays = createByteArrays(700_000);
    
            //warmup and benchmark repeated
            arrays.forEach(this::byteArrayCheck12);
            benchmark(arrays, this::byteArrayCheck12, "byteArrayCheck12");
    
            arrays.forEach(this::byteArrayCheck3);
            benchmark(arrays, this::byteArrayCheck3, "byteArrayCheck3");
    
            arrays.forEach(this::byteArrayCheck4);
            benchmark(arrays, this::byteArrayCheck4, "byteArrayCheck4");
    
            arrays.forEach(this::byteArrayCheck5);
            benchmark(arrays, this::byteArrayCheck5, "byteArrayCheck5");
        }
    
        private void benchmark(final List<byte[]> arrays, final Consumer<byte[]> method, final String name) {
            long start = System.nanoTime();
            arrays.forEach(method);
            long end = System.nanoTime();
            double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
            System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
        }
    
        private List<byte[]> createByteArrays(final int amount) {
            Random random = new Random();
            List<byte[]> resultList = new ArrayList<>();
            for (int i = 0; i < amount; i++) {
                byte[] byteArray = new byte[4096];
                byteArray[random.nextInt(4096)] = 1;
                resultList.add(byteArray);
            }
            return resultList;
        }
    
        private boolean byteArrayCheck12(final byte[] array) {
            int sum = 0;
            for (byte b : array) {
                sum |= b;
            }
            return (sum == 0);
        }
    
        private boolean byteArrayCheck3(final byte[] array) {
            for (byte b : array) {
                if (b != 0) {
                    return false;
                }
            }
            return true;
        }
    
        private boolean byteArrayCheck4(final byte[] array) {
            return (IntStream.range(0, array.length).map(i -> array[i]).reduce(0, (a, b) -> a | b) != 0);
        }
    
        private boolean byteArrayCheck5(final byte[] array) {
            return IntStream.range(0, array.length).map(i -> array[i]).anyMatch(i -> i != 0);
        }
    
        public static void main(String[] args) {
            new Benchmark().start();
        }
    }
    

    The surprising results:

    Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 50.18817142857143ns
    Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 767.7371985714286ns
    Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 21145.03219857143ns
    Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 10376.119144285714ns

    This shows that orring is a whole lots of faster than the branch predictor, which is rather surprising, so I assume some low level optimizations are being done.

    As extra I've included the stream variants, which I did not expect to be that fast anyhow.

    Ran on a stock-clocked Intel i7-3770, 16GB 1600MHz RAM.

    So I think the final answer is: It depends. It depends on how many times you are going to check the array consecutively. The "byteArrayCheck3" solution is always steadily at 700~800ns.

    Follow up update

    Things actually take another interesting approach, turns out the JIT was optimizing almost all calculations away due to resulting variables not being used at all.

    Thus I have the following new benchmark method:

    private void benchmark(final List<byte[]> arrays, final Predicate<byte[]> method, final String name) {
        long start = System.nanoTime();
        boolean someUnrelatedResult = false;
        for (byte[] array : arrays) {
            someUnrelatedResult |= method.test(array);
        }
        long end = System.nanoTime();
        double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
        System.out.println("Result: " + someUnrelatedResult);
        System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
    }
    

    This ensures that the result of the benchmarks cannot be optimized away, the major issue hence was that the byteArrayCheck12 method was void, as it noticed that the (sum == 0) was not being used, hence it optimized away the entire method.

    Thus we have the following new result (omitted the result prints for clarity):

    Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 1370.6987942857143ns
    Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 736.1096242857143ns
    Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 20671.230327142857ns
    Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 9845.388841428572ns

    Hence we think that we can finally conclude that branch prediction wins. It could however also happen because of the early returns, as on average the offending byte will be in the middle of the byte array, hence it is time for another method that does not return early:

    private boolean byteArrayCheck3b(final byte[] array) {
        int hits = 0;
        for (byte b : array) {
            if (b != 0) {
                hits++;
            }
        }
        return (hits == 0);
    }
    

    In this way we still benefit from the branch prediction, however we make sure that we cannot return early.

    Which in turn gives us more interesting results again!

    Benchmark: byteArrayCheck12 / iterations: 700000 / time per iteration: 1327.2817714285713ns
    Benchmark: byteArrayCheck3 / iterations: 700000 / time per iteration: 753.31376ns
    Benchmark: byteArrayCheck3b / iterations: 700000 / time per iteration: 1506.6772842857142ns
    Benchmark: byteArrayCheck4 / iterations: 700000 / time per iteration: 21655.950115714284ns
    Benchmark: byteArrayCheck5 / iterations: 700000 / time per iteration: 10608.70917857143ns

    I think we can though finally conclude that the fastest way is to use both early-return and branch prediction, followed by orring, followed by purely branch prediction. I suspect that all of those operations are highly optimized in native code.

    Update, some additional benchmarking using long and int arrays.

    After seeing suggestions on using long[] and int[] I decided it was worth investigating. However these attempts may not be fully in line with the original answers anymore, nevertheless may still be interesting.

    Firstly, I changed the benchmark method to use generics:

    private <T> void benchmark(final List<T> arrays, final Predicate<T> method, final String name) {
        long start = System.nanoTime();
        boolean someUnrelatedResult = false;
        for (T array : arrays) {
            someUnrelatedResult |= method.test(array);
        }
        long end = System.nanoTime();
        double nanosecondsPerIteration = (end - start) * 1d / arrays.size();
        System.out.println("Result: " + someUnrelatedResult);
        System.out.println("Benchmark: " + name + " / iterations: " + arrays.size() + " / time per iteration: " + nanosecondsPerIteration + "ns");
    }
    

    Then I performed conversions from byte[] to long[] and int[] respectively before the benchmarks, it was also neccessary to set the maximum heap size to 10 GB.

    List<long[]> longArrays = arrays.stream().map(byteArray -> {
        long[] longArray = new long[4096 / 8];
        ByteBuffer.wrap(byteArray).asLongBuffer().get(longArray);
        return longArray;
    }).collect(Collectors.toList());
    longArrays.forEach(this::byteArrayCheck8);
    benchmark(longArrays, this::byteArrayCheck8, "byteArrayCheck8");
    
    List<int[]> intArrays = arrays.stream().map(byteArray -> {
        int[] intArray = new int[4096 / 4];
        ByteBuffer.wrap(byteArray).asIntBuffer().get(intArray);
        return intArray;
    }).collect(Collectors.toList());
    intArrays.forEach(this::byteArrayCheck9);
    benchmark(intArrays, this::byteArrayCheck9, "byteArrayCheck9");
    
    private boolean byteArrayCheck8(final long[] array) {
        for (long l : array) {
            if (l != 0) {
                return false;
            }
        }
        return true;
    }
    
    private boolean byteArrayCheck9(final int[] array) {
        for (int i : array) {
            if (i != 0) {
                return false;
            }
        }
        return true;
    }
    

    Which gave the following results:

    Benchmark: byteArrayCheck8 / iterations: 700000 / time per iteration: 259.8157614285714ns
    Benchmark: byteArrayCheck9 / iterations: 700000 / time per iteration: 266.38013714285717ns

    This path may be worth exploring if it is possibly to get the bytes in such format. However when doing the transformations inside the benchmarked method, the times were around 2000 nanoseconds per iteration, so it is not worth it when you need to do the conversions yourself.

    0 讨论(0)
  • 2021-02-03 18:59

    For Java 8, you can simply use this:

    public static boolean isEmpty(final byte[] data){
        return IntStream.range(0, data.length).parallel().allMatch(i -> data[i] == 0);
    }
    
    0 讨论(0)
  • 2021-02-03 19:02

    I think that theoretically your way in the fastest way, in practice you might be able to make use of larger comparisons as suggested by one of the commenters (1 byte comparison takes 1 instruction, but so does an 8-byte comparison on a 64-bit system).

    Also in languages closer to the hardware (C and variants) you can make use of something called vectorization where you could perform a number of the comparisons/additions simultaneously. It looks like Java still doesn't have native support for it but based on this answer you might be able to get some use of it.

    Also in line with the other comments I would say that with a 4k buffer it's probably not worth the time to try and optimize it (unless it is being called very often)

    0 讨论(0)
提交回复
热议问题