Compute sum of array values in parallel with metal swift

前端 未结 3 1677
Happy的楠姐
Happy的楠姐 2021-02-01 09:49

I am trying to compute sum of large array in parallel with metal swift.

Is there a god way to do it?

My plane was that I divide my array to sub arrays, compute

相关标签:
3条回答
  • 2021-02-01 10:01

    I took the time to create a fully working example of this problem with Metal. The explanation is in the comments:

    let count = 10_000_000
    let elementsPerSum = 10_000
    
    // Data type, has to be the same as in the shader
    typealias DataType = CInt
    
    let device = MTLCreateSystemDefaultDevice()!
    let library = self.library(device: device)
    let parsum = library.makeFunction(name: "parsum")!
    let pipeline = try! device.makeComputePipelineState(function: parsum)
    
    // Our data, randomly generated:
    var data = (0..<count).map{ _ in DataType(arc4random_uniform(100)) }
    var dataCount = CUnsignedInt(count)
    var elementsPerSumC = CUnsignedInt(elementsPerSum)
    // Number of individual results = count / elementsPerSum (rounded up):
    let resultsCount = (count + elementsPerSum - 1) / elementsPerSum
    
    // Our data in a buffer (copied):
    let dataBuffer = device.makeBuffer(bytes: &data, length: MemoryLayout<DataType>.stride * count, options: [])!
    // A buffer for individual results (zero initialized)
    let resultsBuffer = device.makeBuffer(length: MemoryLayout<DataType>.stride * resultsCount, options: [])!
    // Our results in convenient form to compute the actual result later:
    let pointer = resultsBuffer.contents().bindMemory(to: DataType.self, capacity: resultsCount)
    let results = UnsafeBufferPointer<DataType>(start: pointer, count: resultsCount)
    
    let queue = device.makeCommandQueue()!
    let cmds = queue.makeCommandBuffer()!
    let encoder = cmds.makeComputeCommandEncoder()!
    
    encoder.setComputePipelineState(pipeline)
    
    encoder.setBuffer(dataBuffer, offset: 0, index: 0)
    
    encoder.setBytes(&dataCount, length: MemoryLayout<CUnsignedInt>.size, index: 1)
    encoder.setBuffer(resultsBuffer, offset: 0, index: 2)
    encoder.setBytes(&elementsPerSumC, length: MemoryLayout<CUnsignedInt>.size, index: 3)
    
    // We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads
    let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1)
    
    // Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times)
    let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)
    
    encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
    encoder.endEncoding()
    
    var start, end : UInt64
    var result : DataType = 0
    
    start = mach_absolute_time()
    cmds.commit()
    cmds.waitUntilCompleted()
    for elem in results {
        result += elem
    }
    
    end = mach_absolute_time()
    
    print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
    result = 0
    
    start = mach_absolute_time()
    data.withUnsafeBufferPointer { buffer in
        for elem in buffer {
            result += elem
        }
    }
    end = mach_absolute_time()
    
    print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
    

    I used my Mac to test it, but it should work just fine on iOS.

    Output:

    Metal result: 494936505, time: 0.024611456
    CPU result: 494936505, time: 0.163341018
    

    The Metal version is about 7 times faster. I'm sure you can get more speed if you implement something like divide-and-conquer with cutoff or whatever.

    0 讨论(0)
  • 2021-02-01 10:17

    The accepted answer is annoyingly missing the kernel that was written for it. The source is here, but here is the full program and shader that can be run as a swift command line application.

    /*
     * Command line Metal Compute Shader for data processing
     */
    
    import Metal
    import Foundation
    //------------------------------------------------------------------------------
    let count = 10_000_000
    let elementsPerSum = 10_000
    //------------------------------------------------------------------------------
    typealias DataType = CInt // Data type, has to be the same as in the shader
    //------------------------------------------------------------------------------
    let device = MTLCreateSystemDefaultDevice()!
    let library = device.makeDefaultLibrary()!
    let parsum = library.makeFunction(name: "parsum")!
    let pipeline = try! device.makeComputePipelineState(function: parsum)
    //------------------------------------------------------------------------------
    // Our data, randomly generated:
    var data = (0..<count).map{ _ in DataType(arc4random_uniform(100)) }
    var dataCount = CUnsignedInt(count)
    var elementsPerSumC = CUnsignedInt(elementsPerSum)
    // Number of individual results = count / elementsPerSum (rounded up):
    let resultsCount = (count + elementsPerSum - 1) / elementsPerSum
    //------------------------------------------------------------------------------
    // Our data in a buffer (copied):
    let dataBuffer = device.makeBuffer(bytes: &data, length: MemoryLayout<DataType>.stride * count, options: [])!
    // A buffer for individual results (zero initialized)
    let resultsBuffer = device.makeBuffer(length: MemoryLayout<DataType>.stride * resultsCount, options: [])!
    // Our results in convenient form to compute the actual result later:
    let pointer = resultsBuffer.contents().bindMemory(to: DataType.self, capacity: resultsCount)
    let results = UnsafeBufferPointer<DataType>(start: pointer, count: resultsCount)
    //------------------------------------------------------------------------------
    let queue = device.makeCommandQueue()!
    let cmds = queue.makeCommandBuffer()!
    let encoder = cmds.makeComputeCommandEncoder()!
    //------------------------------------------------------------------------------
    encoder.setComputePipelineState(pipeline)
    encoder.setBuffer(dataBuffer, offset: 0, index: 0)
    encoder.setBytes(&dataCount, length: MemoryLayout<CUnsignedInt>.size, index: 1)
    encoder.setBuffer(resultsBuffer, offset: 0, index: 2)
    encoder.setBytes(&elementsPerSumC, length: MemoryLayout<CUnsignedInt>.size, index: 3)
    //------------------------------------------------------------------------------
    // We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads
    let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1)
    
    // Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times)
    let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)
    //------------------------------------------------------------------------------
    encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
    encoder.endEncoding()
    //------------------------------------------------------------------------------
    var start, end : UInt64
    var result : DataType = 0
    //------------------------------------------------------------------------------
    start = mach_absolute_time()
    cmds.commit()
    cmds.waitUntilCompleted()
    for elem in results {
        result += elem
    }
    
    end = mach_absolute_time()
    //------------------------------------------------------------------------------
    print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
    //------------------------------------------------------------------------------
    result = 0
    
    start = mach_absolute_time()
    data.withUnsafeBufferPointer { buffer in
        for elem in buffer {
            result += elem
        }
    }
    end = mach_absolute_time()
    
    print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
    //------------------------------------------------------------------------------
    
    #include <metal_stdlib>
    using namespace metal;
    
    typedef unsigned int uint;
    typedef int DataType;
    
    kernel void parsum(const device DataType* data [[ buffer(0) ]],
                       const device uint& dataLength [[ buffer(1) ]],
                       device DataType* sums [[ buffer(2) ]],
                       const device uint& elementsPerSum [[ buffer(3) ]],
    
                       const uint tgPos [[ threadgroup_position_in_grid ]],
                       const uint tPerTg [[ threads_per_threadgroup ]],
                       const uint tPos [[ thread_position_in_threadgroup ]]) {
    
        uint resultIndex = tgPos * tPerTg + tPos;
    
        uint dataIndex = resultIndex * elementsPerSum; // Where the summation should begin
        uint endIndex = dataIndex + elementsPerSum < dataLength ? dataIndex + elementsPerSum : dataLength; // The index where summation should end
    
        for (; dataIndex < endIndex; dataIndex++)
            sums[resultIndex] += data[dataIndex];
    }
    
    0 讨论(0)
  • 2021-02-01 10:19

    i've been running the app. on a gt 740 (384 cores) vs. i7-4790 with a multithreader vector sum implementation and here are my figures:

    Metal lap time: 19.959092
    cpu MT lap time: 4.353881
    

    that's a 5/1 ratio for cpu, so unless you have a powerful gpu using shaders is not worth it.

    i've been testing the same code in a i7-3610qm w/ igpu intel hd 4000 and surprisely results are much better for metal: 2/1

    edited: after tweaking with thread parameter i've finally improved gpu performance, now it's upto 16xcpu

    0 讨论(0)
提交回复
热议问题