Compute sum of array values in parallel with metal swift

前端 未结 3 1681
Happy的楠姐
Happy的楠姐 2021-02-01 09:49

I am trying to compute sum of large array in parallel with metal swift.

Is there a god way to do it?

My plane was that I divide my array to sub arrays, compute

3条回答
  •  南笙
    南笙 (楼主)
    2021-02-01 10:01

    I took the time to create a fully working example of this problem with Metal. The explanation is in the comments:

    let count = 10_000_000
    let elementsPerSum = 10_000
    
    // Data type, has to be the same as in the shader
    typealias DataType = CInt
    
    let device = MTLCreateSystemDefaultDevice()!
    let library = self.library(device: device)
    let parsum = library.makeFunction(name: "parsum")!
    let pipeline = try! device.makeComputePipelineState(function: parsum)
    
    // Our data, randomly generated:
    var data = (0...stride * count, options: [])!
    // A buffer for individual results (zero initialized)
    let resultsBuffer = device.makeBuffer(length: MemoryLayout.stride * resultsCount, options: [])!
    // Our results in convenient form to compute the actual result later:
    let pointer = resultsBuffer.contents().bindMemory(to: DataType.self, capacity: resultsCount)
    let results = UnsafeBufferPointer(start: pointer, count: resultsCount)
    
    let queue = device.makeCommandQueue()!
    let cmds = queue.makeCommandBuffer()!
    let encoder = cmds.makeComputeCommandEncoder()!
    
    encoder.setComputePipelineState(pipeline)
    
    encoder.setBuffer(dataBuffer, offset: 0, index: 0)
    
    encoder.setBytes(&dataCount, length: MemoryLayout.size, index: 1)
    encoder.setBuffer(resultsBuffer, offset: 0, index: 2)
    encoder.setBytes(&elementsPerSumC, length: MemoryLayout.size, index: 3)
    
    // We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads
    let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1)
    
    // Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times)
    let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)
    
    encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
    encoder.endEncoding()
    
    var start, end : UInt64
    var result : DataType = 0
    
    start = mach_absolute_time()
    cmds.commit()
    cmds.waitUntilCompleted()
    for elem in results {
        result += elem
    }
    
    end = mach_absolute_time()
    
    print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
    result = 0
    
    start = mach_absolute_time()
    data.withUnsafeBufferPointer { buffer in
        for elem in buffer {
            result += elem
        }
    }
    end = mach_absolute_time()
    
    print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
    

    I used my Mac to test it, but it should work just fine on iOS.

    Output:

    Metal result: 494936505, time: 0.024611456
    CPU result: 494936505, time: 0.163341018
    

    The Metal version is about 7 times faster. I'm sure you can get more speed if you implement something like divide-and-conquer with cutoff or whatever.

提交回复
热议问题