I am trying to compute sum of large array in parallel with metal swift.
Is there a god way to do it?
My plane was that I divide my array to sub arrays, compute
I took the time to create a fully working example of this problem with Metal. The explanation is in the comments:
let count = 10_000_000
let elementsPerSum = 10_000
// Data type, has to be the same as in the shader
typealias DataType = CInt
let device = MTLCreateSystemDefaultDevice()!
let library = self.library(device: device)
let parsum = library.makeFunction(name: "parsum")!
let pipeline = try! device.makeComputePipelineState(function: parsum)
// Our data, randomly generated:
var data = (0...stride * count, options: [])!
// A buffer for individual results (zero initialized)
let resultsBuffer = device.makeBuffer(length: MemoryLayout.stride * resultsCount, options: [])!
// Our results in convenient form to compute the actual result later:
let pointer = resultsBuffer.contents().bindMemory(to: DataType.self, capacity: resultsCount)
let results = UnsafeBufferPointer(start: pointer, count: resultsCount)
let queue = device.makeCommandQueue()!
let cmds = queue.makeCommandBuffer()!
let encoder = cmds.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(dataBuffer, offset: 0, index: 0)
encoder.setBytes(&dataCount, length: MemoryLayout.size, index: 1)
encoder.setBuffer(resultsBuffer, offset: 0, index: 2)
encoder.setBytes(&elementsPerSumC, length: MemoryLayout.size, index: 3)
// We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads
let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1)
// Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times)
let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)
encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()
var start, end : UInt64
var result : DataType = 0
start = mach_absolute_time()
cmds.commit()
cmds.waitUntilCompleted()
for elem in results {
result += elem
}
end = mach_absolute_time()
print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
result = 0
start = mach_absolute_time()
data.withUnsafeBufferPointer { buffer in
for elem in buffer {
result += elem
}
}
end = mach_absolute_time()
print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))")
I used my Mac to test it, but it should work just fine on iOS.
Output:
Metal result: 494936505, time: 0.024611456
CPU result: 494936505, time: 0.163341018
The Metal version is about 7 times faster. I'm sure you can get more speed if you implement something like divide-and-conquer with cutoff or whatever.