I\'m new to Grand Central Dispatch and have been running some tests with it doing some processing on an image. Basically I\'m running a grayscale algorithm both sequentially
In my tests, I found that if I just focused on the concurrent B&W conversion, I achieved something close to the "twice the speed" that you were expecting (the parallel rendition took 53% as long as the serial rendition). When I also included the ancillary portions of the conversion (not only the conversion, but also the retrieval of the image, preparation of the output pixel buffer, and creation of the new image, etc.), then the resulting performance improvement was less spectacular, where elapsed time was 79% as long as the serial rendition.
In terms of why you might not achieve an absolute doubling of performance, even if you just focus on the portion that can enjoy concurrency, Apple attributes this behavior to the overhead in scheduling code for execution. In their discussion about using dispatch_apply
in the Performing Loop Iterations Concurrently in the Concurrency Programming Guide, they contemplate the balance between the performance gain of concurrent tasks and the overhead that each dispatched block entails:
You should make sure that your task code does a reasonable amount of work through each iteration. As with any block or function you dispatch to a queue, there is overhead to scheduling that code for execution. If each iteration of your loop performs only a small amount of work, the overhead of scheduling the code may outweigh the performance benefits you might achieve from dispatching it to a queue. If you find this is true during your testing, you can use striding to increase the amount of work performed during each loop iteration. With striding, you group together multiple iterations of your original loop into a single block and reduce the iteration count proportionately. For example, if you perform 100 iterations initially but decide to use a stride of 4, you now perform 4 loop iterations from each block and your iteration count is 25. For an example of how to implement striding, see “Improving on Loop Code.”
As an aside, I think it might be worth considering creating your own concurrent queue and using dispatch_apply
. It is designed for precisely this purpose, optimizing for
loops that can enjoy concurrency.
Here is my code that I used for my benchmarking:
- (UIImage *)convertImage:(UIImage *)image algorithm:(NSString *)algorithm
{
CGImageRef imageRef = image.CGImage;
NSAssert(imageRef, @"Unable to get CGImageRef");
CGDataProviderRef provider = CGImageGetDataProvider(imageRef);
NSAssert(provider, @"Unable to get provider");
NSData *data = CFBridgingRelease(CGDataProviderCopyData(provider));
NSAssert(data, @"Unable to copy image data");
NSInteger bitsPerComponent = CGImageGetBitsPerComponent(imageRef);
NSInteger bitsPerPixel = CGImageGetBitsPerPixel(imageRef);
CGBitmapInfo bitmapInfo = CGImageGetBitmapInfo(imageRef);
NSInteger bytesPerRow = CGImageGetBytesPerRow(imageRef);
NSInteger width = CGImageGetWidth(imageRef);
NSInteger height = CGImageGetHeight(imageRef);
CGColorSpaceRef colorspace = CGImageGetColorSpace(imageRef);
void *outputBuffer = malloc(width * height * bitsPerPixel / 8);
NSAssert(outputBuffer, @"Unable to allocate buffer");
uint8_t *buffer = (uint8_t *)[data bytes];
CFAbsoluteTime start = CFAbsoluteTimeGetCurrent();
if ([algorithm isEqualToString:kImageAlgorithmSimple]) {
[self convertToBWSimpleFromBuffer:buffer toBuffer:outputBuffer width:width height:height];
} else if ([algorithm isEqualToString:kImageAlgorithmDispatchApply]) {
[self convertToBWConcurrentFromBuffer:buffer toBuffer:outputBuffer width:width height:height count:2];
} else if ([algorithm isEqualToString:kImageAlgorithmDispatchApply4]) {
[self convertToBWConcurrentFromBuffer:buffer toBuffer:outputBuffer width:width height:height count:4];
} else if ([algorithm isEqualToString:kImageAlgorithmDispatchApply8]) {
[self convertToBWConcurrentFromBuffer:buffer toBuffer:outputBuffer width:width height:height count:8];
}
NSLog(@"%@: %.2f", algorithm, CFAbsoluteTimeGetCurrent() - start);
CGDataProviderRef outputProvider = CGDataProviderCreateWithData(NULL, outputBuffer, sizeof(outputBuffer), releaseData);
CGImageRef outputImageRef = CGImageCreate(width,
height,
bitsPerComponent,
bitsPerPixel,
bytesPerRow,
colorspace,
bitmapInfo,
outputProvider,
NULL,
NO,
kCGRenderingIntentDefault);
UIImage *outputImage = [UIImage imageWithCGImage:outputImageRef];
CGImageRelease(outputImageRef);
CGDataProviderRelease(outputProvider);
return outputImage;
}
/** Convert the image to B&W as a single (non-parallel) task.
*
* This assumes the pixel buffer is in RGBA, 8 bits per pixel format.
*
* @param inputButter The input pixel buffer.
* @param outputBuffer The output pixel buffer.
* @param width The image width in pixels.
* @param height The image height in pixels.
*/
- (void)convertToBWSimpleFromBuffer:(uint8_t *)inputBuffer toBuffer:(uint8_t *)outputBuffer width:(NSInteger)width height:(NSInteger)height
{
for (NSInteger row = 0; row < height; row++) {
for (NSInteger col = 0; col < width; col++) {
NSUInteger offset = (col + row * width) * 4;
uint8_t *rgba = inputBuffer + offset;
uint8_t red = rgba[0];
uint8_t green = rgba[1];
uint8_t blue = rgba[2];
uint8_t alpha = rgba[3];
uint8_t gray = 0.2126 * red + 0.7152 * green + 0.0722 * blue;
outputBuffer[offset] = gray;
outputBuffer[offset + 1] = gray;
outputBuffer[offset + 2] = gray;
outputBuffer[offset + 3] = alpha;
}
}
}
/** Convert the image to B&W, using GCD to split the conversion into several concurrent GCD tasks.
*
* This assumes the pixel buffer is in RGBA, 8 bits per pixel format.
*
* @param inputButter The input pixel buffer.
* @param outputBuffer The output pixel buffer.
* @param width The image width in pixels.
* @param height The image height in pixels.
* @param count How many GCD tasks should the conversion be split into.
*/
- (void)convertToBWConcurrentFromBuffer:(uint8_t *)inputBuffer toBuffer:(uint8_t *)outputBuffer width:(NSInteger)width height:(NSInteger)height count:(NSInteger)count
{
dispatch_queue_t queue = dispatch_queue_create("com.domain.app", DISPATCH_QUEUE_CONCURRENT);
NSInteger stride = height / count;
dispatch_apply(height / stride, queue, ^(size_t idx) {
size_t j = idx * stride;
size_t j_stop = MIN(j + stride, height);
for (NSInteger row = j; row < j_stop; row++) {
for (NSInteger col = 0; col < width; col++) {
NSUInteger offset = (col + row * width) * 4;
uint8_t *rgba = inputBuffer + offset;
uint8_t red = rgba[0];
uint8_t green = rgba[1];
uint8_t blue = rgba[2];
uint8_t alpha = rgba[3];
uint8_t gray = 0.2126 * red + 0.7152 * green + 0.0722 * blue;
outputBuffer[offset] = gray;
outputBuffer[offset + 1] = gray;
outputBuffer[offset + 2] = gray;
outputBuffer[offset + 3] = alpha;
}
}
});
return YES;
}
void releaseData(void *info, const void *data, size_t size)
{
free((void *)data);
}
On an iPhone 5, this took 2.24 seconds to convert a 7360 × 4912 image with the simple, serial method, and took 1.18 seconds when I used dispatch_apply
with two loops. When I tried 4 or 8 dispatch_apply
loops, I saw no further performance gain.
Most likely guess. In the single-threaded case, you are CPU bound. In the multi-threaded case, you are memory bound. In other words, the two cores are reading the data from DRAM at the maximum bus bandwidth. As a result, the cores end up idling waiting for more data to process.
You can test my theory by doing a true luminance calculation:
int value = floor( 0.299 * red + 0.587 * green + 0.114 * blue );
That calculation will yield gray scale values in the range from 0 to 255, given 8-bit rgb values. It also gives the processors more work to do per pixel. If you change that line of code, the time for the single threaded case should increase somewhat. And, if I'm correct, then the multi-threaded case should show a better performance improvement, as a percentage of the single-threaded time.
I decided to run some benchmarks of my own, both on the simulator and on an iPad2. The structure of my code was as follows.
Single Threaded
start = TimeStamp();
for ( y = 0; y < 2048; y++ )
for ( x = 0; x < 1536; x++ )
computePixel();
end = TimeStamp();
NSLog( @"single = %8.3lf msec", (end - start) * 1e3 );
Two Threads using GCD
dispatch_group_t tasks = dispatch_group_create();
dispatch_queue_t queue = dispatch_get_global_queue( DISPATCH_QUEUE_PRIORITY_HIGH, 0 );
start = TimeStamp();
dispatch_group_async( tasks, queue,
^{
topStart = TimeStamp();
for ( y = 0; y < 1024; y++ )
for ( x = 0; x < 1536; x++ )
computePixel();
topEnd = TimeStamp();
});
dispatch_group_async( tasks, queue,
^{
bottomStart = TimeStamp();
for ( y = 1024; y < 2048; y++ )
for ( x = 0; x < 1536; x++ )
computePixel();
bottomEnd = TimeStamp();
});
wait = TimeStamp();
dispatch_group_wait( tasks, DISPATCH_TIME_FOREVER );
end = TimeStamp();
NSLog( @"wait = %8.3lf msec", (wait - start) * 1e3 );
NSLog( @"topStart = %8.3lf msec", (topStart - start) * 1e3 );
NSLog( @"bottomStart = %8.3lf msec", (bottomStart - start) * 1e3 );
NSLog( @" " );
NSLog( @"topTime = %8.3lf msec", (topEnd - topStart) * 1e3 );
NSLog( @"bottomeTime = %8.3lf msec", (bottomEnd - bottomStart) * 1e3 );
NSLog( @"overallTime = %8.3lf msec", (end - start) * 1e3 );
Here are my results.
Running (r+g+b)/3 on the simulator
2014-04-03 23:16:22.239 GcdTest[1406:c07] single = 21.546 msec
2014-04-03 23:16:22.239 GcdTest[1406:c07]
2014-04-03 23:16:25.388 GcdTest[1406:c07] wait = 0.009 msec
2014-04-03 23:16:25.388 GcdTest[1406:c07] topStart = 0.031 msec
2014-04-03 23:16:25.388 GcdTest[1406:c07] bottomStart = 0.057 msec
2014-04-03 23:16:25.389 GcdTest[1406:c07]
2014-04-03 23:16:25.389 GcdTest[1406:c07] topTime = 10.865 msec
2014-04-03 23:16:25.389 GcdTest[1406:c07] bottomeTime = 10.879 msec
2014-04-03 23:16:25.390 GcdTest[1406:c07] overallTime = 10.961 msec
Running (.299r + .587g + .114b) on the simulator
2014-04-03 23:17:27.984 GcdTest[1422:c07] single = 55.738 msec
2014-04-03 23:17:27.985 GcdTest[1422:c07]
2014-04-03 23:17:29.306 GcdTest[1422:c07] wait = 0.008 msec
2014-04-03 23:17:29.307 GcdTest[1422:c07] topStart = 0.054 msec
2014-04-03 23:17:29.307 GcdTest[1422:c07] bottomStart = 0.060 msec
2014-04-03 23:17:29.307 GcdTest[1422:c07]
2014-04-03 23:17:29.308 GcdTest[1422:c07] topTime = 28.881 msec
2014-04-03 23:17:29.308 GcdTest[1422:c07] bottomeTime = 29.330 msec
2014-04-03 23:17:29.308 GcdTest[1422:c07] overallTime = 29.446 msec
Running (r+g+b)/3 on the iPad2
2014-04-03 23:27:19.601 GcdTest[13032:907] single = 298.799 msec
2014-04-03 23:27:19.602 GcdTest[13032:907]
2014-04-03 23:27:20.536 GcdTest[13032:907] wait = 0.060 msec
2014-04-03 23:27:20.537 GcdTest[13032:907] topStart = 0.246 msec
2014-04-03 23:27:20.539 GcdTest[13032:907] bottomStart = 2.906 msec
2014-04-03 23:27:20.541 GcdTest[13032:907]
2014-04-03 23:27:20.542 GcdTest[13032:907] topTime = 149.596 msec
2014-04-03 23:27:20.544 GcdTest[13032:907] bottomeTime = 149.209 msec
2014-04-03 23:27:20.545 GcdTest[13032:907] overallTime = 152.164 msec
Running (.299r + .587g + .114b) on the iPad2
2014-04-03 23:30:29.618 GcdTest[13045:907] single = 282.767 msec
2014-04-03 23:30:29.620 GcdTest[13045:907]
2014-04-03 23:30:34.008 GcdTest[13045:907] wait = 0.046 msec
2014-04-03 23:30:34.010 GcdTest[13045:907] topStart = 0.270 msec
2014-04-03 23:30:34.011 GcdTest[13045:907] bottomStart = 3.043 msec
2014-04-03 23:30:34.013 GcdTest[13045:907]
2014-04-03 23:30:34.014 GcdTest[13045:907] topTime = 143.078 msec
2014-04-03 23:30:34.015 GcdTest[13045:907] bottomeTime = 143.249 msec
2014-04-03 23:30:34.017 GcdTest[13045:907] overallTime = 146.350 msec
Running ((.299r + .587g + .114b) ^ 2.2) on the iPad2
2014-04-03 23:41:28.959 GcdTest[13078:907] single = 1258.818 msec
2014-04-03 23:41:28.961 GcdTest[13078:907]
2014-04-03 23:41:30.768 GcdTest[13078:907] wait = 0.048 msec
2014-04-03 23:41:30.769 GcdTest[13078:907] topStart = 0.264 msec
2014-04-03 23:41:30.771 GcdTest[13078:907] bottomStart = 3.037 msec
2014-04-03 23:41:30.772 GcdTest[13078:907]
2014-04-03 23:41:30.773 GcdTest[13078:907] topTime = 635.952 msec
2014-04-03 23:41:30.775 GcdTest[13078:907] bottomeTime = 634.749 msec
2014-04-03 23:41:30.776 GcdTest[13078:907] overallTime = 637.829 msec