问题
I'm processing huge TIFF images (grayscale, 8 or 16 bit, up to 4 GB) to be used as high resolution input data for a machine. Each image needs to be rotated by 90 degrees (clockwise). The input TIFF can be LZW or uncompressed, the output may be uncompressed.
So far I implemented my own TIFF reader class in Objective C (including LZW decompression) which is able to handle huge files and does some caching in memory as well. At the moment the TIFF reader class is used for visualization and measurement inside the image and it performs quite good.
For my latest challenge, rotating a TIFF, I need a new approach, because the current implementation is VERY slow. Even for a "medium" sized TIFF (30.000 x 4.000) it takes approx. 30 minutes to rotate the image. At the moment I loop through all pixels and pick the one with reversed x and y coordinates, put all of them into a buffer and write the buffer to disk as soon as one line is complete. The main problem is the reading from the TIFF, since data is organized in strips and not guaranteed to be linearly distributed inside the file (and in case of LZW compressed strips, nothing is linear as well).
I profiled my software and found out that most of the time is spent in copying memory blocks (memmove) and decided to bypass the caching inside my reader class for the rotation. Now the whole process is about 5% faster, which isn't too much, and all of the time is now spent inside fread(). I assume that at least my cache performs almost as well as the system's fread() cache.
Another test using Image Magick with the same 30.000 x 4.000 file took only around 10 seconds to complete. AFAIK Image Magick reads the whole file into memory, processes it in memory and then writes back to disk. This works well up to a few hundred megabytes of image data.
What I'm looking for is some kind of "meta optimization", like another approach for handling the pixel data. Is there another strategy than swapping pixels one by one (and needing to read from file locations far away from each other)? Should I create some intermediate file to speed up the process? Any suggestions welcome.
回答1:
OK given that you have to do pixel munging, let's look at your overall problem. A medium image that is 30000x4000 pixels is 120M of image data for 8 bit gray and 240M of image data for 16 bit. So if you're looking at the data this way, you need to ask "is 30 minutes reasonable?" In order to do a 90 degree rotate, you are inducing a worst-case problem, memory-wise. You are touching every pixel in a single column in order to fill one row. If you work row-wise, at least you're not going to double the memory foot-print.
So - 120M of pixels means that you're doing 120M reads and 120M writes, or 240M data accesses. This means that you are processing roughly 66,667 pixels per second, which I think is too slow. I think you should be processing at least half a million pixels per second, probably way more.
If this were me, I'd run my profiling tools and see where the bottlenecks are and cut them out.
Without knowing your exact structure and having to guess, I would do the following:
Attempt to use one contiguous block of memory for the source image
I'd prefer to see a rotate function like this:
void RotateColumn(int column, char *sourceImage, int bytesPerRow, int bytesPerPixel, int height, char *destRow)
{
char *src = sourceImage + (bytesPerPixel * column);
if (bytesPerPixel == 1) {
for (int y=0; y < height; y++) {
*destRow++ = *src;
src += bytesPerRow;
}
}
else if (bytesPerPixel == 2) {
for (int y=0; y < height; y++) {
*destRow++ = *src;
*destRow++ = *(src + 1);
src += bytesPerRow;
// although I doubt it would be faster, you could try this:
// *destRow++ = *src++;
// *destRow++ = *src;
// src += bytesPerRow - 1;
}
}
else { /* error out */ }
}
I'm guessing that the inside of the loop will turn into maybe 8 instructions. On a 2GHz processor (let's say nominally 4 cycles per instruction, which is just a guess), you should be able to rotate 625 million pixels in a second. Roughly.
If you can't do contiguous, work on multiple dest scanlines at once.
If the source image is broken into blocks or you have a scanline abstraction of memory, what you do is get a scanline from the source image and rotate, say, a few dozen columns at once into a buffer of dest scanlines.
Let's assume that you have a mechanism for accessing scanlines abstractly, wherein you can acquire and release and write to scanlines.
Then what you're going to do is figure out how many source columns you're willing to process at once, because you're code will look something like this:
void RotateNColumns(Pixels &source, Pixels &dest, int startColumn, int nCols)
{
PixelRow &rows[nRows];
for (int i=0; i < nCols; i++)
rows[i] = dest.AcquireRow(i + startColumn);
for (int y=0; y < source.Height(); y++) {
PixelRow &srcRow = source.AcquireRow();
for (int i=0; i < nCols; i++) {
// CopyPixel(int srcX, PixelRow &destRow, int dstX, int nPixels);
sourceRow.CopyPixel(startColumn + i, rows[i], y, 1);
}
source.ReleaseRow(srcRow);
}
for (int i=0; i < nCols; i++)
dest.ReleaseAndWrite(rows[i]);
}
In this case, if you buffer up your source pixels in large-ish blocks of scanlines, you're not necessarily fragmenting your heap and you have the choice of possibly flushing decoded rows out to disk. You process n columns at a time and your memory locality should improve by a factor of n. Then it becomes a question of how expensive your caching is.
Can the problem be solved with parallel processing?
Honestly, I think your problem should be IO bound, not CPU bound. I'd think that your decoding time will dominate, but let's pretend it isn't, for grins.
Think about it this way - if you read the source image a whole row at a time, you could toss that decoded row to a thread that will write it into the appropriate column of the destination image. So write your decoder so that it has a method like OnRowDecoded(byte *row, int y, int width, int bytesPerPixel); And then you're rotating while you're decoding. OnRowDecoded() packs up the information and hands it to a thread that owns the dest image and writes the entire decoded row into the correct dest column. That thread does all the writing to the dest while the main thread is busy decoding the next row. Likely the worker thread will finish first, but maybe not.
You will need to make your SetPixel() to the dest be thread safe, but other than that, there's no reason this should be a serial task. In fact, if your source images use the TIFF feature of being divided up into bands or tiles, you can and should be decoding them in parallel.
回答2:
If you look at the TIFF spec, there is a tag that can be added to an image IFD that sets the image orientation. If you set this tag appropriately, you can change the image rotation without having to decode and re-encode the image.
However - and this is a big however - you should be aware that while it appears straight forward, if not trivial to rewrite IFDs in a TIFF, handling all the aberrant TIFFs in the ecosystem is decidedly non-trivial, so be careful how you go about it.
来源:https://stackoverflow.com/questions/13358919/how-can-i-speed-up-rotating-a-huge-tiff-by-90-degrees