Video Compression: What is discrete cosine transform?

问题

I've implemented an image/video transformation technique called discrete cosine transform. This technique is used in MPEG video encoding. I based my algorithm on the ideas presented at the following URL:

http://vsr.informatik.tu-chemnitz.de/~jan/MPEG/HTML/mpeg_tech.html

Now I can transform an 8x8 section of a black and white image, such as:

0140  0124  0124  0132  0130  0139  0102  0088  
0140  0123  0126  0132  0134  0134  0088  0117  
0143  0126  0126  0133  0134  0138  0081  0082  
0148  0126  0128  0136  0137  0134  0079  0130  
0147  0128  0126  0137  0138  0145  0132  0144  
0147  0131  0123  0138  0137  0140  0145  0137  
0142  0135  0122  0137  0140  0138  0143  0112  
0140  0138  0125  0137  0140  0140  0148  0143

Into this an image with all the important information at the top right. The transformed block looks like this:

1041  0039  -023  0044  0027  0000  0021  -019  
-050  0044  -029  0000  0009  -014  0032  -010  
0000  0000  0000  0000  -018  0010  -017  0000  
0014  -019  0010  0000  0000  0016  -012  0000  
0010  -010  0000  0000  0000  0000  0000  0000  
-016  0021  -014  0010  0000  0000  0000  0000  
0000  0000  0000  0000  0000  0000  0000  0000  
0000  0000  -010  0013  -014  0010  0000  0000

Now, I need to know how can I take advantage of this transformation? I'd like to detect other 8x8 blocks in the same image ( or another image ) that represent a good match.

Also, What does this transformation give me? Why is the information stored in the top right of the converted image important?

回答1:

The result of a DCT is a transformation of the original source into the frequency domain. The top left entry stores the "amplitude" the "base" frequency and frequency increases both along the horizontal and vertical axes. The outcome of the DCT is usually a collection of amplitudes at the more usual lower frequencies (the top left quadrant) and less entries at the higher frequencies. As lassevk mentioned, it is usual to just zero out these higher frequencies as they typically constitute very minor parts of the source. However, this does result in loss of information. To complete the compression it is usual to use a lossless compression over the DCT'd source. This is where the compression comes in as all those runs of zeros get packed down to almost nothing.

One possible advantage of using the DCT to find similar regions is that you can do a first pass match on low frequency values (top-left corner). This reduces the number of values you need to match against. If you find matches of low frequency values, you can increase into comparing the higher frequencies.

Hope this helps

回答2:

I learned everything I know about the DCT from The Data Compression Book. In addition to being a great introduction to the field of data compression, it has a chapter near the end on lossy image compression which introduces JPEG and the DCT.

回答3:

The concepts underlying these kinds of transformations are more easily seen by first looking at a one dimensional case. The image here shows a square wave along with several of the first terms of an infinite series. Looking at it, note that if the functions for the terms are added together, they begin to approximate the shape of the square wave. The more terms you add up, the better the approximation. But, to get from an approximation to the exact signal, you have to sum an infinite number of terms. The reason for this is that the square wave is discontinuous. If you think of a square wave as a function of time, it goes from -1 to 1 in zero time. To represent such a thing requires an infinite series. Take another look at the plot of the series terms. The first is red, the second yellow. Successive terms have more "up and down" transitions. These are from the increasing frequency of each term. Sticking with the square wave as a function of time, and each series term a function of frequency there are two equivalent representations: a function of time and a function of frequency (1/time).

In the real world, there are no square waves. Nothing happens in zero time. Audio signals, for example occupy the range 20Hz to 20KHz, where Hz is 1/time. Such things can be represented with finite series'.

For images, the mathematics are the same, but two things are different. First, it's two dimensional. Second the notion of time makes no sense. In the 1D sense, the square wave is merely a function that gives some numerical value for for an argument that we said was time. An (static) image is a function that gives a numerical value for for every pair of row, column indeces. In other words, the image is a function of a 2D space, that being a rectangular region. A function like that can be represented in terms of its spatial frequency. To understand what spatial frequency is, consider an 8 bit grey level image and a pair of adjacent pixels. The most abrupt transistion that can occur in the image is going from 0 (say black) to 255 (say white) over the distance of 1 pixel. This corresponds directly with the highest frequency (last) term of a series representation.

A two dimensional Fourier (or Cosine) transformation of the image results in an array of values the same size as the image, representing the same information not as a function of space, but a function of 1/space. The information is ordered from lowest to highest frequency along the diagonal from the origin highest row and column indeces. An example is here.

For image compression, you can transform an image, discard some number of higher frequency terms and inverse transform the remaining ones back to an image, which has less detail than the original. Although it transforms back to an image of the same size (with the removed terms replaced by zero), in the frequency domain, it occupies less space.

Another way to look at it is reducing an image to a smaller size. If, for example you try to reduce the size of an image by throwing away three of every four pixels in a row, and three of every four rows, you'll have an array 1/4 the size but the image will look terrible. In most cases, this is accomplished with a 2D interpolator, which produces new pixels by averaging rectangular groups of the larger image's pixels. In so doing, the interpolation has an effect similar throwing away series terms in the frequency domain, only it's much faster to compute.

To do more things, I'm going to refer to a Fourier transformation as an example. Any good discussion of the topic will illustrate how the Fourier and Cosine transformation are related. The Fourier transformation of an image can't be viewed directly as such, because it's made of complex numbers. It's already segregated into two kinds of information, the Real and Imaginary parts of the numbers. Typically, you'll see images or plots of these. But it's more meaningful (usually) to separate the complex numbers into their magnitude and phase angle. This is simply taking a complex number on the complex plane and switching to polar coordinates.

For the audio signal, think of the combined sin and cosine functions taking an attitional quantity in their arguments to shift the function back and forth (as a part of the signal representation). For an image, the phase information describes how each term of the series is shifted with respect to the other terms in frequency space. In images, edges are (hopefully) so distinct that they are well characterized by lowest frequency terms in the frequency domain. This happens not because they are abrupt transitions, but because they have e.g. a lot of black area adjacent to a lot of lighter area. Consider a one dimensional slice of an edge. The grey level is zero then transitions up and stays there. Visualize the sine wave that woud be the first approximation term where it crosses the signal transition's midpoint at sin(0). The phase angle of this term corresponds to a displacement in the image space. A great illustraion of this is available here. If you are trying to find shapes and can make a reference shape, this is one way to recognize them.

回答4:

If I remember correctly, this matrix allows you to save the data to a file with compression.

If you read further down, you'll find the zig-zag pattern of data to read from that final matrix. The most important data are in the top left corner, and least important in the bottom right corner. As such, if you stop writing at some point and just consider the rest as 0's, even though they aren't, you'll get a lossy approximation of the image.

The number of values you throw away increases compression at the cost of image fidelity.

But I'm sure someone else can give you a better explanation.

回答5:

I'd recommend picking up a copy of Digital Video Compression - it's a really good overview of compression algorithms for images and video.

回答6:

Anthony Cramp's answer looked good to me. As he mentions the DCT transforms the data into the frequency domain. The DCT is heavily used in video compression as the human visual system is must less sensitive to high frequency changes, therefore zeroing out the higher frequency values results in a smaller file, with little effect on a human's perception of the video quality.

In terms of using the DCT to compare images, I guess the only real benefit is if you cut away the higher frequency data and therefore have a smaller set of data to search/match. Something like Harr wavelets may give better image matching results.

来源：https://stackoverflow.com/questions/4582/video-compression-what-is-discrete-cosine-transform

标签

video

compression

dct