I\'ve been requested to take a given video, probably a simple cartoon, and return an array of its scenes.
I need to use the opencv libary in order to do it, and the res
Technically, a scene is a group shots which are successively taken together at a single location. A shot is a basic narrative element of the video which is composed of a number of frames that are presented from a continuous viewpoint.
Automatically dividing a video into its shots is called the shot boundary detection problem in which the basic idea is identifying consecutive frames that form a transition from one shot to another.
Identifying transitions generally involve calculating a similarity value between two frames. This value can be calculated using low level image features such as color, edge or motion. A simple similarity metric could be:
where f1 and f2 represent two distinct video frames and N represents number pixels in those frames. This is the average first order (Manhattan) pixel color distance between two frames.
Say you have a video composed of frames { f1, f2 ... fM } and you have calculated this distance between neighboring frames. A simple decision measure could be labeling a transition from fa to fb as a shot boundary if s(fa, fb) is below a certain threshold.
A successful shot boundary detector uses distances of second order (or more) such as Euclidean distance or Pearson correlation coefficient and utilizes a combination of different features instead of using only one, say color.
Usually, a camera or object movement breaks the pixel correspondence between frames. Using frequencies of low level details with the help of histograms will be a cure here.
Also, performing decision making over more than two frames helps in finding smooth transitions where one shot dissolves into or replaces another for a duration. Deciding for a group of frames also help us in identifying false transitions caused by light flashes or fast moving cameras.
For your problem, please start from basic approaches like comparing RGB colors and edge responses between video frames. Analyze your results and data together and try to adapt new features, distance metrics and decision making methods for better performance.
The best way of segmenting a video into shots will vary depending on your data. Machine learning approaches like probabilistically modeling frame transitions with Gaussian mixture models or classification through support vector machines are expected to perform better than hand selected thresholds. However it is important that you learn the basics before effectively choosing input features.
Automatically finding shot boundaries will be sufficent to divide your video into meaningful parts. Dividing your video into scenes, on the other hand, is considered a harder semantic problem. Nevertheless, shot segmentation is the first step to it.