Clustering human faces from a video

问题

I have run the face detection algorithm inbuilt in opencv to extract faces in each frame of a video(sampled at 1 fps). I have also resized each face image to be of same size and I have cropped some fraction of image to remove background noise and hair. Now the problem is that I have to cluster these images of faces - Each cluster corresponding to a person. I implemented the algorithm described here http://bitsearch.blogspot.in/2013/02/unsupervised-face-clustering-with-opencv.html

Basically the above algorithm, uses LBPH face recognizer of OpenCV iteratively to cluster the images. In the description on that page itself the results are not satisfactory. In my implementation the results are worse. Can anyone suggest a better way to cluster faces? May be using some other feature and some other clustering algorithm. The number of clusters are unknown.

回答1:

I suggest having a look at

FaceNet: A Unified Embedding for Face Recognition and Clustering

My shortscience summary (go there if you want to see the Math parts rendered correctly):

FaceNet directly maps face images to $\mathbb{R}^{128}$ where distances directly correspond to a measure of face similarity. They use a triplet loss function. The triplet is (face of person A, other face of person A, face of person which is not A). Later, this is called (anchor, positive, negative).

The loss function is learned and inspired by LMNN. The idea is to minimize the distance between the two images of the same person and maximize the distance to the other persons image.

LMNN

Large Margin Nearest Neighbor (LMNN) is learning a pseudo-metric

$$d(x, y) = (x -y) M (x -y)^T$$

where $M$ is a positive-definite matrix. The only difference between a pseudo-metric and a metric is that $d(x, y) = 0 \Leftrightarrow x = y$ does not hold.

Curriculum Learning: Triplet selection

Show simple examples first, then increase the difficulty. This is done by selecting the triplets.

They use the triplets which are hard. For the positive example, this means the distance between the anchor and the positive example is high. For the negative example this means the distance between the anchor and the negative example is low.

They want to have

$$||f(x_i^a) - f(x_i^p)||_2^2 + \alpha < ||f(x_i^a) - f(x_i^n)||_2^2$$

where $\alpha$ is a margin and $x_i^a$ is the anchor, $x_i^p$ is the positive face example and $x_i^n$ is the negative example. They increase $\alpha$ over time. It is crucial that $f$ maps the images not in the complete $\mathbb{R}^{128}$, but on the unit sphere. Otherwise one could double $\alpha$ by simply making $f' = 2 \cdot f$.

Tasks

Face verification: Is this the same person?
Face recognition: Who is this person?

Datasets

99.63% accuracy on Labeled FAces in the Wild (LFW)
95.12% accuracy on YouTube Faces DB

Network

Two models are evaluated: The Zeiler & Fergus model and an architecture based on the Inception model.