问题
I have run the face detection algorithm inbuilt in opencv to extract faces in each frame of a video(sampled at 1 fps). I have also resized each face image to be of same size and I have cropped some fraction of image to remove background noise and hair. Now the problem is that I have to cluster these images of faces - Each cluster corresponding to a person. I implemented the algorithm described here http://bitsearch.blogspot.in/2013/02/unsupervised-face-clustering-with-opencv.html
Basically the above algorithm, uses LBPH face recognizer of OpenCV iteratively to cluster the images. In the description on that page itself the results are not satisfactory. In my implementation the results are worse. Can anyone suggest a better way to cluster faces? May be using some other feature and some other clustering algorithm. The number of clusters are unknown.
回答1:
I suggest having a look at
FaceNet: A Unified Embedding for Face Recognition and Clustering
My shortscience summary (go there if you want to see the Math parts rendered correctly):
FaceNet directly maps face images to $\mathbb{R}^{128}$ where distances directly correspond to a measure of face similarity. They use a triplet loss function. The triplet is (face of person A, other face of person A, face of person which is not A). Later, this is called (anchor, positive, negative).
The loss function is learned and inspired by LMNN. The idea is to minimize the distance between the two images of the same person and maximize the distance to the other persons image.
LMNN
Large Margin Nearest Neighbor (LMNN) is learning a pseudo-metric
$$d(x, y) = (x -y) M (x -y)^T$$
where $M$ is a positive-definite matrix. The only difference between a pseudo-metric and a metric is that $d(x, y) = 0 \Leftrightarrow x = y$ does not hold.
Curriculum Learning: Triplet selection
Show simple examples first, then increase the difficulty. This is done by selecting the triplets.
They use the triplets which are hard. For the positive example, this means the distance between the anchor and the positive example is high. For the negative example this means the distance between the anchor and the negative example is low.
They want to have
$$||f(x_i^a) - f(x_i^p)||_2^2 + \alpha < ||f(x_i^a) - f(x_i^n)||_2^2$$
where $\alpha$ is a margin and $x_i^a$ is the anchor, $x_i^p$ is the positive face example and $x_i^n$ is the negative example. They increase $\alpha$ over time. It is crucial that $f$ maps the images not in the complete $\mathbb{R}^{128}$, but on the unit sphere. Otherwise one could double $\alpha$ by simply making $f' = 2 \cdot f$.
Tasks
- Face verification: Is this the same person?
- Face recognition: Who is this person?
Datasets
- 99.63% accuracy on Labeled FAces in the Wild (LFW)
- 95.12% accuracy on YouTube Faces DB
Network
Two models are evaluated: The Zeiler & Fergus model and an architecture based on the Inception model.
See also
- DeepFace
See also
- DeepFace: Closing the Gap to Human-Level Performance in Face Verification
来源:https://stackoverflow.com/questions/26179052/clustering-human-faces-from-a-video