Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Face recognition presents challenges to current approaches.
  • FaceNet is a system that maps face images to a compact Euclidean space.
  • FaceNet uses a deep convolutional network to directly optimize the embedding.
  • Triplets of matching/non-matching face patches are used to train the system.
  • FaceNet achieves state-of-the-art face recognition performance with 128-bytes per face.
  • FaceNet achieves record accuracy of 99.63% on Labeled Faces in the Wild (LFW) dataset.
  • FaceNet cuts the error rate by 30% on both LFW and YouTube Faces DB datasets.
  • Harmonic embeddings and harmonic triplet loss allow for direct comparison between different networks.

Paper Content

Introduction

  • System for face verification, recognition and clustering
  • Uses deep convolutional network to learn Euclidean embedding per image
  • Squared L2 distances in embedding space correspond to face similarity
  • Face verification involves thresholding distance between two embeddings
  • Recognition is a k-NN classification problem
  • Clustering is achieved using k-means or agglomerative clustering
  • Data driven method that learns representation from pixels of face
  • Two different deep network architectures used
  • Vast corpus of face verification and recognition works
  • Multiple stages combining deep convolutional network with PCA and SVM
  • Ensemble of networks used for best performance on LFW
  • Triplet loss used to minimize L2 distance between faces of same identity

Method

  • FaceNet uses a deep convolutional network
  • Two core architectures are discussed: Zeiler&Fergus and Inception
  • End-to-end learning of the whole system is employed
  • Triplet loss is used to reflect the goal of face verification, recognition and clustering
  • Triplet loss encourages faces of one identity to live on a manifold while enforcing distance to other identities

Triplet loss

  • Embedding an image into a d-dimensional Euclidean space and constraining it to live on a d-dimensional hypersphere
  • Nearest-neighbor classification: an image of a specific person should be closer to other images of the same person than to images of any other person
  • Triplet selection: selecting hard triplets that are active and can contribute to improving the model

Triplet selection

  • Select triplets that violate the triplet constraint
  • Generate triplets online using large mini-batches
  • Ensure a minimal number of exemplars of any one identity is present in each mini-batch
  • Select semi-hard negatives to avoid bad local minima

Deep convolutional networks

  • Trained CNN using Stochastic Gradient Descent (SGD) and AdaGrad
  • Learning rate started at 0.05 and decreased to finalize model
  • Models initialized from random and trained on CPU cluster for 1,000-2,000 hours
  • Margin α set to 0.2
  • Two types of architectures explored in experimental section
  • Rectified linear units used as non-linear activation function

Datasets and evaluation

  • Evaluated method on four datasets
  • Evaluated on face verification task
  • Used squared L2 distance threshold to determine classification of same and different
  • Defined set of true accepts and false accepts

Hold-out test set

  • Hold out set of 1 million images with same distribution as training set
  • Split into 5 disjoint sets of 200k images each
  • FAR and VAL rate computed on 100k x 100k image pairs
  • Standard error reported across 5 splits

Personal photos

  • Test set has similar distribution to training set
  • Test set has been manually verified to have clean labels
  • Test set consists of 3 personal photo collections with 12k images
  • FAR and VAL rate computed across 12k squared pairs of images

Academic datasets

  • LFW is the standard test set for face verification
  • Youtube Faces DB is a new dataset used for face recognition
  • Both datasets use pairs of images/videos for verification

Experiments

  • Training face thumbnails consist of 8M different identities
  • Face detector is run on each image to generate a tight bounding box
  • Input sizes range from 96x96 pixels to 224x224 pixels

Computation accuracy trade-off

  • FLOPS and accuracy have a strong correlation
  • Five models (NN1, NN2, NN3, NNS1, NNS2) discussed in experiments
  • Performance decreases if number of parameters is reduced further

Effect of cnn model

  • Zeiler&Fergus based architecture with 1x1 convolutions and Inception based models both perform comparably
  • Inception based models reduce model size and FLOPS
  • Image size in pixels affects validation rate
  • Embedding dimensionality of model NN1 affects hold-out set
  • Largest model achieves dramatic improvement in accuracy
  • Tiny NNS2 can be run 30ms/image on a mobile phone and is accurate enough for face clustering

Sensitivity to image quality

  • Model is robust across a wide range of image sizes
  • Performance remains good even with JPEG compression of quality 20
  • Performance remains good even with face thumbnails of size 120x120 and 80x80 pixels
  • Training with lower resolution faces could improve performance range

Embedding dimensionality

  • 128 dimensional float vector used for training
  • 128 dimensional byte vector used for large scale clustering and recognition
  • Smaller embeddings possible with minor loss of accuracy

Amount of training data

  • Using tens of millions of exemplars results in a 60% reduction in error on a personal photo test set.
  • Using hundreds of millions of images gives a small boost, but the improvement tapers off.

Performance on lfw

  • Evaluated model on LFW using standard protocol
  • Nine training splits used to select L2-distance threshold
  • Classification accuracy of 98.87%±0.15 when using fixed center crop
  • Record breaking 99.63%±0.09 standard error of the mean when using extra face alignment
  • Error reduced by more than a factor of 7 compared to DeepFace in [17] and by 30% compared to DeepId2+ in [15]

Performance on youtube faces db

  • Used average similarity of first 100 frames of each video to classify with 95.12% accuracy
  • Compared to 91.4% accuracy of 100 frames from [17], error rate reduced by almost half

Face clustering

  • Compact embedding can be used to cluster photos of people with the same identity.
  • Results of clustering faces are impressive, as shown in Figure 7.
  • Clustering is invariant to occlusion, lighting, pose and age.

Summary

  • Findings work well
  • Future work to explore how far idea can be extended

Appendix: harmonic embedding

  • Introduces concept of harmonic embeddings, which are generated by different models but are compatible
  • Simplifies upgrade paths, allowing for smooth transition without version incompatibilities
  • Figure 8 shows results on 3G dataset, NN2 outperforms NN1, comparison of NN2 to NN1 performs at intermediate level
  • To learn harmonic embedding, triplets are generated that mix v1 and v2 embeddings, semihard negatives are selected from both v1 and v2 embeddings

Harmonic triplet loss

  • Mix embeddings of v1 and v2 to learn harmonic embedding
  • Triplet loss encourages compatibility between different embedding versions
  • Visualization of triplet combinations
  • Initialize v2 embedding from independently trained NN2
  • Retrain last layer of v2 with compatibility encouraging triplet loss
  • Perturb incorrectly placed v1 embeddings to improve verification accuracy
  • FaceNet output distances between pairs of faces of same and different person in different pose and illumination combinations