Recent face detection systems are achieving near-human performance. However, most of these methods are based on slow RCNN [2] based methods.
Recently, I had to detect the faces in several millions of images (about 14 million). The state-of-the-art face detectors operate at around 1-2 frames per second. Detecting faces in my 14 million images using these methods would have taken about 6 months using 1 TitanX GPU. So I decided to use the recently published YOLO [1] method for training a network for face detection. Although the performance of the YOLO method is lower than other detection methods like Fast R-CNN [3] and Faster R-CNN [4], it achieves a rate of about 45 frames per second which is more than six times than that achieved by Faster R-CNN.
I was able to complete the detection task in about a week, though I had to compromise a bit on the accuracy.
In this post, I first give a brief overview of the YOLO method. Then I will explain the training procedure for faces.
Overview of YOLO
The YOLO method reframes the detection problem as a single regression problem to bounding boxes and class probabilities. It requires just a single neural network evaluation for predicting multiple bounding boxes class probabilities. The image is first resized to the input size of the network and divided into an $ S \times S$ grid. If the center of an object falls into a grid cell, then that grid cell is responsible for detecting that object. Each of the $S \times S$ grid cells predicts $B$ bounding boxes $(x,y,w,h)$ along with the objectness scores for those boxes. Each grid cell also predicts class conditional probabilities (i.e. the probability of each class given that ) for the $C$ classes. So the final output of the network is $S \times S \times ((4 + 1) \times B + C)$.
The objectness score associated with each bounding box is the product of the confidence of the model that the box contains an object and the intersection over union (IOU) between the predicted box and the ground truth box. At test time the class conditional probabilities and the individual box confidence predictions are multiplied to get the class-specific confidence scores for each class. This product encodes both the probability of that class appearing in the box and how well the box fits the object.
The loss function is designed to optimize the loss from location accuracy and the loss from confidence predictions. However, the method suffers from a few limitations. The model struggles with small objects. It also struggles to generalize to objects in unusual aspect ratios or configurations. However, the speed somewhat compensates for these limitations.
Adapting YOLO to face detection
I trained the YOLO detector on the WIDER FACE [5] dataset by making minimal changes to the code. I had to generate the labels in the same format as required by the YOLO code. Each image requires a separate label file with each line in the file representing a ground truth bounding box along with the class (which is just 1 in our case). The bounding boxes are in the format: $(x_{c}, y_{c}, w, h)$, where $(x_{c},y_{c})$ is the center of the bounding box and $w$ and $h$ are the width and height of the bounding box respectively. Also, in the network definition file, I had to change the number of classes and the dimensions of the output. Also, in the main yolo.c file in src/, I had to change the source, the destination, and the number of classes accordingly.
The bounding boxes provided with WIDER dataset are very small. But I needed larger boxes to incorporate context in the detector. So, after convergence on the WIDER FACE dataset, I fine-tuned the YOLO detector on FDDB [6] dataset. Though, I had to convert the ellipse annotations provided by the authors of [6] into rectangular ones.
The final detector achieves good recall. But the most important advantage over other recent detectors is the speed (I did this before SSD [7]). I was able to process the 14 million images within a week.
References:
[1] Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. "You only look once: Unified, real-time object detection." arXiv preprint arXiv:1506.02640 (2015).
[2] Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. "Rich feature hierarchies for accurate object detection and semantic segmentation." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580-587. 2014.
[3] Girshick, Ross. "Fast r-cnn." In Proceedings of the IEEE International Conference on Computer Vision, pp. 1440-1448. 2015.
[4] Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. "Faster R-CNN: Towards real-time object detection with region proposal networks." In Advances in neural information processing systems, pp. 91-99. 2015.
[5] Yang, Shuo, Ping Luo, Chen Change Loy, and Xiaoou Tang. "WIDER FACE: A Face Detection Benchmark." arXiv preprint arXiv:1511.06523 (2015).
[6] Jain, Vidit, and Erik G. Learned-Miller. "Fddb: A benchmark for face detection in unconstrained settings." UMass Amherst Technical Report (2010).
[7] Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, and Scott Reed. "SSD: Single Shot MultiBox Detector." arXiv preprint arXiv:1512.02325 (2015).
I was able to complete the detection task in about a week, though I had to compromise a bit on the accuracy.
In this post, I first give a brief overview of the YOLO method. Then I will explain the training procedure for faces.
Overview of YOLO
The YOLO method reframes the detection problem as a single regression problem to bounding boxes and class probabilities. It requires just a single neural network evaluation for predicting multiple bounding boxes class probabilities. The image is first resized to the input size of the network and divided into an $ S \times S$ grid. If the center of an object falls into a grid cell, then that grid cell is responsible for detecting that object. Each of the $S \times S$ grid cells predicts $B$ bounding boxes $(x,y,w,h)$ along with the objectness scores for those boxes. Each grid cell also predicts class conditional probabilities (i.e. the probability of each class given that ) for the $C$ classes. So the final output of the network is $S \times S \times ((4 + 1) \times B + C)$.
The objectness score associated with each bounding box is the product of the confidence of the model that the box contains an object and the intersection over union (IOU) between the predicted box and the ground truth box. At test time the class conditional probabilities and the individual box confidence predictions are multiplied to get the class-specific confidence scores for each class. This product encodes both the probability of that class appearing in the box and how well the box fits the object.
The loss function is designed to optimize the loss from location accuracy and the loss from confidence predictions. However, the method suffers from a few limitations. The model struggles with small objects. It also struggles to generalize to objects in unusual aspect ratios or configurations. However, the speed somewhat compensates for these limitations.
Adapting YOLO to face detection
I trained the YOLO detector on the WIDER FACE [5] dataset by making minimal changes to the code. I had to generate the labels in the same format as required by the YOLO code. Each image requires a separate label file with each line in the file representing a ground truth bounding box along with the class (which is just 1 in our case). The bounding boxes are in the format: $(x_{c}, y_{c}, w, h)$, where $(x_{c},y_{c})$ is the center of the bounding box and $w$ and $h$ are the width and height of the bounding box respectively. Also, in the network definition file, I had to change the number of classes and the dimensions of the output. Also, in the main yolo.c file in src/, I had to change the source, the destination, and the number of classes accordingly.
The bounding boxes provided with WIDER dataset are very small. But I needed larger boxes to incorporate context in the detector. So, after convergence on the WIDER FACE dataset, I fine-tuned the YOLO detector on FDDB [6] dataset. Though, I had to convert the ellipse annotations provided by the authors of [6] into rectangular ones.
The final detector achieves good recall. But the most important advantage over other recent detectors is the speed (I did this before SSD [7]). I was able to process the 14 million images within a week.
References:
[1] Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. "You only look once: Unified, real-time object detection." arXiv preprint arXiv:1506.02640 (2015).
[2] Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. "Rich feature hierarchies for accurate object detection and semantic segmentation." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580-587. 2014.
[3] Girshick, Ross. "Fast r-cnn." In Proceedings of the IEEE International Conference on Computer Vision, pp. 1440-1448. 2015.
[4] Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. "Faster R-CNN: Towards real-time object detection with region proposal networks." In Advances in neural information processing systems, pp. 91-99. 2015.
[5] Yang, Shuo, Ping Luo, Chen Change Loy, and Xiaoou Tang. "WIDER FACE: A Face Detection Benchmark." arXiv preprint arXiv:1511.06523 (2015).
[6] Jain, Vidit, and Erik G. Learned-Miller. "Fddb: A benchmark for face detection in unconstrained settings." UMass Amherst Technical Report (2010).
[7] Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, and Scott Reed. "SSD: Single Shot MultiBox Detector." arXiv preprint arXiv:1512.02325 (2015).
Hi, Did you train yolo-v1 or yolo-v2 for the face detection?
ReplyDeleteI trained V1.
DeleteWhat changes did you make to enable training for WIDER? Can you please point to the reference code, and if possible the updated weights? Wish to compare the performance against other networks
ReplyDeleteHi Sagar,
ReplyDeleteI will not be able to put up the updated weights. However, for training on WIDER, I basically had to generate a label file for each image. Each label file has 1 line for each face in the corresponding image. Each line is of the format:
class xc yc w h
Where xc and yc are the coordinates of the centre of the box normalised by the image width and height respectively. And w and h are the width and height of the box normalised by the image width and height respectively.
The code for this is here:
https://goo.gl/poSLl6
Then you will need to change the train_images and backup_directory variables in src/yolo.c code and some other minor changes like class names and class labels.
You can find my config file here:
https://goo.gl/C0p0QA
Hi Thanks for sharing, I am planning to train face detector on V2, for now I will convert the Wider to YOLO format using your code. Could you also share FDDB conversion code. If possible both from ellipse to BBox and from Regular BBox to YOLO or maybe from Ellipse to YOLO BBox depending on your scenario. Thanks a bundle
ReplyDeleteHi Jumabek. I am sorry for replying this late. I somehow missed your comment. If you still need them, the codes are here:
Deletehttps://github.com/ankanbansal/fddb-for-yolo
Please contact us at tech@spotcrowd.com
ReplyDeleteWe are interested in your work
ReplyDeleteHi,
ReplyDeleteFirst of all, thanks for sharing your experience.
A few questions:
1. How long until convergence in the WIDER FACE dataset (batches, epochs) ?
2. Fine-tuning on FDDB is just another training on top of the weight file ?
2. How long did it take to fine-tune ?
regards
Fábio