This article introduces the fundamentals of YoloV3, a object detection algorithm recognized for its precision And her rapidity. It uses anchors (predetermined bounding boxes) improving detection, automatically calculated using the k-means clustering method with Intersection over Union (IoU).
In the context of image processing, the evolution of technologies has radically changed the approaches and possibilities that make it possible to better and better understand the information contained in photos or videos automatically.
The location of one or more objects in an image is particularly well treated with the latest technologies available and framing an object in a rectangular area of an image is now largely feasible.
You Only Look Once (YOLO) is an object detection algorithm known for its high accuracy and speed. However, it is the result of long years of research as shown by the chronological timeline of object detection algorithms:
The three original versions were developed by the same team of researchers and are available at this address.
The enthusiasm of the community was such that YOLO ended up being used for military and/or liberticidal purposes...
This is in fact what one of the authors laments in a Tweet and thus announces the end of his involvement in this computer vision project.
In this article, we will focus on the third version of the algorithm: YOLOv3 — An Incremental Improvement and explain how it works.
To use YoloV3 To the maximum of one's abilities, you need to understand two things in parallel: how detection is done and how anchors are used to improve detections. A bad choice of anchors can cause the algorithm to fail to detect the target object!
By carrying out a preliminary analysis on the boundaries of the objects we are trying to find, it turns out that most of these boundary boxes have certain relationships between their height and their width. So, instead of directly predicting an enclosing box (bounding box in the rest of the article), YOLOv2 (and v3) predict sets of boxes with particular height/width ratios — these sets of predetermined boxes are the anchor boxes (anchors in the rest of the article) (or anchors — english word).
There are several ways to define these anchors, the most naive is to define them by hand. Instead of choosing them by hand, it is preferable to apply a k-means clustering algorithm to find them automatically.
IoU — or Intersection over Union — is a way of measuring the quality of object detection by comparing, in a training dataset, the known position of the object in the image with the prediction made by the algorithm. The IoU is the ratio between the area of the intersection of the bounding boxes considered and the area of the union of the sets.
The IoU can be between 0 (for a completely failed detection) and 1 (for a perfect detection). In general, the object is considered to be detected from an IoU greater than 0.5.
If we use a standard k-means, i.e. with the Euclidean distance, large bounding boxes will generate more errors than small ones. In addition, it is desirable to have anchors that lead to strong IoU scores. So we will replace the Euclidean distance by:
dist (box, centroid) = 1 − IoU (box, centroid)
IoU calculations are made under the assumption that the upper left corner of the bounding boxes are located in the same location. Thus IoU calculations are simplified and only depend on the height and width of the bounding boxes. So, each bounding box can be represented by a point in the plane, so we apply the algorithm to these points.
Take for example the case of face detection and the database Wider Face:
We take all the faces in this database, and for each face bounding box, we place a point at the coordinates (x, y) (x for the x-axis: the width of the bounding box, y for the ordinates: its height) relative to the total size of the image:
We then apply the k-means algorithm with 9 centroids, which will allow us to determine the dimensions of the 9 anchors for our face detection model. Note that the boundaries between the different clusters are not straight. This is because we are using a distance derived from the IoU.
So for an image with a size of 416×416 pixels, the nine anchors are given by:
To illustrate the importance of anchors, let's take the example of license plates. In green, a neural network was trained with basic anchors and in red, with specific anchors. For each detection, we can see the impact of the anchors in the predictions: with basic anchors, the bounding boxes are too high and too narrow while with specific anchors, the detections are perfectly adjusted.
YOLOv3 is a neural network called fully convolutional, and it produces what are called feature maps at the output. The thing to remember for YOLO is that as there is no constraint on the size of the feature maps, we can give it images of different sizes!
YOLO version 1 directly predicts bounding box coordinates using a dense layer. Instead of directly predicting these boxes, YOLOv3 uses assumptions about the shape of the objects to be detected. This shape is expressed in the number of pixels in width and in the number of pixels in height, since these boxes are considered to be rectangular. These assumptions are the anchors of YOLOv3, whose calculation was described above.
YOLOv3 does not have a dense layer, it is only composed of convolutions. Each convolution is followed by a Batch Normalization layer and then by the LeakyRelu activation function. Batch normalizations are beneficial: convergence is faster and the effects of disappearance and explosion of the gradient are limited. With these batch normalizations, we can remove Dropouts without overfitting problems.
In addition, convolutions have a stride of two to downsample the images and reduce the first 2 dimensions of the feature maps. The following figure shows two convolutions, the first with a strid of 1, the second with a strid of 2.
The network has three outputs to be able to detect objects at three different scales.
Its complete architecture is described in the following figure.
YOLOv3 reduces image size by a factor of 32, called network stride. The first version of YOLO took 448×448 images, so the output feature map was 14×14 in size.
It is common for the objects you want to detect to be in the center of the image. However, a 14×14 grid does not have a single center. In practice, it is therefore preferable for the output to have an odd size. To remove this ambiguity, the size of the images will be 416×416, to provide a 13×13 feature map with a single center.
To fully understand YOLO, you need to understand its outputs. YOLOv3 has three final layers, the first has a dimension divided by 31 compared to the initial image, the second by 16 and the third by 8. Thus, starting from an image with a size of 416×416 pixels, the three feature maps at the output of the network will have respective sizes of 13×13, 26×26 and 52×52 pixels. In this sense, YOLOv3 predicts three levels of detail, to detect large, medium, and small objects respectively.
Starting from an image with a size of 416×416 pixels, the same pixel is “followed” through the network and leads to three cells. For each cell three bounding boxes are predicted, making a total of 9 that come from the 9 anchors. For each bounding box, an objectness score and class membership scores are predicted. In total, the network offers 52×52×3 + 26×26×3 + 13×13×3 = 10647 bounding boxes.
Now that we have all these predictions, we need to select the right ones.
Thus, for each detected object, the NMS algorithm only chooses the best proposal.
Once our algorithm is trained, it can be used and even coupled to other algorithms. Face detection can be combined with checking the wearing of safety equipment, in this case a mask:
So, to use YOLOv3, you first need to Data of training. But that's not enough, you also have to be prepared to “aim” correctly at the objects you want to detect using the right anchors!