Single Shot Detectors – You Only Look Once

In this section we will move on to a slightly different kind of object detector called a single shot detectors. Single shot detectors try posing object detection as a regression problem. One of the main architectures under this category is the YOLO architecture (You Only Look Once) which we will explore in more detail now.

The main idea of the YOLO network is to optimise the computation of predictions at various locations in the input image without using any sliding windows.In order to achieve this, the network outputs feature map in form of a grid of size cells.

Each cell has B*5+C entries. Where "B" is the number of bounding boxes per cell, C is the number of class probabilities and 5 is the elements for each bounding box (x, y :center point coordinates of bounding box with respect to the cell in which it is located , w-width of the bounding box with respect to original image, h-height of the bounding box with respect to original image, confidence score: how likely object is present in the bounding box).

We define Confidence score as:

If there is no object present in the cell then will zero. Otherwise will be equal to the IOU between the ground truth box and the predicted box.

Note that each cell of the grid is responsible for predicting a fixed number of bounding boxes.

Figure below depicts how the cell entries look like as an output from YOLO network which predicts a tensor of shape (N, N, B*5+C). The last conv layer of the network will output feature map of same size as the grid dimensions.

The center coordinates and the height and width of the bounding box are normalized between [0 , 1]. The following figure shows an example of how to calculate these coordinates:

The network predicts class probabilities, bounding boxes, and confidence for these boxes for each of these cells.

The actual YOLO network has 24 convolutional layers, followed by 2 fully connected layers. However, Fast YOLO network is 9 layers, as shown:

Another important point is that each object will be assigned to one grid cell alone (based on this center and the cell distance) even if it appears to be on multiple cells.

Currently, we can imagine that the number of objects that can be detected on the image will be the grid size; later, we will see how to handle multiple objects per grid cell. (Anchor boxes)

Table of Contents for Single Shot Detectors&#xA0;&#x2013; You Only Look Once

Create new playlist

Sign In

Sign Up

Table of Contents for
Single Shot Detectors – You Only Look Once