Create YOLOv3 using PyTorch from scratch (Part-1)

This is the start of a series on understanding and implementing the YOLOv3 model using PyTorch.
Create YOLOv3 using PyTorch from scratch (Part-1)

1. Overview

This would be the start of a mini-series, which talks about the inner workings of the YOLO Version 3 model, and how to construct one from scratch, using the PyTorch framework.

1.1. What is YOLO

For those not quite familiar with what the YOLO model is, YOLO (You Only Look Once) is a family of object detection neural network models that gains its reputation by fast inference time and a simple, end-to-end design. When it first came out in 2015, it was capable of performing real-time (> 30 fps) inference, while its major competitors (e.g. R-CNN, SSD) were still struggling at a level of a few fps. There are many follow-up improvements of the original YOLO model, including YOLOv2 and YOLOv3, both from its original creator Joseph Redmon. This series will be focusing on the YOLOv3 version.

But, in order to get to v3, we will be covering some necessary points in earlier versions. On the other hand, some aspects of the model have been changed during the v1 – v3 evolution, while others remained the same. I think I’ll mention some of the these points at relevant locations, and add some of my own interpretations, or interpretations from other people, to help make the point clearer. Or, if I’m unsure, I will also put down my questions, and if you could offer any help please leave a comment at the end.

1.2. Useful resources

Here are the original papers of the 3 versions of YOLO:

  1. YOLOv1 paper: You Only Look Once: Unified, Real-Time Object Detection.
  2. YOLOv2 paper: YOLO9000: Better, Faster, Stronger.
  3. YOLOv3 paper: YOLOv3: An Incremental Improvement.

Here are some helpful resources that I used as learning materials and references:

  1. C4W3LO9 YOLO Algorithm Youtube video, by Andrew NG
  2. How to implement a YOLO (v3) object detector from scratch in PyTorch: Part 1
  3. YOLOV3 Pytorch implementation by eriklindernoren

I won’t expect to outperform these authors, but only to offer yet another understanding and interpretation of the this wonderful model. Hopefully some of my own understandings about the model could turn out to be helpful to your learning process. And, writing it down in a relatively formal format also helps me gaining a deeper understanding of the subject. If you spot any mistakes in my posts please leave a comment at the end of the page.

1.3. Structure of the series

I’m planing to break down the entire series into a few parts:

  1. Understand the YOLO model: later part of this post. This will cover the building blocks inside the model, what are the model inputs and outputs, and the overall training and inference workflows.
  2. Build the model backbone. The backbone of YOLOv3 is a fully convolutional network called Darknet-53, which, as its name implies, has a total of 53 convolution layers. We will load the config file of the original YOLOv3 and implement it using PyTorch.
  3. Load pre-trained weights. To verify the Darknet-53 model we built works as intended, we could load the pre-trained YOLOv3 weights and perform some inferences on some images.
  4. Get the tools ready. Before getting into the training part, it might be helpful to first get some utilities ready, including the codes to compute IOU (Intersection-Over-Union), NMS (Non-Maximum suppression) and mAP (mean Average Precision).
  5. Training data preparation. This part will write some pre-processing codes to load the COCO detection dataset, including the images and annotation labels.
  6. Train the model. Whether to perform fine-tuning, or train a new model on a different type of data from scratch, we need to have properly working training codes. This is, in my opinion, a lot harder than all the previous parts, so I left it at the very end.

I plan to write up and publish the posts one-by-one. So please stay tuned if you find these helpful or interesting. (I don’t know how to set up notifications or subscription system. Apologies.)

2. Understand the YOLO model

2.1. What does YOLO do

In a nutshell, the YOLO model takes an image, or multiple images, and detects objects in the images. The output of the model consists of:


Figure 1: Demontration of the YOLOv3 detection result. The thin yellow grid lines divide the entire image into 13 * 13 cells. Each cell makes its own predictions. 2 detected objects, the dog and the truck, are shown by their bounding boxes. The center of the dog box is located in the cell (8, 3), and the center of the truck box is located in cell (2, 9). Note that I made up these boxes and locations for illustration purposes, the truth values of this sample may be somewhat different. But the principle is the same.

  1. A number of objects it recognizes, it could be 1, or many, or none.
  2. For each recognized object, it outputs:
    1. Its x- and y- coordinates and the width and height of the bounding box enclosing the object. Using these coordinate and size information, one could locate the detected object in the image, and visualize the detection by drawing out the bounding box on top of the image, as in the example in Figure 1.
    2. A confidence score measuring how confidence the model is about the existence of this detected object. This is a score in the range of 0 – 1, and could be used to filter out those less certain predictions.
    3. If the model is trained on data containing multiple types of objects, or, using a more formal phrase, multiple classes of objects, it also outputs a confidence score for the detected objecting belonging to each class. Typically these classes are mutually exclusive, i.e. an object belongs to only 1 class (a multi-class problem). But a variant of the YOLOv2 – YOLO9000 – was designed to be able to detect objects belongs to more than 1 classes (a multi-label problem). E.g. a Norfolk terrier is labeled both as a “terrier” and as a “dog”.

      In our subsequent discussion and implementation, we will be restricted to the multi-class model: the classes are mutually exclusive.

So, the model performs multiple tasks, localization and classification, in a single pass through the network, thus its name You Only Look Once. This also makes YOLO a multi-task model.

2.2. Input and output data

Given the above, the input data to the YOLO model are the images. When represented in numerical format, these are N-dimensional arrays/tensors of the shape:

[Bt, C, H, W]


  • Bt: batch-size.
  • C: size of the channel or feature dimension. For images in RGB format, C = 3. The ordering of the 3 color channels doesn’t really matter, as long as you stick to the same ordering during training and inference time.
  • H and W are the height and width of the image, in number of pixels. These need to be standardized to a fixed size, e.g. 416 x 416, or take some random perturbations as a method of data augmentation method during the training state, e.g. randomly sampled within a range. But typically one has to set a reasonable upper bound during training time, to save computations.

Also note that this [Bt, C, H, W] ordering is following the PyTorch convention. In Tensorflow it is ordered as [Bt, H, W, C].

When making inferences/predictions, the model outputs a number of detection proposals, because there could be multiple objects in a single image. How these proposals are arranged will be covered in just a minute. But we could already make an educated guess about what each proposal would contain. Based on the previous section, it should provide an array like this:

[x, y, w, h, obj, c1, c2, ..., ck]


  • x, y: give information about the x- and y- coordinates of the detection.
  • w, h: are the width and height information.
  • obj: is the object confidence score.
  • c1 to ck: the confidence score for each class.

Therefore, each proposed object detection is an array/tensor of length 5 + Nc, where Nc is the number of classes to classify the detected object.

For instance, the COCO detection dataset has 80 different classes, then Nc = 80, and each detection is represented by a tensor of size 5 + 80 = 85.

For the Pascal VOC dataset, Nc = 20, and each detection is a tensor of size 5 + 20 = 25.

The ordering of the x, y etc. elements are not crucial as long as it is kept consistent. But I also don’t see any good reason to break this convention, so I will use this same ordering as the original YOLO model.

2.3. Arrange the detections, horizontally

The reasoning is fairly intuitive.

There are typically multiple objects in a single image, so we need to make multiple predictions.

Different objects are located at different places in the image, so we place different detections at different locations.

In the realm of numerically represented images, it is most natural to encode locations as elements in a matrix/array.

That’s how YOLO deals with this: it divides a “prediction matrix/array” into Cy number of rows and Cx number of columns, so a total of Cy * Cx cells, and each cell makes its own predictions, corresponding to different places in the image.

In the example shown in Figure 1, the thin grid lines denote such cells, and the dog target object is bounded by a bounding box, whose center point is located in cell (8,3). Another object, the truck, is located at a different cell (2,9).

2.4. Multiple detections in the same cell

A natural question to ask is: what if two objects overlap with each other and are located into the same cell?

One way to solve this problem is to make the cell size smaller, to reduce the chance that multiple objects would land in the same cell. This is particularly effective for large-sized objects.

Additionally, YOLO also allows each cell to predict multiple objects. And this is done slightly differently in YOLOv1 and later versions (up to v3 at least, I haven’t read about v4 or later).

In YOLOv1, each cell makes B number of predictions, corresponding to B number of bounding boxes. For instance, for evaluation on the Pascal VOC data, the author set B = 2. So the total number of predictions is Cy * Cx * B. But, the B number of predictions in each cell have to be of the same class, so the output tensor size is [Cy, Cx, B * 5 + Nc].

Since YOLOv2, the concept of anchor boxes was introduced. These could be understood as prescribed bounding box templates. They come as different sizes and aspect ratios. For instance, one of them used in YOLOv3 has a dimension of 116 * 90, measured in number of pixels. We will go deeper into the size computation shenanigans later, but for now, it is suffice to know that each cell can produce more than 1 objects. In YOLOv1, each prediction is associated with a bounding box. In later versions, each prediction is associated with an anchor box.

The number of anchor boxes B in each cell can be changed. In YOLOv2 they used 5, and in YOLOv3 3. Different from v1, the B number of predictions in each cell can be of different classes in v2 and later versions. So, the output tensor size is now [Cy, Cx, B, 5 + Nc].

2.5. Multi-scale detections

The author of YOLO admitted that the v1 version struggled at detection small objects, particularly those come in groups:

Our model struggles with small objects that appear in groups, such as flocks of birds.

(From the YOLOv1 paper.)

Why is it the case?

The limited number of bounding boxes in each cell, and the relatively small number of predicting cells mentioned previously are part of the reason.

It is also because the network of YOLOv1 makes predictions using the outputs only from the last model layer.

Because the information from input images have gone through multiple convolution layers and pooling layers, the feature map sizes are becoming smaller and smaller in the width and height dimensions. What are left at the end of the convolution layers are highly distilled representations of the image, with fine-grain details largely lost during the process. And small-sized objects are particularly susceptible to such a loss.

So, to counter this, YOLOv2 included a passthrough layer that connects feature maps with a 26 * 26 resolution with those with 13 * 13 resolution, thus providing some fine-grained features to the detection-making layers. The connection is done in a pixel-shuffle manner. We will not expand on this because YOLOv3 does this differently.

YOLOv3 achieves multi-scale detections, by producing predictions at 3 different scale levels:

  1. Large scale: for the detection of large-sized objects. These outputs are taken at the end of the convolution network (see Figure 2 for a schematic), with a stride of 32. I.e. if the input image is 416 * 416 pixels, feature maps at stride-32 have a size of 416 / 32 = 13.
  2. Mid scale: for detecting mid-size objects. These are taken from the middle of the network, with a stride of 16. I.e. feature maps are 26 * 26.
  3. Small scale: for detecting small-sized objects. These are taken from an even earlier layer in the network, with a stride of 8. I.e. feature maps are 52 * 52.

Again, to complement the mid scale and small scale detections with fine-grained features, passthrough connections are made:

  • For mid scale detections, feature map from a layer towards the end of the network is taken. These feature maps are at stride-32 (13 * 13 in size). Up-sample them (by interpolation) by a factor of 2, to a stride-16 (26 * 26 in size). Then concatenate them with feature map from the last layer with stride=16 (layer number 61) along the channel dimension. This concatenated feature map is passed through a few layers of convolutions before outputting a prediction.
  • For small scale detections, feature map from the above created side-branch at stride-16 is taken, up-sampled to stride=8, and concatenated with the output from layer 36 at a stride of 8. This concatenated feature map is passed through a few more conv layers before outputting the prediction for small objects.

Figure 2 below gives an illustration of this process. Note that when making the small scale predictions, it is incorporating information at 3 scales: stride 8, 16 and 32.


Figure 2: Structure of the YOLOv3 model. Blue boxes represent convolution layers, with their stride level labeled out. Prediction outputs are shown as red boxes, and there are 3 of them, with different stride levels. Pass-through connections are labeled as “Route”, and the layer from which these are taking out are put in parenthese (e.g. Layer 61. Indexing starts from 0).

This method allows us to get more meaningful semantic information from the upsampled features and finer-grained information from the earlier feature map.

(From the YOLOv3 paper.)

So, with 3 scales, the output tensor size is now [Cy1, Cx1, B, 5 + Nc] + [Cy2, Cx2, B, 5 + Nc] + [Cy3, Cx3, B, 5 + Nc].

Where Cy1 = Cx1 = 13, Cy2 = Cx2 = 26 and Cy3 = Cx3 = 52.

Recall that in YOLOv3, B=3 number of prescribed anchor boxes are associated with each scale level, these are:

  1. (116, 90), (156, 198), (373, 326) for large objects,
  2. (30, 61), (62, 45), (59, 119) for mid objects,
  3. (10, 13), (16, 30), (33, 23) for small objects.

These are taken from a K-Means clustering of the bounding boxes from training data. And the numbers are using the unit of pixels. This leads to size and location computations, detailed in the following sub-section.

2.6. The localization task

Let’s dive deeper into how YOLO predicts the location of an object.

Firstly, the location of an object is represented by the location of its bounding box, so we need only 4 numbers, in either of these 2 ways:

  1. (x, y) coordinate of a corner point and (x, y) coordinate of the diagonal point: [x1, y1, x2, y2]. Or
  2. (x, y) coordinate of the center point and (width, height) size: [x, y, w, h].

YOLOv3 takes the 2nd [x, y, w, h] representation for bounding box predictions. But both formats will be used at different places, so we need some housekeeping codes to do the transitions. It should be trivial to implement though.

Now about the sizes. I found it beneficial to first get the coordinate systems straight. Because YOLO implicitly uses 3 different coordinate systems, it is easy to get confused about which one is used at different places, particularly during the training stage.

  1. The original, image coordinate: measured in pixels. This is the coordinate system we use to locate a pixel in an image, x- for counting the columns, and y- for rows.
  2. The feature map coordinate: again, x- for counting columns and y- for rows, but here we are counting the feature map cells. Recall that we used the term cells previously, it may be beneficial to stick to this cells term to distinguish from the pixels unit used in the image coordinate. This distinction is easily observed in Figure 1, where the image is divided into 13 * 13 cells, but one such cell obviously contains more than 1 pixels. At stride=32, the feature map has a size 13 * 13. So the feature map coordinate at this particular stride level measures offsets within a 13 * 13 matrix. Similarly, at stride=16, the feature map measures offsets within a 26 * 26 matrix, and 1 unit of offset here is twice as large as 1 unit of offset at the stride=32 level, in the image coordinate distance sense. Also note that offsets in feature map coordinate can have decimal places. E.g. 1.5 means 1 and a half units of offset, with respect to the corner point of the feature map at a certain stride level.
  3. The fractional coordinate: for both x- and y- dimensions, this is a positive float. It could be measuring offsets, like the x- and y- locations, or width/height sizes. For instance, the bounding box labels in training data are often encoded in fractional coordinates. E.g. in the COCO detection data, a label is in [x_center, y_center, width, height, class] format, and has values of [0.5, 0.51, 0.1, 0.12, 10], then the first 4 floats tell the bounding box location/size measured in fractions of that image.

Now let’s look at how YOLOv3 locates a bounding box.

First, this is the equation given in the YOLOv2 paper (same holds for YOLOv3):

\begin{equation} \begin{aligned} b_x = & \sigma(t_x) + c_x \\ b_y = & \sigma(t_y) + c_y \\ b_w = & p_w e^{t_w} \\ b_h = & p_h e^{t_h} \\ \end{aligned} \end{equation}


\(t_y\), \(t_x\) are the raw model outputs about the y- and x- locations of the bounding box, produced by a cell at location \((c_y, c_x)\) in the feature map.

\(\sigma()\) is the sigmoid function. So \(\sigma(t_y)\) and \(\sigma(t_x)\) are floats in the [0,1] range, and are fractional offsets with respect to the corner of the cell at \((c_y, c_x)\).

When added onto the integer cell counts of \(c_y\) and \(c_x\), the resultant bounding box center location \((b_y, b_x)\) is using the feature map coordinate system.

\(t_w\) and \(t_h\) are the raw model outputs about the width and height sizes of the bounding box, again produced by the same cell at \((c_y, c_x)\) in the feature map.

\(p_w\) and \(p_h\) are the width and height of an anchor box, so \(e^{t_w}\) and \(e^{t_h}\) are in the factional coordinate, and act as non-negative scaling factors to resize this associated anchor box to match the object being detected. Because \(b_x\) and \(b_y\) are in feature map coordinates, so should \(b_w\) and \(b_h\).

But, the prescribed anchor boxes, e.g. the one with a size of (116, 90), are measured in image coordinates using units of pixels. Therefore, we need to convert 116 to a width measure in the feature map coordinate, and similarly for the height of 90. To do so, we need to divide them by the stride of the feature map in question. For instance, for large scale detections, \(p_w\) may be \(116 / 32 = 3.625\) and \(p_h\) may be \(90 / 32= 2.8125\).

But why “may be”? Recall that a single cell has 3 anchor boxes in YOLOv3, so it could have been the anchor box of (156, 198), or the (373, 326) one, in the case of large scale detections. During inference time, all 3 anchor boxes are used. During training time, only those with the closet match with the ground truth label will be picked. If the ground truth label object is not a “large” object in the first place, maybe none of the 3 are picked. We will come back to the training process in a later post.

Hopefully I’m not over-complicating things, but I found it helpful to map out these different coordinate systems to better understand YOLO’s localization mechanisms.

*Figure 3 below gives an illustration of the relationships between the 3 coordinate systems, and the conversions of some variables in between them.


Figure 3: Relationships between the 3 coordinate systems used in YOLOv3. Ellipses show some variables in the coordinate system as the corresponding color, and arrows denote operations applied to convert from 1 coordinate to another.

Having got the predictions in [bx, by, bw, bh] format, we only need to multiply with the current stride level to get back to the image coordinate, measured in pixels, and the results could be plotted out. (NOTE that if you have re-sized the image, for instance, to the standard 416 * 416 size, there is an extra step of re-scaling needed.)

So that is, very superficially, how YOLOv3 predicts object locations. Exactly how it produces the correct numbers of [tx, ty, tw, th] such that simple coordinate transformations could lead to a correct bounding box is beyond me. And frankly, I don’t think we understand neural networks well enough to clearly decipher this black-box. All we can say is that when we feed the model with enough number of correctly formulated training data, it somehow learns to build the correct associations, with certain degrees of generalizability beyond the data it has seen.

2.7. Confidence and classification scores

In addition to localization, YOLO also predicts a confidence score of the existence of an object, and a probability for the objecting belonging to each of the Nc possible classes, should there being an object in the first place.

Formally, the equation for confidence score prediction is:

\[ P_o = \sigma(t_o) \]

where \(t_o\) is the raw model output. Using the

[x, y, w, h, obj, c1, c2, ..., ck]

arrangement, it is the obj term.

Again, \(\sigma()\) is the sigmoid function, and its output can be interpreted as a probability prediction.

The probability prediction for classes is a conditioned on the objectness score:

Pr(Class_i | Object) * Pr(Object)

Note that this is the equation (1) in the YOLOv1 paper, and it is stated that:

At test time we multiply the conditional class probabilities and the individual box confidence predictions,

Pr(Classi |Object) ∗ Pr(Object) ∗ IOU(truth, pred) = Pr(Classi) ∗ IOU(truth, pred)

I’m wondering if there is some error in this statement: we don’t really have the truth term to compare against at test time. Based on the implementation of YOLOV3 Pytorch implementation by eriklindernoren, I assume that there shouldn’t be an IOU term at test/inference time.

Again, under the exclusive classes assumption, class prediction is just a multi-class classification task, and the only difference is the conditional probability of objectness prediction that is multiplied by.

I think I’ll talk more about objectness and class predictions in the Training the model post later.

3. Summary

The above talked about in the expected configurations, how a well- trained YOLOv3 model makes detections. Let’s summarize the main points.

  1. YOLOv3 makes predictions at 3 scale levels: for large objects, mid objects, and small objects.
  2. The 3-scale predictions come from feature maps at 3 stride levels: stride-32 for large objects, stride-16 for mid objects, and stride=8 for small objects.
  3. At each scale level, there are 3 prescribed anchor boxes:

    1. (116, 90), (156, 198), (373, 326) for large objects,
    2. (30, 61), (62, 45), (59, 119) for mid objects,
    3. (10, 13), (16, 30), (33, 23) for small objects.

    These are taken from a K-Means clustering on the bounding boxes from training data, and all sizes are measured in the image coordinate, in pixels.

  4. At each scale level, the model predicts x- and y- location (bx, by) of a bounding box as offsets wrt the corner of the feature map. And width and height of the bounding box, (bw, bh), again wrt to the size of the corresponding feature map.
  5. To get back to the image coordinate measured in pixels, we multiple the [bx, by, bw, bh] values by the stride level of the feature map: 32 for large scale objects, 16 for mid, and 8 for small objects.
  6. In addition to locations, the model also predicts a confidence score about the existence of an object inside that bounding box we just formulated. And, for each of the prescribed Nc number of classes, a conditioned probability of the object belonging to that class.

That’s the part of the work involving the YOLO model itself. There are some extra housekeeping steps afterwards:

  1. Thresholding the confidence scores: the model may produce a large number of low confidence predictions, which could be filtered out using a threshold on the confidence scores.
  2. Removing duplicates by Non-maximum suppression: the model may produce multiple similar predictions around the true location of an object, with slightly different confidence levels. Without further information, it is reasonable to assume that the one with highest confidence score is the “correct” prediction, while other highly overlapping bounding boxes are “duplicated predictions”. Such duplicates are typically removed using a Non-maximum suppression method. We will implement this in a later post.

Figure 4 shows a schematic layout of the YOLOv3 model, summarizing much of the key points in this post.


Figure 4: Schematic layout of a YOLOv3 model predictions. Predictions at 3 scales are represented by 3 cubes, with different row/column numbers, but the same size in the channel dimension. In each scale, a cell in the feature map is highlighted. Each cell consists of predictions from 3 anchor boxes, represented by 3 different colors. The identified object, the dog, is hyperthetically predicted by the green anchor box, at a certain cell, in the large scale prediction cube.

That’s pretty much it for the 1st part of the series. Hopefully this makes things a bit clearer if you are interested in the YOLO model. We’ll revisit some of these points in later posts, with code implementations. The next part will be building the Darknet-53 network. So please stay tuned.

Author: guangzhi

Created: 2022-06-22 Wed 22:36


Leave a Reply