Loss function#
Training the model requires a loss function that balances objectness classification and bounding box regression. The standard approach is to use a multi-task formulation combining two loss functions.
Here, \(\lambda\) is a hyperparameter that balances the two terms (\(\lambda = 1\) is often a good choice).
Objectness loss#
The objectness loss (\(L_{\rm object}\)) measures how well the model predicts whether each grid cell contains an object. This is a binary classification problem, so the loss can be computed using binary cross-entropy between the predicted logit and the target binary label.
Note
Since most grid cells in an image are empty, the dataset is imbalanced. To mitigate this, one can down-weight the contribution of negative cells or use the focal loss, which emphasizes harder examples and reduces the effect of abundant easy negatives.
Regression loss#
The regression loss (\(L_{\rm box}\)) measures how accurately the model predicts the bounding box coordinates. This is a regression problem, so the loss can be computed using the smooth L1 distance between the predicted and target coordinates. Only grid cells that contain an object contribute to this loss; empty cells are ignored.
While the smooth L1 loss has traditionally been used for bounding box regression, it measures only the difference between predicted and target coordinates, not the quality of overlap between boxes. In dense prediction tasks, this can lead to suboptimal localization: a box may have low coordinate error but still misalign significantly with the ground truth. To address this, modern detectors increasingly use IoU-based losses, which directly optimize the overlap between predicted and ground-truth boxes.
Intersection over Union (IoU)#
The IoU between two boxes is defined as:
It ranges from 0 (no overlap) to 1 (perfect alignment). A simple IoU loss can be defined as:
This functions directly encourages maximizing overlap, but has two limitations when boxes do not overlap (IoU = 0): it provides no gradient and does not penalize differences in box size or position.
Generalized IoU (GIoU)#
The GIoU loss addresses the IoU limitations by adding a penalty term based on the smallest enclosing box that contains both predicted and target boxes. It is defined as:
where \(C\) is the smallest enclosing box that contains both the predicted and target boxes, and \(A \cup B\) is the union of the predicted and target boxes. GIoU provides meaningful gradients even when boxes do not overlap, encouraging predictions to move toward the target.
Distance IoU (DIoU)#
The DIoU loss improves the IoU loss by incorporating the distance between predicted and target boxes.
Here, \(d\) is the normalized distance between the box centers, and \(c\) is the diagonal length of the smallest enclosing box that contains both predicted and target boxes. The DIoU loss encourages not only better overlap but also better alignment of the box centers.
Complete IoU (CIoU)#
The CIoU loss extends the DIoU loss by also considering aspect ratio consistency between predicted and target boxes. It is defined as:
where \(\rho\) is the difference in aspect ratio between the boxes, and \(\alpha\) is a weighting factor. The CIoU loss encourages predictions that are well-aligned in position, size, and shape.
Tip
The CIoU loss is often regarded as the most effective IoU-based loss. However, the optimal choice may vary depending on the dataset.
Distribution focal loss (optional)#
Traditionally, bounding box regression consists of predicting four continuous values: the distances from a reference point (usually a cell center) to the four sides of a bounding box (left, top, right, bottom). The Distribution Focal Loss (DFL) reformulates this task by predicting a discrete probability distribution for each side of a bounding box. Instead of predicting four continuous values per grid cell, the regression branch outputs four histograms across a set of bins that cover the possible range of distances. By doing so, the regression task is reformulated as a classification problem over discretized intervals. This often leads to higher localization accuracy, especially for small objects.
Predicted distribution#
Assume distances lie within the range \([y_{\min}, y_{\max}]\). We discretize this range using \(K + 1\) points.
The model predicts a logit vector \(s = (s_0, s_1, \ldots, s_{K})\) for each side of a bounding box. Applying the softmax yields a probability distribution over the discretized range.
At inference time, we recover a continuous distance by taking the expectation of the predicted distribution.
Loss function#
Given a ground-truth distance \(y \in [y_{\min}, y_{\max}]\), we first locate the discretized interval \([y_{\ell}, y_{\ell + 1}]\) that contains \(y\). This is done by finding the index
We then define a target distribution that assigns all the probability mass to the two points surrounding \(y\), weighted by their proximity to the target value.
The loss is then computed as the cross-entropy between the predicted and target distributions.
Tip
The Distribution Focal Loss is often combined with IoU-based losses, since IoU optimizes overlap while DFL sharpens coordinate precision.