Loss function#

Training the model requires a loss function that balances objectness classification and bounding box regression. The standard approach is to use a multi-task formulation combining two loss functions.

\[ \mathcal{L} = L_{\rm object} + \lambda\, L_{\rm box} \]

Here, \(\lambda\) is a hyperparameter that balances the two terms (\(\lambda = 1\) is often a good choice).

Objectness loss#

The objectness loss (\(L_{\rm object}\)) measures how well the model predicts whether each grid cell contains an object center. This is a binary classification problem, so the loss can be computed using binary cross-entropy between the predicted logit and the target label.

Note

Since most grid cells in an image are empty, the dataset is imbalanced. To mitigate this, one can down-weight the contribution of negative cells or use the focal loss, which emphasizes harder examples and reduces the effect of abundant easy negatives.

Regression loss#

The regression loss (\(L_{\rm box}\)) measures how accurately the model predicts the bounding box coordinates. This is a regression problem, so the loss can be computed using the smooth L1 distance between the predicted and target coordinates. Only grid cells that contain an object contribute to this loss; empty cells are ignored.

While the smooth L1 loss has traditionally been used for bounding box regression, it measures only the difference between predicted and target coordinates, not the quality of overlap between boxes. In dense prediction tasks, this can lead to suboptimal localization: a box may have low coordinate error but still misalign significantly with the ground truth. To address this, modern detectors increasingly use IoU-based losses, which directly optimize the overlap between predicted and ground-truth boxes.

Intersection over Union (IoU)#

The IoU between two boxes is defined as:

\[ \text{IoU} = \frac{\text{area of overlap}}{\text{area of union}}. \]

It ranges from 0 (no overlap) to 1 (perfect alignment). A simple IoU loss can be defined as:

\[ \mathcal{L}_{\rm IoU} = 1 - \text{IoU}. \]

This functions directly encourages maximizing overlap but has two limitations when boxes do not overlap (IoU = 0): it provides no gradient and does not penalize differences in box size or position.

Generalized IoU (GIoU)#

The GIoU loss addresses the IoU limitations by adding a penalty term based on the smallest enclosing box that contains both predicted and target boxes. It is defined as:

\[ \mathcal{L}_{\rm GIoU} = 1 - \text{IoU} + \frac{|C - A\cup B|}{|C|} \]

where \(C\) is the area of the smallest enclosing box that contains both the predicted and target boxes, and \(A \cup B\) is the area of the union of the predicted and target boxes. GIoU provides meaningful gradients even when boxes do not overlap, encouraging predictions to move toward the target.

Distance IoU (DIoU)#

The DIoU loss improves the IoU metric by incorporating the distance between predicted and target boxes. It is defined as:

\[ \mathcal{L}_{\rm DIoU} = 1 - \text{IoU} + \frac{d^2}{c^2} \]

where \(d\) is the normalized distance between the box centers, and \(c\) is the diagonal length of the smallest enclosing box that contains both predicted and target boxes. The DIoU loss encourages not only better overlap but also better alignment of the box centers.

Complete IoU (CIoU)#

The CIoU loss extends the DIoU metric by also considering aspect ratio consistency between predicted and target boxes. It is defined as:

\[ \mathcal{L}_{\rm CIoU} = 1 - \text{IoU} + \frac{d^2}{c^2} + \alpha \rho \]

where \(\rho\) is the difference in aspect ratio between the boxes, and \(\alpha\) is a weighting factor. The CIoU loss encourages predictions that are well-aligned in position, size, and shape.

Tip

The CIoU loss is often regarded as the most effective IoU-based loss. However, the optimal choice may vary depending on the dataset.

Distribution focal loss (DFL)#

Traditionally, bounding box regression consists of predicting four continuous values: the distances from a reference point (usually a cell center) to the four sides of a bounding box (left, top, right, bottom). The Distribution Focal Loss reformulates this task by predicting a discrete probability distribution for each side of a bounding box. Instead of predicting four continuous values per grid cell, the regression branch outputs four histograms across a set of bins that cover the possible range of distances. By doing so, the regression task is reformulated as a classification problem over discretized intervals. This often leads to higher localization accuracy, especially for small objects.

Predicted distribution#

Assume distances lie within a bounded range \([0, D]\). We divide this range into \(K\) bins, each of width \(d = D / K\). Then, the \(i\)-th bin corresponds to the interval:

\[ \text{Bin } i: [i \cdot d, (i + 1) \cdot d). \]

With \(K\) bins, the model predicts a logit vector \(s = (s_0, s_1, \ldots, s_{K-1})\) for each side of a bounding box. After applying the softmax, we obtain a probability distribution over the bins.

\[ (\forall i \in \{0, 1, \ldots, K-1\})\qquad p_i = \frac{\exp({s_i})}{\sum_{j=0}^{K-1} \exp({s_j})} \]

At inference time, we recover a continuous distance \(\hat{y}\) by taking the expectation of the predicted distribution.

\[ \hat{y} = \sum_{i=0}^{K-1} p_i \cdot i \]

Loss function#

Given a target distance \(y\), we identify the bin \(\ell = \lfloor y / d \rfloor\) that contains \(y\). The target distribution places all mass on the bins \(\ell\) and \(\ell + 1\), weighted by their fractional distance \(\alpha = y / d - \ell\) to the target.

\[\begin{split} (\forall i \in \{0, 1, \ldots, K-1\})\qquad q_i = \begin{cases} 1 - \alpha & \text{if } i = \ell \\ \alpha & \text{if } i = \ell + 1 \\ 0 & \text{otherwise} \end{cases} \end{split}\]

The loss is then computed as the cross-entropy between the predicted and target distributions.

\[ \mathcal{L}_{\rm DFL} = - \sum_{i=0}^{K-1} q_i \log(p_i) = - \big[ (1 - \alpha) \log(p_\ell) + \alpha \log(p_{\ell+1}) \big] \]

Tip

The Distribution Focal Loss is often combined with IoU-based losses, since IoU optimizes overlap while DFL sharpens coordinate precision.

\[ \mathcal{L}_{\rm box} = \mathcal{L}_{\rm CIoU} + \gamma\, \mathcal{L}_{\rm DFL} \]