Model Architecture#

The multi-object localization model consists of two main components: a convolutional backbone for feature extraction and a convolutional head for bounding box prediction. The backbone is a pretrained convolutional network such as MobileNetV3 or SSDLite. As in the previous part of the project, the backbone can be run once per image to extract and store features and targets on disk as PyTorch tensors. This allows the localization head to be trained efficiently without repeatedly forwarding images through the backbone.

Localization head#

The localization head operates directly on the spatial feature map produced by the backbone. This preserves spatial information and allows the network to produce predictions for multiple grid cells simultaneously. The head is structured as a small convolutional network with two parallel branches.

  • Classification branch. This branch predicts whether each grid cell contains the center of an object.

    • It consists of a series of convolutional layers with ReLU activations in between.

    • All kernel sizes are 1×1 to maintain the spatial dimensions of the feature map.

    • The final output is a 1xS×S tensor that represents the logit scores for binary classification.

  • Regression branch. This branch predicts the coordinates of a bounding box for each grid cell.

    • It also consists of a series of convolutional layers with ReLU activations in between.

    • All kernel sizes are 1×1 to maintain the spatial dimensions of the feature map.

    • The final output is a 4xS×S tensor, with each channel being one of the bounding box coordinates.

Both branches operate in parallel on the same backbone feature map but produce distinct outputs. Because the head is convolutional, the same set of parameters is shared across the entire grid, enabling the model to localize an arbitrary number of objects without knowing in advance how many are present.

Model architecture

Bounding box parametrization#

The regression branch does not predict raw pixel coordinates directly. Instead, it outputs bounding boxes in a different parametrization that is easier to learn. During training, these predictions may either be compared directly to ground-truth targets (encoded into the same parametrization), or decoded back into pixel coordinates for losses that operate on box geometry. During inference, decoding is always required to obtain final bounding box predictions. In what follows, we describe the parametrization used by anchor-free detectors (FCOS, YOLOv6, …), since this project does not make use of anchor boxes.

Notation

Description

\((W, H)\)

Image size in pixels.

\((S, S)\)

Grid resolution.

\((W/S, H/S)\)

Cell size in pixels (stride).

\((i, j)\)

Row and column indices of the cell containing an object, where \(0 ≤ i, j < S\).

\((x_{min}, y_{min}, x_{max}, y_{max})\)

Pixel coordinates of the top-left and bottom-right corners of a bounding box.

Encoding#

In anchor-free detectors, a bounding box is encoded by the distances from the center of an assigned grid cell to the four sides of the box. For a cell \((i, j)\), its center in grid coordinates is

\[ g_x = j + 0.5 \qquad g_y = i + 0.5. \]

To express the distances in grid units, the bounding box coordinates are normalized by the cell size \((W/S, H/S)\), and then subtracted to/from the cell center.

\[\begin{split} \begin{aligned} l &= g_x - \frac{x_{min}}{W}\cdot S &\qquad t &= g_y - \frac{y_{min}}{H}\cdot S \\ r &= \frac{x_{max}}{W}\cdot S - g_x &\qquad b &= \frac{y_{max}}{H}\cdot S - g_y \end{aligned} \end{split}\]

When the cell center lies inside the bounding box, all four values are nonnegative. If the center falls outside, some distances may become negative. The figure below illustrates how these distances are defined.

Box encoding

Decoding#

For each grid cell, the regression branch outputs the encoded distances \((\hat{l}, \hat{t}, \hat{r}, \hat{b})\) without applying an activation function. As noted earlier, these predictions can either be compared directly against encoded ground-truth targets or decoded into pixel coordinates, depending on the loss function used. The decoding process simply reverses the above transformations to recover bounding box coordinates.

\[\begin{split} \begin{aligned} \hat{x}_{min} &= \frac{g_x - \hat{l}}{S} \cdot W &\qquad \hat{y}_{min} &= \frac{g_y - \hat{t}}{S} \cdot H \\ \hat{x}_{max} &= \frac{g_x + \hat{r}}{S} \cdot W &\qquad \hat{y}_{max} &= \frac{g_y + \hat{b}}{S} \cdot H \end{aligned} \end{split}\]

Note

If the ground-truth bounding boxes are already normalized by the image size during preprocessing (i.e., \(x_{min}, x_{max} \in [0, 1]\) and \(y_{min}, y_{max} \in [0, 1]\)), then the divisions by \(W\) and \(H\) in the encoding formulas are not necessary. Moreover, the multiplications by \(W\) and \(H\) in decoding formulas can be omitted if the bounding boxes are desired in normalized coordinates.

Implementation advice#

Keeping data transformations separate from the model makes it easier to experiment with different parametrizations and loss functions. The regression branch always outputs encoded values, but the model itself does not perform any encoding or decoding. If the chosen loss function operates on encoded values, the ground-truth boxes can be encoded either during preprocessing or inside the loss function. When bounding box coordinates are required, the model predictions can be decoded on-the-fly within the loss function. By maintaining a clear separation, you can revisit design choices without altering the network.