Training

Training#

Once the data pipeline and network architecture are in place, training CSRNet becomes a supervised regression problem. Given an input image, the network is asked to predict a density map whose values match the ground-truth density at each pixel. The next sections explain how to set up the training process.

Loss function#

Let \(I\) be a preprocessed image and \(D\) its ground-truth density map, both defined on a grid of size \(H' \times W'\) that depends on your preprocessing choices. Let \(f_\theta\) be the CSRNet model with parameters \(\theta\). Given an input image \(I\), CSRNet predicts a density map \(\hat{D} = f_\theta(I)\) of the same size. The primary training objective is a pixel-wise regression loss between \(\hat{D}\) and \(D\). The most common choice is the mean squared error (MSE).

\[ L(\theta) = \frac{1}{H' W'} \sum_{i=1}^{H'} \sum_{j=1}^{W'} \left( \hat{D}_{i,j} - D_{i,j} \right)^2 \]

This loss is small when the predicted density at every pixel is close to the ground-truth density. It also encourages the sum of the predicted density map to match the actual crowd count, since the ground-truth density map is constructed such that its sum equals the number of people in the image. You do not need a separate count-based loss, as the count is implicitly constrained through the density maps.

Tip

In PyTorch, the MSE loss is already implemented as nn.MSELoss().

Two-stage training procedure#

It is recommended to reuse a pretrained VGG-16 model for the front-end of CSRNet, and initialize the back-end and head randomly. Therefore, it is beneficial to use a training procedure similar to transfer learning, in order to speed up training and improve performance. This involves the following two stages.

Stage 1 - Train only the back-end and head. Freeze the weights of the front-end (VGG-16 layers) so they are not updated during backpropagation. Train only the back-end and head for a few epochs (e.g., 10-20) using a relatively high learning rate (e.g., 1e-4). This allows the model to learn how to map generic VGG features to density maps.
Stage 2 - Fine-tune the entire network. Unfreeze the front-end and continue training the entire CSRNet model (front-end, back-end, and head) for more epochs (e.g., 20-40) using a lower learning rate (e.g., 1e-5). This fine-tuning step allows the VGG-16 layers to adapt to the specific task of crowd counting, improving overall performance.

Tip

In PyTorch, a simple way to freeze and unfreeze layers is to set the requires_grad attribute of their parameters.

In Stage 1, call freeze_frontend(model) BEFORE creating an optimizer, so that the front-end parameters are excluded.
In Stage 2, call unfreeze_frontend(model) BEFORE recreating a new optimizer, so that all parameters are included.

def freeze_frontend(model):
    for p in model.frontend.parameters():
        p.requires_grad = False

def unfreeze_frontend(model):
    for p in model.frontend.parameters():
        p.requires_grad = True

Troubleshooting#

A few practical details can significantly affect training stability.

Consistency of resolution. Make sure that the resolution of your target density maps matches the resolution of CSRNet’s output.
Batch size vs. image size. High-resolution images and density maps can be memory-intensive. If you encounter out-of-memory errors, consider reducing the crop size or lowering the batch size.
Gaussian scale vs. resolution. The Gaussian kernel used to smooth the density maps is defined in pixel units. If you change the image resolution (e.g., train at a higher or lower size), its kernel size and standard deviation should be scaled proportionally to preserve a similar spread around each head.
Debugging with visualization. Early in training, visualize predicted density maps overlaid on the input image. Even if they are noisy, they should roughly highlight crowded regions and gradually evolve toward sharper peaks at head locations.

Summary#

In summary, training CSRNet should follow the same fine-tuning template used elsewhere in the course.

Preprocess the data into (image, density_map) pairs, making sure they are geometrically aligned.
Initialize CSRNet with a VGG-based frontend using pretrained weights.
Train only the back-end and head until the loss begins to stabilize.
Unfreeze the front-end and continue training the entire model end-to-end with a smaller learning rate.
Regularly evaluate on a validation set using count-based metrics.

With this setup, CSRNet becomes a straightforward convolutional regressor that can be trained using the same tools and practices you have already seen, while leveraging its specialized architecture for high-quality crowd density estimation.