Preprocessing

Preprocessing#

CSRNet is designed to predict a density map, so that training can be formulated as a pixel-wise regression problem. Preprocessing must therefore convert the annotated head locations of each image into a density map suitable as a regression target. At the same time, real-world datasets like ShanghaiTech contain images with varying dimensions. Preprocessing must ensure that images are resized to a consistent resolution for efficient batching, and that their pixel values are normalized for stable optimization. Crucially, any geometric operation applied to an image must be reflected in the head locations used to generate the corresponding density map. The next sections will explain how to implement this preprocessing pipeline.

Density Map

Density Map#

The ShanghaiTech dataset provides crowd images and, for each image, a list of head locations. These locations are stored as 2D coordinates in pixel units:

\[ \mathcal{P} = \{(x_1, y_1), \ldots, (x_N, y_N)\}, \]

where \(N\) is the number of people in the image. CSRNet requires these points to be converted into a continuous density map, where each pixel value indicates the estimated density of people at that location.

Definition#

The density map for a crowd image is a non-negative function \(D(x, y)\) defined over the image plane. It is constructed by placing a Gaussian kernel at each annotated head location and summing the contributions from all kernels. If \(\mathcal{P}\) is the set of head locations, the density map is defined as

\[ D(x, y) = \sum_{(x_i, y_i) \in \mathcal{P}} \mathcal{G}_\sigma(x - x_i, y - y_i). \]

Here, \(\mathcal{G}_\sigma(x, y)\) is a 2D Gaussian kernel with standard deviation \(\sigma\), normalized so that its integral equals 1:

\[ \mathcal{G}_\sigma(x, y) = \frac{1}{2 \pi \sigma^2} \exp\left(-\frac{x^2 + y^2}{2 \sigma^2}\right). \]

This formulation ensures that the sum of the density map equals the number of people in the image:

\[ \sum_{x, y} D(x, y) = N. \]

This is also true locally: summing the density map over a region yields the number of people in that region.

Note

The choice of \(\sigma\) affects the spread of the Gaussian kernels and thus the smoothness of the density map. In this project, you will use a fixed \(\sigma\) value for all heads. This keeps the implementation simple and is fully sufficient for reproducing the main behavior of CSRNet.

Generation#

To generate the density map for a given image, you can follow these steps.

Create a zero-valued density map of the required size (usually the same size as the input image).
For each head location \((x_i, y_i)\) in \(\mathcal{P}\), add the value 1 at the corresponding pixel in the density map.
Convolve the density map with the Gaussian kernel \(\mathcal{G}_\sigma\) to create a smooth distribution of values.
Check that the sum of all pixel values in the density map equals the number of heads \(N\).

The result is a smooth density map suitable as a regression target.

Note

In step 2, you can use the index_put_ method from PyTorch tensor. This method allows you to place values at specific indices in a tensor with a single operation. Setting the accumulate argument to True ensures that if multiple heads fall on the same pixel, their contributions are added together rather than overwritten.

Gaussian Kernel#

In step 3, the convolution with the Gaussian kernel can be implemented using PyTorch’s Conv2d layer.

First, you generate a Gaussian kernel tensor of shape (1, 1, k, k), where k is an odd integer.

ksize = 7
sigma = 2
kernel = make_gaussian_kernel(ksize, sigma)  # shape (1, 1, ksize, ksize)

Then, you create a Conv2d layer with the following parameters.

gauss = torch.nn.Conv2d(1, 1, ksize, padding='same', bias=False)

Finally, you assign the Gaussian kernel to the Conv2d layer and disable gradient computation.
```
gauss.weight.data = kernel
gauss.weight.requires_grad = False
```

The function to create the Gaussian kernel tensor is provided below.

import torch

def make_gaussian_kernel(ksize: int, sigma: float, device="cpu"):
    """
    Create a 2D Gaussian kernel of shape (1, 1, ksize, ksize), normalized to sum to 1.
    ksize should be odd (e.g., 7, 9, 15).
    """
    assert ksize % 2 == 1, "Kernel size should be odd."

    # Coordinate grid centered at 0
    radius = ksize // 2
    xs = torch.arange(-radius, radius + 1, device=device)
    ys = torch.arange(-radius, radius + 1, device=device)
    yy, xx = torch.meshgrid(ys, xs, indexing="ij")  # shape (ksize, ksize)

    # 2D Gaussian
    kernel = torch.exp(-(xx**2 + yy**2) / (2 * sigma**2))  # shape (ksize, ksize)
    kernel = kernel / kernel.sum()  # sum to 1

    return kernel.reshape(1, 1, ksize, ksize)

Transform pipeline#

To prepare data for CSRNet, you will design a joint preprocessing pipeline that takes as input a raw image and its head annotations. These inputs undergo the following transformations.

Resizing. The image is rescaled so that its shorter side matches a chosen length (e.g., 256). The head coordinates are scaled accordingly to maintain their relative positions.
Center Cropping. The resized image is then center-cropped to a fixed size (e.g., 224x224). The head coordinates are adjusted by subtracting the crop offsets.
Image Normalization. The cropped image is converted to a floating-point tensor and normalized using the mean and standard deviation of the pretrained backbone (typically ImageNet statistics: mean [0.485, 0.456, 0.406], std [0.229, 0.224, 0.225]).
Head Filtering. After cropping, some head locations may fall outside the image boundaries. These points are removed from the list of head locations before generating the density map.
Impulse Tensor. The remaining head locations are rounded to the nearest pixel indices. A float tensor of zeros is created with the same dimensions as the cropped image. For each head location, a value of 1 (impulse) is placed at the corresponding pixel in the tensor. If multiple heads fall on the same pixel, their contributions are summed.
Density Map. The previous tensor is convolved with the Gaussian kernel to produce the final density map. The sum of all pixel values in the density map must equal the number of heads in the image.

The output of this pipeline is a pair of aligned tensors: a resized, center-cropped, normalized image, and a density map generated from the transformed head locations. Because all geometric transformations are applied jointly to both image and head locations, the density map is perfectly aligned with the content of the processed image. Thus, training CSRNet reduces to standard pixel-wise regression on these pairs.

Checklist#

Before moving on to implementing the CSRNet model architecture, you should verify that your preprocessing pipeline behaves as expected. At minimum, check the following.

Image.
- Each sample returns an image tensor of fixed size (3, H, W).
- Pixel values are properly normalized (roughly centered around 0, not in [0, 255]).
Density map.
- Each sample returns a density tensor of size (1, H, W).
- The sum of the density map is (close to) the number of head annotations for that image.
Alignment.
- If you plot the density map as a heatmap over the preprocessed image, the peaks coincide with visible heads.
- After resizing/cropping, head locations are still in the correct positions (no obvious shifts or flips).
Batch collation.
- Check that the DataLoader can batch multiple samples without a custom collate function (images and densities stack cleanly).

If all these checks pass on a few random samples, your preprocessing is likely correct and ready to be used.