Model Architecture

Model Architecture#

CSRNet is a fully convolutional network that maps an RGB image to a single-channel density map. It is built from three main components: a front-end based on VGG-16, a back-end composed of dilated convolutions, and a head that converts features into a density map. Together, these blocks produce a dense prediction that can be interpreted as a per-pixel crowd density.

Overview#

At a high level, CSRNet processes an input image in four stages:

  • The front-end applies the first ten convolutional layers of VGG-16 (with three max-pooling layers) to produce a feature map whose spatial resolution is 1/8 of the input. Fully connected layers are discarded.

  • The back-end takes this feature map and passes it through a sequence of 3×3 convolutional layers with dilation. These layers enlarge the effective receptive field while keeping the spatial resolution fixed.

  • The head consists of two operations.

    • A final 1×1 convolution reduces the number of channels to 1, yielding a low-resolution density map.

    • A bilinear upsampling step (factor 8) enlarges this map back to the original image size, so the predicted density has the same height and width as the input.

From the outside, CSRNet therefore behaves like an image-to-image model:

\[ \textsf{RGB Image (H × W)} \to \textsf{Density Map (H × W).} \]

Internally, however, most computation happens at 1/8 resolution.

Front-End#

The front-end reuses the convolutional part of VGG-16 as a generic feature extractor. In the original VGG-16, the convolutional layers are grouped into five blocks, each ending with a max-pooling layer. CSRNet keeps only the first three pooling operations, which corresponds to keeping the first ten convolutional layers. Concretely, the front-end consists of the following sequence of layers.

  • Block 1: conv → ReLU → conv → ReLU → maxpool (downsampling by 2).

  • Block 2: conv → ReLU → conv → ReLU → maxpool (downsampling by 2).

  • Block 3: conv → ReLU → conv → ReLU → conv → ReLU → maxpool (downsampling by 2).

  • Block 4: conv → ReLU → conv → ReLU → conv → ReLU (no pooling).

The front-end thus reduces the spatial resolution by a factor of 8 (due to three max-pooling layers), while increasing the number of channels from 3 (RGB) to 512. Overall, the front-end takes a tensor of shape (B, 3, H, W) and returns a tensor of shape (B, 512, H/8, W/8).

Note

The front-end of CSRNet corresponds to the layers with indices 0 to 22 (inclusive) in the pretrained VGG-16 model from torchvision.models.

==========================================================================================
Layer (type (var_name))                  Output Shape              Param #
==========================================================================================
VGG (VGG)                                [1, 1000]                 --
├─Sequential (features)                  [1, 512, 7, 7]            --
│    └─Conv2d (0)                        [1, 64, 224, 224]         1,792
│    └─ReLU (1)                          [1, 64, 224, 224]         --
│    └─Conv2d (2)                        [1, 64, 224, 224]         36,928
│    └─ReLU (3)                          [1, 64, 224, 224]         --
│    └─MaxPool2d (4)                     [1, 64, 112, 112]         --
│    └─Conv2d (5)                        [1, 128, 112, 112]        73,856
│    └─ReLU (6)                          [1, 128, 112, 112]        --
│    └─Conv2d (7)                        [1, 128, 112, 112]        147,584
│    └─ReLU (8)                          [1, 128, 112, 112]        --
│    └─MaxPool2d (9)                     [1, 128, 56, 56]          --
│    └─Conv2d (10)                       [1, 256, 56, 56]          295,168
│    └─ReLU (11)                         [1, 256, 56, 56]          --
│    └─Conv2d (12)                       [1, 256, 56, 56]          590,080
│    └─ReLU (13)                         [1, 256, 56, 56]          --
│    └─Conv2d (14)                       [1, 256, 56, 56]          590,080
│    └─ReLU (15)                         [1, 256, 56, 56]          --
│    └─MaxPool2d (16)                    [1, 256, 28, 28]          --
│    └─Conv2d (17)                       [1, 512, 28, 28]          1,180,160
│    └─ReLU (18)                         [1, 512, 28, 28]          --
│    └─Conv2d (19)                       [1, 512, 28, 28]          2,359,808
│    └─ReLU (20)                         [1, 512, 28, 28]          --
│    └─Conv2d (21)                       [1, 512, 28, 28]          2,359,808
│    └─ReLU (22)                         [1, 512, 28, 28]          --
│    └─MaxPool2d (23)                    [1, 512, 14, 14]          --
│    └─Conv2d (24)                       [1, 512, 14, 14]          2,359,808
│    └─ReLU (25)                         [1, 512, 14, 14]          --
│    └─Conv2d (26)                       [1, 512, 14, 14]          2,359,808
│    └─ReLU (27)                         [1, 512, 14, 14]          --
│    └─Conv2d (28)                       [1, 512, 14, 14]          2,359,808
│    └─ReLU (29)                         [1, 512, 14, 14]          --
│    └─MaxPool2d (30)                    [1, 512, 7, 7]            --
├─AdaptiveAvgPool2d (avgpool)            [1, 512, 7, 7]            --
├─Sequential (classifier)                [1, 1000]                 --
│    └─Linear (0)                        [1, 4096]                 102,764,544
│    └─ReLU (1)                          [1, 4096]                 --
│    └─Dropout (2)                       [1, 4096]                 --
│    └─Linear (3)                        [1, 4096]                 16,781,312
│    └─ReLU (4)                          [1, 4096]                 --
│    └─Dropout (5)                       [1, 4096]                 --
│    └─Linear (6)                        [1, 1000]                 4,097,000
==========================================================================================
Total params: 138,357,544
Trainable params: 138,357,544
Non-trainable params: 0
Total mult-adds (Units.GIGABYTES): 15.48
==========================================================================================
Input size (MB): 0.60
Forward/backward pass size (MB): 108.45
Params size (MB): 553.43
Estimated Total Size (MB): 662.49
==========================================================================================

Tip

In PyTorch, to select specific layers from a pretrained model, you can proceed as follows.

  • Import the necessary modules.

    from torchvision.models import vgg16, VGG16_Weights
    
  • Initialize the model with pretrained weights.

    model = vgg16(weights=VGG16_Weights.DEFAULT)
    
  • Extract the layers in the desired index range from the feature extractor.

    frontend = model.features[0:n+1]  # This includes layers from index 0 to n (inclusive)
    
  • You can print the resulting model to visualize the architecture and confirm the layer indices.

    print(frontend)
    

Back-End#

After the front-end, CSRNet must increase the receptive field so that each pixel in the density map can “see” a large portion of the input image. To avoid downsampling the feature map further, CSRNet uses dilated convolutions in the back-end. They are similar to standard convolutions, but with gaps between the kernel elements, which allows them to cover a larger area without reducing spatial resolution. For example, a 3×3 kernel with dilation 2 covers a 5×5 region; with dilation 3, it covers 7×7.

The back-end consists of six 3×3 convolutional layers with ReLU activations. Each layer uses dilation 2 and padding 2 to maintain the spatial resolution. The first three layers produce 512 channels, while the last three reduce the number of channels to 256, 128, and finally 64.

Tip

In PyTorch, the back-end can be implemented as an nn.Sequential containing several nn.Conv2d layers interleaved with nn.ReLU activations (six of each).

Summary#

CSRNet is a straightforward but carefully designed fully convolutional network. The VGG-16 front-end, initialized with ImageNet weights, extracts multi-scale visual features and reduces spatial resolution by a factor of 8. The dilated convolution back-end then enlarges the receptive field without further downsampling, allowing each prediction to aggregate information from a wide region of the scene while preserving a reasonably detailed spatial grid. A final 1×1 convolution converts these features into a single-channel density map, which is then upsampled back to the input resolution.