Evaluation

Evaluation#

CSRNet is trained as a density-map regressor, but in practice we care primarily about how accurately it counts people in each image. Evaluation therefore focuses on comparing the predicted count (obtained from the predicted density map) to the actual count (derived from the ground-truth density map). The following sections explain how to go from CSRNet predictions to image-level counts, and how to measure counting accuracy over a dataset. At the end, a short section discusses optional, more detailed diagnostics.

Counting metrics#

For a given preprocessed image, let \(\hat{D}\) be the predicted density map produced by CSRNet, with height \(H'\) and width \(W'\). The natural way to obtain a predicted count from this density map is to sum over all pixels:

\[ \hat{C} = \sum_{x=1}^{H'} \sum_{y=1}^{W'} \hat{D}(x,y). \]

The ground-truth count \(C\) can be computed similarly from the ground-truth density map. Over a dataset of \(T\) images, the two most common metrics are the following.

Mean Absolute Error (MAE)

\[ \text{MAE} = \frac{1}{T} \sum_{i=1}^{T} |\hat{C}_i - C_i| \]
Mean Squared Error (MSE)

\[ \text{MSE} = \frac{1}{T} \sum_{i=1}^{T} (\hat{C}_i - C_i)^2 \]

MAE measures the average counting error per image and is typically the primary metric reported for crowd counting. MSE is more sensitive to occasional very large mistakes than MAE, so it is often reported alongside MAE to provide a complementary measure of robustness.

Evaluation loop#

Evaluation is done over a validation or test set using the following procedure.

Put the model in evaluation mode and disable gradient computation;
Initialize accumulators for MAE and MSE;
For each batch, run CSRNet on the input images to obtain predicted density maps;
Sum predictions and targets to obtain counts;
Update metrics with these counts;
After processing all batches, divide MAE and MSE accumulators by the number of images.

Optional diagnostics#

Although MAE and MSE are the main quantitative metrics, CSRNet also produces spatially resolved predictions that can be inspected qualitatively. Two simple diagnostics are particularly useful.

Pixel-wise MSE as a diagnostic. While training CSRNet, you can compute the average MSE between predicted and ground-truth density maps over a validation set. A steadily decreasing pixel-wise MSE indicates that the model is learning to produce density maps that match the ground truth. If the pixel-wise MSE stagnates or oscillates, it may indicate issues with learning rate or data preprocessing.
MAE on the validation set. While training CSRNet, you can compute MAE on a validation set at the end of each epoch. This is more directly related to the counting task and can help detect overfitting, enact early stopping, or tune hyperparameters.
Visual overlays. For a few validation images, plot the predicted density map as a heatmap overlayed on the input image, and compare it to the true head locations. This helps verify that the model concentrates density in the right regions and does not produce obviously misplaced or overly diffuse responses.

These diagnostics can greatly help you understand how the model behaves.

Summary#

Evaluating CSRNet centers around a simple idea: the integral of the predicted density map should match the true number of people. By summing the network’s output over spatial dimensions and comparing the resulting counts to ground-truth counts, you can treat evaluation as a scalar regression problem and use standard metrics such as MAE and MSE. Optional diagnostics and visualizations provide additional insight into how CSRNet distributes density across the image, but the core criterion for success remains accurate counting across the dataset.