Evaluation

Evaluation#

The central purpose of a crowd counting model is to estimate how many people appear in an image. Although P2PNet is trained with a localization-aware loss, the primary evaluation metric remains the count per image. Localization quality (how precisely heads are placed) is useful but secondary. The following sections explain how to turn P2PNet predictions into image-level counts, and how to measure counting accuracy over a dataset. Localization metrics are also discussed for completeness.

Turning predictions into counts#

For each input image, P2PNet predicts an offset and a confidence score for a fixed number (\(M\)) of anchor locations spread uniformly across the image. Let us denote by \(\hat{c}_1, \hat{c}_2, \ldots, \hat{c}_M\) the scores (classification logits before sigmoid) predicted for one image. To obtain a single scalar count from these scores, there are two natural strategies, explained below.

Hard counting#

The hard-count strategy treats P2PNet as a detector. We choose a confidence threshold \(\theta \in ]0,1[\), such as 0.5, and retain only candidates whose scores exceed this threshold. If we denote by \(1_{[\cdot \geq \theta]}\) the indicator that a candidate passes the threshold, the predicted count is

\[ \hat{C}_{\text{hard}} = \sum_{i=1}^M 1_{[\sigma(\hat{c}_i) \geq \theta]}, \]

where \(\sigma(\cdot)\) is the sigmoid function.

This count is an integer and corresponds exactly to what one would obtain by plotting the predicted points on the image and counting how many are visible after thresholding. The choice of threshold has a noticeable effect. A high threshold tends to under-count (many true heads are rejected), while a low threshold tends to over-count (many spurious detections are accepted). It is good practice to fix the threshold once (e.g., using a validation set) and keep it unchanged when reporting final results.

Soft counting#

The soft-count strategy uses the scores directly as fractional votes. Instead of thresholding, we sum all scores after applying the sigmoid function, which transforms logits into probabilities between 0 and 1.

\[ \hat{C}_{\text{soft}} = \sum_{i=1}^M \sigma(\hat{c}_i) \]

If the probabilities are well calibrated, this sum can be interpreted as the expected number of true heads according to the model. The resulting count is real-valued; one may either leave it as such, or round it to the nearest integer for reporting. Soft counts are often smoother and less sensitive to the particular threshold chosen, but they depend more heavily on the calibration of the confidence scores.

Counting metrics#

Once a predicted count has been obtained for each image, evaluation becomes straightforward. For an image with ground-truth count \(C\) (the number of annotated heads) and predicted count \(\hat{C}\), the counting error is simply the difference \(\hat{C} - C\). Over a test set of \(T\) images, two basic metrics are widely used.

Mean Absolute Error (MAE):

\[ \text{MAE} = \frac{1}{T} \sum_{t=1}^T |\hat{C}_t - C_t|. \]
Mean Squared Error (MSE):

\[ \text{MSE} = \frac{1}{T} \sum_{t=1}^T (\hat{C}_t - C_t)^2. \]

MAE measures how many people are miscounted on average. MSE penalizes large errors more heavily due to the squaring. A model with small MAE but very large MSE tends to perform well on easy images but occasionally fails catastrophically on difficult ones. Both metrics are useful: MAE is easier to interpret, while MSE is more sensitive to outliers.

Localization metrics#

Although counting metrics capture the main objective, P2PNet is explicitly designed to localize individuals. It is therefore natural to ask how good these localizations are, beyond their contribution to the count. In this context, the standard metrics are precision, recall, and F1 score. The general procedure is as follows.

For each image, apply the chosen threshold to obtain a set of predicted head locations.
Match predictions to ground-truth points using the Hungarian algorithm with a cost matrix based purely on Euclidean distance (confidence scores are not used).
A prediction is considered correct if it lies within a fixed radius of its matched ground-truth point.
Count true positives (matched predictions), false positives (unmatched predictions), and false negatives (unmatched ground truths).
Aggregate these counts over the dataset and compute precision, recall, and F1.

Let us denote the above statistics by TP, FP, and FN, aggregated over the test set. Then, the localization metrics are defined as follows.

\[ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \qquad \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \qquad \text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}. \]

Because implementing these metrics requires point matching and careful handling of edge cases, they are considered optional in this project. However, they provide a valuable perspective: two models with similar MAE may have very different localization behavior, which only precision/recall-type metrics can reveal.

Evaluation loop#

A complete evaluation loop typically follows this structure:

Switch the model to evaluation mode, disable gradient computation, and reset all metrics.
For each batch of images and ground-truth points:
- run P2PNet to obtain offsets and logits;
- decode offsets into predicted positions and apply sigmoid to logits;
- for each image in the batch, apply a confidence threshold to decide which candidates to keep;
- compute the predicted count (hard or soft) and update the MAE/MSE metrics;
- (optional) extract the kept points and use the matching procedure to obtain TP, FP, and FN.
After the loop, finalize the MAE and MSE calculations by dividing accumulated sums by the number of images. If localization metrics were computed, compute precision, recall, and F1.

You should ensure that you never mix training and evaluation code: do not update model parameters inside the evaluation loop, avoid applying any data augmentations, etc.

Summary#

To evaluate P2PNet as a crowd counting model, you first convert its grid-based anchor predictions into a scalar count per image, either by thresholding the scores and counting retained candidates (hard count) or by summing scores directly (soft count). Comparing these counts to the true number of heads using MAE and MSE yields quantitative measures of counting accuracy over a dataset.

For students who wish to go further, P2PNet’s explicit head locations make it natural to compute localization metrics such as precision, recall, and F1 by matching predicted points to ground truth. Together, counting and localization evaluations provide a comprehensive picture of how well the model performs both as a counter and as a locator of individuals in crowded scenes.