Choice 2: Head localization#
In this project, you will build a convolutional neural network that estimates the number of individuals in a crowded scene by directly localizing their head positions. Your implementation will follow the approach described in “Rethinking Counting and Localization in Crowds: A Purely Point-Based Framework” by Song et al. (2021). This approach formulates crowd counting as a direct localization problem, predicting head positions through a combination of feature extraction and point regression. The model is trained using a loss function that matches predicted points to ground-truth annotations using the Hungarian algorithm, allowing it to effectively learn to identify and count individuals in dense crowds.
Difficulty |
Suggested Tutorials |
|---|---|
Medium |
Requirements: A GPU with at least 8GB of memory is recommended. Training on a CPU is possible but will be significantly slower.

Grading
The project will be graded based on the following criteria. Points for each activity are awarded based on quality and completeness (partial credit possible).
Activity |
Points (max) |
|---|---|
Build a preprocessing pipeline for the ShangaiTech dataset |
4 |
Implement the P2PNet architecture on top of VGG16 |
4 |
Implement the loss function with hungarian assignment |
2 |
Train the backend while keeping the frontend frozen |
2 |
Fine-tune the entire network end-to-end |
2 |
Evaluate model performance using MAE |
3 |
Presentation (clarity & demo) |
3 |
Total |
0-20 |