Part 1: Single Object#
In the first part of the project, you will focus on the simplified task of detecting whether an image contains a single object of a chosen category and, if so, predicting a bounding box around it. This restricted problem removes the complexity of handling multiple objects per image, making it easier to implement. The goal is to build a complete pipeline that takes images as input and produces both a binary decision (object present or not) and bounding box coordinates. This stage also lays the foundation for the later parts of the project.
Overview#
The workflow for single-object localization combines classification and regression in a single model. The classification branch predicts whether the target object class is present (binary label). The regression branch predicts the bounding box coordinates if the object exists. To implement this, you will proceed as follows.
Construct a dataset based on Pascal VOC by retaining only images with at most one instance of a chosen class.
Preprocess the images and annotations in a suitable format for training.
Use a pretrained MobileNetV3 to extract compact feature representations from the images.
Train a small localization head with two branches: one for classification, one for bounding box regression.
Optimize a combined loss function: binary cross-entropy for classification and Smooth L1 for regression (applied only when the object is present).
This approach gives you a first complete pipeline for object localization, from data preparation to model training. By completing this stage, you will build the intuition required for more advanced settings.