# Image Scraping

In this project, you are tasked with creating your own multiclass image dataset. Manual download is time-consuming and error-prone. Instead, you will automate dataset collection using [icrawler](https://github.com/hellock/icrawler), a Python library that supports scraping images from multiple web sources (e.g., Bing, Google).

## Icrawler

Icrawler provides a simple interface for downloading images based on keywords. It has 6 built-in crawlers, listed below.
- **Google**: Scrapes images from Google Images.
- **Bing**: Scrapes images from Bing Images.
- **Baidu**: Scrapes images from Baidu Images.
- **Flickr**: Scrapes images from Flickr.
- **General greedy crawl**: Scrapes all images from a website.
- **UrlList**: Scrapes all images given a list of URLs.

The search engine crawlers (Google, Bing, Baidu) have universal APIs. Here is an usage example.

```python
# uv add icrawler --OR-- pip install icrawler
from icrawler.builtin import GoogleImageCrawler 

filters = dict(size='large', license='commercial,modify')

google_crawler = GoogleImageCrawler(storage={'root_dir': 'your_image_dir'})
google_crawler.crawl(keyword='cat', filters=filters, max_num=100)
```

The filter options provided by Google, Bing and Baidu are different. Check the [documentation](https://icrawler.readthedocs.io/en/latest/builtin.html) for more details on the filters and other options. 

### Scrapping logic

For a better organization of your code, encapsulate the scraping logic in a function, like the one below. This function takes a class label, a list of search queries, the output directory, the total images per class, and filters. It then creates a subfolder and invokes a crawler once per query.

In [1]:
import os
from icrawler.builtin import BingImageCrawler

def fetch_class_images(label, queries, root='.data/images', total=50, filters=None):

    save_dir = os.path.join(root, label)
    os.makedirs(save_dir, exist_ok=True)
    per_query = total // len(queries)

    crawler = BingImageCrawler(
        feeder_threads=1,
        parser_threads=1,
        downloader_threads=4,
        storage={'root_dir': save_dir}
    )

    for q in queries:
        crawler.crawl(keyword=q, max_num=per_query, filters=filters, file_idx_offset='auto')

```{tip}
Search engines will limit the number of returned images per query. This number is usually 1000 for Google and Bing. To crawl more than 1000 images with a single keyword, you can invoke the `.crawl` method multiple times with different date ranges in the filter options.
```

## Fetching Images for All Classes

Prepare a dictionary mapping class labels to search terms. Then loop through each entry and call the scraping function. This will populate `.data/images/<label>` with roughly equal numbers of images per query. The actual number of images per class may vary depending on the search terms and filters used.

In [None]:
classes = {
    'car': ['car', 'vintage car'],
    'flower': ['flower', 'wildflower'],
    'building': ['architecture', 'building']
}

filters = dict(size='small')  # Small images are ok, since they will be resized anyway

for label, queries in classes.items():
    fetch_class_images(label, queries, total=10)  # Adjust total as needed

2025-05-13 20:26:03,195 - INFO - icrawler.crawler - start crawling...
2025-05-13 20:26:03,196 - INFO - icrawler.crawler - starting 1 feeder threads...
2025-05-13 20:26:03,197 - INFO - feeder - thread feeder-001 exit
2025-05-13 20:26:03,198 - INFO - icrawler.crawler - starting 1 parser threads...
2025-05-13 20:26:03,199 - INFO - icrawler.crawler - starting 4 downloader threads...
2025-05-13 20:26:03,609 - INFO - parser - parsing result page https://www.bing.com/images/async?q=car&first=0
2025-05-13 20:26:03,696 - ERROR - downloader - Response status code 403, file https://www.hdwallpapers.in/download/2024_lamborghini_revuelto_car_4k_5k_hd_cars-5120x2880.jpg
2025-05-13 20:26:03,872 - INFO - downloader - image #1	https://www.autocar.co.uk/sites/autocar.co.uk/files/images/car-reviews/first-drives/legacy/ferrari-vision-gt-front-three-quarter.jpg
2025-05-13 20:26:03,878 - INFO - downloader - image #2	https://images.carexpert.com.au/resize/3000/-/app/uploads/2023/04/mini-hatch-1.jpg
2025-05-1

2025-05-13 20:26:20,318 - INFO - parser - downloaded image reached max num, thread parser-001 is ready to exit
2025-05-13 20:26:20,318 - INFO - parser - thread parser-001 exit


After running this code, you should have a folder structure similar to the one below.

```bash
.data
├── images
│   ├── cat/
│   ├── flower/
│   └── building/
```

## Quality Control

After downloading, you may want to verify and clean your raw images.

 - **Visual inspection:** glance through a subset in each class to confirm relevance.

 - **Duplicate removal:** use an image hashing tool to detect and delete near‑duplicates.

 - **Count check:** ensure each class has the intended number of images.

### Removing duplicate images

You can use the `imagehash` library to compute a hash of the image and check for duplicates. The following code snippet shows a function that compares two images by checking if their hashes are too similar. You can use this function to build your own duplicate removal logic, which boils down to checking each image against all others in the same class and removing those that are too similar.

In [3]:
from PIL.Image import Image  # uv add pillow --OR-- pip install pillow
import imagehash             # uv add imagehash --OR-- pip install imagehash


def similar_images(image1: Image, image2: Image, threshold: int = 5) -> bool:
    """
    Compare two images using perceptual hashing.
    """
    hash_function = imagehash.phash

    hash1 = hash_function(image1)
    hash2 = hash_function(image2)
    
    diff = hash1 - hash2
    
    # If the difference is small, consider them similar.
    return diff < threshold

## Creating a Dataset

Once you have your images, you can create a dataset. The format expected by most libraries (e.g., PyTorch) is a folder structure where each class has its own subfolder, with the images inside. This is consistent with the folder structure created during scraping. Refer to the [Dataset From Files](../../tutorials/cnn/cnn-2.ipynb) tutorial for more details on how to load the images into a dataset object, split them into training and validation sets, and apply any necessary transformations.