Image Scraping

Image Scraping#

In this project, you are tasked with creating your own multiclass image dataset. Manual download is time-consuming and error-prone. Instead, you will automate dataset collection using icrawler, a Python library that supports scraping images from multiple web sources (e.g., Bing, Google).

Icrawler#

Icrawler provides a simple interface for downloading images based on keywords. It has 6 built-in crawlers, listed below.

Google: Scrapes images from Google Images.
Bing: Scrapes images from Bing Images.
Baidu: Scrapes images from Baidu Images.
Flickr: Scrapes images from Flickr.
General greedy crawl: Scrapes all images from a website.
UrlList: Scrapes all images given a list of URLs.

The search engine crawlers (Google, Bing, Baidu) have universal APIs. Here is an usage example.

# uv add icrawler --OR-- pip install icrawler
from icrawler.builtin import GoogleImageCrawler 

filters = dict(size='large', license='commercial,modify')

google_crawler = GoogleImageCrawler(storage={'root_dir': 'your_image_dir'})
google_crawler.crawl(keyword='cat', filters=filters, max_num=100)

The filter options provided by Google, Bing and Baidu are different. Check the documentation for more details on the filters and other options.

Scrapping logic#

For a better organization of your code, encapsulate the scraping logic in a function, like the one below. This function takes a class label, a list of search queries, the output directory, the total images per class, and filters. It then creates a subfolder and invokes a crawler once per query.

import os
from icrawler.builtin import BingImageCrawler

def fetch_class_images(label, queries, root='.data/images', total=50, filters=None):

    save_dir = os.path.join(root, label)
    os.makedirs(save_dir, exist_ok=True)
    per_query = total // len(queries)

    crawler = BingImageCrawler(
        feeder_threads=1,
        parser_threads=1,
        downloader_threads=4,
        storage={'root_dir': save_dir}
    )

    for q in queries:
        crawler.crawl(keyword=q, max_num=per_query, filters=filters, file_idx_offset='auto')

Tip

Search engines will limit the number of returned images per query. This number is usually 1000 for Google and Bing. To crawl more than 1000 images with a single keyword, you can invoke the .crawl method multiple times with different date ranges in the filter options.

Fetching Images for All Classes#

Prepare a dictionary mapping class labels to search terms. Then loop through each entry and call the scraping function. This will populate .data/images/<label> with roughly equal numbers of images per query. The actual number of images per class may vary depending on the search terms and filters used.

classes = {
    'car': ['car', 'vintage car'],
    'flower': ['flower', 'wildflower'],
    'building': ['architecture', 'building']
}

filters = dict(size='small')  # Small images are ok, since they will be resized anyway

for label, queries in classes.items():
    fetch_class_images(label, queries, total=10)  # Adjust total as needed

Show code cell output Hide code cell output

2025-05-13 20:26:03,195 - INFO - icrawler.crawler - start crawling...
2025-05-13 20:26:03,196 - INFO - icrawler.crawler - starting 1 feeder threads...
2025-05-13 20:26:03,197 - INFO - feeder - thread feeder-001 exit
2025-05-13 20:26:03,198 - INFO - icrawler.crawler - starting 1 parser threads...
2025-05-13 20:26:03,199 - INFO - icrawler.crawler - starting 4 downloader threads...
2025-05-13 20:26:03,609 - INFO - parser - parsing result page https://www.bing.com/images/async?q=car&first=0
2025-05-13 20:26:03,696 - ERROR - downloader - Response status code 403, file https://www.hdwallpapers.in/download/2024_lamborghini_revuelto_car_4k_5k_hd_cars-5120x2880.jpg
2025-05-13 20:26:03,872 - INFO - downloader - image #1	https://www.autocar.co.uk/sites/autocar.co.uk/files/images/car-reviews/first-drives/legacy/ferrari-vision-gt-front-three-quarter.jpg
2025-05-13 20:26:03,878 - INFO - downloader - image #2	https://images.carexpert.com.au/resize/3000/-/app/uploads/2023/04/mini-hatch-1.jpg
2025-05-13 20:26:03,882 - INFO - downloader - image #3	https://carwow-uk-wp-3.imgix.net/18015-MC20BluInfinito-scaled-e1666008987698.jpg
2025-05-13 20:26:03,952 - INFO - downloader - image #4	https://www.motoringresearch.com/wp-content/uploads/2022/10/2_SPECTRE-UNVEILED-–-THE-FIRST-FULLY-ELECTRIC-ROLLS-ROYCE_FRONT-3_4-1.jpg
2025-05-13 20:26:04,002 - INFO - downloader - image #5	https://wallpapercave.com/wp/wp11815133.jpg
2025-05-13 20:26:04,108 - INFO - downloader - downloaded images reach max num, thread downloader-001 is ready to exit
2025-05-13 20:26:04,108 - INFO - downloader - thread downloader-001 exit
2025-05-13 20:26:04,285 - INFO - downloader - downloaded images reach max num, thread downloader-003 is ready to exit
2025-05-13 20:26:04,286 - INFO - downloader - thread downloader-003 exit
2025-05-13 20:26:04,394 - INFO - downloader - downloaded images reach max num, thread downloader-004 is ready to exit
2025-05-13 20:26:04,395 - INFO - downloader - thread downloader-004 exit
2025-05-13 20:26:05,036 - INFO - downloader - downloaded images reach max num, thread downloader-002 is ready to exit
2025-05-13 20:26:05,037 - INFO - downloader - thread downloader-002 exit
2025-05-13 20:26:05,206 - INFO - icrawler.crawler - Crawling task done!
2025-05-13 20:26:05,207 - INFO - icrawler.crawler - start crawling...
2025-05-13 20:26:05,207 - INFO - icrawler.crawler - starting 1 feeder threads...
2025-05-13 20:26:05,208 - INFO - feeder - thread feeder-001 exit
2025-05-13 20:26:05,208 - INFO - icrawler.crawler - starting 1 parser threads...
2025-05-13 20:26:05,210 - INFO - icrawler.crawler - starting 4 downloader threads...
2025-05-13 20:26:05,519 - INFO - parser - parsing result page https://www.bing.com/images/async?q=vintage car&first=0
2025-05-13 20:26:05,761 - INFO - downloader - image #1	https://i.pinimg.com/originals/94/b2/90/94b29081a29e34434943acc39a6e87c5.jpg
2025-05-13 20:26:06,001 - INFO - downloader - image #2	https://i.pinimg.com/originals/cd/80/26/cd802678c48cfea5f659a9385947e8a1.jpg
2025-05-13 20:26:06,101 - INFO - downloader - image #3	https://png.pngtree.com/background/20230522/original/pngtree-1920s-vintage-car-wallpaper-picture-image_2685021.jpg
2025-05-13 20:26:06,107 - INFO - downloader - image #4	https://cdn2.adrianflux.co.uk/wp-fluxposure/uploads/2017/05/vintage-British-classic-car-1.jpg
2025-05-13 20:26:06,110 - INFO - downloader - image #5	http://wallup.net/wp-content/uploads/2015/01/cadillac-vintage-car.jpg
2025-05-13 20:26:06,144 - INFO - downloader - downloaded images reach max num, thread downloader-004 is ready to exit
2025-05-13 20:26:06,145 - INFO - downloader - thread downloader-004 exit
2025-05-13 20:26:06,777 - INFO - downloader - downloaded images reach max num, thread downloader-003 is ready to exit
2025-05-13 20:26:06,777 - INFO - downloader - thread downloader-003 exit
2025-05-13 20:26:06,841 - INFO - downloader - downloaded images reach max num, thread downloader-002 is ready to exit
2025-05-13 20:26:06,842 - INFO - downloader - thread downloader-002 exit
2025-05-13 20:26:07,239 - INFO - downloader - downloaded images reach max num, thread downloader-001 is ready to exit
2025-05-13 20:26:07,240 - INFO - downloader - thread downloader-001 exit
2025-05-13 20:26:08,116 - INFO - parser - downloaded image reached max num, thread parser-001 is ready to exit
2025-05-13 20:26:08,117 - INFO - parser - thread parser-001 exit
2025-05-13 20:26:08,120 - INFO - parser - downloaded image reached max num, thread parser-001 is ready to exit
2025-05-13 20:26:08,121 - INFO - parser - thread parser-001 exit
2025-05-13 20:26:08,227 - INFO - icrawler.crawler - Crawling task done!
2025-05-13 20:26:08,231 - INFO - icrawler.crawler - start crawling...
2025-05-13 20:26:08,231 - INFO - icrawler.crawler - starting 1 feeder threads...
2025-05-13 20:26:08,232 - INFO - feeder - thread feeder-001 exit
2025-05-13 20:26:08,232 - INFO - icrawler.crawler - starting 1 parser threads...
2025-05-13 20:26:08,234 - INFO - icrawler.crawler - starting 4 downloader threads...
2025-05-13 20:26:08,561 - INFO - parser - parsing result page https://www.bing.com/images/async?q=flower&first=0
2025-05-13 20:26:08,742 - INFO - downloader - image #1	https://images4.alphacoders.com/568/thumb-1920-568620.jpg
2025-05-13 20:26:08,892 - INFO - downloader - image #2	https://images.pexels.com/photos/531675/pexels-photo-531675.jpeg?cs=srgb&dl=beautiful-bloom-blooming-531675.jpg
2025-05-13 20:26:09,128 - INFO - downloader - image #3	https://static.vecteezy.com/system/resources/previews/022/267/874/large_2x/rose-flower-pictures-beautiful-roses-love-rose-flower-beautiful-flowers-wallpapers-ai-generated-free-photo.jpg
2025-05-13 20:26:09,168 - INFO - downloader - image #4	https://img.freepik.com/premium-photo/most-beautiful-flower-world-close-up-generative-ai_691560-9322.jpg
2025-05-13 20:26:09,519 - INFO - downloader - image #5	https://wallpapertag.com/wallpaper/full/f/e/3/840612-best-pictures-of-beautiful-flowers-wallpapers-2560x1600-hd-for-mobile.jpg
2025-05-13 20:26:09,630 - INFO - downloader - downloaded images reach max num, thread downloader-001 is ready to exit
2025-05-13 20:26:09,632 - INFO - downloader - thread downloader-001 exit
2025-05-13 20:26:09,827 - INFO - downloader - downloaded images reach max num, thread downloader-003 is ready to exit
2025-05-13 20:26:09,828 - INFO - downloader - thread downloader-003 exit
2025-05-13 20:26:09,836 - INFO - downloader - downloaded images reach max num, thread downloader-004 is ready to exit
2025-05-13 20:26:09,837 - INFO - downloader - thread downloader-004 exit
2025-05-13 20:26:10,028 - INFO - downloader - downloaded images reach max num, thread downloader-002 is ready to exit
2025-05-13 20:26:10,029 - INFO - downloader - thread downloader-002 exit
2025-05-13 20:26:10,243 - INFO - icrawler.crawler - Crawling task done!
2025-05-13 20:26:10,244 - INFO - icrawler.crawler - start crawling...
2025-05-13 20:26:10,245 - INFO - icrawler.crawler - starting 1 feeder threads...
2025-05-13 20:26:10,246 - INFO - feeder - thread feeder-001 exit
2025-05-13 20:26:10,247 - INFO - icrawler.crawler - starting 1 parser threads...
2025-05-13 20:26:10,249 - INFO - icrawler.crawler - starting 4 downloader threads...
2025-05-13 20:26:10,549 - INFO - parser - parsing result page https://www.bing.com/images/async?q=wildflower&first=0
2025-05-13 20:26:10,958 - INFO - downloader - image #1	http://www.publicdomainpictures.net/pictures/10000/velka/flowers.jpg
2025-05-13 20:26:11,025 - ERROR - downloader - Response status code 403, file https://hgtvhome.sndimg.com/content/dam/images/hgtv/stock/2018/3/1/iStock-french-marigold-636948806.jpg
2025-05-13 20:26:11,077 - INFO - downloader - image #2	http://wallpapercave.com/wp/tPw30bf.jpg
2025-05-13 20:26:11,401 - INFO - downloader - image #3	https://images5.alphacoders.com/934/thumb-1920-934279.jpg
2025-05-13 20:26:11,427 - INFO - downloader - image #4	https://a-z-animals.com/media/2023/02/iStock-1293887522-1024x683.jpg
2025-05-13 20:26:11,729 - INFO - downloader - image #5	http://images.alphacoders.com/462/46291.jpg
2025-05-13 20:26:11,807 - INFO - downloader - downloaded images reach max num, thread downloader-001 is ready to exit
2025-05-13 20:26:11,808 - INFO - downloader - thread downloader-001 exit
2025-05-13 20:26:12,572 - INFO - downloader - downloaded images reach max num, thread downloader-002 is ready to exit
2025-05-13 20:26:12,573 - INFO - downloader - thread downloader-002 exit
2025-05-13 20:26:12,610 - INFO - downloader - downloaded images reach max num, thread downloader-004 is ready to exit
2025-05-13 20:26:12,611 - INFO - downloader - thread downloader-004 exit
2025-05-13 20:26:12,627 - INFO - downloader - downloaded images reach max num, thread downloader-003 is ready to exit
2025-05-13 20:26:12,628 - INFO - downloader - thread downloader-003 exit
2025-05-13 20:26:13,265 - INFO - icrawler.crawler - Crawling task done!
2025-05-13 20:26:13,267 - INFO - icrawler.crawler - start crawling...
2025-05-13 20:26:13,268 - INFO - icrawler.crawler - starting 1 feeder threads...
2025-05-13 20:26:13,269 - INFO - feeder - thread feeder-001 exit
2025-05-13 20:26:13,269 - INFO - icrawler.crawler - starting 1 parser threads...
2025-05-13 20:26:13,272 - INFO - icrawler.crawler - starting 4 downloader threads...
2025-05-13 20:26:13,433 - INFO - parser - downloaded image reached max num, thread parser-001 is ready to exit
2025-05-13 20:26:13,434 - INFO - parser - thread parser-001 exit
2025-05-13 20:26:13,686 - INFO - parser - parsing result page https://www.bing.com/images/async?q=architecture&first=0
2025-05-13 20:26:13,742 - INFO - parser - downloaded image reached max num, thread parser-001 is ready to exit
2025-05-13 20:26:13,744 - INFO - parser - thread parser-001 exit
2025-05-13 20:26:13,794 - ERROR - downloader - Response status code 403, file https://blog.architizer.com/wp-content/uploads/Heydar-ALiyev-Center-in-Baku_cropped.jpg
2025-05-13 20:26:13,819 - INFO - downloader - image #1	https://media.architecturaldigest.com/photos/5d3f6c8084a5790008e99f37/master/w_1600%2Cc_limit/GettyImages-1143278588.jpg
2025-05-13 20:26:13,945 - INFO - downloader - image #2	https://cdn.futura-sciences.com/buildsv6/images/wide1920/e/1/3/e13e8d94f6_50147478_dr-chau-chak-wing-building-martin-snicer-flickr.jpg
2025-05-13 20:26:13,958 - INFO - downloader - image #3	https://www.immerse.education/wp-content/uploads/2022/10/what-are-the-7-different-types-of-architecture.jpg
2025-05-13 20:26:14,132 - INFO - downloader - image #4	https://amazingarchitecture.com/storage/3820/london_timber_concert_pavilion_exploration_study_as_architects.jpg
2025-05-13 20:26:14,168 - INFO - downloader - image #5	https://media.architecturaldigest.com/photos/64526b147ae2e42df7436f42/16:9/w_2560%2Cc_limit/GettyImages-1442286876.jpg
2025-05-13 20:26:14,746 - INFO - downloader - downloaded images reach max num, thread downloader-004 is ready to exit
2025-05-13 20:26:14,747 - INFO - downloader - thread downloader-004 exit
2025-05-13 20:26:14,751 - INFO - downloader - downloaded images reach max num, thread downloader-001 is ready to exit
2025-05-13 20:26:14,752 - INFO - downloader - thread downloader-001 exit
2025-05-13 20:26:15,341 - INFO - downloader - downloaded images reach max num, thread downloader-002 is ready to exit
2025-05-13 20:26:15,342 - INFO - downloader - thread downloader-002 exit
2025-05-13 20:26:15,746 - INFO - downloader - downloaded images reach max num, thread downloader-003 is ready to exit
2025-05-13 20:26:15,746 - INFO - downloader - thread downloader-003 exit
2025-05-13 20:26:16,174 - INFO - parser - downloaded image reached max num, thread parser-001 is ready to exit
2025-05-13 20:26:16,175 - INFO - parser - thread parser-001 exit
2025-05-13 20:26:16,283 - INFO - icrawler.crawler - Crawling task done!
2025-05-13 20:26:16,284 - INFO - icrawler.crawler - start crawling...
2025-05-13 20:26:16,284 - INFO - icrawler.crawler - starting 1 feeder threads...
2025-05-13 20:26:16,284 - INFO - feeder - thread feeder-001 exit
2025-05-13 20:26:16,285 - INFO - icrawler.crawler - starting 1 parser threads...
2025-05-13 20:26:16,286 - INFO - icrawler.crawler - starting 4 downloader threads...
2025-05-13 20:26:16,655 - INFO - parser - parsing result page https://www.bing.com/images/async?q=building&first=0
2025-05-13 20:26:17,153 - INFO - downloader - image #1	https://wallpaperaccess.com/full/1802075.jpg
2025-05-13 20:26:18,031 - ERROR - downloader - Response status code 404, file https://wallpaperboat.com/wp-content/uploads/2019/11/building-02.jpg
2025-05-13 20:26:18,113 - INFO - downloader - image #2	https://images.ctfassets.net/wdjnw2prxlw8/6HRSjw4NJnvoQEKuDF9BsM/25bd19e9dfbfa9be0137096f74c454fa/glass_buildings_view_from_street_to_sky.jpg
2025-05-13 20:26:18,147 - INFO - downloader - image #3	https://i.pinimg.com/originals/ba/bd/b9/babdb9a770dd5ee06fea15d1ad2cde54.jpg
2025-05-13 20:26:18,254 - INFO - downloader - image #4	https://i.pinimg.com/originals/00/09/9e/00099e7e6776dbd8d137cc5c39905fda.jpg
2025-05-13 20:26:18,307 - INFO - downloader - image #5	https://designthoughts.org/wp-content/uploads/2022/11/Modern-comercial-building-design.jpg
2025-05-13 20:26:18,330 - INFO - downloader - downloaded images reach max num, thread downloader-002 is ready to exit
2025-05-13 20:26:18,330 - INFO - downloader - thread downloader-002 exit
2025-05-13 20:26:18,392 - INFO - downloader - downloaded images reach max num, thread downloader-004 is ready to exit
2025-05-13 20:26:18,393 - INFO - downloader - thread downloader-004 exit
2025-05-13 20:26:18,714 - INFO - downloader - downloaded images reach max num, thread downloader-001 is ready to exit
2025-05-13 20:26:18,715 - INFO - downloader - thread downloader-001 exit
2025-05-13 20:26:19,481 - INFO - downloader - downloaded images reach max num, thread downloader-003 is ready to exit
2025-05-13 20:26:19,482 - INFO - downloader - thread downloader-003 exit
2025-05-13 20:26:20,303 - INFO - icrawler.crawler - Crawling task done!

2025-05-13 20:26:20,318 - INFO - parser - downloaded image reached max num, thread parser-001 is ready to exit
2025-05-13 20:26:20,318 - INFO - parser - thread parser-001 exit

After running this code, you should have a folder structure similar to the one below.

.data
├── images
│   ├── cat/
│   ├── flower/
│   └── building/

Quality Control#

After downloading, you may want to verify and clean your raw images.

Visual inspection: glance through a subset in each class to confirm relevance.
Duplicate removal: use an image hashing tool to detect and delete near‑duplicates.
Count check: ensure each class has the intended number of images.

Removing duplicate images#

You can use the imagehash library to compute a hash of the image and check for duplicates. The following code snippet shows a function that compares two images by checking if their hashes are too similar. You can use this function to build your own duplicate removal logic, which boils down to checking each image against all others in the same class and removing those that are too similar.

from PIL.Image import Image  # uv add pillow --OR-- pip install pillow
import imagehash             # uv add imagehash --OR-- pip install imagehash


def similar_images(image1: Image, image2: Image, threshold: int = 5) -> bool:
    """
    Compare two images using perceptual hashing.
    """
    hash_function = imagehash.phash

    hash1 = hash_function(image1)
    hash2 = hash_function(image2)
    
    diff = hash1 - hash2
    
    # If the difference is small, consider them similar.
    return diff < threshold

Creating a Dataset#

Once you have your images, you can create a dataset. The format expected by most libraries (e.g., PyTorch) is a folder structure where each class has its own subfolder, with the images inside. This is consistent with the folder structure created during scraping. Refer to the Dataset From Files tutorial for more details on how to load the images into a dataset object, split them into training and validation sets, and apply any necessary transformations.