The objective of the project is to show human presence in a given Youtube video (e.g. the Dior - Eau de Parfum commercial) by drawing bounding boxes around them on each frame.
The retained solution uses the ImageAI library and more specifically its video detection class. This library enables quick usage of several pre-trained Deep Learning models for object detection such as RetinaNet which is found to perform best for this task (especially better than YOLOv3 and its lightweight variant tiny-YOLOv3). Note that such pre-trained models are released by ImageAI at https://github.com/OlafenwaMoses/ImageAI/releases.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. Its execution will generate a copy of the "Dior - Eau de Parfum" video in which detected humans are annotated on each frame.
There are two options available to get started:
- Recommended: use Anaconda 3 and follow Installing with conda
- Alternatively, install Python 3.7.6 and pip in a virtual environment and follow Installing with pip
In both cases, installation instructions must be performed at the root of a local copy of this repository:
git clone https://github.com/pauldmk/human_detection_video.git
cd human_detection_video
Create an environment with all requirements:
conda env create -f requirement.yml
Activate this environment:
conda activate video_detection_aive
Activate the previously created virtual environment and install requirements using pip:
pip install -r requirements.txt
The code execution is done in one line:
python src/video_detection.py
It performs the following steps:
- Download pre-trained model for image instance detection (RetinaNet with ResNet50 backbone by default).
- Download local copy of video
- Perform object detection, using GPU if machine has a CUDA enabled GPU available (otherwise it will run on CPU)
RetinaNet and a ResNet50 backbone are found to perform best, and perform annotation in about an hour on a basic CPU.
Several detection threshold were attempted. A threshold of 60% gives visually satisfying results, but it is highly dependent on the use case. Besides, even with this manually tuned threshold, some frames feature false positives, as well as false negatives under challenging conditions (unusual human posture, hidden body parts, distant shot) which would be caught using a lower detection threshold. On the upside, the annotation is overall of great quality, and even performs better than my human eye on some frames (e.g. at 0:17 with a blurred person in the background).