A Comparative Study of Deep Learning Models for Sign Language Sentence Recognition

1. Project Overview

This project presents a systematic evaluation of deep learning architectures for sign language sentence recognition using the large-scale How2Sign dataset. The primary objective is to investigate and quantify the performance progression from a simple keypoint-based baseline to a sophisticated, hyperparameter-optimized video-based model. All experiments focus exclusively on frontal-view RGB video clips.

<hr>

2. Methodology

The investigation follows an iterative, three-phase approach to model development, featuring a two-stage training strategy for the advanced video-based models.

Phase 1: Baseline Model (Keypoint-based LSTM)

A baseline was established using pre-processed 2D pose estimation keypoints to test the efficacy of abstract geometric data.

Architecture: A stacked LSTM network: LSTM(64) -> Dropout(0.5) -> LSTM(64) -> Dropout(0.5) -> Dense(32).

Phase 2: Advanced Model (CNN-LSTM)

To leverage richer visual information, an advanced model was built to process raw video frames. This architecture combines a CNN for spatial feature extraction with an LSTM for temporal modeling.

CNN Base: A pre-trained MobileNetV2 with frozen weights.
Training Strategy: A two-stage process was employed:
1. Feature Extraction Training: Initially, only the LSTM and Dense classification layers were trained. This allows the new layers to learn how to interpret the powerful, general-purpose features from the frozen CNN base.
2. Fine-Tuning: After initial stable training, the top layers of the MobileNetV2 base were unfrozen and the entire model was trained for a few more epochs with a very low learning rate, allowing the feature extractor to adapt to the specifics of sign language data.
Manual Hyperparameters: This phase used a manually-tuned configuration of LSTM(128) and Dropout(0.5).

Phase 3: Bayesian Hyperparameter Optimization

To systematically discover an optimal configuration for the CNN-LSTM, Bayesian Optimization was performed using the Optuna framework. The search space was defined as:

learning_rate: Log-uniform distribution from 1e-5 to 1e-3.
lstm_units: Integer from 64 to 256.
dropout_rate: Uniform distribution from 0.2 to 0.5.

<hr>

3. Data Processing & Feature Engineering

Several key preprocessing decisions were made to ensure model compatibility and computational efficiency.

Sequence Standardization: All video clips were standardized to a fixed length of 30 frames. Shorter videos were padded with zero-vectors, and longer videos were truncated. This is a requirement for batch processing in RNNs.
Frame Resolution: All video frames were downsampled to 64x64 pixels using OpenCV. This drastically reduces the computational load while retaining essential visual features.
Keypoint Feature Vector: For the baseline model, a 274-dimensional feature vector was engineered for each frame by concatenating the (X, Y) coordinates from pose, face, and hand keypoints, discarding confidence scores.
CNN Feature Extraction (Transfer Learning): For the advanced models, the pre-trained MobileNetV2 (without its top layer) was used as a frozen feature extractor. This leverages existing knowledge of visual patterns, significantly accelerating training and improving performance.

<hr>

4. Technical Environment

Dataset: How2Sign (official train/validation/test splits).
Hardware: NVIDIA A100 GPU.
Frameworks: Google Colab, TensorFlow, Keras, Optuna.

<hr>

5. Project File Structure

The project is organized into the following key notebooks, which correspond to the experimental phases. The 0_ prefixed files are utility and pipeline verification ("smoke test") scripts.

1_1_Train_Baseline_LSTM.ipynb: Implements and trains the baseline LSTM model.
2_2_Train_Manual_CNN_LSTM.ipynb: Implements and trains the manually-tuned CNN-LSTM model.
3_1_Optimize_Hyperparameters_Optuna.ipynb: Executes the Optuna hyperparameter search.
3_2_Train_Final_Optimized_Model.ipynb: Trains and evaluates the final model using the best-found hyperparameters.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
PIC		PIC
Smoke test		Smoke test
1_1_Train_Baseline_LSTM.ipynb		1_1_Train_Baseline_LSTM.ipynb
2_1_DataPrep.ipynb		2_1_DataPrep.ipynb
2_2_Train_CNN+LSTM.ipynb		2_2_Train_CNN+LSTM.ipynb
2_3_Evaluate_Manual_Model.ipynb		2_3_Evaluate_Manual_Model.ipynb
3_1_Optimize_Hyperparameters_Optuna.ipynb		3_1_Optimize_Hyperparameters_Optuna.ipynb
3_2_Train_final_optimized_model.ipynb		3_2_Train_final_optimized_model.ipynb
BO+CNN+LSTM.pdf		BO+CNN+LSTM.pdf
README.md		README.md
process.docx		process.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Comparative Study of Deep Learning Models for Sign Language Sentence Recognition

1. Project Overview

2. Methodology

Phase 1: Baseline Model (Keypoint-based LSTM)

Phase 2: Advanced Model (CNN-LSTM)

Phase 3: Bayesian Hyperparameter Optimization

3. Data Processing & Feature Engineering

4. Technical Environment

5. Project File Structure

About

Uh oh!

Releases

Packages

Languages

IanJ332/Sign_language_translator

Folders and files

Latest commit

History

Repository files navigation

A Comparative Study of Deep Learning Models for Sign Language Sentence Recognition

1. Project Overview

2. Methodology

Phase 1: Baseline Model (Keypoint-based LSTM)

Phase 2: Advanced Model (CNN-LSTM)

Phase 3: Bayesian Hyperparameter Optimization

3. Data Processing & Feature Engineering

4. Technical Environment

5. Project File Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages