Traffic Sign Narration System

A Holistic Approach to Driver Safety

Developed as my UCL Capstone project, this system leverages machine learning and advanced image processing to detect, interpret and narrate traffic signs in real time.

1.0 Introduction

1.1 Problem Definition

Skills to drive motorised vehicles are acquired through experience and situational learning...

Figure 1: UK Driving Statistics for 2022

This is concerning, given the increased amount of roadside advertising and other environmental clutter, with studies supporting that young drivers invest more attention to road advertising than highway code signs. With technological breakthroughs like Tesla’s Autopilot, the technology to increase driver awareness is proven and available in the automotive industry. Nevertheless, given the high cost of such vehicles and evidence demonstrating young drivers are substantially likely to buy used vehicles, this technology is far from reaching the demographics that need it most.

1.2 Aims & Objectives

The aim of this project is to deliver a low-cost solution that improves driver awareness by providing early audio alerts for oncoming traffic signs. The objective is to build a machine learning algorithm that recognises, interprets and narrates the meaning of essential traffic signs in real time.

2.0 Literature Review

A comprehensive review of current methodologies was undertaken. The literature explored traditional techniques such as Colour Recognition and Shape Recognition, and compared these with Deep Learning approaches. While Colour and Shape Recognition offer simplicity, they are prone to errors in adverse conditions. In contrast, Deep Learning, particularly using Convolutional Neural Networks, provides robust performance in diverse real-time scenarios.

3.0 Methodology

Figure 2: Training and Validation Flowchart

Figure 3: Final Narration Algorithm Flowchart

3.1 Dataset Selection

The dataset chosen must have a sufficient number of images, be globally distributed and include at least 59 regulatory/warning traffic sign classes to meet project criteria. This project was constrained to using open-source traffic sign datasets. The Mapillary Traffic Sign Dataset (MTSD) was most suited as it comprises over 320,000 labelled images—with 52,000 fully annotated and 48,000 partially annotated images. Although the dataset contains 401 classes, over 65% of the signs are labelled as "other" (a miscellaneous category), which were removed to improve accuracy.

The global distribution of images is clearly illustrated in the image below.

Figure 4: Distribution of MTSD Images

3.2 Baseline Performance

YOLOv5 was selected as the training algorithm due to its fast prediction times for real-time detection. The MTSD was divided into three subsets—65% for training, 10% for validation, and 25% for testing—with annotations converted to a YOLOv5-compatible format. Baseline training ran for 1400 epochs, achieving a maximum mAP of 23.2% and an F1 score of 0.32. The loss functions for bounding box, object, and class predictions indicated the necessity for further dataset preprocessing.

3.3 Class Reduction

3.3.1 Removing Unnecessary Classes

To notify drivers of only essential information, non-critical classes (such as informational or complementary signs) were removed. The MTSD uses a naming convention ("Sign Category-Meaning-Location Variant") that allowed systematic identification and removal of unnecessary classes. A manual review further eliminated signs irrelevant to the primary user group. In total, 237 classes were removed, leaving 164 classes.

3.3.2 Grouping Regional Labels

To prevent confusion from similar signs with regional variants, classes were grouped together. This approach reduced the total number of classes to 93 and achieved a more balanced dataset distribution.

3.3.3 Removal of Low Instance Classes

Finally, to address imbalances, classes with fewer than 100 instances were removed, reducing the dataset to 63 classes. This step not only improved training efficiency but also reduced potential bias.

Figure 5: Distribution after Class Reduction

3.4 Image Resizing and Cropping

YOLOv5 requires images to be resized to a 640x640-pixel square. However, the original MTSD images average 3497 pixels in width and 2442 pixels in height, with significant variance. To address this, a hybrid cropping algorithm was developed. The algorithm first uses bounding boxes to define initial crop areas, restores 20% of the original dimensions if the crop is too small, and adds a 50-pixel tolerance to ensure no critical features are omitted. The images below illustrate the improvements before and after cropping.

Figure 6: Boxplots Before Cropping

Figure 7: Comparison of Cropping Techniques

3.5 Background Elimination

Post-cropping analysis revealed the presence of irrelevant background elements like sky, trees, and roads. To reduce this noise, a background elimination method using RGB thresholding was applied. The process involved eroding, dilating, and Gaussian blurring to enhance edge boundaries, followed by defining colour thresholds for red, blue, yellow, black, and white. The image below demonstrates the effect of this method.

Figure 8: RGB Thresholding Effect

3.6 Hyperparameter Optimisation

YOLOv5 provides 27 tuneable hyperparameters, which are critical to model performance. Manual tuning was impractical, so a genetic algorithm was implemented to optimise these parameters. The process involved selecting an initial set of hyperparameters, generating 10 variations per generation, training each for 10 epochs, and averaging the top 5 performers to form a new parent set. This iterative method continued until convergence, ultimately improving the mAP to 83.43%. The image below visualises this hyperparameter evolution process.

Figure 9: Hyperparameter Evolution Flowchart

3.7 Narration Algorithm

The final stage integrated the detection model with an audio narration algorithm to create a fully functional prototype. The algorithm implements the following steps:

Exclude predictions below an 80% confidence threshold.
Prioritise the sign with the highest confidence when multiple signs are detected.
Prevent repeated narration by suppressing alerts for the same sign within a 10-second interval.

These measures ensure that the audio notifications are clear and do not distract the driver.

4.0 Results and Discussion

4.1 Baseline Results

Baseline training produced an mAP of 23.2% and an F1 score of 0.32. The loss functions indicated that the model was prone to overfitting and that further preprocessing was required.

4.2 Class Reduction Outcomes

After class reduction, the training mAP improved to 34.07% and the F1 score reached 0.412. The process also resulted in a significant reduction in training time.

4.3 Impact of Image Cropping

The hybrid cropping algorithm yielded marked improvements with a training mAP of 76.1% and an F1 score of 0.75. Loss values dropped to within state-of-the-art thresholds.

4.4 Background Elimination Results

Although background elimination improved the loss metrics, it negatively affected the mAP by removing some essential sign details.

4.5 Hyperparameter Optimisation Results

The genetic algorithm reduced variations in hyperparameters, with the Anchor Threshold showing the greatest impact. The mAP was improved to 83.43% through this iterative process.

4.6 Final Training

Retraining the model from scratch using the optimised hyperparameters resulted in a final mAP of 84.4% and an F1 score of 0.81. Loss values for bounding box, object and class predictions were all within acceptable limits.

Figure 14: Final Training Results – Graphs of mAP and F1 scores across epochs.

5.0 Model Testing

5.1 Robustness Testing

The model was tested for robustness under varying conditions.

5.1.1 Obstruction Test

The model correctly identified traffic signs with up to 50% obstruction.

Obstruction Test: Traffic sign detection with 50% obstruction

Figure 10: Obstruction Test – Traffic sign detection under 50% obstruction.

5.1.2 Brightness Test

Reduced brightness levels (up to 90% reduction) helped decrease background noise and improved detection accuracy.

Brightness Test: Traffic sign detection under reduced brightness

Figure 11: Brightness Test – Detection performance under 90% brightness reduction.

5.1.3 Weather Test

The model demonstrated resilience in fog, rain and snowy conditions, although confidence levels varied.

Weather Test: Traffic sign detection in adverse weather conditions

Figure 12: Weather Test – Detection performance under rainy conditions.

5.2 Large Dataset Testing

A large, partially annotated dataset was used to further verify model performance, with results visualised through a binary confusion matrix.

Figure 13: Binary Confusion Matrix after sample dataset testing.

6.0 Deployment of Prototype

The final stage involved integrating the detection model with the narration algorithm. The prototype applies a confidence threshold, prioritises the most critical sign and suppresses duplicate notifications within a 10-second window. Testing with a laptop webcam and simulated GPU support yielded an estimated 15.5 FPS, making the system suitable for real-time applications.

7.0 Conclusion

84.4%

Mean Average Precision

0.81

F1 Score

This project successfully developed a traffic sign narration system that enhances driver safety by delivering a state-of-the-art detection model. Through rigorous dataset modification, advanced image processing and hyperparameter tuning, the final model achieved a Mean Average Precision of 84.4% and an F1 score of 0.81. Robust testing and efficient prototype deployment demonstrate the system's strong potential to significantly improve driver awareness and contribute to safer roads.

Full Research Report

Want a more detailed look? Here is my full research report: