How to detect objects more effectively with YOLO?

Introduction: Object Detection, a Revolution at the Heart of Computer Vision

Object detection is more than just a technological feat; it's the cornerstone of modern computer vision. By merging localization and classification, it allows for the precise identification and positioning of elements within images or video streams. This capability has become indispensable across a multitude of sectors: public safety, industrial automation, medical diagnostics, advanced robotics, and even for the autonomous cars navigating our roads with increasing confidence. Imagine accelerated medical diagnoses, intelligent production lines, and enhanced security. All this, and much more, is made possible by object detection. In this article, we will delve deep into one of the most iconic algorithms in this field: YOLO.

Image showing a prediction with the YOLO model on a photo of Zinedine Zidane and Carlo Ancelotti during the Champions League final. Both individuals are framed and labeled with confidence scores, indicating "person 0.88" for Zinedine Zidane and "person 0.85" for Carlo Ancelotti. Zinedine Zidane's tie is also detected and labeled with a confidence score of "tie 0.61".

YOLO (You Only Look Once): Innovation for Performance

YOLO, an acronym for "You Only Look Once," is an object detection algorithm that marked a breakthrough. Its distinctive feature? Analyzing images in a single pass, which drastically reduces processing time without sacrificing performance. This combination of efficiency and accuracy has propelled YOLO to the ranks of reference models in computer vision. With constant evolution driven by an active community and players like Ultralytics, YOLO continues to push boundaries, as evidenced by research into advanced versions like YOLOv12.

From Redmon's Initiative to Ultralytics' Impetus: The History of YOLO Maintenance

YOLO's evolution is closely tied to its successive leading figures. Initially championed by Joseph Redmon, creator of YOLOv1 to YOLOv3 under the Darknet framework, the project took a major turn when Redmon withdrew from computer vision research in early 2020 for ethical reasons. This withdrawal didn't spell the end for YOLO, but rather the beginning of a new era of diverse contributions. While researchers like Alexey Bochkovskiy and his team brilliantly ensured continuity with YOLOv4, extending the Darknet legacy, the ecosystem was ripe for an approach focused on greater accessibility. It was in this context that Ultralytics, under the leadership of Glenn Jocher, truly catalyzed the democratization of YOLO. By re-implementing and extending YOLO's fundamental concepts natively in PyTorch with models like YOLOv5, then YOLOv8, Ultralytics not only delivered cutting-edge performance but, more importantly, drastically lowered the barrier to entry. Their framework simplified the training, customization, and deployment of YOLO models, making them accessible to a global community of developers, researchers, and businesses.

At the Heart of YOLO: A Balance Between Speed and Visual Intelligence

In computer vision, interpreting and locating objects in images rely on sophisticated algorithms. Two main approaches have long coexisted, each with its strengths and weaknesses.

CNNs (Convolutional Neural Networks): Fast and Efficient Analysis

CNNs excel in their speed of analysis. They operate by applying filters to the image to progressively identify simple elements (edges, shapes, etc.) and then more complex combinations (eyes, wheels, etc.). This method, highly efficient and fast due to optimized calculations, forms the basis of many vision algorithms, including early versions of YOLO. However, CNNs can struggle to grasp complex relationships between distant objects or the overall context of an image as accurately as other techniques. For example, to recognize a "soccer ball," a CNN might find it difficult to link it to the presence of a soccer goal located on the other side of the image.

Diagram illustrating the architecture of Convolutional Neural Networks (CNNs). The diagram shows the processing steps of an input image through convolutional, pooling, and fully connected layers, leading to an output. The steps are labeled as "Feature Extraction" for convolution and pooling, and "Classification" for the fully connected layers.

Transformers and the Attention Mechanism: A New Era for Visual Interpretation

More recently, Transformers, with their attention mechanism, have generated great interest. Attention allows the algorithm to assess the relative importance of different parts of an image (or text) to each other. This gives it the ability to connect distant information and focus on what's essential to interpret complex scenes with high accuracy. Revisiting the "soccer ball" example, a Transformer can give more importance to the presence of a soccer goal located on the other side of the image. The major drawback of Transformers is their high demand for computing power, often making them slower than CNNs, which can be problematic for real-time applications. Among Transformer models, DETR is currently one of the best computer vision models. In comparison, the medium version of the YOLOv12 series is 3 times smaller than DETR.

Diagram illustrating the architecture of the Vision Transformer (ViT). The diagram shows the process of an input image through "Patch + Position embedding" stages, a linear projection of flattened patches, a transformer encoder, and an MLP head for classification. The detail of the transformer encoder is also shown, including normalization layers, MLP layers, and multi-head attention mechanisms.

YOLO: A Successful Marriage of CNNs and Attention Mechanism

YOLO's creators sought to combine the advantages of these two approaches. While retaining the efficient CNN architecture for speed, they integrated lighter, optimized attention mechanisms. The goal is not to systematically use the "heavy" attention of Transformers, but to apply it in a targeted and ingenious way. For instance, thanks to the Localized Attention (i.e., "Area Attention") architecture, YOLO learns to focus its attention on specific areas of the image by segmenting it and identifying regions that require more in-depth analysis, without having to calculate relationships between all pixels. Other technical optimizations, such as "Flash Attention" or network simplification, help make the use of attention compatible with YOLO's characteristic speed.

In summary, YOLO strikes a clever compromise by leveraging the speed of CNNs for global analysis and integrating elements of contextual intelligence inspired by Transformers. This allows it to improve detection accuracy in complex scenes without sacrificing its speed, much like a fast athlete who also develops observational and strategic skills.

YOLO: An Algorithm with Unlimited Concrete Applications

YOLO's versatility and speed have opened doors in numerous sectors, transforming complex challenges into tangible solutions.

Industry and Quality Control

In the manufacturing sector, YOLO automates quality control by spotting defects on production lines, improving efficiency, and reducing costs.

Image showing an industrial production line with plastic bottles on a conveyor. The bottles are framed and labeled, with one bottle marked as "DEFECTED BOTTLE" in red and others as "BOTTLE" in green.

Autonomous Vehicles

Autonomous driving demands flawless environmental perception. YOLO detects pedestrians, vehicles, signage, and obstacles, ensuring safer and more reliable navigation.

Image showing a traffic scene with detected and framed vehicles. A bus, cars, and a truck are visible on a road with colored lane markings. The vehicles are framed with detection boxes.

Surveillance and Security

YOLO revolutionizes surveillance by analyzing video streams to identify suspicious behavior, intrusions, or abandoned objects, enhancing the security of property and people.

Precision Agriculture

In agriculture, YOLO helps identify plant diseases, detect weeds, or monitor livestock, optimizing yields and promoting sustainable farming.

Image showing a plant leaf with identified and framed areas of bacterial spots. The bacterial spot areas are labeled with confidence percentages, indicating "Bacterial Spot" with various scores. A healthy part of the leaf is also framed and labeled "Healthy Part 0.83".

Health and Medical Diagnosis

Speed and accuracy are vital in medicine. YOLO assists practitioners by detecting anomalies in medical images (X-rays, MRIs), such as tumors or fractures, for quicker diagnoses.

Medical image showing an MRI scan of the brain with identified and framed tumor areas. The tumor areas are labeled with confidence percentages, indicating "Tumor 100%" and "Tumor 98%".

Humanitarian Aid and Crisis Management

During disasters, a rapid damage assessment is crucial. YOLO, by analyzing aerial or satellite imagery (e.g., for rural housing censuses in disadvantaged countries), helps coordinate relief efforts and efficiently deliver aid. This is one of the projects led by Aqsone Lab, participating in the Disaster Vulnerability Challenge proposed by the Zindi platform.

Satellite image showing a rural area in Malawi with fields, trees, and roads. Several houses are detected and framed in blue, with confidence scores indicated.

The Ultralytics Framework: A Versatile Toolkit for Vision

The Ultralytics framework is designed to simplify and accelerate the development of computer vision solutions. By providing a comprehensive set of pre-built tools, pre-trained models, and modular features, it allows researchers and developers to focus on the specific aspects of their projects rather than building foundations from scratch. Let's explore some of the key tools and benefits offered by this powerful framework in detail.

Data Augmentation

To ensure a model's robustness, training on a large and varied volume of data is essential. To increase this diversity without acquiring new images, data augmentation techniques are used. These include classic transformations (cropping, flipping, rotation, brightness and contrast modification) as well as more sophisticated methods like Mosaic (combining four images into one) or MixUp (weighted mixing of images and their labels). Applying these techniques exposes the model to a wider range of situations, thereby improving its ability to generalize.

Image showing a transformation of an original image into several variants, including standard transformations like random cropping, flips, rotations, and adjustments to brightness and contrast.

Test-Time Augmentation (TTA)

During the inference (prediction) phase, Test-Time Augmentation (TTA) can improve performance. It involves applying multiple transformations to the test image, obtaining predictions for each transformed version, and then aggregating these predictions (e.g., by majority vote or averaging). Although more time-consuming, this method can significantly increase the model's precision and recall.

A Complete Ecosystem for Computer Vision

The Python library provided by Ultralytics is not limited to object detection. Their framework extends YOLO's capabilities to:

Object Tracking: Follow identified objects across video frames.
Instance Segmentation: Identify each object instance at the pixel level, providing a precise delineation of its shape.
Pose Estimation: Detect and track key points of the human body (skeleton).
Image Classification: Categorize an entire image (no bounding boxes).
Oriented Object Detection (OBB): Detect objects with an oriented bounding box, crucial for aerial imagery where angle is important.

Conclusion: YOLO, an Engine of Continuous Innovation in Computer Vision

YOLO has established itself as an indispensable object detection algorithm, combining speed, accuracy, and remarkable adaptability. Its innovative architecture and constant evolution, driven by a dynamic scientific community and industrial players, allow it to excel in an impressive variety of fields. Continuous advancements, such as those explored with models like YOLOv12-turbo (see performance graph on the COCO dataset, a reference dataset for object detection), promise new capabilities.

Graph comparing performance in mAP (%) on the COCO dataset versus latency (in milliseconds) for different versions of the YOLO model, including YOLOv12-turbo.

‍

The flexibility offered by frameworks like Ultralytics, with its advanced features for segmentation, pose estimation, or oriented detection, makes YOLO much more than a simple detector: it's a truly versatile toolkit for computer vision. As artificial intelligence continues to advance, YOLO and its derivatives are undoubtedly destined to redefine standards further, paving the way for ever more intelligent and powerful applications.

To Go Further (Useful Resources)

YOLO Research Papers

YOLOv1 (Redmon et al., 2016) : "You Only Look Once: Unified, Real-Time Object Detection" https://arxiv.org/abs/1506.02640
YOLOv2 / YOLO9000 (Redmon et Farhadi, 2017) : "YOLO9000: Better, Faster, Stronger" https://arxiv.org/abs/1612.08242
YOLOv3 (Redmon et Farhadi, 2018) : "YOLOv3: An Incremental Improvement" https://arxiv.org/abs/1804.02767
YOLOv4 (Bochkovskiy, Wang, et Liao, 2020) : "YOLOv4: Optimal Speed and Accuracy of Object Detection" https://arxiv.org/abs/2004.10934
YOLOv6 (Meituan Technical Team, 2022-2023) : "YOLOv6 v3.0: A Full-Scale Reloading" https://arxiv.org/abs/2301.05586
YOLOv7 (Wang, Bochkovskiy, et Liao, 2022) : "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors" https://arxiv.org/abs/2207.02696
YOLOv8 (Ultralytics) : https://docs.ultralytics.com/models/yolov8/
YOLOv9 (Wang, Kuo, et Liao, 2024) : "YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information" https://arxiv.org/abs/2402.13616
YOLOv10 (Wang et al., Tsinghua University, 2024) : "YOLOv10: Real-Time End-to-End Object Detection" https://arxiv.org/abs/2405.14458
YOLOv12 (Yunjie Tian et al., 2025) : "YOLOv12: Attention-Centric Real-Time Object Detectors" https://arxiv.org/abs/2502.12524

Major Datasets in Object Detection

COCO (Common Objects in Context): https://cocodataset.org/
Pascal VOC: http://host.robots.ox.ac.uk/pascal/VOC/
ImageNet: https://www.image-net.org/
Open Images Dataset: https://storage.googleapis.com/openimages/web/index.html

Key Concepts and Associated Technologies

Convolutional Neural Networks (CNNs):
- Wikipedia Article (EN): https://en.wikipedia.org/wiki/Convolutional_neural_network
- Explanation (EN): https://zilliz.com/glossary/convolutional-neural-network
Transformers and Attention Mechanism in Vision:
- "Attention Is All You Need" (Vaswani et al., 2017 - foundational paper for Transformers in NLP): https://arxiv.org/abs/1706.03762
- "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (ViT, Dosovitskiy et al., 2020): https://arxiv.org/abs/2010.11929
- DINOv2 (Meta AI): https://dinov2.metademolab.com/
Framework:
- Ultralytics Official Website: https://ultralytics.com/

Image Links from the Article

‍