YOLOv4: The Subtleties of High-Speed Object Detection

Bochkovskiy, Wang & Liao: balancing speed, accuracy and weight

One of the more impressive things you can do is look at things and understand exactly what you’re seeing. Your brain is constantly receiving two feeds of photon interpretations and somehow you’re able to say “this is a pineapple and this is pizza” — and you then know never to combine the two.

Object recognition is impressive because it’s 1) abstract (babies have no idea what’s going on) and 2) damn hard to learn. As AI improves, we’re constantly grappling with this challenge, which consists of detection () and classification ().

That being said, it’s a much easier task to identify open parking spots in a still image at a leisurely pace than it is to identify “CAR” “BICYCLE” “RED LIGHT” at 40mph.

One of the more promising object-focused CNNs is “You Only Look Once”, a neat open-source model written mostly in C/C++ and assembled with Python.
Here’s a video of “YOLOv3” in action that speaks for itself:

This is .

My favorite part is 1:35 when it labels the bird “sports ball” and then “dog” before the camera shifts and you can see its beak & side profile. That says quite a lot about how image nets work, really. It’s been fed a ton of images that include birds, dogs and balls, so it knows balls are round, dogs are horizontally rectangular, and birds have beaks.

Bochkovskiy, Wang & Liao released an updated version this April that boasts significantly improved performance with minimal training time; I think their write-up deserves a lot more attention, so I wanted to dive deep into what makes it stand out over the competition.
The results of version 4 are hard to ignore; here’s it’s Average Precision at different framerates using the MS COCO dataset:

Image for post
Image for post

Many CNN object detectors are used for slow recommendation systems. When you’re asking “what type of clothing is in this stock photo”, you can afford a net that takes its time to reach maximum precision.

But there’s a great deal of vision-related tasks that require real-time processing, like a self-driving car. Faster performance on video feeds allows AI to take over more “human” and “reactive” tasks, so we chase it eternally.

The team’s goal for v4 was thus to optimize performance at higher speeds while keeping the model lightweight — ideally, they wanted it to train and run on a single GPU rather than requiring heavier machinery.

This is a tall order; normally there’s some sort of tradeoff between those qualities. But with some clever architectural tricks, the team managed to boost speed, accuracy and weight at the same time.

Speed vs. Precision

Measuring accuracy on object detection is a bit more complex than statistical predictions, but the principles are largely the same. Instead of pure precision & recall judgments on True Positives and such, nets like YOLO often use AP50, or “Average Precision at .50 Intersection Over Union”.

Image for post
Image for post

IoU, or the Jaccard Index, is actually fairly intuitive to look at: If the top left box predicts a rabbit within an image and the bottom left box is the actual boundary-box rabbit, the IoU sort of compares:

Average Precision is a clean way of synthesizing “accuracy”: it can be defined as the area under the precision-recall curve. This streamlined single evaluation metric makes it simpler to compare different model performance.

Blueprints and Bags

The YOLOv4 model has several distinct “types” of layers, from bottom to top:

  1. Input: Feeds image into network
  2. Backbone: Detects objects in image
  3. Neck: Collects feature maps from different layers
  4. Head: Outputs predicted bounding boxes & classes for objects

Selecting the pieces of model architecture is tricky. There’s a lot to balance in the composite task of “recognition”.

Detectors need:

  1. Higher input size (resolution) — to detect multiple smaller objects
  2. More layers for a higher receptive field — to cover the increased input size (bigger image needs more processing)
  3. More parameters — to better detect differently-sized objects in one image

The receptive field, essentially “what the net is looking at within the image and how”, has many size-related considerations. It needs to be big enough to see the entire object, as well as the context around the object to differentiate it from the background.

The best way to boost the receptive field is adding more layers, but casually tacking on extra convolutional layers can inflate the computational time, so some workarounds are required.

The team experiments with two ImageNet-pretrained detection networks, CSPResNeXt-50 (“Resnext”) and CSPDarknet53 (“Darknet”) as the backbone. The neck consists of PAN and SAM layers (we’ll get to that later), while the head is YOLOv3, in a strangely recursive fashion. Bootstrapping is rarely a bad idea.

Bags of Tricks

To optimize their network, the team identified two categories of improvement techniquies:

  1. Bag of Freebies: Augmentation & other non-layer = accuracy gain at no computation/time cost
  2. Bag of Specials: plug-in module layers & post-processing methods = significant detection increase with some time cost

“Freebies” include some interesting image augmentation and other methods that solve semantic bias & class imbalances in the dataset. I’m a big fan of their new “mosaic” data augmentation process:

The idea behind image augmentation is to randomly create noise in your data, which helps prevent over-training and makes for a more nose-resilient, robust detector overall. This mosaic feature splices parts of 4 images together at irregular crop thresholds to make one composite image that probably shouldn’t exist in reality.

Imagine you’re the robot here: The mosaic 1) messes with your perspective and 2) presents you with more diverse object types & sizes in the same image. A net trained with “busier” and more diverse examples should perform better in a busier environment — it will be harder to surprise.

Consider it like this: detecting and classifying pigeons, people and benches is much harder if you suddenly add bears. Most things are, to be fair.

Special Tactics

As part of the “Bag of Specials” the team introduces a 2-stage “Self-Adversarial Training” layer: In stage one, the net alters the original image instead of altering its weights, creating a “deception” image without the desired object. In stage two, the net is told to go ahead and detect something anyway.

This is the deep learning equivalent of your calculus professor tossing in a question on Babylonian religious rites to keep you on your toes. If the network’s getting good at detecting bicycles, then telling it to ignore bicycles in the Tour de France gallery will get it to pay more attention to things it otherwise may have overlooked.

Image for post
Image for post

The team also modifies existing modules. The “Self-Adaptive Module” or SAM is designed to streamline detection, as it doesn’t require special configuration to choose particular receptive fields/depths. The team changes their SAM from spatial-wise attention to point-wise to cut down on calculation time.

The “Path Aggregation Network”, PAN, is intended to optimize connections between low and high layers, ensuring the most important information from each feature level is passed along. The team alters PAN’s additive shortcut connection to concatenation — concatenating matrices is faster than adding them.

Tuning & Training

The team did a great deal of work on hyperparameter combinations, using genetic algorithms as well as hand-picked augmentation techniques. Here’s a few examples of augmentation at work for the object classifier:

Image for post
Image for post

By comparing individual and combined performances of several augmentation techniques, the team figured out that mixing them together produced the best result — boosting 1st-guess and 5-guess accuracy by 1.9% & 1.2% for CSPResNeXt-50 and 1.5% & 1.2% for CSPDarknet-53.

The theory holds up well: a more diverse dataset with more surprises mixed in does better overall.
Any extra accuracy without extra time cost is very welcome; keep in mind that 1.5% could be the difference between a truck labeling you “pedestrian” or “turkey”.

For detecting objects, there’s a of extra Freebies to be considered:

Image for post
Image for post
  • S: alters grid sensitivity = multiplying sigmoid by x > 1
  • M: Mosaic
  • IT: multiple IoU thresholds for anchor boxes (tolerance/strictness)
  • GA: Genetic algorithms for hyperparameter tuning in first 10% of training periods
  • LS: Class label smoothing for sigmoid optimization
  • CBN: Cross Batch Normalization to learn beyond minibatches
  • CA: Cosine annealing scheduler to alter learning rate during sinusoid training
  • DM: Dynamic minibatch size, alters size for small resolution objects with random shapes
  • OA: Using optimized anchors for 512x512 input resolution

The team then compared CPResNeXt-50 and CSPDarknet53 as backbones. Resnext is better at classification, while Darknet is better at detection. However they respond differently to BoF & Mish improvements: Resnext gains classification accuracy but loses detection power, while Darknet gains both, so Darknet was chosen as YOLOv4’s final backbone.

They also looked into different minibatch sizes, but it had almost no effect on model accuracy (probably due to the comprehensive image augmentation).

Conclusion: Significant Progress

There’s nothing better than a big, vindicating graph to show off your hard work:

Image for post
Image for post
YOLOv4 comparison against leading detection/classification networks. Y axes are AP vs AP50, X axes are FPS on different GPUs.

Note the “real-time” area of >30 FPS marked in blue: these are the nets applicable to detecting objects in motion. It’s clear to see that YOLOv4 is the king of this category.

And to top it off, this performance was achieved with relatively normal hardware, training and running off of one industry-standard GPU. They’ve certainly achieved their stated goal of making a quick, accurate object prediction net available for personal use.

What makes this case interesting is the careful consideration of many architectural variables. In their paper, the team does a great job of detailing why existing recognition nets are good and bad in certain areas, and how they altered existing modules to keep computation light while extracting more features at a faster pace.

I highly recommend checking out their GitHub; they make it relatively easy to install dependencies and get it running with your own videos. I’ve been digging into their code to see if I can plug in one of those Ghost modules I wrote about to cut runtime even further.

Written by

data scientist, machine learning engineer. passionate about ecology, biotech and AI. https://www.linkedin.com/in/mark-s-cleverley/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store