If you want to modify top-of-the-line deep learning models to run twice as fast without dropping accuracy, I’d say you have a ghost of a chance.
Making image-based neural networks quicker is a big deal, especially when you consider where and how they’re applied. A self-driving car employs complex object detection and recognition nets to determine “are those pixels a piece of asphalt or a pedestrian?”.
The faster (and cleaner) the networks at play, the better.
While exorcising my convolutional neural network last week, I stumbled across a paper by some clever fellows from Beijing & Sydney Universities proclaiming a new style of convolutional layers for use in CNNs.
Current production-standard models offer high performance, but have to spend a decent chunk of time calculating Floating Point OPerations (FLOPs). Han et. al determined that existing image-analysis networks have a lot of redundancy, and runtime can be slashed by reducing the computational load with some efficient linear operations.
For image classification & object detection, “GhostNet” yields similar or better performance 33% to 46% faster than state-of-the-art nets.
Imagine you’ve got a math problem that involves finding a number of different answers to several equations. Han’s team found a way to calculate a few of the answers, then duplicate & alter those answers to solve the rest of the problem in half the time. They did this using (in their own words) ghosts.
Absolutely bonkers, I thought. I should plug these ghosts into my own CNN and see what happens. But first, I need to understand why it works.
Let’s dive deep into the mechanics behind network architecture optimization. Things are about to get spooky.
In case you aren’t too sure how a CNN works, don’t worry, most of us aren’t either. The basic idea is that you run an image through convolutional layers, which run filters with different sizes (3x3, 7x7 pixels etc) and algorithms over the pixels in the image to get an ‘average’ idea of what each pixel sort of looks like. This helps your model recognize “features” in the image — it allows your net to generalize.
For example, in a cat-or-not-cat image classification net, convolutional layers can start to recognize tails, ears, eyes and such, even in different shapes and angles.
CNNs string layers into blocks to create “feature maps”: convoluted versions of the original image from which typical components of the image can be extracted. Each feature map will look different based on various hyperparameters (kernel size, algorithm, stride etc).
Here’s some maps from the first block of a famously high-performing CNN, “ResNet-50” (all images are from the paper above):
Han’s team (who I’ll refer as “the team”) figured out that some of these maps are extremely similar (noted in the colored pairs): you can get from one to the other by running simple linear operations on each pixel. They determined that some feature maps are simply clones, or “ghosts” of others, creating significant redundancy in existing architectures.
To test this out, the team used depth-wise convolutional filters (size d x d) in ResNet-50’s first block, and fed the above colored feature map pairs in as input & output. This allowed them to learn the mapping (linear operation) between them and calculate the Mean Squared Error for each pair. The MSEs are quite small, which indicates strong correlations between the paired maps and supports the idea of redundancy.
The team hypothesized that you don’t need spend time generating so many unique feature maps in each block — you can just calculate fewer “intrinsic” maps and run quick linear operations on them to approximate the rest of the “ghost” maps.
Figures 4 and 5 below compare the maps in the Ghost version of VGG-16, another high-performing net, with the original.
Here the compression parameter s = 2, so it convolves the image into half as many maps (the red-bounded 16) and then haunts the layer, transforming the red 16 with linear operations to create the green-bounded 16. Considering the above pair MSE calculations, the ghosted half should produce about as much insight as the original full-convolution 32 while using fewer FLOPs. Even for computers, ghosting is a lot easier.
Ghosting your net
After establishing that there’s plenty of redundancy in normal CNNs, the team designed a “ghost module”: an alternative convolutional layer that runs linear transformations on fewer convoluted feature maps.
The mathematics are a little complex; a full explanation can be found in the paper, but I’ll summarize the logic behind FLOP reduction:
Calculating the number of Floating Point OPerations for a layer:Input data:
X ∈ R^(c*h*w)
c = number of input channels
h, w = height, width of input imageOutput data:
Y = X [conv] f+b = convolutional layer generating n feature maps
[conv] = convolution operation
b = bias
Y ∈ R^(h'*w'*n) = output feature map with n channels
h', w' = height, width of output image
f ∈ R^(c*k*k*n) = the layer's convolution filters
k*k = size of convolutional filters. 3x3 etc
c = number of channelsNumber of FLOPs in layer = n*h'*w'*c*k*k
= num_filters * output_size * num_channels * kernel_size
Most networks have many filters and channels (n, c) so the resulting product is often huge — hundreds of thousands of relatively memory-intensive operations just for one layer. The Ghost module is one way to reduce this load:
In the Ghost module, fewer truly-convoluted maps are duplicated & transformed to serve as extra psuedo-convoluted maps, generating the same amount of output with much less effort. The positions of the Φ indicate that the order is preserved while cloning; when ghosting 16 into 32, the original #7 transforms into the ghost #7.
The team writes that affine and wavelet transformations can theoretically offer better insight/performance compared to linear, but the latter is much computationally cheaper and convolution is already quite efficient. Changing kernel size from module to module could also yield a performance boost, but can spike GPU load, so they deemed it more efficient to keep a stable filter size while using depth-wise convolution.
Ghost modules are stacked into “Ghost Bottleneck” blocks to use in CNNs. The blocks are essentially two stacked modules similar to ResNet’s basic residual block structure; the first expands the number of channels and the second reduces the channels to match the “shortcut” path size, where the shortcut is connected between the input and output each bottleneck.
Here’s the layer-by-layer structure of their “GhostNet” standalone neural network. It’s like a haunted mansion, but with more math.
The team notes that they followed MobileNetV3’s general plan, changing all bottleneck blocks to ghost bottlenecks. At the end they employ a global average pool & standard convolution layer, transforming the featmaps into 1280-dimensional vectors for classification (is it a cat? is the cat haunted?).
“#exp” indicates the “expansion ratio” of
num_output_channels:num_input_channels. Layers where “SE” = 1 indicate a “Squeeze and Excite” module is applied to the residual layer of the bottleneck. Yes, that’s really what they’re called. They’re quite popular.
Testing Spooky Models
The publicly-available benchmark CIFAR-10, Imagenet ILSVRC 2012 and MS COCO datasets were fed into various architectures to judge performance. The team measured their own “GhostNet” and haunted versions of the state-of-the-art VGG-16, ResNet-50, and MobileNetV3 networks against the specter-free originals.
In tables 3 and 4, they measure GhostNet with different kernel/filter sizes (d) and compression factors (s) against VGG-16 for image classification on CIFAR-10.
The team decided that 3 is a good kernel size, in between 1x1’s lack of spatial understanding and 5/7’s increased load and overfitting. A compression factor of 2 is also (for the models they tested) generally healthy.
The really neat thing is that these modules are actually modular. Plugging them into existing cutting-edge networks makes those networks run twice as fast with the same performance, or (for the quicker networks) runs about as fast with better performance. Ghost modules generally seem to boost either performance or speed.
In table 6, we see that ResNet compressed versions trade accuracy for speed (SSS-ResNet-50) but the ghost version keeps pace with higher performance.
GhostNet’s most impressive performance is against small-scale networks. GhostNet outperforms MobileNetV3 (which was used as its blueprint) in MS COCO object detection, boasting 46–33% speed boosts while maintaining or outperforming in precision. It carries this performance through ImageNet classification as well.
This is, in short, bloody impressive. The purpose of ghosting is to reduce computations (FLOPs) while maintaining accuracy. Fewer feature maps should mean less accuracy, but not if there’s redundancy that can be simulated at cost.
It’s a similar idea to image augmentation, but instead of generating extra “good enough” training images to gain more understanding, it generates extra “good enough” feature maps in each layer to get the same understanding with less work.
Overall, ghost modules perform extremely well, maintaining or beating accuracy while halving operational load.
They have the code up on GitHub (done in TensorFlow), so I’m going have a good time haunting my own neural networks. Give the modules a try if you need some excitement in your network.