We’re really good at looking at things. Image recognition & object classification is an extremely complex task, but your brain learned what certain objects are earlier and life and “generalized”.
So you can see a cat you haven’t seen before at a strange angle, yet still recognize the furry beast as such — you have a level of “abstract” understanding of what photons hitting your retinal cells constitutes a feline.
This is wicked tough for a computer to do, by the way.
You’ll need thousands of cat pictures to train a “cat or not-cat” convolutional neural network, and you’d better hope the data is diverse (cats that don’t look like cats at odd lighting or angles) or it’ll be surprised by new data.
A general object-recognition net like YOLO trains on datasets such as MS-COCO, which has ~330,000 images and 2 million object instances. This takes an awful lot of computing power, and plenty of research is done on clever techniques and convolutional layer shortcuts to cut down on training cost & runtime.
But here’s the thing: these networks process images quite differently from how your brain recognizes sights. Convolutional networks convert images to dense numeric arrays, representing pixel values of brightness or color. Each pixel-value is fed into a corresponding node in the network’s input layer, and the connections & convolutional layers handle it from there.
But the brain doesn’t run on scalar numbers, as far as we understand. You’ll pick up on brightness or color values, but in your head various neurons are activating, sending signals down dendrites to activate other neurons, forming and reinforcing complex synaptic connections.
In other words, your brain is a graph with binary nodes. Our current models are graphs with scalar (non-binary) nodes. Perhaps we’re barking up the wrong tree?
Enter the Sparse Distributed Representation, a bit-array that some reckon is a more abstract rendition of information. The fellows at Numenta and elsewhere put these to work behind Hierarchical Temporal Memory neural networks, which are graphs with binary nodes and complex synaptic connections.
By passing input data through a specific-datatype encoder, you convert any information to a binary vector. Only a few bits (~2%) are activated, giving the representation high sparsity.
The positions and activation of each bit contain information from the input, so similar input data will have similar overlapping bit positions active.
Cat has more in common with Dog than Dog does with Fish (mammalian features overlap), but the existence of catfish means Cats are slightly closer to Fish than Dogs are.
But images are tricker. Instead of converting one number (within a defined min/max range) to an SDR, we’re converting 40,000 scalar pixels for a 200x200 picture — 120,000 if it’s in color.
There’s a neat example of SDRs for images on the HTM core Github, where they use a simple non-temporal HTM net on the MNIST handwriting classification task. It hits 95% accuracy, but tests with more complex datasets (Fashion MNIST, for example) with large, color images don’t go as smoothly.
They reckon that this is simply because they encode the handwritten digit images with the following method:
def encode(data, out): # encode the (image) data
# @param data - raw data
# @param out - return SDR with encoded data
out.dense = data >= np.mean(data) # convert greyscale image to binary B/W.
#TODO improve. have a look in htm.vision etc
To encode an image’s pixel-value vector (
data) to an SDR (
out.dense), they compare each pixel in the array to the array’s NumPy average pixel value. If the pixel is over the mean threshold, the corresponding bit is activated.
The one-line encoding method lacks nuance: it can only create an SDR the exact length of the input vector and will only work well on images with extremely clear objects. This works when you have size 28x28 grayscale images, but doesn’t do so well on, say, pictures of cats.
The internet, and to a lesser extent all technology is focused around cats, so let’s look for a better image encoder for such purposes.
A really interesting approach I found is the HTM community edition’s Eye and Channel encoders, which deal with numeric-value pixel arrays using the power of redundancy.
ChannelEncoder assigns a random numeric range, called a “bin”, to each bit of the output SDR. Each bit will become active if the corresponding input (pixel value) falls within its unique range.
It relies on the concept of topology in the input data:
“Adjacent pixels are likely to show the same thing”.
Let’s say we’ve got our cat image:
If a certain bit in the output SDR isn’t able to encode the area it “should” be encoding (perhaps it’s trying to encode the cat’s eye, but the pixel randomly falls outside its bin-range), that portion of the image really just a bunch of neighboring bits that contain the same “cat eye” information.
So there’s still a good chance that neighboring SDR-bits will be able to encode the adjacent pixels of the cat’s eye. This is the strength of redundant encoding, which is only possible in data with toplogy — which is inherent to all images.
So the Channel makes every output SDR bit receptive to a unique, random range of inputs. This has a lot more representational power than identical input ranges, due to the power of redundancy.
This process lets the
ChannelEncoder maintain semantic similarity, which means that 2 similar inputs will have similar SDRs. A small change in the input image will cause some outputs to change, while a larger change will cause nearly all outputs to change.
What’s interesting is that this bit sensitivity (semantic similarity) is determined by the “bin sizes” or span of the randomly assigned ranges mentioned above. Bin sizes, in turn, is determined by sparsity.
Imagine that we’re collecting rainfall with real “bins”. If a bin gets a volume of water that falls in its unique range, its corresponding bit becomes “active”.
If we line up our bins next to each other on the ground, rain falls somewhat randomly due to trees and wind. However, wider bins will collect more raindrops on average.
Smaller bins means less active bits, which means more sparsity. Since sparsity is inversely related to the amount of semantic similarity, we want to achieve the previously mentioned ~2% sparsity that most SDRs have.
When we look at an image, we don’t take it all in at once. We focus on a small field of view while subconsciously processing our peripheral input, and our eyes naturally move around the scene.
The HTM.core example we’re working through makes use of the openCV community’s Bioinspired module, specifically the Retina processor, to craft an “Eye” encoder that follows this same logic.
The Eye encoder mimics the action of parvocellular (P-cell) and magnocellular (M-cells) pathways. Parvo cells detect color and shape information in static images. Magno cells detect temporal changes of photons/pixels from multiple images.
It’s important to note that the control isn’t yet written into the algorithm: this is a motor function. A computer will naturally read an image’s pixels from top left to bottom right, but our eyes travel in certain patterns based on what features/objects we recognize.
I worked through some of the test code on the GitHub:
Keep in mind that without specification of “where to look”, we’re just aiming for the center of the image and making small random movements to track nearby positions. However, this nets some pretty interesting results nonetheless:
Running the above code launches several windows which loop through 10 (specifiable in the loop range) random eye movements. For each position the retina scans, it generates a parvo and magno SDR, as well as a visual representation of what it’s looking for.
The REM steps are important to judge performance, since we want to encode semantic similarity as above: similar input data generates similar SDRs.
Compare the magnocellular and parvocellular visualizations: Parvo seems to be focusing more on detail and color, while magno is more soft and smoothed to detect gradient changes over time.
We can also examine the statistics of both cells:
Look at all those parameters just waiting to be tuned with swarm methods. Nice output for initial testing.
The sparsity is crucial here. Parvo is a little on the high end, but magno is right where we want it.
We should be able to tune both of these further by adjusting the parameters, including bin size — since that’s directly related to semantic similarity.
Two eyes, two cells
Parvo cells focus on analyzing static images, so it’s a “busier”, less sparse SDR that tries to encode detail of the
part_of_cat it’s looking at. “What am I looking at?”
Magno cells are less detailed, and constantly compare the general “structure” of the current field of view to prior images . “Am I looking at the same thing as before, or something else?”
Consider the potential applications of a dual-cell object recognition network, tasked with identifying objects in a busy city intersection.
The Retina moves around the image with some predefined, perhaps ML-trained behavior algorithms.
The parvo cell scans to recognize “what’s happening in this part of the image”, and tries to classify objects.
The magno cell aids the parvo cell by saying “we’re probably looking at the same object as the last few frames of video” or “this feels like a different object based on temporal comparisons”.
Further biological coding
The next big question to ask is “where should the Retina focus in each image?” which is a complex issue. Depending on the type of image data you’re looking at, you’ll want to scan different regions — portraits are concerned with the center and middle vertical third, while landscape-type nature photos could have important information in the corners.
I can imagine a brute-force approach where you divide any image into N tessellating squares, and scan M rapid eye movements within the center of each square.
A more interesting approach would be “training” a net to learn human eye behavior based on “type” of images. You’d hook up an eye movement tracker cam and have humans spend ~5 seconds looking at various images. Whether or not to label the images is another question, but an unsupervised method would have more general applicability.
Overall, the Channel and Eye encoders are extremely interesting implementations of nuanced image binary encoding. Their development logic is closely aligned with biological reality, which I believe to be the most far-sighted path forward for AI.