Embodied Learning is Sparse
unpacking Viviene Clay’s findings in reinforcement learning
I’ve written a few articles on Hierarchical Temporal Memory neural networks, which encode data into Sparse Distributed Representations to make noise-resistant predictions that consider multiple time-steps from the past input feed.
The key pioneer of HTM technologies, Numenta, holds weekly research meetings to discuss interesting new theories and advancements in machine learning and neuroscience.
Recently they invited Viviane Clay, a PhD student from the Institute of Cognitive Science in Osnabrück, to discuss her fascinating experiments in embodied reinforcement learning.
Her experimental reasoning can be summarized as such:
- Observing nature can lead to clues on how to efficient perform a task
- We can observe how humans learn to gain insight into artificial intelligence
The full video is long but quite interesting, so perhaps this quick summary will convince you to watch the whole thing.
How we learn changes how we think
Clay starts by noting the Piaget’s stages of mental development in humans, and the single stage of learning in machines:
Humans learn different systems, paradigms and representations of the world by interacting with it. We take in huge amounts of data over years before we start writing essays, and early stages of thought are preoccupied with things like object permanence and cause-and-effect, concepts that gradually build upon and reinforce each other.
In contrast, our machine learning systems are built, trained upon a large set of data, and specifically instructed on right/wrong or left to categorize on its own. While humans are embodied, machines generally lack a body or multi-faceted way of interacting with their environment.
Our current AI networks are great, but can often be tripped up by noise. Clay highlights the importance of context in object recognition:
Adding a small amount of noise to a picture of a panda produces an image we’d still categorize as an herbivorous bear, but completely throws off the classification network.
Similarly, the network on the right is mislead by local features — textures or patterns in non-squirrel parts of the image may be similar to sea lions or rocking chairs.
Neural networks generally tend to over-rely on local features (such as texture and pattern) instead of global features, like shape or context.
Humans don’t make these same mistakes. We also learn quite differently from machines.
The difference in performance can then be logically attributed to the difference in learning; thus, to boost performance we need to make embodied, multi-stage AI.
Viviane’s experiments center around a maze-exploring 3D game environment, where the RL learning agent network navigates through floors of a tower, interacting with various objects. A blue sphere increases the remaining time, keys can be picked up that unlock doors to progress through the level or climb to the next floor of the tower.
The learning agent network receives two inputs: an RGB image of the game camera, and a vector representing level variables (time remaining, number of keys collected, etc).
The agent then can make decisions on how to rotate the camera and move around the environment. This is a fairly difficult task. The network is shallow (2 convolutional layers, 1 pooling, 2 dense) with 256 neurons and no real temporal memory, and rewards (time spheres, keys, progression doors) are quite sparse, showing up in less than 2% of image frames.
However, some structure and decision-making still emerges. Breaking down the various input:decision relationships with T-SNE dimensionality reduction shows some interesting clusters:
Red-circled dots are frames with level doors in the image. It’s important to note that these door-frames form clusters within existing action clusters, because a level door in the top right corner of the camera necessitates a different action than the same door in the top left corner.
This may seem minor, but it indicates that the network is learning locational invariance — correctly identifying a policy based on the same object in different spots within the input image.
To assist the navigational learner agent, Clay also designed supplementary autoencoder and classifier networks. The autoencoder takes the encoded state of the learning net (after it starts “thinking” about the RGB image as numbers within its neurons) and tries to decode & reconstruct the image. The classifier simply tries to classify “which objects are currently in this frame?”.
Clay breaks down the 256-neuron encoding with a heatmap showing the percent of total frames in which each neuron fired:
The most compelling part of these initial networks is the natural appearance of sparsity in the embodied agent network: very few neurons in each layer are ever active. Compare this to the neurons in the autoencoder and classifier.
No regularization was performed on the network to encourage or enforce sparsity; it repeatedly arrived at this pattern by hunting for the optimal course of actions. This bodes well for Numenta’s theory of biological computing via Sparse Distributed Representations.
Jeff Hawkins noted that sparsity may be inherently desirable for this agent’s task, because it requires the context of an object to make optimal decisions (see the above T-SNE clustering where a door’s location/context influences the action/decision).
Encoding each image in a sparse representation allows more “room” to encode extra information (context) such as object position or multiple overlapping objects.
However, there’s nothing that inherently points to “why sparsity developed” aside from “it’s beneficial”, which is an evolutionary trial and error approach.
The timeline of sparsity in each network is also interesting to look at. Clay graphs the Gini indexes of each network’s neuron activation percentages over time:
The Gini index here measures inequality of neurons (sparsity) and shows the sharp-then-gradual development of a few neurons encoding most of the navigational decisions.
Curiosity and Recognition
In the next segment, Clay tests if the navigational agent can learn without explicit object-reward supervision, and move into the preoperational stage of learning.
She hooks the original agent network up to a state-prediction network, which receives the same image and vector input as the nav-net and tries to predict the next state.
The nav-net is rewarded/encouraged to take actions that aren’t predicted: to “explore unforeseen scenarios”. This is an interesting approach to encoding curiosity in a reinforcement learning agent, and prevents the navigator from simply staring at a wall continuously.
A task that children at preoperational stages carry out is “fast mapping” of new objects to semantic labels: learning what an apple is after seeing a couple of apples.
In Viviane’s maze-exploring task, the learner ideally identifies objects such as doors or time spheres. But without explicit supervision (autoencoder and classifier networks), she takes a different approach.
The idea is that we can infer what a “door” looks like to the network by checking for overlapping neuron spikes over multiple frames with doors in them.
She begins by manually identifying frames where objects appear, recording the connections between neurons that activate in response to those frames. Combining the visualizations of those connections allows for a clearer picture to emerge:
In the aggregate connections of N frames with a door, identify the top M neuron connections (and weights of those connections) as the measurable concept of an object. Since doors show up in each component image, the connections that show up the most are probably encoding for “door”.
We can then compare new frames to each “object concept” to classify on the fly by comparing connections and weights for overlap.
What’s interesting is that this manual example approach works with remarkably few labeled images.
Most object recognition networks need thousands, if not millions of input images to achieve desirable performance.
However, Clay’s “fast mapping” approach reaches respectable performance at even 3 examples, making it an interesting answer to the question of few-shot learning.
She visualizes the object recognition’s sparse patterns:
What I found particularly interesting about Clay’s experiments is the natural occurrence of sparsity and possible connections to embodiment.
The navigational RL agent network developed sparsity levels similar to temporal memory networks simply by interacting with the environment, although it lacked any specific constraints or memory.
In comparison, the classifier and autoencoder networks were built with the same architecture, yet yielded much different neuron frequency levels.
To me, this suggests that some tasks — such as navigation with vision and a separate body — may lend themselves to more sparse encoding of information.
The 3D camera environment, where the input image is a 3rd-person view separate from the actual “runner” interacting with the maze, is also an interesting concept, because it adds another layer of depth to the concept of embodiment.
This is brought up early on in the research meeting, and Jeff hypothesizes that the learner would perform just as well in a 1st-person view with only one rotational direction to consider.
However, Kevin Hunter notes that having a 3rd-person rotational camera combined with a controllable “body” in view is similar to how humans have a field of view that can include our own controllable body.
He likens the concept to a baby holding its fist in front of it, then moving its head and gaining an understanding that the fist is a separate object from the “viewing self” behind its eyes, yet a controllable part of the “body self” overall.
So there may be an unforeseen connection between multi-faceted sensorimotor pathways and effective learning within complex environments.
Embodied AI could be able to develop patterns of behavior due to an agent optimizing concurrent use of different output controls in ways we have yet to predict. Adding memory, such as in HTM networks, may also lead to interesting results.
Clay’s next steps include extracting different types of objects (open/closed doors, keys in different screen positions), and hooking up the fast-mapping output back to the net to aid in decisionmaking.