Self-flying Drones and Reinforcement Learning

“just like the simulations”

Mark Cleverley
7 min readJun 21, 2020

I tried flying a small quadcopter drone once. I crashed it into a cat (the cat was fine). They’re pretty tricky.
I came across a paper (Hwangbo et. al) the other day, wherein some clever robotics/AI researchers taught a neural network how to fly a quadcopter. Like every other time I was outperformed by a robot, I desired to know exactly how I was defeated.


The results were remarkable. Once the drone is trained, you tell it “Go there” and it goes there with alarming efficiency. It stabilizes near-perfectly after being thrown randomly, and the researchers mention that performance is mainly limited by hardware and room size.

This isn’t Skynet-level thinking quite yet, but they’ve achieved a functional drone that can be given a target location and, with no further input or control, will reach that location with above-human speed and precision.

This is no small task, yet the ‘pilot’ learned a very different way from human pilots. It was given no prior information on aerodynamics or angular drag; they simply ran their algorithm through a bunch of simulated flights, and the would-be-pilot improved a little with every crash.

This was achieved through reinforcement learning: An area of machine learning where a robot ‘agent’ interacts with its environment, receives a positive or negative reward, and adjusts its behavior accordingly.

Trial and Error

RL actors, or agents, learn like kids do. They wander around their environment interacting with things seemingly at random, something good or bad happens, and they (probably) associate “good” or “bad” with that action for next time. Do this enough and you’ll start to approach logical behavior.
The basic components are an environment with measurable states and encodable rewards.


The simplest example is a “point-grid” agent. The environment is a 2D (x,y) plane, with a point indicating the robot’s position and another point marking the target destination. Each unique combination of those two points are states.
The agent can take a step in any direction, but doesn’t know what to do — it might as well be blind.

Let’s say the initial state is bot at (1,1) and destination at (2,3). It takes a random action and steps down to (1,0). Suddenly the bot receives the digital equivalent of a rolled-up newspaper to the nose.
The reward function calculated the bot’s pre- and post-step distances to the target, and since the action took the bot further away from its goal, the bot gets a penalty instead of a reward.

“Spare the rod and spoil the robot”. The agent records the initial state, action taken and reward in a Q-table: a table where rows are all possible states and columns are all possible actions.

What has the bot learned from this penalty?
“At (1,1), when I want to get to (2,3), stepping down is bad.”

At the Q-table’s row[ initial(1,1), destination(2,3) ] and column[ down ], the bot records a negative number. This number varies depending on the complexity of the environment and how you want the bot to learn.

But the bot doesn’t know what to do at (1,0) either. Let’s say it randomly steps up. It gets a little reward and thinks “Ah, if I’m at (1,0) stepping up is good”. It’s now back where it started, and it has a bit of knowledge from the previous penalty. It doesn’t know if left, right or up will take it closer to (2,3) from (1,1), but it knows that “from here down = bad”, so it probably won’t go down this time.

This is learning, in the same way that an eager toddler picking up a grumpy, sharp-toothed hamster is learning. Run the simulation enough and the Q-table begins to gain significant knowledge of the rewards of possible actions in various states, and you have a bot that can move towards a destination without being able to ‘see’ like we do.

Higher Learning

The quadcopter model is more complex than a 2D grid agent, but the cycle is the same:

  1. Act
  2. Reinforce
  3. Learn

For the training simulation, Hwangbo’s team used two paired neural networks: A “Value” network that determines ‘how good is this current state vs. the target state’ and a “Policy” network that actually flies the drone. The Value net trains the Policy net towards favorable actions.

In place of Q-tables, the base layer of each network receives state information including the drone’s orientation, position, and velocity. Value has a one-node output layer — a reward/penalty — and Policy has four nodes to control each rotor’s engine independently.

source Value and Policy neural networks structure. Base layers of both networks encode physical state data

The exploration strategy is particularly interesting. For each learning iteration, the team generated a series of branching trajectories (each a series of states) and compared the “tail costs” or values of each trajectory’s final state.
To make sure most of the space gets explored, each iteration starts with a random state — angles, velocity, position, target — and the bot sets off on an initial trajectory by following its current Policy.

source Visualizing flight trajectories with noise-induced branching. Each dot is a ‘time-step’ where state is updated, and can be fed to the Value network.

But only doing what you know won’t get you anywhere, so the team introduced Gaussian noise at ‘junctions’ to branch off randomly before following the policy again. The end states of these branches can then be compared against each other to determine “which policy offered better state values?” and the Policy network is adjusted accordingly.
There’s a remarkable amount of exploration strategy optimization to be done when modeling real-world tasks, but keeping a relatively long branch length allowed the networks to limit bias and avoid getting stuck in local minima.

The strategy makes decent sense: If your current trajectory isn’t working, change course and see what happens. What really helped is that the mad lads wrote their own simulation software in C++ and optimized the value calculations with some terrifically complex Singular Value Decomposition functions.

This paired-network approach allows the pilot-bot to effectively ‘learn’ what to do in certain states. Over many iterations, it figures out things like “If the target is above me, increasing all four rotors’ thrust is a good move” and “Oh no I’m upside down, time to increase two adjacent rotors and decrease the other two”.

Here’s a clip from the above video demonstrating the different flight paths during training.


We used 512 initial trajectories and 1024 branching trajectories with noise depth of 2 which corresponds to 1.0 million time steps per iteration. Although this number seems high, it took less than ten seconds per iteration due to parallelization of rollouts.”

Six hours of decent computation power and the simulation’s looking good. But can the robot pilot fly in real life?

Yes. Yes it can.


The team loaded the Policy net onto an AT Hummingbird and tracked state updates with a high-speed motion capture system. The article mentions that the model didn’t account for anything beyond four-rotor thrust — no aerodynamic drag, no air resistance close to the ground — yet it still flew like a dream, and stabilized well beyond expectations.

While extolling the virtues of autonomous flying robots, it’s important to acknowledge the limitations. This model didn’t account for things that Skynet might encounter in the real world, like wind, rain or geese.

However, the principle’s the same. If you wanted to teach this bot to avoid geese, you need only simulate some geese & give the drone a video camera that can recognize a goose mid-flight. The rest is computation.

Here’s the full video; demonstrations begin around 3 minutes in.

First Forays into RL

I was quite excited by their new simulation software, but it’s designed for Ubuntu and installing cmake & g++ on Mac is not a good time.
Fortunately, there’s a fantastic Python framework for developing and deploying RL algorithms called Garage, which is easy as pip install garage. Using it is somewhat complex, however, so if you’re looking to learn more about RL, check out OpenAI’s Spinning Up tutorial.



Mark Cleverley

data scientist, machine learning engineer. passionate about ecology, biotech and AI.