Reinforcement Learning: A promising AI?

The article explores the Reinforcement Learning (RL) and its applications, including games (AlphaGo, AlphaStar), autonomous cars, and robotics. Presentation of an RL implementation demonstrator in the Aqsone Lab, highlighting thelearning by mistake, there virtual simulation, and an example ofvirtual moon landing, highlighting the challenges and ambitions in various sectors.

Aqsone Profile Picture

Of all the subjects of Artificial Intelligence, Reinforcement Learning (RL), or Reinforcement Learning, is probably the most promising in terms of its capabilities in various applications but also the most difficult to implement.

These are the methods that allowed DeepMind's AlphaGo to beat the best players of Go, a board game with simple rules, but whose combinations of moves are extremely numerous. AlphaGo recently evolved into AlphaStar, which has become almost unbeatable in the world-famous game Starcraft 2.




This type of method is also found in autonomous cars as well as in industrial robotics.
In this article, we will see how to implement an RL algorithm, its strengths and weaknesses, through a demonstrator created within the Aqsone Lab.


Learning through mistakes… and through success!



Training a self-driving car AI without virtual simulation would require losing millions of cars ! 

This form of automatic learning, Reinforcement Learning, consists of placing an autonomous agent in an environment, with a precise objective about which it has no information. This agent will learn, via a system of sanctions and rewards predefined by the developer, to act as best as possible to fulfill this objective with an optimal score. We immediately see the benefit of having a simulated environment for the first phases of training. Indeed, we not only avoid having too many hardware failures, but also to parallelize the situations and accelerate the learning phase.

In the real world, every living being does this learning naturally: the environment is predefined, and each action performed constantly provides feedback that allows this living being to adapt accordingly.

The key to Reinforcement Learning lies in the relevance of the simulation of the environment, but also in the relevance of the sanctions and rewards policy.

 Illustration of the different components of the RL

Consider the above example of an ant (our agent). She finds herself in an environment with different obstacles (e.g. a spider) and different opportunities (e.g. food). These elements are defined in the policy of this ant, with negative and positive scores. The objective is to reach a final position where the score is maximum (home).

The agent has the possibilities (actions) of going in any direction, and its state will be defined by its position at time t. Note that the environment may change (the spider moves).

The agent explores different paths, learns from each path taken and ultimately manages to find an optimal path (passing through the leaf, the bread and the house).

This illustration is very simple, but there are more complex models that can simulate concrete cases.




The best policies in the best environments

But here it is: in a concrete application there are millions of ways to define policies and environments. This is where business knowledge comes into play to clearly define the limits, the objectives and how to achieve them (ie the processes or the policy) and then correctly model them in a virtual environment.

The other advantage of the virtual environment is that the realism of the environment can be improved iteratively, by integrating increasingly complex processes, which will allow the agent to become more and more efficient in the environment. real. Fortunately, there are many solutions for modeling environments, whether we are using robotics applications or optimizing a process involving physical machines, for example. We can cite in particular the OpenModelica simulator.

The long-term objective is ambitious: to achieve results beyond human capabilities!





The objective of the RL: to achieve superhuman performance (attention: “Rush Hour” video montage) 


An example: The Aqsone lander

Among the most widespread application cases, we can cite Atari-type games, with a fairly basic environment (few possible actions and an environment that changes little). We decided to work on the OpenAI Gym moon landing game, which is more complex than the Atari games.

The objective is to place a lunar landing module between two flags, like Apollo on the Moon. The module is controlled using directional arrows to control the thrusters (left, right and down). The moon landing must be done smoothly, otherwise the module will crash.

Here are all the parameters of our model:

  • Agent: Moon landing device with 3 thrusters (bottom, left and right) as well as 2 landing gears equipped with shock absorbers.
  • Environment: Lunar surface, whose gravity is modeled.
  • Possible actions: Activation of each thruster.
  • Policy:
    • If the lander lands between the flags, it earns points.
    • If he leaves the area, he loses points.
    • If he lands outside the flags or breaks his trains, he loses points.

In order to add to the challenge and show the super-human capacity of a Reinforcement Learning algorithm, we decided to modify the design of the OpenAI Gym moon lander and give it the shape of the word “Aqsone”. Thus, the center of gravity is no longer aligned with the central thruster, which means using all the thrusters more subtly than with the standard lander.




Algorithm choice and environment settings

For this project, we used a reinforcement method called Actor-critic, which has the advantage of requiring little computing power.

The Actor-Critic method brings together an Actor network, which represents the agent, and a Critical network. This critical network will determine the value associated with a given situation. For example, the situation where the lander is on its back will have a very low associated value because it will most certainly induce a crash in the short term.

In addition to calculating the total score at the end of the game (Policy based method), the Actor-Critic method also evaluates an intermediate score for each situation (Value based method). The actor can then know whether he is in a “good” or “bad” situation at any given moment.





For more technical details on this method, you can check out this article on Towards Data Science:

After a few tests, we understood that the new design of the lander had significantly increased its mass, which forced us to increase the power of the thrusters.

After adjusting the size of the network and defining the policy precisely, learning can start.




 Let's go ! 

The lander begins by making many mistakes but gradually learns the consequences of each of its actions. The algorithm progressively records the sequences of actions that worked best, that is to say which earned it the most points.

After several thousand iterations, the final result is impressive: the Reinforcement Learning model very quickly alternates the actions on the different thrusters (see video), at a speed unattainable by a human, to achieve an almost perfect landing.

We tried to land the module manually, to compare our performance to that of the machine. Results ? We never managed to land our new module correctly and between the flags. This proves the effectiveness of Reinforcement Learning algorithms.


We can only see the extraordinary capabilities of Reinforcement Learning algorithms. However, we must not forget the importance of the relevance of the simulation of the environment as well as the policy put in place before placing the model in a real environment.
The range of application cases in Reinforcement Learning is very wide, ranging from robotics (physical world) to the optimization of marketing methods (virtual world). We encourage you to contact us to discuss your issues and define together whether Reinforcement Learning is the most appropriate technology to address them.


A must see

Most popular articles

Do you have a transformation project? Let's talk about it !