Reinforcement Learning: a promising AI?

The article explores the Reinforcement Learning (RL) and its applications, including games (AlphaGo, AlphaStar), autonomous cars, and robotics. Presentation of an RL implementation demonstrator in the Aqsone Lab, highlighting theLearning by error, the virtual simulation, and an example ofVirtual Moon Landing, highlighting the challenges and ambitions in various sectors.

Of all the Artificial Intelligence topics, Reinforcement Learning (RL), or Reinforcement Learning (RL), or Reinforcement Learning, is probably the most promising in terms of its capabilities in various applications but also the most difficult to implement.

These are the methods that allowed DeepMind's AlphaGo to beat the best players in Go, a board game with simple rules, but where the combinations of shots are extremely numerous. AlphaGo has recently evolved into AlphaStar, which has become almost unbeatable to the world-famous game Starcraft 2.

‍

This type of method is also found in autonomous cars as well as in industrial robotics.In this article, we will see how to implement an RL algorithm, its strengths and weaknesses, through a demonstrator created within the Aqsone Lab.

Learning through mistakes... and through success!

‍

*Training an autonomous car AI without virtual simulation would require losing* ***Millions of cars*** !

‍

This form of machine learning that is Reinforcement Learning consists in putting an autonomous agent into a situation in a situation in an environment, with a specific objective about which it has no information. This agent will learn, through a system of sanctions and rewards predefined by the developer, to do the best they can to fulfill this objective with an optimal score. You can immediately see the value of having a simulated environment for the first phases of training. Indeed, we avoid not only having too many material breakages, but also from parallelizing situations and accelerating the learning phase.

In the real world, every living being naturally does this learning: the environment is predefined, and each action carried out constantly gives feedback that allows this living being to adapt accordingly.

The key to Reinforcement Learning lies in the relevance of the simulation of the environment, but also in the relevance of the sanctions and rewards policy.

Take the example above of an ant (our agent). It is in an environment with different obstacles (for example a spider) and different opportunities (such as food). These elements are defined in the policy of this ant, with negative and positive scores. The objective is to reach a final position where the score is maximum (the house).

The agent has the options (actions) to go in any direction, and its state will be defined by its position at the moment. Note that the environment can change (the spider is moving).

The agent explores different paths, learns with each path taken and eventually, he manages to find an optimal path (going through the sheet, the bread and the house).

This illustration is very simple, but there are more complex models that make it possible to simulate concrete cases.

The best policies in the best of environments

There's only one thing: in a concrete application there are millions of ways to define policies and environments. This is where business knowledge comes into play to properly define the limits, the objectives and how to reach them (i.e. the processes or the policy) in order to then correctly model them in a virtual environment.

The other advantage of the virtual environment is that the realism of the environment can be improved iteratively, by integrating more and more complex processes, which will allow the agent to become more and more efficient in a real environment. Fortunately, many solutions for modeling environments are very numerous, whether we are working on robotics applications or for optimizing a process involving physical machines, for example. In particular, we can mention the OpenModelica simulator.

The long-term objective is ambitious: to achieve results beyond human capabilities!

An example: The moon lander Aqsone

Among the most common application cases, we can cite Atari-type games, with a fairly basic environment (few actions possible and an environment that does not change very much). We decided to work on the OpenAI Gym moon landing game, which is more complex than the Atari games.

The objective is to place a lunar landing module between two flags, like Apollo on the Moon. The module is controlled using directional arrows to control the thrusters (left, right and down). The moon landing must be done smoothly, otherwise the module will crash.

Here are all the parameters of our model:

Agent: Launcher with 3 thrusters (bottom, left and right) as well as 2 landing gears equipped with shock absorbers.
Environment: Lunar surface, whose gravity is modelled.
Possible actions: Activation of each thruster.
Policy:
- If the lander lands between the flags, he earns points.
- If he leaves the zone, he loses points.
- If he lands out of the flag or breaks his trains, he loses points.

In order to increase the challenge and show the super-human capacity of a Reinforcement Learning algorithm, we decided to modify the design of the OpenAI Gym lunar lander and give it the shape of the word “Aqsone”. Thus, the center of gravity is no longer aligned with the central thruster, which means using all the thrusters more subtly than with the standard lander.

Algorithm choice and environment parameters

For this project, we used a reinforcement method called Actor-Critical, which has the advantage of requiring little computing power.

The Actor-Critic method brings together an Actor network, which represents the agent, and a Critical network. This critical network will determine the value associated with a given situation. For example, the situation where the lander is on his back will have a very low associated value because it will most certainly induce a crash in the short term.

In addition to calculating the total score at the end of the game (Policy based method), the Actor-Critic method also assesses an intermediate score for each situation (Value based method). The actor can then know if he is in a “good” or “bad” situation at any given moment.

‍

For more technical details on this method, you can consult this article on Towards Data Science: https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f

After a few tests, we understood that the new lunar lander design had significantly increased its mass, which forced us to increase the power of the thrusters.

After setting the size of the network and defining the policy precisely, learning can start.

Unfolding

Let's go!

The lunar lander starts by making a lot of mistakes but gradually learns the consequences of each of his actions. The algorithm progressively records the series of actions that worked best, that is to say that earned it the most points.

After several thousand iterations, the final result is impressive: the Reinforcement Learning model very quickly alternates actions on the various thrusters (see video), at a speed unattainable by a human, to have an almost perfect landing.

We tried to get the module to land manually, to compare our performance to that of the machine. Results? We were never able to get our new module to land correctly and between flags. This proves the effectiveness of Reinforcement Learning algorithms.

We can't help but see the extraordinary capabilities of Reinforcement Learning algorithms. However, we must not forget the importance of the relevance of the simulation of the environment as well as the policy put in place before placing the model in a real environment.
The range of Reinforcement Learning application cases is very broad, ranging from robotics (physical world) to the optimization of marketing methods (virtual world). We encourage you to contact us to discuss your problems and to define together if Reinforcement Learning is the most appropriate technology to answer them.

Reinforcement Learning: a promising AI?

Learning through mistakes... and through success!

The best policies in the best of environments

An example: The moon lander Aqsone

Algorithm choice and environment parameters

Unfolding

Latest blog posts

How to detect objects more effectively with YOLO?

How AI is Revolutionizing Quality Control in Aerospace with Smart Sampling?

Can we chat with our company database like with ChatGPT?