AS021 » Reinforcement Learning on FPGA for Robotics
Reinforcement learning (RL) allows systems to learn from their own experience, tries and errors. In robotic, when dynamics are complex and environments are unknown, RL is a great way to obtain control and decision policies. This project aims to implement different methods of RL on the FPGA inside a Intel SoC, in order to take advantage of the parallelism that programmable logic offers.
The way robots follow trajectories, move their joints, or avoid obstacles has been mainly based on mathematical models and previous characterization of their dynamics, their sensors and actuators, and the environment where they work. Even so, their movements are not always accurate enough, and their decisions are not always the best.
On the other hand, humans (and a number of animals in general) learn how to move through trying, and correcting its movements as errors are made. In the beginning, babies do not know how to walk, and fall down a lot, but as they try more, they fall down less, learning the correct way to distribute the weight when they move, and the appropriate trajectory of the feet when walking. From the point of view of the controller (the brain), it is interacting with the system (the body) and the environment (gravity, floor, etc), and learn the correct movements (actions) that it must perform for each situation.
Reinforcement Learning (RL) is a computational approach that aims to imitate the previously mentioned learning method. Given a numerical reward, that represents how well the system is behaving, the system tries to maximize it, not only for the present, but also for the future. This maximization is reached by the learning of the correct action to take in each possible state of the environment, that means that it will be created a policy that says the correct way to behave in the different situations to which the agent will face.
This is a widely used machine learning method today, and the idea of this project is to implement it on the FPGA contained in a Intel SoC, taking advantage of the benefits that a FPGA offers. It could contribute to the acceleration of the learning processes and implies the development of specialized hardware for this task, focused on robotics. In addition, it will be tested with real mobile robots.
Reinforcement Learning Background
In the Reinforcement Learning problem, the agent receives one state of the environment and takes one action for each time step. This action can be based on its own experience or in the decision of exploring. When action is applied in the environment, it returns a numeric reward that represent how well the taken action was, and moves to a new state, as shown in the figure below.
The idea is to generate the best policy, that implies to take the action, according to the current state, to generate the best reward in the future. In this way, the objective is to maximize the reward. The policy can be modeled as a table that maps the state to a Q value for each possible action, or a neural network that converts the current state to the same Q values. The state with the biggest Q value is the best action to take, and taking it is named to "exploit". Even so, the agent doesn't always take the best action, because it needs to "explore" what happens if it takes unknown actions in the current state, that is what assures the maximum long-term reward.
In Q-learning (widely used), the updating rule is:
This project aims to develop a "Reinforcement Learning Core", with tabular and neural network versions on FPGA, implementing it on an Intel Cyclone V SoC device (DE-10 Nano Kit). RL can be used to learn policies from the low level control to the decision making processes, so it can be used in almost all the sub-systems in robotics. The initial platform to test the project is a balancing robot (as the one shown below), and the idea is to develop in it the following capabilities using Reinforcement Learning:
The following image presents the general block diagram of the project:
Reinforcement Learning Core
This block will implement the update of the Q-table, the fully connected neural network, or the convolutional neural network, in order to generate the correct policy. The specific method will depend on the test target and the problem to face. This core can be a peripheral of the HPS mapped on the FPGA, or an independent module that shares information using the main memory.
These blocks will contain the driver for the possible robotic actuators: DC motors (PWM), smart servomotors (half-duplex TTL and 485), etc. They will be HPS peripherals mapped on the FPGA.
These blocks will contain the driver for the possible robotic sensors: accelerometers, lasers, etc. They will be HPS peripherals mapped on the FPGA. Some sensor can also be connected directly to the HPS using its own peripherals.
The HPS will use the Embedded Linux available for that platform. The idea of the HPS is to manage the beginning and the end of the training, but all the training process will be performed in the RL core. The HPS will also monitor the training process, generating learning curves and reporting the statistics.
Intel FPGA-SoC devices offer a set of advantages for the development of artificial intelligence algorithms.Their flexibility allows to control all the aspects of your design, and parallelism increases the throughput when there are high quantity of data. In addition, the ARM A9 based HPS is the perfect complement to include high level software and general purpose operating systems without losing performance.
The FPGA-SoC devices are also a powerful alternative to implement computational expensive applications like neural networks, that are the base of a number of artificial intelligence methods. Besides the low power consumption that they have makes them an ideal technology for mobile robotics applications.