AP004 » Neuroevolved Binary networks accelerated by FPGA
With the explosive interest in the utilization of Neural Networks (NN), several approaches have taken place to make them faster, more accurate or power efficient; one technique used to simplify inference models is the utilization of binary representations for weights, activations, inputs and/or outputs. This competition entry will present a novel approach to train from scratch Binary Neural Networks (BNN) using neuroevolution as its base technique (gradient descent free) executed on Intel FPGA platforms to achieve better results than general purpose GPUs
Traditional NN uses different variants of gradient descent to train fixed topologies, as an extension to that optimization technique, BNN research has focused on the application of such algorithms to discrete environments, with weights and/or activations represented by binary values (-1,1). It has been identified by the authors that the most frequent obstacle of the approach taken by multiple BNN publications to date is the utilization of gradient descent, given that the procedure was originally designed to deal with continuous values, not with discrete spaces. Even when it has been shown that precision reduction (Float32 -> Float16 -> Int16) can train NN at a comparable precision , the problem resides in the adaptation of a method originally designed for continuous contexts into a different set of values that create instabilities at time of training.
In order to tackle that problem, it is imperative to take a completely different approach to how BNNs are trained, which is the main proposition of this project, in which we expose a new methodology to obtain neural networks that use binary values in weights, activations, operations and is completely gradient free; which brings us to the brief summary of the capabilities of this implementation:
• Use weights and activations as unsigned short int values (16 bits)
• Use only logic operations (AND, XOR, OR...), no need of Arithmetic Logic Units (ALU)
• Calculate distance between individuals with hamming distance
• Use evolutionary algorithms to drive the space search and network topology updates.
These substantial changes simplify the computing architecture needed to execute the algorithm, which match natively with the Logic Units in the FPGA, but also allows us to design processing elements that effectively adapt to the problem to be solved, while at the same time, remain power efficient in terms of the units needed to deploy because agents with un-optimized structures would automatically be disregarded.
The algorithm proposed, Binary SUNA (SUNA  with binary extensions ), will be used to solve standard reinforcement learning challenges, which are going to be connected to an FPGA to solve them more efficiently, given that the architecture will match the evolved network at multiple stages, specially during training and inference. Comparison of the performance gains between CPU, GPU and FPGA will be demonstrated.
 Michaela Blott et al. 2018. FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks. ACM Trans. Reconfigurable Technology
 Danilo Vargas et al. 2017. Spectrum-Diverse Neuroevolution With Unified Neural Models. IEEE Transactions on Neural Networks and Learning Systems 28
The convergence of three important areas of technology: advancements in semiconductor technology, data analysis and applied mathematics have enabled the escalation of deep learning as a noteworthy research topic, having neural networks demonstrate its potential and effectiveness to solve complex problems not feasible beforehand.
Nevertheless there are inherent limitations in its design, especially with non differentiable functions or those with multiple local optima. Also, applications of such algorithms to FPGA has been relegated to inference model execution, where limited research has been done at time of training , a place where general purpose GPU have taken most market share, even when the first can be more power efficient and have higher throughput than the latter for multiple applications .
On the other hand, evolutionary algorithms, as an alternative optimization technique, operate the same bio-inspired operations continuously applied to a substantial number of individuals at the same time, called “agents”; this function takes most of the computation time of the execution with multiple key features :
Calculations are embarrassingly parallelizable if application is well designed.
Sharing updates and communication has low overhead, with examples of linear growth as more processors become available.
Stored data required computations can be cached more efficiently given the lack of backpropagation.
Evolution has been proven by nature to perform wide exploratory behaviour of complex environments.
Therefore, by iteratively re-applying the procedure to multiple agents in parallel using dedicated processing units, it is possible to achieve more performance at scale; traditional evolutionary algorithms require multiple floating point operations for controlling agents’ population even though the processing time for them is very small.
For this work, it was employed the algorithm named BiSUNA, Binary Spectrum-diverse Unified Neuroevolutionary Architecture, which is an extension on publication  that uses only binary operations/values in the creation of non-sequential neural networks in charge of solving a Reinforcement Learning (RL) problems. A high level overview of how BiSUNA works can be visualized in Fig 1.
Fig 1. Basic structure of BiSUNA evolutionary steps
It is well known that floating point circuitry requires a substantial amount of hardware resources; this proposal targets a joint venture between BNN characteristics with the parallelization properties of evolutionary routines, taking a step forward to use only binary operations that drives the search space, grasping the full advantage of what FPGA offer in its architecture.
Intel’s FPGAs are excellently suited for neuroevolutionary computations because of the following features: with Cyclone V’s distributed LUTRAMs, the algorithm is able to store each agent’s bit encoding very efficiently, which in turn allows an efficient pipelined to process to exploit parallelism.
Another desired characteristic implemented was “Agent Tiling”, which enables quick data access by each execution element; by first packing code, transfer agent data tiles from off-chip DRAM to on-chip BRAM, to later have quick on-chip access. Lastly, thanks to OpenCL 1.0 support offered by the target board, it was possible to bring C++ code already tested in BiSUNA , then it was transformed into C kernels that are executed in the OpenVINO starter kit platform.
Implementing the BiSUNA neuroevolutionary algorithm , a battery of test with multiple OpenAI reinforcement learning environments  take place: Mountain Car, Duplicated Input, Roulette, NChain and Copy. These represent just a small sample of BiSUNA’s possible applications, where FPGA acceleration allow researchers to test more examples and verify the generalization abilities onto new areas where neural networks have not been applied yet. Those interested in DNN will be able to use this framework to train an adapted model that interacts with a dynamic conditions and learn from its rewards.
Taking a bottom up approach, first it is going to be explained how the OpenCL Kernel operates (Fig 2), more details can be found in section “Function Description”.
BiSUNA uses a fixed number of “agents” to explore the RL environment, which are formulated as possible solutions to the problem. Therefore, every agent must perform three basic operations: Process Input, Process Primers/Control, Process Remaining/Output, all of them refer to neurons in the mesh. Also it is possible to visualize that each agent must copy the information situated in global memory to its local region to improve access time.
Fig 2. Block Diagram of the Processing elements, buffer catching and communication inside the OpenCL Kernel.
Following a FPGA design pattern, it was decided to use a Pipelined Single Task execution model, that provided development flexibility while at the same time offered a good balance with resource utilization. It is important to mention that, at every clock cycle, an agent moves its execution from one stage to the next, given that the Initiation Interval of most loops was carefully considered to be 1 or at most ~1.
Zooming out from kernel processing, Fig 3 shows us the functional steps the FPGA has to perform: Initialization is related to the population and configuration, both properties tightly related. That function allocates enough memory to keep Neurons and Connections data; in the previous explanation was named as “Global Memory”. When all evaluations finish, they are stored back into a location named “Population State” that is written back to the CPU as output.
Fig 3. Block diagram of the parameters and algorithm flow inside the FPGA.
Moving one abstraction layer up, Fig 4 shows us the basic communication process that BiSUNA uses to offload work to the FPGA. First, the CPU enables an environment controller that is in charge of communicating with the RL puzzle, keeping track of the number of iterations, adaptation and agent organization. Data flow manager takes charge of codifying the population, identify which inputs/outputs correspond to each individual; everything using the PCI-E port, once again with Global memory represented as DRAM in this diagram.
Fig 4. High level FPGA communication flow and population execution.
BiSUNA currently has an open source CPU implementation , the next natural step was the evaluation of the best acceleration hardware alternative that fits best to the algorithm characteristics. After a profound analysis of current technologies, OpenCL was a clear choice by virtue of its hardware agnostic philosophy, C/C++ compatibility and wide user baseline adoption. Also, this selection enabled a targeted platform development, where FPGA offers a unique set of features that cover multiple areas which are critical for the usability of the algorithm: raw compute power, power efficiency and functional adaptation.
1.- Programmable Compute Power: Paper  showed that a good design on FPGA can everreach state of the art execution on GPUs, in a specific case, “Stratix 10 INT6 performance is more than 50% better than the Titan X theoretical peak INT8”; in case of performance per watt it was 2 times higher. It is important to note that large on-chip cache memory reduces bottlenecks/round trips to external memory, reducing energy costs. Lastly, design flexibility offered by Intel’s FPGAs in supporting any range of data types precision; INT8, FTP32, binary sets or custom data lengths, is the key FPGA feature when DNN application are conducted.
2.- Energy Efficiency: Referencing once more , it was shown that Arria 10 FPGA reduces almost 10x power consumption against a Titan X GPU. One factor influencing the power-hungry architecture found in GPUs originates at additional module complexity required to facilitate software programmability (ex. CUDA). On the other hand, FPGA’s reconfigurability, a complete software development stack offered by Intel’s OpenCL (Fig 5), enables developers much higher efficiency for any number of applications using C as its base language.
3.- Functional Adaptation: Intel FPGAs are used across multiple industries with application critical functions, where safety is a key deciding factor, cases like factory automation, autonomous aviation and defense. From its inception, FPGAs have been designed to meet the special requirements that typical consumer electronics do not need. Given that BiSUNA could be used to learn from environments where humans are not capable of reaching, it is essential to design products that can adapt to extreme conditions.
Fig 5. Compilation steps to obtain x86 binary and FPGA bitstream.
This project started by the identified need of an alternative to train efficiently binary neural networks (BNN). Looking through the literature, authors found multiple papers that tried to adapt different alterations to the gradient descent concept to be readjusted as discrete state values, papers , are two of the most influential papers around BNN. Even when their approaches worked well, their designs required to keep track of backpropagation derivatives with higher precision than binary values.
With BiSUNA, that need is completely neglected, it only depends on agent swarms to perform the space search, keeping exclusively elements that obtain a higher reward. Another important distinction is the organic growth, given that selection pressure will make the network adapt to the environment circumstances, changing neurons/connections number; in contrast to traditional hand crafted NN, where designers had a fixed topology adapted to a particular problem.
Another important contribution of this work, where the hardware requirements needed to execute BiSUNA are simpler, helping to the adoption of NN onto edge devices. This work tested the OpenVINO starter kit by Terasic, which enabled the device to train from scratch a neuro-evolved topology. In the future, other platforms (like DE10 Nano) could be tested, aiding in the reduction of data transfers to servers in the cloud, while keeping all models highly adapted to the hardware it executes on.
A result of BiSUNA’s generalization capabilities, it will be possible for data scientist and NN researchers to explore new applications where the training models relies on rewards and not labeled data. Section “Performance Parameters” shows how BiSUNA can solve problems where the agent learns from the environment, to later apply actions that provide an incentive to continue the exploration.
As mentioned in the paper , BNN are naturally suited to the FPGA architecture, because there is no need of DSP (digital signal processors) to keep track of arithmetic operations, BiSUNA only requires ALUT/Memory Blocks, enabling high computational performance while keeping power consumption at a minimum. Intel FPGA have shown its capabilities in different areas of research and applied sciences, especially with its platform OpenVINO, which has multiple code samples of traditional DNN that perform speech recognition and face detection, this was critical given that documentation and examples are essential to learn a new platform.
Then, with BiSUNA, Intel has the opportunity to extend such documentation with a framework capable of solving reinforcement learning problems, as the ones shown by the OpenAI gym environment , even when most of them use continous values, BiSUNA has a compiler flag to enable floating point calculations on its neurons/connections, an example of such versatility can be found in Fig 6.
Fig 6. An example of floating point (left) and binary with 16bitset (right)
With the brief sketch of how BiSUNA works from previous sections, this part will dive deeper on the mechanics behind the algorithm acceleration using reconfigurable silicon, why that is important and a high level overview on how that is achieved. First, with the ”all binary” philosophy, it was possible to transfer the first C/C++ code implementation from , develop aspects like dynamic memory management, code organization and speed, to later translate the heavy duty of processing each agent into a OpenCL kernel.
Generally speaking, OpenCL has two modes of execution: NDRange and Single Task. The first refers to the dexterity of offloading functions to multiple computing units at the same time. The second option performs the same operations, though focused on a single entity. Depending on the characteristics of the work to be performed, target device, memory available and many other factors, software architects need to define which execution mode suits their needs. In this particular case, it was decided to wield a Pipelined Single Task model given the FPGA features, like the one shown in Fig 2.
Even though it is mentioned the kernel runs a single task, the FPGA offers the advantage of data pipeline when correctly programmed, which means that every clock cycle, the values of a new agent pass through a different process without stalling others progress. Given the properties of the OpenCL kernel, it can be extended to multiple “computing units” if a larger size device is targeted, for example training with a Stratix or Aria family. In general terms, this project used the following tools:
OpenVINO Starter Kit with Cyclone V (301K LUT, 13Kb Mem)
Intel i5-7600 (test CPU)
Ubuntu 18.04 with OpenCL development tools (Linux kernel 4.8.17)
In Fig 7 is possible to visualize how BiSUNA works. In general terms, when the executable starts, it performs initialization steps for the population, external configuration file, RL environment connection and OpenCL runtime. After those steps are completed, BiSUNA enters a loop stage where it continuously reads “observations” from the RL module, which is treated as input to all agents, with each attached to its own instance. Then all values are 64bit memory aligned and sent to the Cyclone V via PCI-Express, which repeats the process analyzed in Fig 2.
At this stage, BiSUNA waits for the results to be delivered back from the FPGA, once they arrive are sent to the RL environment in a loop, which finishes under two possible conditions: an agent has either completed or failed the task. When all of them conclude their interactions, it steps forward to the evolution stage (Fig 1), that can fulfill its execution under the circumstance that the loop has reached its maximum number of generations, otherwise it restarts the RL context.
Fig 7. Diagram flow of how BiSUNA works.
Access to the open source framework can be found in , which provides all source code needed to replicate results shown in the next section, along with the Quartus project created by the Altera offline compiler and all raw data obtained from running the OpenAI gym riddles. That will allow anyone interested in deep learning to test multiple environments along other functionality:
Train nonlinear floating / binary neural networks models.
Find the most efficient NN topology for a certain environment.
Generalize its application to problems that can be formulated as “reinforcement learning” riddles.
Train, inference and iterate faster using multiple processing elements programmed at the hardware level.
Fig 8. Screenshot of BiSUNA-OpenCL repository in Github
This section consigns BiSUNA under multiple RL enigmas, all of them can be found at the OpenAI gym repository . There are some adapted to discrete input/output: Roulette, NChain, Copy, Duplicated Input; showing as well one example that uses continuous values (Mountain Car) to confirm that is possible to use discrete neurons using the correct translation between numeric spaces. Results raw output of BiSUNA solving those environments can be found in ; all of them show how they were accelerated using the programmable elements present in FPGAs and traditional CPU executions, which is later summarized in Table 1.
Fig 9. Screenshot of two OpenAI Gym environments, left is Mountain Car, right is Copy.
Given the possibilities that testing on different architectures makes challenging to compare objectively one with the other, therefore it was decided to focus on power consumption, bearing in mind that future plans embrace the deployment of intelligent low power devices that are capable of learning from adverse environments. Nonetheless, in Table 1 many other details are as well included to give perspective of how BiSUNA in both configurations behave.
Table 1. Summary of execution between FPGA/CPU OpenCL Kernels that connects to the OpenAI gym environments.
This project has explored the following metrics to confirm FPGA development is correctly guided, from top to bottom, below are listed all of them along a quick explanation what that represents:
Exe Type: This signals the underlying operations used, either floating point (only Mountain Car), or binary.
Accelerator: The architecture used to execute the same OpenCL Kernel, options are CPU (Intel i5) or FPGA (Cyclone V).
Generations: Number of iterations each population had to evolve and reach a solution. In other words, a full cycle from Fig 1.
Best score: The highest reward obtained at the end of a generation.
Total time: How long it took to finish executing all generations. This includes as well the time needed to set up the OpenCL runtime.
Process Neurons: OpenCL profiling flag that reported how long it took to perform the operation “process”, as described in Fig 2. The value reported is the maximum it took for the kernel to operate.
Output Read: OpenCL Profiling flag that reported how long it took to read back the results from “process” to the main loop, graphically is the output arrow from Fig 4. The value reported is the maximum it took for the OpenCL context to retrieve its data.
TDP: Estimated nominal power consumption of each device. In the CPU case, it was considered the spec sheet for the model being under test. FPGA reports the value from Quartus’ Power Analyzer Summary.
Frequency: Reported circuit operational frequency, CPU used the box specs, whereas the FPGA used the Quartus Report “kernel_clk_gen”.
Process / TDP: Ratio of the processing capacity against power consumption; in other words, it represents the number of agents the architecture can operate in comparison to the energy consumption it draws. Larger is better, signaling that it can undertake more agents wasting less resources.
Process / Freq: Process neuron ratio as a proportion to the device’s frequency; also could be understood as the capacity to operate agents in relation to the functioning rate, The larger this number is, more agents can be analyzed by a certain device.
Looking at Table 1, at first glance accelerating this workload on the FPGA might not seem impressive if only time is considered, taking in some examples up to 13x times longer to run the same environment (in case of Roulette). If it is taken into consideration time overhead of setting up the OpenCL runtime, pass data to the device, wait for it to be processed and retrieve it back, the picture still does not look bright.
Notwithstanding, if the perspective is analyzed from a different angle, using the FPGA does indeed accelerate BiSUNA from the power standpoint, for that reason the last two ratios were included, the Cyclone V capacity to process neurons in relation to the energy used/frequency of operation is astonishing, with typical gains around 400x and 700x for TDP and frequency respectively. In other words, executing this kernel on the Cyclone V is 400x more power efficient than performing that on the Intel i5 used for testing. Of course, raw data used here can be accessed in the same repository  as the open source implementation to confirm these claims.
At the time of assembling the OpenCL kernel using the Altera offline compiler and reading Quartus project, it was possible to obtain the multiple parameters that are of especial important when designing hardware. Fig 10 shows three main blocks, the first represents the resource utilization summary, displays only 26% logic units were used, but most importantly 0% DSP, which is the intended objective when executing binary neural networks.
Fig 10. Quartus Reports of resource utilization, Maximum Frequency and Power Dissipation.
On the other hand, when the continuous parameter flag was enabled (use floating point) for the OpenCL kernel, the estimated resource utilization jumped to 174%. Table 2 shows the comparison of trying to compile “float” against “binary”, showing that in principle, the gains from not using DSP are about 58x.
Table 2. Estimated resource utilization when “Float” was intended to be compiled in AOC. * Note, ALUT is estimated given that the Quartus Report Summary does not provide the same value directly.
Another important discovery is the amount of total thermal power dissipation this design, using only 3.2w, which if it is compared to a nominal 65 TDP power of the Intel i5-7600 used for testing, it represents a 20.32% reduction. Of course, the CPU runs at 3.50GHz while the Cyclone V kernel does at 117.5 MHz (29% less).
This project requires tight integration between its software components and hardware acceleration. Fig 11 uncovers how important is that both sections communicate effectively. On the software side, multiple OpenAI gym instances are executed providing observations, which are like “snapshots” of the environment at after a certain “action” by an agent has taken place. Especially, the “action” part executes on the FPGA, communicating with the OpenVINO starter kit using its PCI-Express port: it receives as input those “observations”, then for each binary neural network stored on the device, it returns “actions” of multiple agents executing in parallel.
Fig 11. Software and hardware delimitation and shared data.
With the help of the Altera offline compiler (AOC), which translates the OpenCL C to the low level VHDL code, it is possible to develop using a high level language to communicate with the FPGA. As part of the development suite, the AOC shows every module in a cascade fashion, which also confirms the pipeline architecture mentioned in previous sections. Fig 12 shows the high level detail, however the full report can be found in the open source repository for this project .
Fig 12. System viewer of the OpenCL kernel FPGANetSt.cl.
The code repository  for this project is located here: https://github.com/rval735/bisunaocl, which has the following sections:
Quartus: It has most files that were automatically generated by the AOC when the OpenCL kernel “FPGANetSt.cl” is compiled. It was not possible to upload all files given the 10MB restriction on files imposed by github. Nonetheless, running the compilation command, will recreate the whole project.
Resources: It mainly has the needed configuration files required to execute BiSUNA given certain parameters.
Scripts: Some bash scripts that help automate the deployment of BiSUNA/Gym environments.
Src: BiSUNA Source code. Inside “OCL/Kernels” are located all files needed to recreate the FPGA bitstream.
Makefile: Ease the process of compilation by simply “make” the executable
License: Apache 2.0.
The video presentation of this project can be found in the following link: https://youtu.be/eX2jxLuIqj8
This work provides a novel approach to the DNN community, that adds a new tool when training BNN using evolutionary principles, a full alternative that sidesteps completely gradient descent. As a result of BiSUNA’s proficiency to conditionally compile between discrete/floating point systems, this work can be generalized to more environments, setting precedence on the application of BNN to reinforcement learning. It was shown as well that only the BNN mode resulted in a capable bitstream for the Cyclone V in its current form, whereas the continuous case did not. Future research will analyze more architectures and extend the number of RL environments that use the OpenAI gym to test how BiSUNA solves those riddles.
As benefit of how BiSUNA is architected, simple changes done from continuous to binary values, they can easily form the foundation in research of discrete values to the application of RL problems, typical domain of NN algorithms like QLearning, which rely on floating components to reach a solution. Simplifying hardware requirements to train BNN contributes towards the creation of more efficient networks and circuits, facilitating FPGA acceleration not only to be used at time of inference, but also train with them. We envision a future, where the edge computation is localized, which will enable IoT devices to expand their capabilities, increasing the technology expertise to the next level, the Intelligent Internet of Things (IIOT).
8.1 Future Work
This project can be extended in multiple ways, take BiSUNA out of test environments and bring it into real world problems, as mentioned before, an application where intelligence has to be performed at the edge but cloud trained models can not be executed given memory/energy constraints, places where FPGA overcomes the competition.
Another interesting venue to explore is the BiSUNA development as a plug-in that is capable of interacting with other agents, share resources, genetic improvements and adaptation, establishing a simple communication protocol, which would enable individuals to cooperate instead of compete for survival, taking advantage of FPGAs reconfigurability.
Most likely, our next routine step would be to test this kernel on smaller devices to push their resource utilization as well as experiment on larger devices on which it will be possible to execute in a OpenCL NDRange mode.
 Michaela Blott et al. 2018. FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks. ACM Trans. Reconfigurable Technology.
 W. Tang, G. Hua, and L. Wang, “How to train a compact binary neural network with high accuracy?” in AAAI, 2017.
 Junichi Murata Danilo Vargas. “Spectrum-Diverse Neuroevolution With Unified Neural Models”. In: IEEE Transactions on Neural Networks and Learning Systems 28.8 (2017), pp. 1759–1773. issn: 2162-237X.doi: 10.1109/TNNLS.2016.2551748.
 Wenlai Zhao et al. “F-CNN: An FPGA-based framework for training Convolutional Neural Networks”. In: 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 2016, pp. 107–114. doi: 10.1109/ASAP.2016.7760779.
 Jason Cong et al., “Understanding Performance Differences of FPGAs and GPUs”. In: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. FPGA ’18. Monterey, CALIFORNIA, USA: ACM, 2018, pp. 288– 288. ISBN: 978-1-4503-5614-5
 Open AI. Evolution Strategies as a Scalable Alternative to Reinforcement Learning. 2018.
 R. Valencia. (2019, Sep) Binary Spectrum-diverse Unified Neuroevolution Architecture. https://github.com/rval735/BiSUNA
 G. Brockman et al. “OpenAI Gym”. In: arXiv e-prints (June 2016). arXiv: 1606.01540.
 R. Valencia. (2019, Sep) BiSUNA-OCL. https://github.com/rval735/bisunaOCL
 Eriko Nurvitadhi et al., “Can FPGAs Beat GPUs in Accelerating Next- Generation Deep Neural Networks?” In: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. isbn: 978-1-4503- 4354-1
 Yaman Umuroglu et al. “FINN: A Framework for Fast, Scalable Binarized Neural Network Inference”. In: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. FPGA ’17.
 Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. “BinaryConnect: Training Deep Neural Networks with binary weights during propagations”. In: Advances in Neural Information Processing Systems 28. Ed.