Annual: 2019

AS011 »
3D Stethoscope
📁High Performance Computing
👤Al Schneider
 (Schneider Software Systems)
📅Jun 29, 2019
Regional Final

👀 2290   💬 1

AS011 » 3D Stethoscope


2019 Innovate
FPGA Project Proposal
3D Stethoscope
By Al Schneider

This entry to the contest presents a new computer concept. The purpose of this entry is to show the concept is feasible.


The architect of this concept grew up in software in the seventies working on Univac multi-processor systems. Then, the power of such systems was apparent but shortcomings were also apparent. Years after leaving Univac this person discovered solutions to these shortcomings. The cost of implementing these solutions was high and prevented a serious attempt at demonstrating them. However, with the advent of FPGA technology, the solutions can be implemented in the real world at a reasonable cost.

Project Overview

In essence, this system enables large scale multiprocessing with all processors accessing a common memory without conflict. A system is planed with 256 processors to demonstrate the concept.

The method utilizes an approach somewhat opposite from traditional computer technology. Traditionally, computers move data to a hardware CPU. This system, on the other hand, utilizes virtual processor units (VPUs) as opposed to a central processor unit. They move within the system traveling to the data.

Experiments performed on a Max 10 FPGA indicate that the switching time for a LUT gate is 1.6 nanoseconds. Based on that, the aggregate throughput of the suggested system would be 200 Mega IPS.

A simple C like language will be provided to program the many processors.

Proposed Entry

The proposed entry would use a Terasic Open VINO Starter Kit to analyze the frequencies from an audio input into 256 sub bands and translate the sound into moving images on a display screen. This is to be implemented as a stethoscope that displays a visual representation of heart sounds as well as audio. An important point here is that the device will do this in real time. This requires analysis of audio input, image rendering, and image display within 33 milliseconds to produce 30 frames per second.

This link visualizing music illustrates how the display might appear:

FPGA Virtues Realized in this Entry

Using FPGA’s to develop concepts.
Demonstrates the value of having memory and logic on the same chip.
Demonstrates the ability to eliminate unneeded functionality.
Demonstrates the ability to add custom desired functionality.
Demonstrates performing many tasks simultaneously.
There may be a need to demonstrate the Cyclone V GX I/O capability and the PCIe interface.

Project Proposal

1. High-level Project Description

Introduction to High Level Project Description

An introduction is difficult as the ultimate goal imagined is about general purpose computing services. The entry proposed in this contest is to be a small example of a much larger plan. So the prose here is about my big dreams and wishes about the task. This introduction includes a bit of background of this plan.


Purpose of the Design

The purpose of this design is to demonstrate the viability of Large Scale Virtual Multiprocessing (LSVMP) by implementing a new kind of stethoscope presenting a visual representation of heart sounds.


In essence, the Large Scale Virtual Multiprocessor is a computer that supports hundreds if not thousands of processors executing out of a common memory without conflicts. Note that this does not mean making a bigger computer. It means using the same logic area used by a standard computer and rearranging the components to produce a LSVMP system.


Application Scope

The immediate scope is to demonstrate the use of FPGA technology in a stethoscope that provides a visual heart beat context as well as an audio context. The ultimate application scope is to advance the power of computers in our society. (Get past Moore’s Law.)


The scope of this contest entry is to implement a multiprocessing system with 258 processors on the Cyclone V GX. That system will be used to analyze a real time audio signal. The energy/frequency of 256 subdivisions of the input signal is to be fed to a program on a PC that renders a visual representation on the screen of the PC.


Targeted Users

The primary target of this contest entry is the architects that build today’s computers. The hope is that this audience will find something in the concept that is valuable in future designs. A secondary audience is medical personnel that would use the device to visualize body sounds as well as hear the normal audio representation.


Some Background and Why Intel FPGA

I grew up in software in the seventies working on Univac multi-processor systems. Then, the power of such systems was apparent but shortcomings were also apparent. Years after leaving Univac, I discovered solutions to these shortcomings. The cost of implementing these solutions was high and prevented a serious attempt at demonstrating them. FPGA’s appeared on the market. They had possibilities for this idea. As time passed the size of them seemed to increase suddenly and looked like a good way to try the concept. After looking around at possibilities, I stumbled across a board utilizing the MAX 10 device. I viewed Altera as a large well established company that would present an ideal support organization for my device. The appeal of the MAX 10 was that it did not require an off chip memory device to hold the fabric of the device. A factor in the back of my mind was also that Altera was dedicating effort in inter chip communication. I felt that someday might be useful if I were in a position to link several chips together for a multi-chip super computer.  Furthermore, Altera was offering the software for no charge. And the board cost $50. I bought one and experimented with it using VHDL in the process. I gave a lecture on the subject at an international FPGA conference in California in 2018. No one showed interest and the trip was very expensive. My enthusiasm for the project died as I saw no way a little guy like me could bring something like this to market.


This competition is a chance to present the concept to industry leaders. That is why I am here.

2. Block Diagram


The system consists of an audio input, audio analyzer, PC, and display.


From the right, a microphone picks up sound from a heart. The PC gets the audio signal and routes it to the Audio Analyzer. The Audio Analyzer establishes energy in 256 frequency bands. Those signals go back to the PC where that data is converted into a visual image that is sent to a display.


Audio Analyzer

The audio analyzer is a LSVMP (Large Scale Virtual Multiprocessing) system. The essence of this system is the implementation of virtual processors. A virtual processor is a set of data that are critical for the operation of a processor. This would include: Instruction pointer, Op Code, Processor ID, Flags, General purpose registers.


Note that this not a statement that these are registers. Rather these are data contained in register sets that support virtual processors. The data or register sets are temporarily held in buffers connected to each other. The following picture presents 16 buffers each of which contains a register set.



The following picture shows the buffers are connected to each other so the registers sets held by the buffers can be moved from buffer to buffer.


Each buffer is controlled by a clock line. On a clock tick, the register sets that are virtual processors are transferred to a following buffer. The end buffer data is moved to the first buffer so the register sets or virtual processors move in a loop.



Processing components are placed between the buffers.



As the data moves from buffer to buffer, the data bits move through processing components. slices.jpg



The processing components represent a very small slice of the operations executed during the processing of an instruction. Processing may include adds, multiplies, memory reads, and memory writes. The following shows a slice performing an add. Note the opcode data enters a decode cell that indicates this slice is to perform an add. The ADD circuit then adds the contents of register 0 to the contents of register 1 moving the result to register 0 that is moved into the right side buffer. slice.jpg



The add function consists of many smaller operations. That function can be further divided into several slices. The following depicts dividing a 16 bit add into 8 slices. slices.jpg



The following depicts memory as a slice of the loop. slice.jpg


When this slice is executed the address in register 0 and 1 may cause the memory to be read and the output to be placed on register 0 on the right output buffer.


Complex memories can be broken into smaller slices to enable faster reads of memory. The following shows memory broken into 8 slices. slices.jpg



Note that an instruction consists of one loop through all slices. The following represents the kinds of slices that could form the LSVMP computer. loop.jpg



A critical point is that an instruction is executed when a virtual processor makes one pass through the loop. During that loop, instructions are fetched, logic executed, data is processed, etc. That is, one instruction is executed in that loop and during that time many virtual processors execute instructions simultaneously.


One could easily call this an instruction pipeline. However, the intention is different. The purpose of an instruction pipeline is to execute slices of common parts of instructions that are executed serially. The purpose here is to slice all instructions into small components so that all processors involved can execute their independent streams at the same time. This not only means hundreds of processors are executing their own instructions at the same time, it also means that all processors are reading and writing a common memory at the same time. This means no memory access conflicts other than two processors attempting to write the same word. Actually, there is no conflict. It just means the last to write wins.


Audio Analysis

The audio analyzer runs on the LSVMP system. That software breaks the input frequency into 256 bands with an array of oscillator simulators. Such a simulator appears as follows.



The software simulates this oscillator where M is the mass, D sub 0 is the distance to the mass with no force on it, K is the spring constant, and there is a viscosity imposed on the oscillator by viscosity L. The input signal is the applied frequency which moves the bar up and down in the simulation. The energy of the system would be the sum of kinetic and potential energy of the spring pendulum.


The analyzer consists of a large array of these simulators to examine various frequency bands of the input heart signal. M, K, and L vary to establish the frequency bands studied.



A software program in the PC examines the energy of each frequency band to produce virtual instruments. The program does not merely display the energy of each frequency but groups various combinations of energies and frequency bands to represent various sounds in the heart. Each combination is referred to as a virtual instrument. Thus the operator can associate a specific moving object on the screen with some particular function of the heart. The goal is to enable the operator to identify perhaps hundreds of heart functions at a glance. An advantage of this device is its ability to visually show high frequency events beyond the range of hearing. This link visualizing music illustrates how the display might appear:


3. Intel FPGA virtues in Your Project

As mentioned, the plan is to implement 258 virtual processors to analyze a heartbeat audio signal. The goal is to dedicate a virtual processor to one band of the frequency spread. The aggregate processing power is to be at least 200 mega instructions per second. During the realization of this experiment, a more complete understanding of the Cyclone V GX may result in a higher performance than this.  Also, I expect to find that the real estate needed for this experiment may be a small part of the resources available on the Cyclone. That would indicate the Cyclone could greatly surpass the performance used in this entry.


FPGA Virtues Realized in this Entry


Using FPGA’s to develop concepts.

Demonstrates the value of having memory and logic on the same chip.

Demonstrates the ability to eliminate unneeded functionality.

Demonstrates the ability to add custom desired functionality.

Demonstrates performing many tasks simultaneously.

Demonstrate the Cyclone V GX I/O capability and the PCIe interface.


4. Design Introduction

Again, the overall plan is to use VHDL to build a multiprocessing system on the Terasic Open VINO Starter Kit. That board is to be installed on a PC supporting PCIe. The PC is to be programmed to input an audio signal and send it to the Cyclone. The Cyclone will analyze the sound in 256 bands and return the energy of each band to the PC. The PC will then use this data to display images representing the sound on the screen of the PC.


The sound on the Cyclone will be analyzed by using an array of simulated oscillators. Each simulation will pose a different weight, spring tension, and viscosity. The input signal will be applied to each oscillator and each will oscillate at a different resonate frequency. The PC will read the energy/frequency of the oscillators from the Cyclone FPGA. That will be used to render moving images on the screen.


Here are some reasons the Terasic Open VINO Starter Kit is a good choice for this project. The PCIe interface of the board is convenient. Memory and logic on the same die is important. A clock line throughout the fabric is convenient and having more than one is very convenient. Versatility in altering VHDL code makes this project possible. Transmission off board is very convenient. The LUTs and flip flops can be broken out of the ALMs and used separately.


5. Function Description

The first step is to establish switching time of a gate formed from an LUT.


The second step is to establish memory access time.


If I can get the switching time less than 1.6 nanoseconds and memory access time under 3 nanoseconds: the following design plan may change.


Presently the goal is to implement a 16 bit word system with 258 processors. This should use a fraction of the Cyclone V GX resources. The unused portion will be used as expansion as necessary. Depending on the speeds above, the extra may be used for an additional system. Projected speed would then be an aggregate of 100 meg IPS. If I can use the other half for an additional system the aggregate would be 200 meg IPS. If I can get the memory access time down, perhaps I can jump it to 400 meg IPS.


My desired goal was 1 gig IPS with 1024 processors but I don’t think the Cyclone will support it. My initial plan was to implement an 8 bit system but decided to go with a 16 bit system to avoid multi-byte math manipulation.


The design requires a great deal of duplicity so small blocks of VHDL will be utilized with the goal of using VC++ to organize writing the VHDL.


The systems will implement a limited language resembling a simple kind of C like code. The language structure is a bit strange. The system will not be register oriented but memory to memory organized. An add might appear as:


C = A + B;


This is equivalent to an assembly instruction.


The instructions will be very limited but still supply needed functionality. For example, there are no immediate data instructions. This need is to be supplied by putting constants into data memory.


In this design little effort is devoted to making instructions efficient. In this design, all instructions require the same amount of time to execute. Extra instruction complexity results in additional processors being required. Aggregate speed is not altered.


The execution plan is to let the processors run independently. Processor 0 will read input from the PC via a memory mapped interface. Processor 0 will distribute the input value to the 256 simulated oscillators. Processor 1 will read the energies from each simulator and send it to the PC via a memory mapped interface. These two processors will establish a handshaking protocol with the PC to insure no data loss. The remaining 256 processors will continuously simulate oscillators with varying frequencies.

6. Performance Parameters

This step in the proposal appears to be a duplicate of a previous step.

7. Design Architecture

When this project is complete, this step will list all code used in the system and explain the parts in detail.


The parts include:


A VHDL overview picture

VHDL code

Assembly code

Snip code used in the application

Code used in the PC to render images

List of instructions

A software flowchart


------------------------------------- oOo -----------------------------------------------




To this old programmer, using this system is extraordinary. While programming this system there was no:


Test and set


Message Passing


Task Switching

State Save and Restore


Interrupt Latency

Program Control Tables

Priority establishment


It truly feels strange to program what I want. The FPGA allows me to do what I want to do without handling things I don’t care about. Without this versatility, this task would be impossible.


Finally, the long range dream of this project is to build powerful supercomputers. LSVMP offers modularity, extensibility and versatility. The machine personality could be morphed into the most complicated structure by loading a different fabric. The system would even allow several personalities to exist within the same box executing separate tasks independently. Fast chip interfaces and the FPGA versatility are critical ingredients to this end.



Mandy Lei
Please complete the rest part of the proposal.
🕒 Jun 27, 2019 08:48 PM

Please login to post a comment.