Microsoft Word - NSTC05122007.doc

An ASIP architecture framework to facilitate automated

design space exploration and synthesis for Iterative Repair

solvers

Aravind Dasu and Jonathan Phillips

Electrical and Computer Engineering

Utah State University

4120 Old Main Hill

Logan, UT 84321 USA

[email protected], jdphil[email protected]

Abstract-Autonomous dynamic event scheduling, using

Iterative Repair techniques such as those employed by CASPER

and ASPEN, is an essential component of successful space

missions, as it enables spacecraft to adaptively schedule tasks in a

dynamic, real-time environment. Event rescheduling is a

compute-intensive process. Typical applications involve

scheduling hundreds of events that share tens or hundreds of

resources. We are developing a set of tools for automating the

derivation of application-specific processors (ASIPs) from ANSI

C source code that perform this scheduling in an efficient

manner. The tools will produce VHDL code targeted for a Xilinx

Virtex 4 FPGA (Field Programmable Gate Array). Features of

FPGAs, including large processing bandwidth and embedded

ASICs and block RAMs, are exploited to optimize the design.

Efficiency is measured by combining the factors of execution

speed, circuit size, power consumption, and fault tolerance.

Iterative Repair problems are generally solved using a

combinatorial search heuristic, such as Simulated Annealing

(which is used by CASPER and ASPEN), Genetic Algorithms, or

Stochastic Beam Search. All of these methods operate by

gradually improving an initial solution over hundreds or

thousands of iterations. We propose an FPGA-based

architectural framework derived from ANSI C function-level

blocks for accelerating these computations. At a function level,

99% of the work done by any Simulated Annealing algorithm is

the repeated execution of three high-level steps: (1) generating a

new solution, (2) evaluating the solution, and (3) determining

whether the new solution should be accepted. The specifics of

how each step operates vary with the application and are

implemented in VHDL through data- and control-flow analysis

of the source C code. In this paper, we discuss specifics of an

architecture template for automated processor design.

I. INTRODUCTION

Field Programmable Gate Arrays (FPGAs) are becoming

increasingly popular as a platform of choice for spacecraft

computer systems. FPGA-based designs are highly cost

effective compared to Application-Specific Integrated Circuits

(ASICs), and provide more computing power and efficiency

than standard microprocessors. Current and planned NASA

missions that utilize FPGA technology include MARTE (Mars

Astrobiology Research and Technology Experiment) [1] and

the Discovery and New Frontier programs [2]. However, the

complexity of designing even reasonably efficient micro-

architectures on commodity FPGA devices is daunting for

engineers outside the realm of VLSI design.

Therefore, a methodology for automatic derivation of

FPGA-based application-specific processors for use in the

mission planning and event scheduling computations

performed by satellites and deep-space probes will mitigate

this steep barrier and facilitate their adoption to a larger

audience who do not have skills in VLSI design. Through our

methodology custom ASIPs on FPGAs can quickly be

designed which exploit the features of the scheduling

algorithms and maximize the efficiency of the system.

II. RELATED WORK

Our methodology leverages concepts from several different

research areas, including hardware implementations of

heuristic search techniques, the design of application-specific

instruction processors (ASIPs), and methods for performing

design space exploration for FPGA-based processors. Recent

advances in each of these fields are discussed in this section.

Iterative repair utilizes a combinatorial search heuristic,

such as a genetic algorithm (GA), simulated annealing (SA),

or a stochastic beam search (SBS), to arrive at a solution. In

theory, implementing these combinatorial search algorithms in

hardware could significantly speed up the search process.

Large amounts of parallelism and pipelining can be extracted

from GA and SBS, since deriving a new generation is largely

only a function of the previous generation.

FPGA-based GAs and SBS have been implemented for the

purposes of blind signal separation [3], filter design [4],

function interpolation [5], and speech recognition [6]. As long

as the solution length is kept reasonably small, this technique

in which entire solutions are passed between pipelined

modules works well. Iterative repair problems, however, are

complex enough that a solution can be hundreds of bytes in

length.

Design space exploration in the context of FPGA-based

architectures is a powerful tool. Exploring a design space is,

in essence, searching the combinatorial space of all possible

hardware architectures that can support a given function. The

goal is to identify the architecture that yields the best tradeoff

between conflicting goals, such as minimizing required FPGA

resources while maximizing system throughput. The design

space is generally very large, thus demanding a search

heuristic such as simulated annealing or a genetic algorithm to

arrive at a solution within a reasonable amount of time. An

FPGA design space can be searched at many levels, from the

low-level specification of individual look-up tables to high-

level complex modules.

An overview of the different types of processors that are

typically considered in a design space search is provided in [7].

Reduced Instruction Set (RISC), Complex Instruction Set

(CISC), VLIW (Very Long Instruction Word), dataflow, and

tagged-token architectures are all commonly utilized. A

design space explorer is generally restricted to one flavor of

processor in order to put an upper bound on the time needed to

search the design space. Trying to search across all possible

architectures is considered to be an intractable problem.

In [8], a good description of performing design space

exploration for a reconfigurable processor is described.

Important elements to be considered in the design space

include allocation of computational, control, and memory

resources, along with the scheduling of operations onto these

resources. Exploration can occur in both parallelization

(spatial optimization) and pipelining (temporal optimization).

Simulated annealing is employed as the search heuristic. Over

thousands of iterations of the simulated annealing algorithm,

the throughput of the algorithm gradually improves.

III. IMPLEMENTATION

In order to develop an automated tool to derive a micro-

architecture from a C program describing applications within

the class of iterative repair based scheduling algorithms

similar to that shown in fig. 1, we are taking the approach of

first defining and prototyping an application oriented

architecture framework. This framework will then be used to

guide the tool to analyze the C program and determine the

specifics of different control, memory, and computation

modules that make up the application-specific processor. The

general hardware framework consists of an architecture that is

conducive to the execution of the simulated annealing

algorithm as employed by Iterative Repair. Based upon the

framework shown in fig. 1, a tool flow is derived for the

design of iterative repair processors. This tool flow is shown

in fig 2. Source C code for an Iterative Repair problem is first

passed through GCC to obtain an intermediate .cfg format.

This is then passed through an Intermediate Format Generator

to produce custom Control-Data flow graphs. The custom

CDFGs are then partitioned by function to Design Space

Explorers for the different pipeline stages. The Design Space

Explorers take the Intermediate Format code a stage-specific

architecture template, and a constraint file, and produce an

architecture for each pipeline stage. In this paper, we

specifically discuss the custom intermediate code and the

templates that have been derived for each stage.

In a simulated annealing/iterative repair technique solutions

are represented as a string of start times for events numbered 0

to n-1 for a problem consisting of n events that need to be

scheduled. Lists of available resources and resources needed

by each event are also provided. A generic framework macro-

architecture for such algorithms is shown in fig 3.

The architecture is composed of a five-stage pipeline

coupled with six memory banks. A global controller

coordinates execution and data exchange between the units.

As this is a pipelined architecture, it can only operate as fast as

the slowest stage. Design Space Exploration techniques must

be employed in the more complex stages to minimize the

latency. Each of these stages is discussed in detail in this

section.

A. Memory Design

The architecture consists of 6 memory banks, derived from

Xilinx FPGA block RAMs. A 1024-word (32-bit word)

memory bank consumes 4 BRAMs. Each memory bank holds

a solution, the score of the solution, and provides some space

for temporary data storage. At a given point in time, one

Figure 1: Pseudocode for the Simulated Annealing algorithm. The main

loop consists of five steps.

Figure 2: High-level diagram showing tool flow from C source code to

application-specific architecture. Red text indicates portions discussed

in this paper

memory bank is associated with each of the five processing

stages in the pipeline. The sixth memory block holds the best

solution found so far. The main controller determines how

memory blocks are associated with different processing stages.

Details on the manner in which memory banks are managed

are discussed in the section on the main controller.

B. Copy Processor

As shown in fig. 1, the main loop of the simulated annealing

algorithm begins by making a copy of the best solution. This

copy is then altered to generate a new solution that could

potentially replace the best solution. In the architecture shown

in fig. 3, the Copy Processor performs this function.

Assuming the length of the solution is known; the contents

of the solution in the “best-solution” memory bank are copied,

word by word, into the memory bank currently associated with

the Copy Processor. There is no need to accelerate the copy

process, as this pipeline stage is guaranteed to complete in

n+1 clock cycles for a solution length of n. Other stages are

much more compute-intensive. As can be seen from fig. 4, the

copy processor is merely a controller to facilitate data

transfers. The “step” signal comes from the main controller,

indicating that a new pipeline step has begun. The copy

controller consists of a counter that generates addresses and

produces a “done” signal when all data has been copied and

also controls the write-enable line on the destination memory

bank. The source and destination addresses are identical,

because the data locations in each memory bank are identical.

C. Alter Processor

The second stage in the Iterative Repair pipeline is the Alter

Processor. The C code for this function is as follows:

void alter(int *sched)

{

int i, j;

i = rand() % MAX_EVENTS;

j = rand() % MAX_LATENCY;

sched[i] = j;

}

Basically, one event is selected at random from the solution

string. The start time of this event is changed to a random

value smaller than the maximum latency. This stage shown in

fig. 5 could be accelerated by introducing an additional

random number generator and an additional divider, allowing

for maximum concurrency. But it is not necessary as a 15-

cycle integer divider allows this stage to terminate in 21 clock

cycles, regardless of the size of the solution string. As

solutions generally consist of hundreds of events, even the

simple Copy Processor will have a greater latency than the

Alter Processor. The alter controller is based on a counter that

starts when the “step” signal is received from the Main

Controller, control logic to enable register writing on the

“address” and “data” registers on the proper clock cycles, and

a “done” signal to indicate that the stage has completed.

D. Accept Processor

The Accept Processor’s job is to determine whether to

accept the current solution as the new best solution. If the

Figure 3: Top-level architecture depiction for a pipelined Iterative Repair

processor. Black lines represent data buses and red lines signify control

signals.

Figure 4: The Copy Processor: Data is copied word by word from the

source memory bank to the destination memory bank.

Figure 5. The Alter Processor. A random number generator is used to

modify the incoming solution.

current solution is better than the best solution, the current

solution is accepted unconditionally. According to the

Simulated Annealing algorithm, a solution that is worse than

the best solution can also be accepted with a computed

probability, defined in (1).

p = e

∆E/T

, ∆E = S

cur

- S

best

(1)

cur

and S

best

are the current and best scores, respectively,

and T represents temperature. This probability is a function of

both the temperature and the difference between the score of

the current solution and the score of the new solution. When

the temperature is high, suboptimal solutions are more-likely

to be accepted. This feature allows the algorithm to escape

from local minima as it searches the solution space and zero in

on the true optimal solution.

An architecture that supports this computation is shown in

fig 6. The best score and the current score are read from their

respective memory banks. The temperature is provided by the

Main Controller. The random number generator (RNG) is a

simple 15-bit tapped shift register. The exponential block is a

BRAM-based lookup table. The I-to-F block is an integer-to-

float converter.

E. Adjust Temperature Processor

The Adjust Temperature Processor is a simple but critical

stage in the pipelined processor. The temperature is used to

compute the probability of acceptance in the Accept Processor

and by the Main Controller to determine when the algorithm

should complete. The architecture for the Adjust Temperature

Processor is shown in fig. 7. The current temperature is stored

in a register. When the “step” signal is received, the

temperature is multiplied by the constant “cooling rate”,

which is typically a value such as 0.9999. This cooling rate

allows the temperature to decrease slowly and geometrically,

allowing for the discovery of better solutions.

F. Main Controller

The main controller keeps track of the memory block that is

associated with each processing stage. Upon the completion

of a pipeline period, the main controller must determine how

to reassign the memory blocks to the different stages, keeping

track of which one holds the best solution and which one can

be recycled and assigned to the Copy Processor. The main

controller also performs global synchronization. As shown in ,

the main controller receives a “done” signal from each of the

pipeline stages. When all stages have completed, the main

controller sends out a “step” signal to each processor,

indicating that they can proceed. The main controller also

monitors the temperature and halts the system when the

algorithm is complete.

G. Evaluate Processor

The Evaluate Processor is by far the most complex of all the

pipeline stages in the Iterative Repair architecture. Work is

currently in progress for designing this stage. The “score” of a

solution to the Iterative Repair problem consists of 3

components. A penalty is incurred for total clock cycles

consumed by the schedule. A penalty is assessed for double-

booking a resource on a given clock cycle. Also, a penalty is

assigned for dependency violations, which occur when event

“b” depends upon the results of event “a”, but event “b” is

scheduled before event “a”.

Fig. 8 shows an intermediate output of our tool as it works

upon the Evaluate Processor. Fig. 7 is a control-data flow

Figure 7: Adjust Temperature Processor. The temperature is reduced

geometrically each time this processing stage runs.

Figure 6: The Accept Processor. The new solution is always accepted if it

is better. If worse, it is accepted with a computed probability.

graph depicting basic blocks, data dependencies, control

dependencies, and data operations for the evaluate function

described above. The Evaluate Processor will be designed

using Design Space Exploration. A Simulated Annealing

heuristic will be utilized to identify the optimal processor. In

this case, optimal signifies the appropriate tradeoff between

latency of the Evaluate Processor and resource utilization on

the FPGA.

The manner in which the design space is searched is as

follows: A minimally adequate architecture is derived as a

starting point for the processor. The search is performed by

repeatedly increasing the resources (such as arithmetic units or

temporary register storage) allotted to the Evaluate Processor

to reduce its latency until either no further optimization can be

done or the processor latency falls below the latency of

another processing stage. Latency is determined by

simulating the performance of the processor using a software

model.

IV. FUTURE WORK

The first priority for future work is to complete the Evaluate

Processor. The entire architecture can then be tested and

optimized. The end-to-end tool chain depicted in fig. 2 can

then be built.

The architecture described in this paper is only an initial

stage in our vision of hardware acceleration of the Iterative

Repair algorithm. Once the processing platform outlined here

has been completed, a future step would be to further exploit

the parallel nature of the iterative repair algorithm. Simulated

Annealing is a sequential algorithm that can be pipelined, but

not parallelized. However, similar heuristic search techniques

exist that are much more conducive to parallelization.

Stochastic Beam Search is one of these. It is almost identical

to Simulated Annealing, but a set of “best” solutions are

maintained, rather than a single solution. The pseudocode for

the Stochastic Beam Search is shown in fig. 9. A modified

version of the Stochastic Beam Search could better utilize

FPGA resources if significant space is left by the traditional

Simulated Annealing algorithm.

Finally, the tool will be designed to accept additional

optimization constraints such as using Triple Modular

Redundancy (TMR) or other techniques for implementing

fault tolerance along with power optimization strategies.

REFERENCES

[1] A. Winterholler, M. Roman, D. Miller, J. Krause, and T. Hunt,

"Automated core sample handling for future Mars drill missions," in 8th

International Symposium on Artificial Intelligence, Robotics and

Automation in Space Germany, 2005.

[2] "New Space Communications Capabilities Available for NASA's

Discovery and New Frontier Programs," in NASA Technology

Discovery/New Frontier Roadmap, 2006.

[3] H. Emam, M. A. Ashour, H. Fekry, and A. M. Wahdan, "Introducing an

FPGA based genetic algorithms in the applications of blind signals

separation," in System-on-Chip for Real-Time Applications, 2003.

Proceedings. The 3rd IEEE International Workshop on, 2003, pp. 123-

127.

[4] M. S. Hamid and S. Marshall, "FPGA realisation of the genetic

algorithm for the design of grey-scale soft morphological filters," in

Visual Information Engineering, 2003. VIE 2003. International

Conference on, 2003, pp. 141-144.

[5] H. E. Mostafa, A. I. Khadragi, and Y. Y. Hanafi, "Hardware

implementation of genetic algorithm on FPGA," in Radio Science

Conference, 2004. NRSC 2004. Proceedings of the Twenty-First

National, 2004, pp. C9-1-9.

[6] T. Anantharaman and R. Bisiani, "A hardware accelerator for speech

recognition algorithms," in Proceedings of the 13th annual international

symposium on Computer architecture, Tokyo, Japan, 1986, pp. 216-223.

[7] G. Mehta, R. R. Hoare, J. Stander, and A. K. Jones, "Design space

exploration for low-power reconfigurable fabrics," in Parallel and

Distributed Processing Symposium, 2006. IPDPS 2006. 20th

International, 2006, p. 4 pp.

[8] B. Miramond and J. M. Delosme, "Design space exploration for

dynamically reconfigurable architectures," in Design, Automation and

Test in Europe, 2005. Proceedings, 2005, pp. 366-371 Vol. 1.

Figure 8: Control-Data Flow Graph of the Evaluate function. Information

contained in this graph can be used to create an optimal application-specific

processor.

Figure 9. The Stochastic Beam Search algorithm is similar to

Simulated Annealing.