ASIP Design Methodologies : Survey and Issues

Manoj Kumar Jain, M. Balakrishnan, Anshul Kumar

Department of Computer Science and Engineering

Indian Institute of Technology, Delhi

{manoj,mbala,anshul}@cse.iitd.ernet.in

Abstract

Interest in synthesis of Application Speciﬁc Instruction Processors or ASIPs has increased considerably and a

number of methodologies have been proposed in the last decade. This paper attempts to survey the state of the art in

this area and identiﬁes some issues which need to be addressed. We have identiﬁed the ﬁve key steps in ASIP design as

application analysis, architectural design space exploration, instruction set generation, code synthesis and hardware

synthesis. A broad classiﬁcation of the approaches reported in the literature is done. The paper notes the need to

broaden the architectural space being explored and to tightly couple the various subtasks in ASIP synthesis.

1 Introduction

An Application Speciﬁc Instruction Processor (ASIP) is a processor designed for a particular application or for a

set of speciﬁc applications. An ASIP exploits special characteristics of application(s) to meet the desired performance,

cost and power requirements. ASIPs can be very efﬁcient when applied to a speciﬁc application such as digital signal

processing, servo motor control, automatic control systems, avionics, cellular phones etc [35, 5].

According to Liem et al [31], ASIPs are a balance between two extremes : ASICs and general programmable

processors. ASIPs offer the availability of custom sections for time critical tasks (e.g. a Multiply-Adder for real time

DSP), and offer ﬂexibility through an instruction-set. Hoon et al [15] observed that for the complex applications more

ﬂexibility is required to accommodate design errors and speciﬁcation changes, which may happen at the later design

stages. Since an ASIC is specially designed for one behavior, it is difﬁcult to make any changes at a later stage. In

such a situation, the ASIPs offer the required ﬂexibility at lower cost than general programmable processors.

There are several issues in the ASIP design process. Some are already addressed by various reserchers and

some more to be addressed. Integrated work is done rarely. Important steps involved in the ASIP design process are

identiﬁed. We attempted to classify the approaches on the basis of technique used for different steps. In this paper we

present a survey of work done on ASIP design. Approaches employed for each major step is classiﬁed and important

contributions are highlighted.

In section 2 we have described the steps involved in the ASIP design process, whereas, in section 3 we have dis-

cussed the approaches to application analysis which is crucial for synthesizing a suitable ASIP architecture. Sections

4 and 5 describe the various techniques to architectural exploration and instructionset generation respectively. Section

6 describes and classiﬁes the approaches to code generation, and ﬁnally, section 7 presents the conclusion and future

work.

Application Analysis

Architectural Design

Space Exploration

Instruction Set

Code

Hardware

Synthesis

Code

Synthesis

Object Code

Design Constraints

Application(s) and

Generation

Figure 1: Flow diagram of ASIP design methodology

2 Steps in ASIP Synthesis

Gloria et al [10] deﬁned some main requirements of the design of application-speciﬁc architectures. Important

among these are as follows.

Design starts with the application behavior.

Evaluate several architectural options.

Identify hardware functionalities to speed up the application.

Introduce hardware resources for frequently used operations only if it can be supported during compilation.

Various methodologies have been reported to meet these requirements. We have studied these methodologies

and found that typically there are ﬁve main steps followed in the synthesis of ASIPs. The steps are shown in ﬁg. 1.

1. Application Analysis : Input in the ASIP design process is an application or a set of applications, along with

their test data and design constraints. It is essential to analyze the application to get the desired characteristics/

requirements which can guide the hardware synthesis as well as instruction set generation. An application

written in a high level language is analyzed statically and dynamically and analyzed information is stored in

some suitable intermediate format, which is used in the subsequent steps.

2. Architectural Design Space Exploration : First a set of possible architectures is identiﬁed for a speciﬁc

application(s) using output of step 1 as well as the given design constraints. Performance of possible architec-

tures is estimated and suitable architectures satisfying performance and power constraints and having minimum

hardware cost is selected.

3. Instruction Set Generation : Instruction set is to be synthesized for that particular application, and for the

architecture selected. This instruction set is used during the code synthesis and hardware synthesis steps.

4. Code Synthesis : Compilergenerator or retargetable code generator is used to synthesize code for the particular

application or for a set of applications.

5. Hardware Synthesis : In this step the hardware is synthesized using the ASIP architectural template and

instruction set architecture starting from a description in VHDL/ VERLOG using standard tools.

Every methodologydoes not emphasize on all these steps. Some of them consider the processor micro-architecture

to be ﬁxed while only synthesizing the instructionset within the ﬂexibility provided by the microarchitecture, e.g. [11],

[15] and [29], while others consider the process of instruction set synthesis only after the parallelism and functionality

of the processor microarchitecture is ﬁnalized based on the application, e.g. [13], [17], [32] and [36].

3 Application Analysis

Typically ASIP design starts with analysis of the applications. These applications with their test data should be

analyzed statically and dynamically using some suitable proﬁler before proceeding further in the design process.

Sato et al [35] have developed an Application Program Analyzer (APA). The output of APA includes data types

and their access methods, execution counts of operators and functions etc used in application program, the frequency

of individual instructions and sequence of contiguous instructions.

The methodology suggested by Gupta et al [12] takes the application as well as the processor architecture as

inputs. Using SUIF as intermediate format, a number of application parameters are extracted. These include the

average basic block size, number of Multiply-Accumulate(MAC)operations, ratio of address computation instructions

to data computation instructions, ratio of input/output instructions to the total instructions, average number of cycles

between generation of a scalar and its consumption in the data ﬂow graph etc. Similar analysis is performed in the

scheme reported by Ghazal et al [9].

Saghir et al [34] have shown that dynamic behavior of DSP kernels signiﬁcantly differs from that of DSP appli-

cation programs. This implies that kernels should play a more signiﬁcant role in DSP application analysis especially

for parameters related to operation concurrency.

4 Architectural Exploration

Figure 2 shows a typical architecture exploration block diagram. Inputs from the application analysis step are

used along with the range of architecture design space to select a suitable architecture(s). The selection process

typically can be viewed to consist of a search technique over the design space driven by a performance estimator. The

methodologies considered differ both in the range and nature of architecture design space as well as the estimation

techniques employed.

4.1 Architecture design space

Almost all the techniques suggested by various researchers emphasized the need for a good parameterized model

for the architecture. Different values can be assigned to parameters (of course keeping design constraints into consid-

architecture

for a specific

Performace Estimator

Input from

Application analysis

Selected

Architecture(s)

Architecture

Design Space

Control

Figure 2: Block diagram of an architecture explorer

eration), which constitute the design space. So the size of the design space will depend on the number of parameters

and the range of values which can be assigned to these parameters.

The parameterized architecture model suggested by almost all the techniques includes the number of functional

units of different types. Gong et al [7] consider storage units and interconnect resources also in their architectural

model. Binh et al [2, 3] emphasize on incorporating pipelined functional units in the model. Kienhuis et al [19] put

available element types (like buffer, controller, router and functional unit) and their composition rules. Composition

rules generate alternative implementations of elements. All parameters together support a large design space.

Kin et al [20] consider issue width, number of branch units, number of memory units, the size of instruction

cache and size of data cache etc. in the model. Gupta et al [12] use a model which include parameters like number

of registers, number of operation slots in each instruction, concurrent load/store operations and latency of functional

units and operations. Ghazal [9] include optimizing features like addressing support, instruction packing, memory

pack/ unpack support, loop vectorization, complex arithmetic patterns such as dual multiply-accumulate, complex

multiplication etc in their architectural model. This results in a large design space. They have developed a retargetable

estimator which takes advantage of such an architecture model.

Architectures considered by different researchers also differ in terms of the instruction level parallelism they

support. For example [3] and [19] do not support instruction level parallelism, whereas [7] and [12] support VLIW

architecture and [9] and [20] support VLIW as well as super scalar architecture.

Most of these approaches consider only a ﬂat memory. Only [20] addresses consideration of instruction and data

cache sizes during design space exploration, but the range of architectures explored is rather limited. Similarly no

approach considers ﬂexibility in terms of number of stages in a pipeline, though a pipelined architecture is considered

in [3] and [10]. Table 1 compares various architecture models used in major approaches for design space exploration.

Approach Gloria Sato Gong Binh Kinhuis Kin Gupta Gazhal

Attribute [10] [35] [7] [2, 3] [19] [20] [12] [9]

Number of FUs of different

type

yes yes yes yes yes yes yes yes

Storage units yes yes yes yes yes

Interconnect resources yes yes

Pipelining yes yes

Controller yes

Variable Issue-width yes

No. of branch units yes

Memory hierarchy yes yes

Instruction level parallelism VLIW no VLIW no no VLIW, Super-

scalar

VLIW VLIW, Super-

scalar

addressing support yes yes

memory pack/ unpack yes

Loop vectorization yes

Complex arith. patterns yes yes

Distinguishing Complexity

of FU uncon-

strained

Considered stor-

age units inter-

connect resources

Introduced

pipelining

Alternative im-

plementation of

elements considered

Considered

memory

hierarchy

Few optimiz-

ing features

considered

Optimizing

features

considered

Table 1: Comparison of various architecture models used for design space exploration

4.2 Performance Estimation

In the literature two major techniques have been used for performance estimation. They are scheduler based and

simulator based.

1. Scheduler based : In this type of approach the problem is formulated as a resource constrained scheduling

problem with the selected architecture components as the resources and the application is scheduled to generate

an estimate of the cycle count. Proﬁle data is used to obtain the frequency of each operation. Examples of such

approaches are [7], [12] and [9].

Gong et al [7] proposed a scheduler which takes into account not only functional and storage unit resources but

also interconnectresources forperformance evaluation. They use a customized GNU C compiler (GCC) to trans-

late a C program into three address instructions. The three-address instruction format is based on a RISC like

load/store architecture in which only load and store operations access memory and all other operations work on

registers. A parallelizing compiler which is used to extract program parallelism is a ﬁne-grain VLIW compiler

which takes serial three-address instruction code produced by GCC as input and generates parallel three-address

instruction code in which an instruction may contain more than one operation. The VLIW compiler schedules

operations as soon as possible using unlimited resources. A ultra-ﬁne-grain scheduler is developed to exploit

hardware parallelism offered by various architectures. It takes either the serial three-address instruction code

or the parallel three-address instruction code as input and schedules them on various application-speciﬁc ar-

chitectures instantiated from the parameterized architecture model. The output is the parallel control code in

which register transfers in each control step are performed concurrently. A simulator is used to mimic the exe-

cution of the serial three-address code, the parallel three-address code as well as parallel control code to obtain

performance metrics.

Gupta et al [12] have developed a resource constrained scheduler that estimates the number of clock cycles,

given the description ﬁle of the target processor architecture, application timing constraints and the proﬁling

data. A key feature of this list scheduler is its capability to reﬂect the ﬂexibility of the instruction set to handle

concurrency. Processor selector rejects processors which do not meet the constraints. Ghazal et al [9] have

developed a retargetable estimator for performance estimation required for architectural exploration using a

better architectural model. A distinguishing feature of their approach is that they consider several optimizing

transformations while mapping the application on to the target architecture. The optimizations considered in-

clude : optimized multi-operation pattern matching (e.g. multiply-accumulate), address mode optimization (e.g.

auto-update, circular buffer), loop optimization (e.g. pre-fetched sequential loads), loop vectorization/ packing,

simple if-else conversion (by use of predicated instructions), rescheduling within basic blocks, loop unrolling,

software pipelining etc. The ﬂow diagram of the estimation scheme followed in both ([9] and [12]) approaches

is shown in ﬁg. 3

One of the important features of the architectural models used in [9] and [12] is that it captures the differentiating

capabilitiesof the instruction set and special functional resources, rather than the complete speciﬁcation required

for code synthesis or simulation. This helps in fast performance estimation. Further estimator used in [9] also

produces a trace of the application annotated with the suggested optimizations and ranked bottlenecks.

2. Simulator based : In the approaches suggested by [10], [19], [20], [18] [3] and [4], a simulation model of

the architecture based on the selected features is generated and the application is simulated on this model to

C Code

’SUIF’ Front End

Translation and

Annotation

Estimation

Cycle-level

Profiler

Parameterized

Architecture Model

Annotated Profile

Estimate,

. Translation of SUIF instructions to

Architecture-Compatible

Instructions

. Optimized Computation Patterns

. High-level Optimization Features

. Address Generation Cost

. Functional Unit Usage Rules

. Instruction Set Attributes

Figure 3: The Flow of Estimation Scheme : Ghazal [9] & Gupta [12]

compute the performance.

A system for the evaluation of architectures has been presented by Gloria et al [10]. The simulator, by using

pre-deﬁned modules and the architecture description, reports performance and statistics about resource utiliza-

tion. Its engine is an event-driven algorithm also supporting various memory hierarchy structures (data-cache,

instruction-cache, bank interleaving, etc.). The application is translated into an intermediate code, composed of

simple instructions directly expressing primitive hardware functionalities. Optimization techniques such as loop

unrolling, data-dependency analysis, variable renaming and loop interchanging, are further applied in order to

increase the degree of parallelism. A sequential code simulation is then performed, in order to validate the code,

and to extract some early features which can perform preliminary architectural choices. For instance, frequently

executed sequence of instructions can be substituted by the introduction of new instructions.

Kienhuis et al [19] constructed a retargetable simulator for an architecture template. For each architecture

instance of the architecture template, a speciﬁc simulator is derived in three steps. The architecture instance is

constructed, an execution model is added and the executable architecture is instrumented with metric collectors

to obtain performance numbers. Object oriented principles together with a high-level simulation mechanism are

used to ensure retargetability and efﬁcient simulation speed. Kin et al [20] propose an approach to design space

exploration for media processors considering power consumption. Aggressively optimized code is generated

using IMPACT compiler to increase instruction level parallelism. The optimized code is consumed by the Lsim

simulator. Measured run-times of benchmarks through simulations are used to compute the energy based on

the power model they have described. The power dissipation numbers are normalized with respect to a baseline

machine power dissipation.

Most researchers have focused on performance and area but do not address the power consumption. Only Kin et

al[20] and Imai et al[18] considered power consumption. Kin et al[20] computes power consumption of the generated

ASIP from a very coarse model which is based on the total cycle count. Main approaches considering performance

estimation are compared in table 2.

Approach Gloria Gong Binh Kinhuis Kin Gupta Gazhal

Attribute [10] [7] [2, 3] [19] [20] [12] [9]

Technique Simulation based Scheduler

based

Simulation

based

Simulation based Simulation based Scheduler based Scheduler based

Algorithm Extension of

list scheduling

branch-and-

bound

List scheduling List scheduling

FU consideration yes multiple iden-

tical FUs

yes yes

Interconnect

resources

yes yes yes

Optimizations loop-unrolling, data-

dependency analysis,

variable-renaming,

loop interchanging

no instruction level

parallelism

optimized multi-operation

pattern matching, address

modes, rescheduling, loop

unrolling

as above and loop vector-

ization, memory packing,

software pipelining etc.

Measure of per-

formance

cycle count and

resource allocation

statistics

performance

number

cycle count cycle count and trace of

the application annotated

with suggested optimiza-

tions

Memory hierar-

chy

yes yes

Other main fea-

tures

RAM and

ROM sizes

consideration

Metric collectors

used to obtain

performance

number

power consump-

tion and area con-

straints consider-

ation

only differentiating fea-

tures of instruction set is

captured

only differentiating fea-

tures of instruction set is

captured

Table 2: Comparison of major performance measurement techniques

4.3 Search Control

Algorithms for search range from branch-and-bound ([3] and [20]) to exhaustive over a restricted domain ([12]).

Binh et al [3] suggested a new HW/SW partitioning algorithm (branch-and-bound) for synthesizing the highest

performance pipelined ASIPs with multiple identical functional units. Constraints considered are given gate count

and power consumption. They have improvised their algorithm considering RAM and ROM sizes as well as chip

area constraints [4]. Chip area includes the hardware cost of the register ﬁle for a given application program with the

associated input data set. This optimization algorithm deﬁnes the best trade offs between the CPU core, RAM and

ROM of an ASIP chip to achieve highest performance while satisfying design constraints on the chip area. Kin et al

[20] ﬁrst reduced the search space by eliminating machine conﬁgurationsnot satisfying given area constraint and those

that are dominated by at least one another machine conﬁguration and then they used a branch-and-bound algorithm

for searching for an optimum solution fast.

Gupta et al [12] developed a processor selector which ﬁrst reduces the domain of search by restricting the various

architectural parameter ranges based on certain gross characteristics of the processors, such as branch-delay, presence/

absence of multiply-accumulateoperation, depth of pipeline and memory bandwidths. The output is a smaller subset of

processors. Processor description ﬁles of this restricted range of processors are supplied to the performance estimator

which exhaustively explores this set.

5 Instruction Set Generation

We have found that almost all approaches to instruction set generation can be classiﬁed as either instruction set

synthesis approach or instruction set selection approach on the basis of how they are generating instructions.

1. Instruction set synthesis : In this class of techniques, instruction set is synthesized for a particular application

based on the application requirements, quantiﬁed in terms of the required micro-operations and their frequen-

cies. Examples of such approaches are in [16, 17], [15] and [11]. Theses approaches differ in the manner the

instructions are formed, from a list of operations.

Huang and Despain [16, 17] integrated problem of instruction formation with scheduling problem. Simulated

annealing technique is used to solve the scheduling problem. Instructions are generated from time steps in the

schedule. Each time step corresponds to one instruction. Pipelined machine with data stationary control model

is assumed. They had considered goal of the instruction set design as to optimize and trade off the instruction

set size, the program size and the number of cycles to execute a program. General multiple-cycle instructions

are not considered. Multiple-cycle arithmetic/logic operations, memory access, and change of control ﬂow

(branch/jump/call) are supported by specifying the delay cycles as design parameters. The formulation takes

the application, architecture template, objective function and design constraints in the input and generates as

outputs the instruction set, resource allocation (which instantiates the architecture template) and assembly code

for the application. This approach is shown in ﬁg. 4.

Hoon et al [15] proposed another approach which supports multi-cycle complex instructions as well. First the

list of micro-operations is matched to the primary instructions in order to estimate the execution time if only

these instructions are employed. If the estimated performance is unacceptable then the complex instructions are

considered. If the performance is still unacceptable, special instructions which require special hardware units

Applications

(CDFGs)

Design Constraints

Objective Function

Micro Architecture

Specifications

Scheduling /

Allocation Formation

Instruction Set Compiled Code

Instruction

Figure 4: The integrated scheduling / instruction-formation process

are included. Gshwind [11] describe an approach for application speciﬁc design based on extendable micropro-

cessor core. Critical code portions are customized using the application-speciﬁc instruction set extensions.

2. Instruction selection : In this class of techniques a superset of instructions is available and a subset of them

is selected to satisfy the performance requirements within the architecture constraints. Examples of such ap-

proaches are in [1], [2], [18], [31], [29] and [36]. All the approaches in this technique differ in the algorithms

they are using to select the instructions from the super set.

Imai et al [35, 18] assumes that the instruction set that can be divided into two groups : operations and functions.

It is relatively easy for a compiler to generate instructions corresponding to operators used in C language com-

pared to those which corresponds to functions. It is because that the full set of operations is already described,

but the full set of user-deﬁned functions is not known a priori. They further divide the set of operations into two

subgroups : primitive operators and basic operators. The set of primitive operators is chosen so that other basic

operators can be substituted by a combination of primitive operators. Thus they have classiﬁed functionalities

as follows.

Primitive Functionalities (PF) : Can be realized by a minimal hardware component, such as ALU and

shifter.

Basic Functionalities (BF) : Set of operations used in C except those included in PF.

Extended Functionalities (XF) : Library functions or user deﬁned functions.

The intermediate instructionsare described as primary RTL, basic RTL, and extended RTLrespectively for these

classes. Generated ASIP will include hardware modules corresponding to all of the primary RTL, but only a

part of the basic RTL and extended RTL. Selection problem is formulated as an integer linear programming

problem with the objective to maximize the performance of the CPU under the constraints of the chip area and

power consumption. The branch-and-bound algorithm is used to solve the problem. Later, they have revised

their formulation considering functional module sharing constraints and pipelining [1, 2].

* *

Figure 5: Expression tree

Some approaches [31, 36, 29] use pattern matching for instruction selection. The set of template patterns is

extracted from the instruction set and the graph representing the intermediate code of the application is covered

by these patterns. By arranging instruction patterns in a tree structure, matching process becomes signiﬁcantly

fast.

For example there is an expression as follows.

y = w1 . x1 + w2 . x2 + w3 . x3 + w4 . x4

The corresponding expression tree is shown in ﬁg. 5. This type of expression evaluation is very frequent in

DSP domain. Hardware conﬁguration required to implement it is shown in ﬁg. 6 a. The instruction patterns are

shown in ﬁg. 6 b. Apart from intrinsic patterns MULT and ADD it has MAC also which performs multiplication

and addition together. If only intrinsicpatterns (ADD and MULT) are available for pattern matching then we get

matching as shown in ﬁg. 7 b. If we use the pattern MAC as well then we get optimal cover as shown in ﬁg. 7 a.

Using cost and delay of different patterns and the frequency count of this expression tree received from dynamic

proﬁling the cost-performance trade offs can be evaluated, and it can be decided that such complex instructions

should be included or not in the selected instruction set.

Liem et al [31] adopts a pattern matching approach to instruction set selection. A wide range of instruction

types is represented as a pattern set, which is organized in a manner such that matching is extremely efﬁcient

and retargetting to architectures with new instruction set is well deﬁned. Shu et al [36] have implemented the

matching method based on a pattern tree structureof instructions. Two genetic algorithms(GA) are implemented

for pattern selection: a pure GA which uses standard GA operators and a GA with backtracking which employs

variable-length chromosomes.

Our analysis also shows that there are design methodologies in both the instruction synthesis [11, 15] as well

as instruction selection [1, 2, 18] categories which start from a basic instruction set and only synthesize/ select

application-speciﬁc special or complex instructions. Comparison of various approaches for instruction set genera-

tion is shown in table 3.

There is an interesting approach given by Praet et al [32] which emphasizes instruction generation by a mix of

instruction selection and instruction synthesis. The main steps involved in this technique are instruction selection by

bundling and instruction set deﬁnition. A bundle is a maximal sequence of micro-operations in which each micro-

operation is directly coupled to its neighbors. Directly coupled micro-operations are primitive processor activities

Pipeline Register

Accumulator

MULT

MAC

ADD

Figure 6: a) Multiplier-accumulator , b) Instruction Patterns

* *

MULT

MAC

* *

MULT

ADD

MULT

ADD

Figure 7: a) Optimal cover with MAC , b) Cover with intrinsic patterns only

Instruction

Set Architecture (ISA)

Architecture Template Application

Retargetable Code Generator

Object Code

Figure 8: Retargetable Code Generator

which pass data to each other through a wire or latch. Graph structure was used for pattern matching rather than

tree structure. Instruction set deﬁnition includes selecting the representive application parts, selecting initial data part

out of a library based on statistics obtained from the analysis tool, iteratively updating these parts considering area/

performance trade offs and deﬁning instruction encodings.

6 Code Synthesis

The work reported have followed two different approaches for code synthesis. They are retargetable code gener-

ator and compiler generator.

1. Retargetable Code Generator : Taking architecture template, instructionset architecture (ISA) and application

as inputs, object code is generated (ﬁg. 8). [6], [38], [24, 25, 29], [21], [33], [13], [37] and [39] follow such an

approach.

All these approaches try to address following sub-problems while synthesizing instructions

(a) Instruction mapping: Instruction patterns are mapped to CDFG of given application using tree or graph

pattern matching technique and an optimal cover is formed.

(b) Resource allocation and binding: Resources like registers, memory, FUs, buses etc are allocated and cor-

responding bindings are performed.

Some approaches address issue of code compaction also while generating code. Examples are [21], [24], [25]

and [29].

Cheng and Lin [6] address issue of reducing the memory access and register usage conﬂicts. The goal is to

maximize the possibility of scheduling more instructions into a control step. Wilson et al [38] formulated an

integer linear programming model to solve the three subproblems concurrently. Their model considers spill over

to memory as well. Leupers et al [24, 25, 29] suggest a two phase scheme using pattern matching for instruction

mapping. In the ﬁrst phase the instructions are modeled as tree patterns and an optimal cover is computed, while

in the second phase a code compaction considering instruction level parallelism is done.

Approach Imai Liem Praet Huang Shu Leupers Hoon Gshwind

Attribute [35, 18, 1, 2] [31] [32] [16, 17] [36] [29] [15] [11]

Generation mech-

anism

selection selection selection

and syn-

thesis

synthesis selection selection synthesis synthesis

Technique/ Algo-

rithm

ILP, branch-and-

bound

tree pattern

matching,

dynamic

programming

graph

pattern

matching

modiﬁed scheduling prob-

lem, simulated annealing

tree pattern

matching,

genetic algo-

rithms

ILP, dynamic

programming

sub-set-sum prob-

lem

hw/sw co-

evaluation

Instruction types primary, basic and

special

single-cycle single-cycle,

multi-cycle

complex

intrinsic,

complex

Other main issues

and distinguish-

ing features

primary, basic and

special RTL for ba-

sic, primary and spe-

cial functionalities

capture be-

havioral

representation

of instructions

instruction utilization, in-

struction operand encod-

ing, delay load/ store and

delay branches. Integrated

instruction formation with

scheduling problem

better opti-

mization than

tree pattern

matching

Generated single

and multi-cycle

complex in-

structions then

sub-set-sum

technique

extendable

core tech-

nique, appli-

cation speciﬁc

instruction set

extensions

Table 3: Comparison of major instruction generation approaches

Assembler

Generator

Binary

Environment

Architecture

Design

Code

Source

C/C++

Compiler

Front End

Compiler

Back end

Assembler

Simulation

SUIF

ISDL

Assembly

AVIV

Figure 9: Compiler frame work used by AVIV

Kreuzer et al [21] suggest a method for code compaction and register optimization to exploit instruction level

parallelism based on trellis tree straight-linecode generation algorithm. Retargetability is provided by the use of

target architecture description ﬁle. Algorithms for intermediate code generation, compaction and optimizations

are processor independent. Praet et al [33] have used graph based processor model in their approach which

including the processor connectivity and parallelism. They have deﬁned the sub-problems of code generation

in terms of this processor model. Yamaguchi et al [39] presented new binding and scheduling algorithms for a

retargetable compiler which can deal with diverse architectures.

Hanono and Devadas [13] address all three subproblems concurrently using branch-and-bound approach. It

accomplishes this by converting the input application to a graphical representation that speciﬁes all possible

ways of implementing the application on target processor. A detailed register allocation is carried out in the

second step, whereas its estimates are generated in the ﬁrst step considering insertion of loads and spill over

to memory if the available resources are exceeded. Their approach is shown in Fig. 9. The compiler front

end receives a source program written in C/C++. It performs machine independent optimizations and generates

an intermediate format description in SUIF. The instruction set description language (ISDL) description of

the target processor is generated either by hand or by a high-level CAD tool. The compiler back end takes

the SUIF code and the ISDL description as inputs and produces assembly code speciﬁc to, and optimized for,

the target processor. The ISDL description is also used to create an assembler which transforms the code

produced by the compiler to a binary ﬁle that is used as input to an instruction-level simulator. Visser [37] used

simulated annealing technique for code generation. This technique focuses especially on highly irregular DSP

architectures. It fully tackles the compiler phase-coupling problem. Phase-coupling problem is how to order

various phases of code generation.

2. Compiler Generator : Taking architecture template and instruction set architecture as the inputs, a retargetable

compiler is generated. This is used to generate object code for the given application written in a high level

language (ﬁg. 10). Examples are Hatche[14], Kuroda[23] and Leupers[30]. A compiler generated in this manner

generally has phases similar to general compilers, namely, program analysis, intermediate code optimization

Instruction

Set Architecture (ISA)

Architecture Template Application

Retargetable Compiler Generator

Object Code

Customized Compiler

Figure 10: Retargetable Compiler

and code generation. The difference is that the optimizations are tailored to the speciﬁc architecture-application

combination.

Hatcher and Tuller [14] developed a code generator generator that accepts a machine description in a YACC-

like format and a set of C structure deﬁnitions for valid tree nodes and produces C source (both functions and

initialized data) for a code generator which is time and space efﬁcient. Its important features of their system

include compact machine description via operator factoring and leaf predicates, fast production of time and

efﬁcient code generators. The System supports arbitrary evaluation orders and goal oriented code generation.

Kuroda and Nishitani [23] proposed a knowledge-based compiler for application speciﬁc signal processors.

Code optimization knowledge employed by expert programmer is implemented in the knowledge-base and is

applied in various phases of the compiler. The generated retargetable compiler RECORD by Leupers and Mar-

wedel [30] does not require tool-speciﬁc modeling formalism, but starts from general HDL processor models.

All processor aspects needed for code generation are automatically derived. Table 4 compares various code

synthesis approaches.

7 Conclusion and future work

We have identiﬁed ﬁve key steps in the process of designing ASIPs. We have presented a survey on the work

done while attempting classiﬁcation of the approaches for each step in the synthesis process. Performance estimation

is based either on scheduler based or simulation based technique. Instruction set is generated either through synthesis

or a selection process, whereas, code is synthesized either by a retargetable code generator or by a custom generated

compiler.

Though, a variety of approaches have evolved in addressing each of the key steps, the target architecture space

being explored by these methodologies is still limited. With the increase in integration, it should be possible to support

memory hierarchies on the chip and the same has not been addressed in an integrated manner. Similarly issues of

pipelined ASIP design as well as low power ASIP design is in its infancy. We also observe that the problems of

processor synthesis and retargetable code generation have been considered in isolation.

Approach Hatcher Kuroda Saghir Cheng Wilson Leupers Kreuzer Praet Gebotys Yamaguchi Hanono Visser

Attribute [14] [23] [34] [6] [38] [25, 27, 26, 28] [21, 22] [33] [8] [39] [13] [37]

Code generator/

Compiler generator

compiler

genera-

tor

compiler

genera-

tor

code

genera-

tor

code

gener-

ator

code

gen-

erator

code generator code

genera-

tor

code

genera-

tor

code genera-

tor

code gen-

erator

code generator code

genera-

tor

Instruction mapping tree

pattern

match-

ing

graph pattern

matching

tree

pattern

match-

ing

graph

pattern

match-

ing

yes yes yes

Resource allocation

and binding

yes yes yes yes yes yes yes yes yes yes yes

Scheduling yes yes yes yes yes yes yes yes yes yes

Code optimization yes yes yes yes yes yes yes yes yes yes yes

Technique/ Algo-

rithm

tree

pattern

match-

ing

ILP branch-and-

bound

simulated

anneal-

ing

Retargetability yes yes yes yes yes yes yes yes yes yes yes yes

Instruction level par-

allelism

yes yes yes

Phase-ordering yes yes

Distinguishing fea-

tures

arbitrary

evalu-

ation

order

uses

experts

knowl-

edge

need for

kernel

synthe-

sis

compaction based

on a heuristic

scheduler, behav-

ioral/ structural

target arch., ad-

dress assignment

scheduling

followed by

instruction

mapping

mapping, binding

and schedul-

ing concurrent,

detailed reg. allo-

cation at second

step

Table 4: Comparison of major code synthesis approaches

References

[1] Alomary, A.; Nakata, T.; Honma, Y.; Imai, M.; Hikichi, N. : “An ASIP instruction set optimization algorithm with functional

module sharing constraint.”, Proceedings of the IEEE/ACM International Conference on Computer Aided Design 1993.

ICCAD-93. Digest of Technical Papers 1993 , 7-11 Nov. 1993, Pages: 526-532.

[2] Binh, N.N.; Imai, M.; Shiomi, A.; Hikichi, N. : “A hardware/software partitioning algorithm for pipelined instruction set

processor.”, Proceedings of the Design Automation Conference, 1995, with EURO-VHDL, EURO-DAC ’95, 18-22 Sept.

1995, Pages: 176-181.

[3] Binh, N.N.; Imai, M.; Shiomi, A. : “A new HW/SW partitioning algorithm for synthesizing the highest performance

pipelined ASIPs with multiple identical FUs.”, Proceedings of the Design Automation Conference, 1996, with EURO-VHDL

’96 , EURO-DAC ’96, 16-20 Sept. 1996, Pages: 126-131.

[4] Binh, N.N.; Imai, M.; Takeuchi, Y. : “A performance maximization algorithm to design ASIPs under the constraint of chip

area including RAM and ROM sizes.”, Proceedings of the Asia and South Paciﬁc Design Automation Conference 1998

(ASP-DAC ’98), 10-13 Feb. 1998, Pages: 367-372.

[5] Childers, B.R.; Davidson J.W. : “Application Speciﬁc Pipelines for Exploiting Instruction-Level Parallelism.”, University of

Virginia Technical Report No. CS-98-14, May 1, 1998.

[6] Cheng W.; Lin Y. : “Code generation for a DSP processor.”, Proceedings of the Seventh International Symposium on High-

Level Synthesis 1994, 18-20 May 1994, Pages: 82-87.

[7] Gong J.; Gajski, D.D.; Nicolau, A. : “Performance evaluation for application-speciﬁc architectures.”, IEEE Transactions on

Very Large Scale Integration (VLSI) Systems, Dec. 1995 Vol. 3 Issue: 4, Pages: 483-490.

[8] Gebotys, C.H. : “An efﬁcient model for DSP code generation: performance, code size, estimated energy.”, Proceedings of

the Tenth International Symposium on System Synthesis 1997, 17-19 Sept. 1997, Pages: 41-47.

[9] Ghazal, N.; Newton, R.; Jan Rabaey. : “Retargetable estimation scheme for DSP architectnre selection.”, Proceedings of the

Asia and South Paciﬁc Design Automation Conference 2000 (ASP-DAC 2000), 25-28 Jan. 2000, Pages: 485-489.

[10] Gloria A. D.; Faraboschi, P. : “An evaluation system for application speciﬁc architectures.”, Proceedings of the 23rd Annual

Workshop and Symposium on Microprogramming and Microarchitecture. (Micro 23), 27-29 Nov. 1990, Pages: 80-89.

[11] Gschwind, M. : “Instruction set selection for ASIP design.”, Proceedings of the Seventh International Workshop on Hard-

ware/Software Codesign 1999 (CODES ’99), 3-5 May 1999, Pages: 7-11.

[12] Gupta, T.V.K.; Sharma, P.; Balakrishnan, M.; Malik, S. : “Processor evaluation in an embedded systems design environ-

ment.”, Proceedings of Thirteenth International Conference on VLSI Design 2000, 3-7 Jan. 2000, Pages: 98-103.

[13] Hanono, S.; Devadas, S. : “Instruction selection, resource allocation, and scheduling in the AVIV retargetable code genera-

tor.”, Proceedings of the Design Automation Conference 1998, 15-19 June 1998, Pages: 510-515.

[14] Hatcher, P.J.; Tuller, J.W. : “Efﬁcient retargetable compiler code generation.”, Proceedings of the International Conference

on Computer Languages 1988, 9-13 Oct. 1988, Pages: 25-30.

[15] Hoon Choi; In-Cheol Park; Seung Ho Hwang; Chong-Min Kyung : “Synthesis of application speciﬁc instructions for

embedded DSP software.”, Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, 1998.

ICCAD 98. Digest of Technical Papers 1998, 8-12 Nov. 1998, Pages: 665 - 671.

[16] Huang, I.J.; Despain, A.M. : “Generating Instruction Sets And Microarchitectures From Applications.”, Proceedings of the

IEEE/ACM International Conference on Computer-Aided Design 1994, 6-10 Nov. 1994, Pages: 391-396.

[17] Huang I.J.; Despain, A.M. : “Synthesis of application speciﬁc instruction sets.”, IEEE Transactions on Computer-Aided

Design of Integrated Circuits and Systems, June 1995 Vol. 14 Issue: 6 Pages: 663-675.

[18] Imai, M.; Alomary, A.; Sato, J.; Hikichi, N. : “An integer programming approach to instruction implementation method

selection problem.”, Proceedings of the European Design Automation Conference, 1992., EURO-VHDL ’92, EURO-DAC

’92, 7-10 Sept. 1992, Pages: 106 - 111.

[19] Kienhuis, B.; Deprettere, E.; Vissers, K.; van der Wolf, P. : “The construction of a retargetable simulator for an architecture

template.”, Proceedings of the Sixth International Workshop on Hardware/Software Codesign 1998 (CODES/CASHE ’98),

15-18 March 1998, Pages: 125-129.

[20] Kin, J.; Chunho Lee; Mangione-Smith, W.H.; Potkonjak, M. : “Power efﬁcient mediaprocessors: design space exploration.”,

Proceedings of the 36th Design Automation Conference 1999, 21-25 June 1999, Pages: 321-326.

[21] Kreuzer, W.; Gotschlich, M.; Wess, B. : “A retargetable optimizing codegeneratorfor digital signalprocessors.”,Proceedings

of the IEEE International Symposiumon Circuits and Systems 1996.ISCAS ’96., Connecting the World., 1996, Vol.2, Pages:

257-260.

[22] Kreuzer, W.; Wess, B. : “Cooperative register assignment and code compaction for digital signal processors with irregular

datapaths.”, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1997. ICASSP-

97, 21-24 April 1997, Vol. 1, Pages: 691-694.

[23] Kuroda, I.; Nishitani, T. : “A knowledge-basedretargetable compiler for application speciﬁcsignal processors.”, Proceedings

of the IEEE International Symposium on Circuits and Systems 1989, 8-11 May 1989, Vol. 1, Pages: 631-634.

[24] Leupers, R.; Schenk, W.; Marwedel, P. : “Retargetable assembly code generation by bootstrapping.”, Proceedings of the

Seventh International Symposium on High-Level Synthesis 1994, 18-20 May 1994, Pages: 88-93.

[25] Leupers, R.; Niemann, R.; Marwedel, P. : “Methods for retargetable DSP code generation.”, Proceedings of the International

Workshop on VLSI Signal Processing, VII, 1994, 26-28 Oct. 1994, Pages: 127 - 136.

[26] Leupers, R.; Marwedel, P. : “A BDD-based frontend for retargetable compilers.”, Proceedings of the European Design and

Test Conference, 1995. ED&TC 1995, 6-9 March 1995, Pages: 239-243.

[27] Leupers, R.; Marwedel, P. : “Time-constrained code compaction for DSPs.”, IEEE Transactions on Very Large Scale In-

tegration (VLSI) Systems, March 1997, Vol. 5, Issue: 1, Proceedings of the International Conference on Very Large Scale

Integration (VLSI) Systems 1995, 18-20 Jan. 1995, Pages: 112-122.

[28] Leupers, R.; Marwedel, P. : “Algorithms for address assignment in DSP code generation.”, Proceedings of the IEEE/ACM

International Conference on Computer-Aided Design, 1996. ICCAD-96. Digest of Technical Papers 1996, 10-14 Nov. 1996,

Pages: 109-112.

[29] Leupers, R.; Marwedel, P. : “Instruction selection for embedded DSPs with complex instructions.”, Proceedings of the

European Design Automation Conference, 1996, with EURO-VHDL ’96, EURO-DAC ’96, 16-20 Sept. 1996, Pages: 200-

205.

[30] Leupers, R.; Marwedel, P. : “Retargetable generation of code selectors from HDL processor models.”, Proceedings of the

European Design and Test Conference, 1997. ED&TC 97, 9-13 June 1997, Pages: 534-539.

[31] Liem, C.; May, T.; Paulin, P. : “Instruction-set matching and selection for DSP and ASIP code generation.”, Proceedings of

the European Design and Test Conference, 1994. EDAC, The European Conference on Design Automation. ETC European

Test Conference. EUROASIC, 28 Feb.-3 March 1994, Pages: 31-37.

[32] Praet J.V.; Goossens, G.; Lanneer, D.; De Man, H. : “Instruction set deﬁnition and instruction selection for ASIPs.”, Pro-

ceedings of the Seventh International Symposium on High-Level Synthesis 1994, 18-20 May 1995, Pages: 11-16.

[33] Praet, J.V.; Lanneer, D.; Goossens, G.; Geurts, W.; De Man, H. : “A graph based processor model for retargetable code

generation.”, Proceedings of the European Design and Test Conference 1996 (ED&TC 96), 11-14 March 1996, Pages: 102-

107.

[34] Saghir, M.A.R.; Chow, P.; Lee, C.G. : “Application-driven design of DSP architectures and compilers.”, Proceedings of the

IEEE International Conference on Acoustics, Speech, and Signal Processing 1994 (ICASSP-94), 19-22 April 1994, Pages:

II/437-II/440 Vol.2.

[35] Sato, J.; Imai, M.; Hakata, T.; Alomary, A.Y.; Hikichi, N. : “An integrated design environment for application speciﬁc

integrated processor.”, Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and

Processors 1991, ICCD ’91, 14-16 Oct. 1991, Pages: 414-417.

[36] Shu, J.; Wilson, T.C.; Banerji, D.K. : “Instruction-set matching and GA-based selection for embedded-processor code

generation.”, Proceedings of the Ninth International Conference on VLSI Design 1996, 3-6 Jan. 1996, Pages: 73-76.

[37] Visser, B.-S. : “A framework for retargetable code generation using simulated annealing.”, Proceedings of the 25th EU-

ROMICRO Conference 1999, 8-10 Sept. 1999, Pages: 458-462.

[38] Wilson, T.; Grewal, G.; Halley, B.; Banerji, D. : “An integrated approach to retargetable code generation.”, Proceedings of

the Seventh International Symposium on High-Level Synthesis 1994, 18-20 May 1994, Pages: 70-75.

[39] Yamaguchi, M.; Ishiura, N.; Kambe, T. : “Binding and scheduling algorithms for highly retargetable compilation.”, Pro-

ceedings of the Asia and South Paciﬁc Design Automation Conference 1998 (ASP-DAC ’98), 10-13 Feb. 1998, Pages:

93-98.