Heterogeneous Chip Multiprocessor Design
(HTCMP)
People:
Asst. Prof. Ozcan Ozturk
MS. Student Dilek Demirbas
Ph.D. Student Ismail Akturk
Funding: Funded by
European Commission (EC) FP7-PEOPLE MARIE CURIE ACTIONS
Duration: 2009 – 2013
Amount: €100,000
Abstract:
Increasing
complexity of applications and their large dataset sizes make it imperative to
consider novel architectures that are efficient from both performance and power
angles. Chip Multiprocessors (CMP) are one such example where multiple
processor cores are placed into the same die. As technology scales, the
International Technology Roadmap for Semiconductors (ITRS) projects that the
number of cores in a chip multiprocessor (CMP) will drastically increase to
satisfy performance requirements of future applications. A critical question
that needs to be answered in CMPs is the size and strength of the cores.
Homogeneous chip multiprocessors provide only one type of core to match these
various application requirements, consequently not fully utilizing the
available chip area and power budget. The ability to dynamically switch between
different cores, and power down unused cores gives a
key advantage to heterogeneous chip multiprocessing. One of the challenging
problems in the context of heterogeneous chip multiprocessor systems is the
placement of processor cores and storage blocks within the available chip area.
Focusing on such a heterogeneous chip multiprocessor, we address different
design decision problems. First, decide on the memory hierarchy design and its
distribution within the available chip area. Second, distribute effectively the
available area among the processor cores and the memory blocks (cache). Third,
select the optimum number of processors and their types among the available
processor types. Fourth, perform thread and data distribution within the given
processor and memory design. Fifth, evaluate improvements brought by advanced
techniques, such as 3D designs. Our past experience and preliminary results
indicate that the proposed approach will be able to generate promising results.
Report Period I:
PUBLISHABLE
SUMMARY
HTCMP aims at
designing efficient and powerful heterogeneous chip multiprocessors (HTCMP). A
challenging problem in the context of heterogeneous chip multiprocessor systems
is the placement of processor cores and storage blocks within the available
chip area. Focusing on such a heterogeneous chip multiprocessor, we address
different design decision problems: (i) effective
distribution of the available area among the processor cores and the memory
blocks (cache), (ii) memory hierarchy design (iii) selection of number of
processors and their types from the processor pool, (iv)
thread and data distribution (v) advanced techniques such as 3D designs.
Our main
objective is to make significant contributions towards the development of
compiler-based techniques for emerging and future HTCMPs. The software support
for such systems is lagging way behind current advancements at the circuit and
architecture levels. Effective compilation support for these architectures will
make programming them much easier, thereby helping scientists to port their
applications to these architectures. Outcomes of this research will be
beneficial to the computer architecture field in that it will reveal the types
of processor cores and memory components that are needed by the compiler for
achieving the best application adaptation under dynamically changing power,
performance and thermal conditions.
After the
initial setup, we profiled the benchmarks and estimated their memory and
processing requirements. Based on these requirements, we have implemented two
major components of the project. In both components, after parallelization and
mapping, the input code is fed to a compiler analysis module. Purpose of this
module is to identify the set of CMP nodes that communicate with each other.
This information is subsequently passed to the solver which determines the
location of each node within the NoC based CMP and
the type of processor used for each node. Specific solvers we used in
implementing this approach are: (1) a genetic algorithm (GA) based solver
implemented using Java and (2) an integer linear programming (ILP) based solver
implemented on a commercial tool.
Project
resulted in one direct journal publication and four indirect journal
publications. During the course of the first part of the project one M.S.
student supported and is about to graduate and a Ph.D. student is still being
supported.
PROJECT OBJECTIVES FOR THE PERIOD
In the first
nine months of the project we have worked on the project setup and initial
requirements. As explained in the initial project proposal, main tasks
performed in this context are hiring research assistants, training of researchassistants, literature survey, software system
design, collecting benchmarks. Main objectives of the project for this period
apart from the initial setup are:
Distribute the available area among the processor
cores and the memory blocks
Limited die
area needs to be effectively utilized as applications differ in their
processing needs and memory requirements. While some applications can be
processor hungry and require relevantly less data, others may process bigger
data sets. Former will prefer a larger processing power, whereas the latter
will benefit from a bigger cache. For a given application, we will first analyze
the application requirements using a profiler which, in turn, will be used in
formulating the area distribution problem.
Processor selection
When Alpha
cores EV4 and EV8 is compared [2,3], one can see that EV8 consumes 80 times
larger area and 12 times more power to gain just above 2 times better
performance. Marginal benefit provided by larger cores is lower compared to
smaller cores; consequently one should try to utilize the available chip area
and power budget according to the application needs. Based on the application
requirements obtained by profiler, proposed approach will select the suitable
processing core using ILP and heuristic techniques.
WORK PROGRESS AND ACHIEVEMENTS DURING THE PERIOD
One of the
main achievements in the first period of the project is evolutionary-based
optimization. Evolutionary computational techniques start with an initial
population, where individuals of the population are generated through
variations in the input parameters. New generations are generated using new models
created through cross-over and mutation. Similar to natural selection, some of
the individuals in the population are promoted to the next generation using a
fitness function. Cross-over is performed on highest fitness individuals,
thereby forcing low-fitness individuals disappear.
We try to
minimize the communication and computation of a given parallel application by
selection and placement of nodes through a genetic algorithm (GA). We start our
GA implementation with a set of randomly generated chromosomes which form our
initial population. Size of this initial population is selected based on the NoC size and the number of different type of processors, as
it greatly affects the solution time. In our GA implementation, chromosomes are
represented as triplets, (t, x, y), where t indicates the type of processor
used for that node, whereas (x, y) indicates the coordinates of the location of
the node.
We have also
implemented an integer linear programming (ILP) based formulation of the
problem of minimizing communication and computation by determining the optimal
selection and placement of nodes in an NoC based CMP. ILP provides a set of techniques that solve
those optimization problems in which both the objective function and
constraints are linear functions and the solution variables are restricted to
be integers. The 0-1 ILP is an ILP problem in which each (solution) variable is
restricted to be either 0 or 1.
We performed
experiments with eight different array-based benchmark codes parallelized
through an optimizing compiler built upon SUIF. We performed experiments with
four different execution models for each benchmark; Homogeneous (same type of
processors), Random (randomly selected processors), GA (genetic algorithm based
processor selection), and ILP (integer linear programming based processor
selection). According to the experiments, while some of the benchmarks prefer a
homogeneous NoC topology, as in the case of ammp and vortex, most of the benchmarks take advantage of
heterogeneous NoC. The effectiveness of our GA-based
and ILP approaches increases with increasing number of processor types.
DISSEMINATION ACTIVITIES
Directly-related
journal publications (specifically acknowledged the FP7 support):
Heterogeneous NoC Design Through Evolutionary Computing, by Ozcan Ozturk and
Dilek Demirbas, International Journal of Electronics, Francis & Taylor,
Volume 97, No. 10, pages 1139-1161, 2010.
Indirect journal
publications (specifically acknowledged the FP7 support):
On-Chip Memory Space Partitioning for Chip Multiprocessors using
Polyhedral Algebra, by Ozcan Ozturk, Mahmut Kandemir, and Mary J. Irwin,
IET Computers & Digital Techniques, Volume 4, Issue 6, pages 484-498, 2010.
Data Locality and Parallelism Optimization Using A
Constraint-Based Approach, by O. Ozturk, Journal of Parallel and
Distributed Computing (JPDC), DOI: 10.1016/j.jpdc.2010.08.005.
Improving Chip Multiprocessor Reliability Through Code Replication, by Ozcan Ozturk. Computers
& Electrical Engineering, Elsevier, Issue 36, pages 480-490, 2010, ISSN
0045-7906, DOI: 10.1016/j.compeleceng.2009.11.004).
Compiler Directed Communication Reliability Enhancement for
Chip Multiprocessors, by O. Ozturk, M. Kandemir, S. Narayanan, and M. J.
Irwin. ACM SIGPLAN Notices, Vol. 45, No. 4, pp. 85-94, 2010.