Heterogeneous Chip Multiprocessor Design (HTCMP)


 

 

 

 

People:

Asst. Prof. Ozcan Ozturk

MS. Student Dilek Demirbas

Ph.D. Student Ismail Akturk

Funding: Funded by European Commission (EC) FP7-PEOPLE MARIE CURIE ACTIONS

Duration: 2009 2013

Amount: 100,000

 

Abstract:

 

Increasing complexity of applications and their large dataset sizes make it imperative to consider novel architectures that are efficient from both performance and power angles. Chip Multiprocessors (CMP) are one such example where multiple processor cores are placed into the same die. As technology scales, the International Technology Roadmap for Semiconductors (ITRS) projects that the number of cores in a chip multiprocessor (CMP) will drastically increase to satisfy performance requirements of future applications. A critical question that needs to be answered in CMPs is the size and strength of the cores. Homogeneous chip multiprocessors provide only one type of core to match these various application requirements, consequently not fully utilizing the available chip area and power budget. The ability to dynamically switch between different cores, and power down unused cores gives a key advantage to heterogeneous chip multiprocessing. One of the challenging problems in the context of heterogeneous chip multiprocessor systems is the placement of processor cores and storage blocks within the available chip area. Focusing on such a heterogeneous chip multiprocessor, we address different design decision problems. First, decide on the memory hierarchy design and its distribution within the available chip area. Second, distribute effectively the available area among the processor cores and the memory blocks (cache). Third, select the optimum number of processors and their types among the available processor types. Fourth, perform thread and data distribution within the given processor and memory design. Fifth, evaluate improvements brought by advanced techniques, such as 3D designs. Our past experience and preliminary results indicate that the proposed approach will be able to generate promising results.

 

Report Period I:

 

PUBLISHABLE SUMMARY

 

HTCMP aims at designing efficient and powerful heterogeneous chip multiprocessors (HTCMP). A challenging problem in the context of heterogeneous chip multiprocessor systems is the placement of processor cores and storage blocks within the available chip area. Focusing on such a heterogeneous chip multiprocessor, we address different design decision problems: (i) effective distribution of the available area among the processor cores and the memory blocks (cache), (ii) memory hierarchy design (iii) selection of number of processors and their types from the processor pool, (iv) thread and data distribution (v) advanced techniques such as 3D designs.

 

Our main objective is to make significant contributions towards the development of compiler-based techniques for emerging and future HTCMPs. The software support for such systems is lagging way behind current advancements at the circuit and architecture levels. Effective compilation support for these architectures will make programming them much easier, thereby helping scientists to port their applications to these architectures. Outcomes of this research will be beneficial to the computer architecture field in that it will reveal the types of processor cores and memory components that are needed by the compiler for achieving the best application adaptation under dynamically changing power, performance and thermal conditions.

 

After the initial setup, we profiled the benchmarks and estimated their memory and processing requirements. Based on these requirements, we have implemented two major components of the project. In both components, after parallelization and mapping, the input code is fed to a compiler analysis module. Purpose of this module is to identify the set of CMP nodes that communicate with each other. This information is subsequently passed to the solver which determines the location of each node within the NoC based CMP and the type of processor used for each node. Specific solvers we used in implementing this approach are: (1) a genetic algorithm (GA) based solver implemented using Java and (2) an integer linear programming (ILP) based solver implemented on a commercial tool.

 

Project resulted in one direct journal publication and four indirect journal publications. During the course of the first part of the project one M.S. student supported and is about to graduate and a Ph.D. student is still being supported.

 

PROJECT OBJECTIVES FOR THE PERIOD

 

In the first nine months of the project we have worked on the project setup and initial requirements. As explained in the initial project proposal, main tasks performed in this context are hiring research assistants, training of researchassistants, literature survey, software system design, collecting benchmarks. Main objectives of the project for this period apart from the initial setup are:

 

Distribute the available area among the processor cores and the memory blocks

 

Limited die area needs to be effectively utilized as applications differ in their processing needs and memory requirements. While some applications can be processor hungry and require relevantly less data, others may process bigger data sets. Former will prefer a larger processing power, whereas the latter will benefit from a bigger cache. For a given application, we will first analyze the application requirements using a profiler which, in turn, will be used in formulating the area distribution problem.

 

Processor selection

 

When Alpha cores EV4 and EV8 is compared [2,3], one can see that EV8 consumes 80 times larger area and 12 times more power to gain just above 2 times better performance. Marginal benefit provided by larger cores is lower compared to smaller cores; consequently one should try to utilize the available chip area and power budget according to the application needs. Based on the application requirements obtained by profiler, proposed approach will select the suitable processing core using ILP and heuristic techniques.

 

WORK PROGRESS AND ACHIEVEMENTS DURING THE PERIOD

 

One of the main achievements in the first period of the project is evolutionary-based optimization. Evolutionary computational techniques start with an initial population, where individuals of the population are generated through variations in the input parameters. New generations are generated using new models created through cross-over and mutation. Similar to natural selection, some of the individuals in the population are promoted to the next generation using a fitness function. Cross-over is performed on highest fitness individuals, thereby forcing low-fitness individuals disappear.

 

We try to minimize the communication and computation of a given parallel application by selection and placement of nodes through a genetic algorithm (GA). We start our GA implementation with a set of randomly generated chromosomes which form our initial population. Size of this initial population is selected based on the NoC size and the number of different type of processors, as it greatly affects the solution time. In our GA implementation, chromosomes are represented as triplets, (t, x, y), where t indicates the type of processor used for that node, whereas (x, y) indicates the coordinates of the location of the node.

 

We have also implemented an integer linear programming (ILP) based formulation of the problem of minimizing communication and computation by determining the optimal selection and placement of nodes in an NoC based CMP. ILP provides a set of techniques that solve those optimization problems in which both the objective function and constraints are linear functions and the solution variables are restricted to be integers. The 0-1 ILP is an ILP problem in which each (solution) variable is restricted to be either 0 or 1.

 

We performed experiments with eight different array-based benchmark codes parallelized through an optimizing compiler built upon SUIF. We performed experiments with four different execution models for each benchmark; Homogeneous (same type of processors), Random (randomly selected processors), GA (genetic algorithm based processor selection), and ILP (integer linear programming based processor selection). According to the experiments, while some of the benchmarks prefer a homogeneous NoC topology, as in the case of ammp and vortex, most of the benchmarks take advantage of heterogeneous NoC. The effectiveness of our GA-based and ILP approaches increases with increasing number of processor types.

 

DISSEMINATION ACTIVITIES

 

Directly-related journal publications (specifically acknowledged the FP7 support):

 

Heterogeneous NoC Design Through Evolutionary Computing, by Ozcan Ozturk and Dilek Demirbas, International Journal of Electronics, Francis & Taylor, Volume 97, No. 10, pages 1139-1161, 2010.

 

Indirect journal publications (specifically acknowledged the FP7 support):

 

On-Chip Memory Space Partitioning for Chip Multiprocessors using Polyhedral Algebra, by Ozcan Ozturk, Mahmut Kandemir, and Mary J. Irwin, IET Computers & Digital Techniques, Volume 4, Issue 6, pages 484-498, 2010.

Data Locality and Parallelism Optimization Using A Constraint-Based Approach, by O. Ozturk, Journal of Parallel and Distributed Computing (JPDC), DOI: 10.1016/j.jpdc.2010.08.005.

Improving Chip Multiprocessor Reliability Through Code Replication, by Ozcan Ozturk. Computers & Electrical Engineering, Elsevier, Issue 36, pages 480-490, 2010, ISSN 0045-7906, DOI: 10.1016/j.compeleceng.2009.11.004).

Compiler Directed Communication Reliability Enhancement for Chip Multiprocessors, by O. Ozturk, M. Kandemir, S. Narayanan, and M. J. Irwin. ACM SIGPLAN Notices, Vol. 45, No. 4, pp. 85-94, 2010.