HTCMP - Heterogeneous Chip Multiprocessor Design

Heterogeneous Chip Multiprocessor Design (HTCMP)

People:

Asst. Prof. Ozcan Ozturk

MS. Student Dilek Demirbas

MS. Student Ismail Akturk

MS. Student Mahmut Sami Dikici

Ph.D. Student Naveed Ul Mustafa

Funding: Funded by European Commission (EC) FP7-PEOPLE MARIE CURIE ACTIONS

Duration: 2009 – 2013

Amount: €100,000

Abstract:

Increasing complexity of applications and their large dataset sizes make it imperative to consider novel architectures that are efficient from both performance and power angles. Chip Multiprocessors (CMP) are one such example where multiple processor cores are placed into the same die. As technology scales, the International Technology Roadmap for Semiconductors (ITRS) projects that the number of cores in a chip multiprocessor (CMP) will drastically increase to satisfy performance requirements of future applications. A critical question that needs to be answered in CMPs is the size and strength of the cores. Homogeneous chip multiprocessors provide only one type of core to match these various application requirements, consequently not fully utilizing the available chip area and power budget. The ability to dynamically switch between different cores, and power down unused cores gives a key advantage to heterogeneous chip multiprocessing. One of the challenging problems in the context of heterogeneous chip multiprocessor systems is the placement of processor cores and storage blocks within the available chip area. Focusing on such a heterogeneous chip multiprocessor, we address different design decision problems. First, decide on the memory hierarchy design and its distribution within the available chip area. Second, distribute effectively the available area among the processor cores and the memory blocks (cache). Third, select the optimum number of processors and their types among the available processor types. Fourth, perform thread and data distribution within the given processor and memory design. Fifth, evaluate improvements brought by advanced techniques, such as 3D designs. Our past experience and preliminary results indicate that the proposed approach will be able to generate promising results.

Report Period II:

PUBLISHABLE SUMMARY

HTCMP aims at designing efficient and powerful heterogeneous chip multiprocessors (HTCMP). A challenging problem in the context of heterogeneous chip multiprocessor systems is the placement of processor cores and storage blocks within the available chip area. Focusing on such a heterogeneous chip multiprocessor, we address different design decision problems: (i) effective distribution of the available area among the processor cores and the memory blocks (cache), (ii) memory hierarchy design (iii) selection of number of processors and their types from the processor pool, (iv) thread and data distribution (v) advanced techniques such as 3D designs.

Our main objective is to make significant contributions towards the development of compiler-based techniques for emerging and future HTCMPs. The software support for such systems is lagging way behind current advancements at the circuit and architecture levels. Effective compilation support for these architectures will make programming them much easier, thereby helping scientists to port their applications to these architectures. Outcomes of this research will be beneficial to the computer architecture field in that it will reveal the types of processor cores and memory components that are needed by the compiler for achieving the best application adaptation under dynamically changing power, performance and thermal conditions.

During the second report period, we built over what we have done during the first report period. The two major contributions during the first report period were 1) Distribute the available area among the processor cores and the memory blocks and 2) Processor selection. Our contributions in the second period were 1) Thread and data distribution, 2) Communication Reduction, and 3) Advanced Optimizations. In the first part, we introduce an application-specific heterogeneous Network-On-Chip (NoC) design algorithm that considers the given constraints and generates a floorplan for the desired many-core. On the other hand, second part aims at minimizing the communication costs of 3D NoC architectures. The third part proposes a reliability-aware 3D NoC design by reducing the inter-layer communications in a 3D NoC design.

During the second reporting period, project resulted in two direct journal publication, one conference publication, one poster and three indirect journal publications. Moreover, three M.S. students and one PhD student were supported.

PROJECT OBJECTIVES FOR THE PERIOD

After the first reporting period, we have worked on the three main objectives listed below.

Thread and data distribution

In the design of chip multiprocessors, architects and designers must decide how thread and data distribution should happen to realize the desired chip. However, the immense amount of variation in processing cores makes this decision tedious and error prone. Thus, figuring out the types of processing cores to be used is of great importance. Our algorithm identifies the processing cores that will be used in the design. Our algorithm places the selected cores on a given chip area in a way that the total latency occurring on the chip is minimized, while the given area is utilized as much as possible. It should be noted that the regular NoC topologies are inappropriate for such cases because heterogeneous architectures have non-uniform sets of processing cores. Thus, an effective custom NoC is the key to achieve the desired performance of a heterogeneous multi-core.

Communication Reduction

NoC architectures have been extended to the third dimension by the help of through silicon vias (TSVs). 3D NoCs have the potential to achieve better performance with higher scalability and lower power consumption. Most of the related work on 3D NoCs consider homogeneous cores. While, 3D NoCs provide the aforementioned benefits, the best utilization cannot be extracted without including heterogeneity. This is due to the fact that every application (and different parts of an application) has different characteristics. Enabling heterogeneity in 3D NoC architectures will make it possible to match all these various requirements, while keeping energy and heat consumption as minimum as possible. Since heat is one of the most critical issues in 3D ICs, providing heterogeneity has the potential to meet the requirements. A well known heterogeneous (asymmetric) Chip Multiprocessor (CMP) example is IBM’s Cell Processor, where 1 PPU (power processing unit) and 8 SPUs (synergistic processing unit) are combined to perform more efficiently. It was shown that a representative heterogeneous processor using two core types achieves as much as a 63 percent performance improvement over an equivalent-area homogeneous processor. This is mainly due to matching execution resources to application needs effectively. One of the challenging problems in the context of 3D NoC heterogeneous chip multiprocessor systems is the placement of processor cores within the available chip area. Focusing on such a heterogeneous 3D NoC, we explored how different types of processors can be placed to minimize data access costs.

Advanced Optimizations

Ability to stack separate chips in a single package enables three-dimensional integrated circuits (3D ICs). Heterogeneous 3D ICs provide even better opportunities to reduce the power and increase the performance per unit area. An important issue in designing a heterogeneous 3D IC is reliability. To achieve this, one needs to select the data mapping and processor layout carefully. This paper addresses this problem using an integer linear programming (ILP) approach. Specifically, on a heterogeneous 3D CMP, it explores how applications can be mapped onto 3D ICs to maximize reliability. Preliminary experimental evaluation indicates that the proposed technique generates promising results in both reliability and performance.

WORK PROGRESS AND ACHIEVEMENTS DURING THE PERIOD

Along with generation of custom NoC topology, there are two other important concerns that affect the overall performance of a multi-core: task scheduling and core mapping. Our work is complementary to task scheduling and core mapping because their effectiveness is tightly coupled to the NoC that should be used. Our algorithm cooperates with task-scheduling and core-mapping algorithms during the generation of the desired NoC and the floorplan. Although the given constraints belong to separate stages of the design process, we take them into account as a whole. For example, maximization of area utilization should be considered together with the latency concern. We target application-specific heterogeneous NoC-based architecture generation. We implemented our approach using a latency-aware Least-Wasted-First (LWF) 2D bin-packing algorithm.

Global interconnect problem has become more important with the increase in the number of processor cores in chip multiprocessing. 3D designs and NoC architectures have been unified as 3D NoCs to overcome the interconnect scaling bottleneck. We try to map heterogeneous processors onto the given 3D chip area with minimal data communication costs. Our goal is to present an Integer Linear Programming (ILP) formulation of the problem of minimizing data communication cost of a given application. This is achieved through optimal placement of nodes in a 3D NoC. Our initial results indicate that the proposed approach generates promising results within tolerable solution times.

As the technology shrinks, one of the challenging problems in the context of 3D NoC systems is reliability. Reliability of 3D ICs is effected by both temperature and thermo-mechanical stress. This is especially caused by the limited cooling capability between the layers. Specifically, vias become more and more sensitive and when the via fails to make proper connection, unwanted loss in yield and decrease in reliability may occur. Reliability for 3D ICs have been explored from different angles. Through Silicon Vias (TSVs) are the most recent medium in stacking multiple dies on a 3D IC. However, these vias become more sensitive with higher temperatures that can be caused by more activity or traffic. Since TSVs are bridges between layers, they are potentially more prone to thermal stress. Therefore, reducing the TSV communication load has potential of improving reliability. This work aims at increasing the reliability of an application through effective mapping on 3D heterogeneous IC. Contribution of the approach is in two folds: 1) An ILP formulation of the problem of maximizing the reliability of a given application. This is achieved through optimal placement of nodes in a 3D NoC. 2) Minimization of the communication cost between the nodes, thereby improving both performance and energy consumption. ILP-based approach presented here targets at reducing the amount of layer-to-layer communication on TSVs, while keeping the overall communication overheads minimum.

DISSEMINATION ACTIVITIES

Directly-related journal publications (specifically acknowledged the FP7 support):

Application-Specific Heterogeneous Network-on-Chip Design, by Dilek Demirbas; Ismail Akturk; Ozcan Ozturk; Ugur Gudukbay. The Computer Journal 2013; doi: 10.1093/comjnl/bxt011.

Reliability-Aware Heterogeneous 3D Chip Multiprocessor Design, by Ismail Akturk and Ozcan Ozturk. Accepted. Journal of Electronic Testing: Theory and Applications.

ILP-Based Communication Reduction for Heterogeneous 3D Network-on-Chips, Ismail Akturk and Ozcan Ozturk. In Proc. of 21st Euromicro International Conference on Parallel, Distributed and Networked-Based Processing (PDP'13), Belfast, Northern Ireland, February 2013.

Reliability-Aware 3D Chip Multiprocessor Design, Ismail Akturk and Ozcan Ozturk. In Proc. of Manufacturable and Dependable Multicore Architectures at Nanoscale (MEDIAN'12), Annecy, France, June 2012.

Indirect journal publications (specifically acknowledged the FP7 support):

Compiler-Directed Energy Reduction Using Dynamic Voltage Scaling and Voltage Islands for Embedded Systems, by O. Ozturk, M. Kandemir, and G. Chen, IEEE Transactions on Computers (TC), Vol. 62, No. 2, pages 268-278, February 2013.

Reducing Memory Space Consumption Through Dataflow Analysis, by O. Ozturk, Computer Languages, Systems & Structures, Volume 37, Issue 4, October 2011, pages 168-177.

Multicore Education Through Simulation, by O. Ozturk, IEEE Transactions on Education (TE), Volume 54, Issue 2, pages 203-209, May 2011.

Report Period I:

PUBLISHABLE SUMMARY

After the initial setup, we profiled the benchmarks and estimated their memory and processing requirements. Based on these requirements, we have implemented two major components of the project. In both components, after parallelization and mapping, the input code is fed to a compiler analysis module. Purpose of this module is to identify the set of CMP nodes that communicate with each other. This information is subsequently passed to the solver which determines the location of each node within the NoC based CMP and the type of processor used for each node. Specific solvers we used in implementing this approach are: (1) a genetic algorithm (GA) based solver implemented using Java and (2) an integer linear programming (ILP) based solver implemented on a commercial tool.

Project resulted in one direct journal publication and four indirect journal publications. During the course of the first part of the project one M.S. student supported and is about to graduate and a Ph.D. student is still being supported.

PROJECT OBJECTIVES FOR THE PERIOD

In the first nine months of the project we have worked on the project setup and initial requirements. As explained in the initial project proposal, main tasks performed in this context are hiring research assistants, training of researchassistants, literature survey, software system design, collecting benchmarks. Main objectives of the project for this period apart from the initial setup are:

Distribute the available area among the processor cores and the memory blocks

Limited die area needs to be effectively utilized as applications differ in their processing needs and memory requirements. While some applications can be processor hungry and require relevantly less data, others may process bigger data sets. Former will prefer a larger processing power, whereas the latter will benefit from a bigger cache. For a given application, we will first analyze the application requirements using a profiler which, in turn, will be used in formulating the area distribution problem.

Processor selection

When Alpha cores EV4 and EV8 is compared [2,3], one can see that EV8 consumes 80 times larger area and 12 times more power to gain just above 2 times better performance. Marginal benefit provided by larger cores is lower compared to smaller cores; consequently one should try to utilize the available chip area and power budget according to the application needs. Based on the application requirements obtained by profiler, proposed approach will select the suitable processing core using ILP and heuristic techniques.

WORK PROGRESS AND ACHIEVEMENTS DURING THE PERIOD

One of the main achievements in the first period of the project is evolutionary-based optimization. Evolutionary computational techniques start with an initial population, where individuals of the population are generated through variations in the input parameters. New generations are generated using new models created through cross-over and mutation. Similar to natural selection, some of the individuals in the population are promoted to the next generation using a fitness function. Cross-over is performed on highest fitness individuals, thereby forcing low-fitness individuals disappear.

We try to minimize the communication and computation of a given parallel application by selection and placement of nodes through a genetic algorithm (GA). We start our GA implementation with a set of randomly generated chromosomes which form our initial population. Size of this initial population is selected based on the NoC size and the number of different type of processors, as it greatly affects the solution time. In our GA implementation, chromosomes are represented as triplets, (t, x, y), where t indicates the type of processor used for that node, whereas (x, y) indicates the coordinates of the location of the node.

We have also implemented an integer linear programming (ILP) based formulation of the problem of minimizing communication and computation by determining the optimal selection and placement of nodes in an NoC based CMP. ILP provides a set of techniques that solve those optimization problems in which both the objective function and constraints are linear functions and the solution variables are restricted to be integers. The 0-1 ILP is an ILP problem in which each (solution) variable is restricted to be either 0 or 1.

We performed experiments with eight different array-based benchmark codes parallelized through an optimizing compiler built upon SUIF. We performed experiments with four different execution models for each benchmark; Homogeneous (same type of processors), Random (randomly selected processors), GA (genetic algorithm based processor selection), and ILP (integer linear programming based processor selection). According to the experiments, while some of the benchmarks prefer a homogeneous NoC topology, as in the case of ammp and vortex, most of the benchmarks take advantage of heterogeneous NoC. The effectiveness of our GA-based and ILP approaches increases with increasing number of processor types.

DISSEMINATION ACTIVITIES

Directly-related journal publications (specifically acknowledged the FP7 support):

Heterogeneous NoC Design Through Evolutionary Computing, by Ozcan Ozturk and Dilek Demirbas, International Journal of Electronics, Francis & Taylor, Volume 97, No. 10, pages 1139-1161, 2010.

Indirect journal publications (specifically acknowledged the FP7 support):

On-Chip Memory Space Partitioning for Chip Multiprocessors using Polyhedral Algebra, by Ozcan Ozturk, Mahmut Kandemir, and Mary J. Irwin, IET Computers & Digital Techniques, Volume 4, Issue 6, pages 484-498, 2010.

Data Locality and Parallelism Optimization Using A Constraint-Based Approach, by O. Ozturk, Journal of Parallel and Distributed Computing (JPDC), DOI: 10.1016/j.jpdc.2010.08.005.

Improving Chip Multiprocessor Reliability Through Code Replication, by Ozcan Ozturk. Computers & Electrical Engineering, Elsevier, Issue 36, pages 480-490, 2010, ISSN 0045-7906, DOI: 10.1016/j.compeleceng.2009.11.004).

Compiler Directed Communication Reliability Enhancement for Chip Multiprocessors, by O. Ozturk, M. Kandemir, S. Narayanan, and M. J. Irwin. ACM SIGPLAN Notices, Vol. 45, No. 4, pp. 85-94, 2010.