Bilkent University
Department of Computer Engineering


Reducing Dependant-Data Application Costs with Co-Placement on HDFS


H.Saygın Arkan
MSc Student
Computer Engineering Department
Bilkent University

MapReduce has been an emerging programming paradigm for data-parallel applications especially used in processing of vast amounts of data. The Hadoop Distributed File System (HDFS) is designed for storing very large data sets without any loss and streaming them at high bandwidth to user applications. Hadoop distributed systems platform with its own MapReduce framework and own file system HDFS, had been a successful system for processing independent (embarrassingly parallel) tasks, i.e. all partitioned data could be processed (almost) independently. Recently, due to the increasing complexity of data processing demands from the Hadoop platform, schemas that enable Hadoop to process problems with data dependencies have emerged. Our research study can be categorized as en effort towards this direction. We are planning to modify HDFS Namenode structure, so that it will be possible to store related (dependent) data into the same or closer data nodes with the belief that query results on dependent data can be processed faster if we can avoid unnecessary data transfers. We envisage improvements especially in processing of Web Queries and Data Mining applications. Another focus of our study is keeping the fault tolerance in acceptable levels, while ensuring lossless data processing.

Keywords: Hadoop, Distributed Systems, MapReduce


DATE: 09 April, 2012, Monday @ 16:00