Homework III

In this homework you will implement a simple function that executes map/reduce jobs.

A map/reduce job contains two tasks, namely: a mapper and a reducer.

The mapper takes as input a data item and produces a list of (key, value) pairs. As an example, a mapper can take a line of text as input, and for each unique word, compute the number of times it appears in the line of text. E.g.:

in: "to be or not to be"
out: [("to",2), ("be",2), ("or",1), ("not",1)]

The reducer takes as input a key and a list of values and produce a (key, value) pair. For instance, a mapper can take a word and list of counts, and reduce this into a sum. E.g.:

in: "be", [2, 4, 5, 7]
out: ("be", 18)

The goal of the map/reduce function is to apply the mapper to each input data, organize the output such that for each unique key, the list of values for it are kept in a list, and then apply the reducer to compute the final results. Continuing with the running example, assume our input text is:

"to be or not to be"
"be where the change is"
"to be is to do"

We apply the mapper to get:

[("to",2),("be",2),("or",1),("not",1)]
[("be",1),("where",1),("the",1),("change",1),("is",1)]
[("to",2),("be",1),("is",1),("do",1)]

We then reorganize so that values for the same key are organized together:

{
"to" => [2, 2],
"be" => [2, 1, 1],
"or" => [1],
"not" => [1],
"where" => [1],
"the" => [1],
"change" => [1],
"is" => [1, 1],
"do" => [1]
}

Then for each key, we apply the reducer, to get:

{
"to" => 4,
"be" => 4,
"or" => 1,
"not" => 1,
"where" => 1,
"the" => 1,
"change" => 1,
"is" => 2,
"do" => 1 
}

Part a) Complete the following code:

def computeWordCounts(line):
    # TODO

def sumupWordCounts(word, counts):
    # TODO

def mapReduce(data, mapper, reducer):
    # TODO

def test():
    with open('/Users/bgedik/Desktop/zzz.txt') as f:
        lines  = f.read().splitlines()
    counts = mapReduce(lines, mapper=computeWordCounts, reducer=sumupWordCounts);
    for word, count in counts:
        print word, " => ", count

Part b) Write a brief report on the following:

Logistics

Put your code and report under a directory named lastname_name_hw3 and make an archive from that directory. Your report should be named lastname_name_hw3.pdf (or .txt). For example, the following Unix commands could be used:

    mkdir lastname_name_hw3
    cd lastname_name
        ...
        (edit and test your files in this directory)
        ...
    cd ..
    tar -cvzf lastname_name_hw3.tar.gz lastname_name_hw3

Then e-mail this newly generated file (named lastname_name_hw3.tar.gz) to your TA Doğukan Çağatay (dogukan.cagatay@bilkent.edu.tr).

Collaboration on the homework is not allowed. Reports in formats other than .txt and .pdf are not accepted.