In this homework you will implement a simple function that executes map/reduce jobs.
A map/reduce job contains two tasks, namely: a mapper and a reducer.
The mapper takes as input a data item and produces a list of (key, value) pairs. As an example, a mapper can take a line of text as input, and for each unique word, compute the number of times it appears in the line of text. E.g.:
in: "to be or not to be"
out: [("to",2), ("be",2), ("or",1), ("not",1)]
The reducer takes as input a key and a list of values and produce a (key, value) pair. For instance, a mapper can take a word and list of counts, and reduce this into a sum. E.g.:
in: "be", [2, 4, 5, 7]
out: ("be", 18)
The goal of the map/reduce function is to apply the mapper to each input data, organize the output such that for each unique key, the list of values for it are kept in a list, and then apply the reducer to compute the final results. Continuing with the running example, assume our input text is:
"to be or not to be" "be where the change is" "to be is to do"
We apply the mapper to get:
[("to",2),("be",2),("or",1),("not",1)]
[("be",1),("where",1),("the",1),("change",1),("is",1)]
[("to",2),("be",1),("is",1),("do",1)]
We then reorganize so that values for the same key are organized together:
{
"to" => [2, 2],
"be" => [2, 1, 1],
"or" => [1],
"not" => [1],
"where" => [1],
"the" => [1],
"change" => [1],
"is" => [1, 1],
"do" => [1]
}
Then for each key, we apply the reducer, to get:
{
"to" => 4,
"be" => 4,
"or" => 1,
"not" => 1,
"where" => 1,
"the" => 1,
"change" => 1,
"is" => 2,
"do" => 1
}
Part a) Complete the following code:
def computeWordCounts(line):
# TODO
def sumupWordCounts(word, counts):
# TODO
def mapReduce(data, mapper, reducer):
# TODO
def test():
with open('/Users/bgedik/Desktop/zzz.txt') as f:
lines = f.read().splitlines()
counts = mapReduce(lines, mapper=computeWordCounts, reducer=sumupWordCounts);
for word, count in counts:
print word, " => ", count
Part b) Write a brief report on the following:
Put your code and report under a directory named lastname_name_hw3 and make an archive from that directory. Your report should be named lastname_name_hw3.pdf (or .txt). For example, the following Unix commands could be used:
mkdir lastname_name_hw3
cd lastname_name
...
(edit and test your files in this directory)
...
cd ..
tar -cvzf lastname_name_hw3.tar.gz lastname_name_hw3
Then e-mail this newly generated file (named lastname_name_hw3.tar.gz) to your TA Doğukan Çağatay (dogukan.cagatay@bilkent.edu.tr).
Collaboration on the homework is not allowed. Reports in formats other than .txt and .pdf are not accepted.