Map reduce

In a map reduce problem the computation is divided in:

Map in Key Value pair all the possible data
Reduce all the values of a given key to only one

![[map reduce.png]]

[!example]- Consider google maps. The map of the world can be represented as a [[Graphs]] in which all the nodes are points of interest and every edge is a directed road. We want to find the shortest path from lets say a point and the nearest gas station

![[example map.png]]

The first thing that comes to mind is to use one of the widely known [[Search problem]] to find such path, but given their [[Complexity of an algorithm]] and the amount of data that is available this is not quite feasible. And even if it were, just sending this amount of data through some channel to perform the computation would be impossible.

To resolve this problem we need to make the right assumptions

All the data is not required, we can just take the portion that is relevant, say 50 km from every node is a block

All block are independent and can reside on any server

Now the computation is divided in:

Every block finds the distance of every node from the gas station

Re-assemble the data and using the intersections of the blocks, find the shortest path

To design a system able to compute such problems we need to resolve 3 problems:

Scheduling

We use a master-slave architecture in which:

Data divided in M map tasks of 64 MB
Reduce the results
Automatic worker task management

A master node needs to also take into account the locality of the data

Distribution

The main objective is to minimize the network operations of the system. Files are divided in 64 MB with 3 copies each all saved on different machines, but still geographically near.

Fault tolerance

On worker failure the master will detect it through pings and assign new workers to re-execute the task. On master failure we can just re-elect a new one and resume the computation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big data.md

Big data.md

Map reduce

Scheduling

Distribution

Fault tolerance

Files

Big data.md

Latest commit

History

Big data.md

File metadata and controls

Map reduce

Scheduling

Distribution

Fault tolerance