Question 20: Slow completion
Ankur is a machine learning researcher and regularly runs computations spanning tens of thousands of machines. All the machines run independent computations (i.e., don't interact with each other) and send their results to a central machine. He spent a large amount of time optimizing his computations to be fast and it seems to have worked. However, he notices that in every job there are some machines that take much longer than the others to complete their computation, which leads to an overall increase in the computation time. For example, 99% of the machines would be done with their computation but 1% of the machines would still be far behind.
Can you suggest a mechanism to reduce the overall computation time for Ankur's jobs?
Solution
Ankur is facing the problem of stragglers. When dealing with a large number of machines, a small percentage of those are statistically going to be slower than others due to a variety of reasons ranging from hardware issues to software upgrades. This can become a problem for workloads that require jobs on all the machines to finish. A common strategy to address stragglers is to start the straggling jobs on a new machine once the stragglers have been identified (either based on the completion of a certain fraction of the jobs or based on thresholds for the job completion duration.) One caveat to consider is that the jobs should be idempotent to be able to safely restart them on new machines.
The map reduce paper from Google describes this problem and using backup tasks as a solution.