After Hours Academic

Question 19: Slow completion

Ankur is a machine learning researcher and regularly runs computations spanning tens of thousands of machines. All the machines run independent computations (i.e., don't interact with each other) and send their results to a central machine. He spent a large amount of time optimizing his computations to be fast and it seems to have worked. However, he notices that in every job there are some machines that take much longer than the others to complete their computation, which leads to an overall increase in the computation time. For example, 95% of the machines would be done with their computation but 5% of the machines would still be far behind.

Can you suggest a mechanism to reduce the overall computation time for Ankur's jobs?

Solution coming up in the next post!


Solution for scalable file creation:

File creation in the same directory can indeed be scaled. One way it can be done is by storing the directory as a hash table with keys being the file names and values being the inode number for the file's inode. By using key-level locking, file creations for files with different names can easily scale. The reason why this seems counterintuitive is because we conventionally think of file system directory as a shared resource. However, that is a design/implementation choice and not a fundamental property.

I lifted this example from the scalable commutativity paper. This paper is an excellent read and provides a way to reason about what operations can or cannot scale purely based on the interface (and not the implementation).

#distributed-systems #qna