Cloud computing APIs like other system or application APIs have their success and failure modes. However, cloud computing APIs have another failure mode which most system or application APIs don’t have – timeouts. One problem with timeouts is that they take time :), which implies that whichever higher level operation is being performed, it generally has much higher latency before a response can be gathered for the user (which most probably is going to be an error).
There is another problem with timeouts. Timeouts are required in distributed systems because of the problem of consensus [ http://en.wikipedia.org/wiki/Consensus_(computer_science) ] also known as the FLP impossibility result. Cloud computing APIs require data consistency (consensus on data) all the time but they cannot fundamentally achieve it. So they shift the problem to consensus in mutual time. A conservative timeout is an engineering approximation of a consensus on completion of the operation with an error.
Now, computer clocks rarely run at the same rate. The maximum possible difference between computer clock rates is called clock skew. This clock skew error must be added to every cloud computing API one is using so that we are really sure that the cloud computing operation beneath the API has completed with an error.
Normally, this is not a big problem. However, with a proliferation of the cloud APIs, and increased composition and layering of these APIs, the clock skew error needs to be added to the timeouts at every layer. This results in highly inflated timeout values at the end user.
So what can we do? IMHO, the first thing to do is to reduce timeouts at their origins. Most timeouts originate with heartbeats between components and it is important to have smaller heartbeats. The heartbeats cannot be made very frequent because it uses up network iops. However, its definitely something that should be tuned. Also, a timeout of an operation is often the maximum of all the timeouts that we will experience in sub-operations. Thus, sub-operations should be chosen such that their max. value is lower than other alternatives.