Azure for Architects
上QQ阅读APP看书,第一时间看更新

Retry pattern

The retry pattern is an extremely important pattern to make applications and services more resilient to transient failures. Think about a situation that you are trying to connect and use the service, and the service is not available due to any reason. If the service is going to come up in short duration, it makes sense to keep retrying for a successful connection. This will make the application more robust, fault tolerant, and bring instability. In Azure, most of the components are running on the internet and that internet connection can produce transient faults intermittently. Because of these failures that rectify within seconds, an application should not be allowed to crash. The application should be designed in a manner that it can re-try to use the service repeatedly during failures. It should also stop retrying if even after certain number of retries the service is not available.

This pattern should be implemented when an application could experience transient faults as it interacts with a remote server or accesses a remote resource. These faults are expected to be short-lived, and repeating a request that has previously failed could succeed on a subsequent attempt.

The retry pattern can adopt different retry strategies based on the nature of errors and application:

  • Retry a fixed number of times: It denotes that the application will try to communicate to the service a fixed number of times before it can determine a failure and raise an exception. For example, retry three times to connect to another service. If it is successful in connecting within these three tries, the entire operation is successful, or after the expiry of this count, it will raise an exception.
  • Retry based on schedule: It denotes that the application will try to communicate to the service repeatedly for a fixed number of seconds or minutes and wait for a fixed number of seconds or minutes before retrying. For example, retrying every three seconds for 60 seconds to connect to another service. If it is successful in connecting within these tries, the entire operation is successful, or after the expiry of 60 seconds, it will raise an exception.
  • Sliding and delaying the retry: It denotes that the application will try to communicate to the service repeatedly based on schedule and keeps adding an incremental delay in subsequent tries. For example, retrying for a total of 60 seconds where the first retry happens after one second, the second try happens after two seconds from the last retry, the third one after four seconds from the last try, and so on. This helps in reducing the overall number of retries.

The retry pattern is depicted by means of a picture where the first request gets HTTP 500 as a response message, the second retry again gets HTTP 500 as a response, and finally, the request is successful and gets HTTP 200 as a response message.