In a distributed system, calls to different services might fail due to timeouts, network connection slowness, or overused resources. All these problems are for a short while and can be corrected by themselves. The cloud services should be designed to handle such events, and this can be implemented using a retry pattern.
In simple terms, the circuit breaker’s main function is to interrupt the current flow after a fault has been detected. What does it mean in technical terms?
Whenever an external system or a process is not working it prevents an entire system from getting failed. It is used to detect failures and design a program that prevents a failure from constantly recurring, during maintenance, temporary external system failure, or unexpected system difficulties.
However, there might be scenarios where the entire service is down, and the TAT for the same is longer, in such cases continuously retrying an operation that will not succeed & is pointless, instead of that the caller service should handle error considering the callee service is down for a more extended period.
Additionally, one part of the system could be configured to send error messages on timeouts, i.e. reply with an error message once the timeout period has crossed the threshold. But the problem here is that there will be concurrent calls to the caller services for the same operation, and it has to wait till the timeout period has expired. Thereby causing resources to be held, and that could be fatal to the entire system. Setting a shorter timeout won’t also solve this problem since if a service takes a more expected time than timeout to respond, it will fail every time.
Fault-Tolerant and Resiliency: Introduction to Circuit breaker pattern.
Cascading System Failures can be prevented using Circuit Breaker Pattern. Cascading Effect might have been introduced in the system due to the Retry Pattern added to improve the system's overall resilience.
So circuit breakers can help the system to prevent the problem mentioned above. A circuit breaker precludes an application from repeatedly calling the caller services that are likely to fail. Also, it does that and makes sure that once the services are up, it will start invoking the operations.
A circuit breaker acts as a proxy for the operations that tend to fail. Based on the failure rates, the representative should decide whether the subsequent calls are to be forwarded to the services or not or return the exception or message that is configured.
So the operation to be performed is wrapped with a circuit breaker object, which monitors the failures. There are configured thresholds to indicate the timeout that considers the operations as failure/success. Once the losses reach the point, the circuit breaker trips and all requests passed to the circuit breaker fail.
The following points highlight the need for a circuit breaker pattern:
The different States of Circuit breaker
Example to example the above states.
A system is configured to call a service and the response time is 100-200 ms. We have configured the circuit breaker to trip once 75% of the request crosses this threshold in 10 minutes. The sleep window is 20 seconds. So if 100 calls are made and 80 calls take more than 200ms the circuit breaker trips. Not allowing any further requests to the service. After 20 seconds i.e., configured sleep window the circuit breaker will call the service again to check if the requests have succeeded and the response time is as per the configured one. If successful the circuit breaker moves back to the closed state, else it will still continue to be in the open state and it will retry again once the configured sleep window is reached i.e 20 seconds.
The time-series events below help us understand how the interaction between caller and callee service takes place with the mentioned average percentage of failures. The below is a more sophisticated way of implementing the Circuit Breaker Patterns where the system falls back to Closed State ONLY after n (In this example the no. of checks is 5) consecutive checks. Lesser than n number of consecutive successful checks will keep the system in Half Open State.
Elapsed Time (Min) |
0-1 |
1-2 |
2-3 |
3-4 |
4-5 |
5-6 |
6-7 |
7-8 |
8-9 |
9-10 |
10-11 |
11-12 |
12-13 |
13-14 |
14-15 |
15-16 |
16-17 |
Avg. Failures % |
0 |
50 |
77 |
80 |
80 |
81 |
81 |
82 |
83 |
88 |
70 |
75 |
70 |
70 |
60 |
60 |
60 |
State |
Closed |
Closed |
Closed |
Closed |
Closed |
Closed |
Open |
Open |
Open |
Open |
Open |
Open |
Half-Open |
Half-Open |
Half-Open |
Half-Open |
Closed |
Calle Service |
Called |
Called |
Called |
Called |
Called |
Called |
Not Called |
Not Called |
Not Called |
Not Called |
Not Called |
Not Called |
Not Called |
Not Called |
Not Called |
Not Called |
Called |
The Circuit Breaker design pattern is a pattern used in both monolithic- and microservice-based deployments. It helps the system prevent sending unnecessary loads to a failed callee service. In addition, it provides time to the backend service in order to recover from errors.