exponential backoff

description: rate-seeking algorithm

14 results

pages: 725 words: 168,262

API Design Patterns
by Jj Geewax
Published 19 Jul 2021

Now that we have an idea of the various requests and whether they’re retriable (and in what circumstances), let’s switch topics and start looking at the cases where we’ve decided it’s acceptable to retry a request. In particular, let’s look at the way we’ll decide on how long to wait before retrying a request. 29.3.2 Exponential back-off As we learned earlier, exponential back-off is an algorithm by which the delay between requests grows exponentially, doubling each time a request returns an error result. Perhaps the best way to understand this would be by looking at some code. Listing 29.1 Example demonstrating retrial with exponential back-off async function getChatRoomWithRetries(id: string): Promise<ChatRoom> { return new Promise<ChatRoom>(async (resolve, reject) => { let delayMs = 1000; ❶ while (true) { try { return resolve(GetChatRoom({ id })); ❷ } catch (e) { await new Promise((resolve) => { ❸ return setTimeout(resolve, delayMs); }); delayMs *= 2; ❹ } } }); } ❶ We define the initial delay as 1 second (1,000 milliseconds) ❷ First, attempt to retrieve the resource and resolve the promise. ❸ If the request fails, wait for a fixed amount of time, defined by the delay. ❹ Finally, double the amount of time we’ll need to wait next time.

…

It’s particularly tricky because we have no idea what’s happening once the request is sent, leaving us to simply make guesses about when the request is more likely to be handled successfully on the other side. Based on these constraints, the most effective, battle-tested algorithm for this problem is exponential back-off. The idea of exponential back-off is pretty simple. For a first retry attempt, we might wait one second before making the request again. After that, for every subsequent failure response we take the time that we waited previously and double it. If a request fails a second time, we wait two seconds and then try again.

…

We’ll introduce some extra pieces to this algorithm, but the core concept will remain unchanged. As we noted, this algorithm is great when the system on the other side is a complete black box. In other words, if we know nothing else about the system, exponential back-off works quite well. But is that the case with our web API? In truth, this is not the case and, as we’ll see in the next section, there is actually a better solution for a unique subset of scenarios. 29.2.2 Server-specified retry timing While exponential back-off is still a reasonable algorithm, it turns out that there are certain cases in which there is a better option available. For example, consider the scenario where a request fails because the API service says that this request can only be executed once per minute.

pages: 523 words: 143,139

Algorithms to Live By: The Computer Science of Human Decisions
by Brian Christian and Tom Griffiths
Published 4 Apr 2016

Since the maximum delay length (2, 4, 8, 16…) forms an exponential progression, it’s become known as Exponential Backoff. Exponential Backoff was a huge part of the successful functioning of the ALOHAnet beginning in 1971, and in the 1980s it was baked into TCP, becoming a critical part of the Internet. All these decades later, it still is. As one influential paper puts it, “For a transport endpoint embedded in a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations, only one scheme has any hope of working—exponential backoff.” But it is the algorithm’s other uses that suggest something both more prescriptive and more profound.

…

But it is the algorithm’s other uses that suggest something both more prescriptive and more profound. Beyond just collision avoidance, Exponential Backoff has become the default way of handling almost all cases of networking failure or unreliability. For instance, when your computer is trying to reach a website that appears to be down, it uses Exponential Backoff—trying again one second later, again a few seconds after that, and so forth. This is good for everyone: it prevents a host server that’s down from getting slammed with requests as soon as it comes back online, and it prevents your own machine from wasting too much effort trying to get blood from a stone.

…

The longer the round-trip time between sender and receiver, the longer it takes a silence to be significant—and the more information can be potentially “in flight” before the sender realizes there’s a problem. In networking, having the parties properly tune their expectations for the timeliness of acknowledgments is crucial to the system functioning correctly. The second question, of course, once we do recognize a breakdown, is what exactly we should do about it. Exponential Backoff: The Algorithm of Forgiveness The world’s most difficult word to translate has been identified as “ilunga,” from the Tshiluba language spoken in south-eastern DR Congo.… Ilunga means “a person who is ready to forgive any abuse for the first time, to tolerate it a second time, but never a third time.”

Designing Web APIs: Building APIs That Developers Love
by Brenda Jin , Saurabh Sahni and Amir Shevat
Published 28 Aug 2018

If your infrastructure can support additional quota and there is no bet‐ ter way to achieve the same result, you might want to consider giving them an exception. 112 | Chapter 6: Scaling APIs • Starting with lower rate-limit thresholds is helpful. Increasing rate-limit thresholds is easier than reducing them because reducing them can negatively affect active developer apps. • Implement exponential back-off in your client SDKs and pro‐ vide sample code to developers on how to do that. This way, developers are less likely to continue hitting your API when they are rate-limited. For more information, see “Error Han‐ dling and Exponential Back-Off ” on page 115. • Use rate limits to reduce the impact of incidents by heavily ratelimiting noncritical traffic during an outage. Lessons Learned from Slack’s Rate-Limiting In March 2018, Slack rolled out an evolved rate-limiting system.

…

Caching Frequently Used Data You can add support for storing API responses or frequently used data locally in a cache. This can help in reducing the number of API calls you will receive. If you have concerns or policies around what data clients can store, ensuring that the cache automatically expires in a few hours can help. Error Handling and Exponential Back-Off Errors are often handled poorly by developers. It’s difficult for devel‐ opers to reproduce all possible errors during development, and that’s why they might not write code to handle those errors grace‐ fully. As you build your SDK, first, you could implement local checks to return errors on invalid requests.

…

Some failures, like authorization errors, cannot be addressed by a retry. Your SDK should surface appropriate errors for these failures to the developer. For other errors, it’s simply better for the SDK to automatically retry the API call. To help developers avoid making too many API calls to your server, your SDK should implement exponential back-off. This is a standard error-handling strategy in which client applications periodically retry a failed request over an increasing amount of time. Exponen‐ tial back-off can help in reducing the number of requests your server receives. When your SDKs implement this, it helps your web application recover gracefully from outages.

pages: 719 words: 181,090

Site Reliability Engineering: How Google Runs Production Systems
by Betsy Beyer , Chris Jones , Jennifer Petoff and Niall Richard Murphy
Published 15 Apr 2016

When issuing automatic retries, keep in mind the following considerations: Most of the backend protection strategies described in “Preventing Server Overload” apply. In particular, testing the system can highlight problems, and graceful degradation can reduce the effect of the retries on the backend. Always use randomized exponential backoff when scheduling retries. See also “Exponential Backoff and Jitter” in the AWS Architecture Blog [Bro15]. If retries aren’t randomly distributed over the retry window, a small perturbation (e.g., a network blip) can cause retry ripples to schedule at the same time, which can then amplify themselves [Flo94]. Limit retries per request.

…

You might consider some of the following production tests: Reducing task counts quickly or slowly over time, beyond expected traffic patterns Rapidly losing a cluster’s worth of capacity Blackholing various backends Test Popular Clients Understand how large clients use your service. For example, you want to know if clients: Can queue work while the service is down Use randomized exponential backoff on errors Are vulnerable to external triggers that can create large amounts of load (e.g., an externally triggered software update might clear an offline client’s cache) Depending on your service, you may or may not be in control of all the client code that talks to your service. However, it’s still a good idea to have an understanding of how large clients that interact with your service will behave.

…

As capacity becomes scarce, the service no longer returns pictures alongside text or small maps illustrating where a story takes place. And depending on its purpose, an RPC that times out is either not retried (for example, in the case of the aforementioned pictures), or is retried with a randomized exponential backoff. Despite these safeguards, the tasks fail one by one and are then restarted by Borg, which drives the number of working tasks down even more. As a result, some graphs on the service dashboard turn an alarming shade of red and SRE is paged. In response, SREs temporarily add capacity to the Asian datacenter by increasing the number of tasks available for the Shakespeare job.

Principles of Protocol Design
by Robin Sharp
Published 13 Feb 2008

Analysis of this situation (see for example Chapter 4 in [11]) shows that when the (new) traffic generated by the users exceeds a certain value, then the retransmissions themselves cause so many extra collisions that the throughput of the system actually begins to fall as the load of new traffic rises further. This leads to instability in the system, which can only be avoided by reducing the rate of retransmission as the load increases. Doubling the average retransmission delay for each failed attempt is a simple way of doing this. The technical term for this is Binary Exponential Backoff (BEB); or, if the doubling process terminates after a certain maximum number of attempts, as in the case of the ISO/IEEE CSMA/CD protocol, truncated BEB. The receiver processes, described by R[i] are much simpler. These are controlled by the signals cs[i], which indicates the presence of an arriving transmission, and nc[i], which indicate the absence of such a transmission.

…

Since poor estimates of the timeout setting give rise to excessive retransmissions, which in turn may cause more congestion and further delays, the retransmission timeout timer value is doubled after each unsuccessful retransmission. As in the case of the CSMA/CD protocol described by Protocol 11, this mechanism, known as Binary Exponential Backoff ensures stability in the system. After a successful retransmission, the normal way of evaluating the round trip time – and thus the timer value – is recontinued. 7.4 Congestion 235 Y B X A Fig. 7.25 Congestion control using choking. System X has detected excessive utilisation of the link from X to Y.

…

Assumed underlying service: CSMA/CD LAN Physical Layer (ISO8802.3). Connection phase: None. Data transfer phase: Unacknowledged data transfer. Disconnection phase: None. Other features: Statistical time-division multiplexing with distributed control. Collision resolution by random wait with truncated binary exponential backoff. Coding: Start of DPDU marked by delimiter field. CRC-32 block checksum. Ad hoc binary coding of all fields in DPDU. Addressing: 16-bit or 48-bit flat addressing. Single-station or group addressing. Fault tolerance: Corruption of data PDUs. A number of other LAN MAC protocols, which are described in other parts of the ISO8802/IEEE 802 family of standards, are listed in Table 9.1.

pages: 1,025 words: 150,187

ZeroMQ
by Pieter Hintjens
Published 12 Mar 2013

Take a look at the Paranoid Pirate worker in Example 4-13. Example 4-13. Paranoid Pirate worker (ppworker.c) // // Paranoid Pirate worker // #include "czmq.h" #define HEARTBEAT_LIVENESS 3 // 3-5 is reasonable #define HEARTBEAT_INTERVAL 1000 // msec #define INTERVAL_INIT 1000 // Initial reconnect #define INTERVAL_MAX 32000 // After exponential backoff // Paranoid Pirate Protocol constants #define PPP_READY "\001" // Signals worker is ready #define PPP_HEARTBEAT "\002" // Signals worker heartbeat // Helper function that returns a new configured socket // connected to the Paranoid Pirate queue static void * s_worker_socket (zctx_t *ctx) { void *worker = zsocket_new (ctx, ZMQ_DEALER); zsocket_connect (worker, "tcp://localhost:5556"); // Tell queue we're ready for work printf ("I: worker ready\n"); zframe_t *frame = zframe_new (PPP_READY, 1); zframe_send (&frame, worker, 0); return worker; } We have a single task that implements the worker side of the Paranoid Pirate Protocol (PPP).

…

\n", interval); zclock_sleep (interval); if (interval < INTERVAL_MAX) interval *= 2; zsocket_destroy (ctx, worker); worker = s_worker_socket (ctx); liveness = HEARTBEAT_LIVENESS; } // Send heartbeat to queue if it's time if (zclock_time () > heartbeat_at) { heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL; printf ("I: worker heartbeat\n"); zframe_t *frame = zframe_new (PPP_HEARTBEAT, 1); zframe_send (&frame, worker, 0); } } zctx_destroy (&ctx); return 0; } Some comments about this example: The code includes simulation of failures, as before. This makes it (a) very hard to debug, and (b) dangerous to reuse. When you want to debug this code, disable the failure simulation. The worker uses a reconnect strategy similar to the one we designed for the Lazy Pirate client, with two major differences: it does an exponential backoff, and it retries indefinitely (whereas the client retries a few times before reporting a failure). You can try the client, queue, and workers by using a script like this: ppqueue & for i in 1 2 3 4; do ppworker & sleep 1 done lpclient & You should see the workers die one by one as they simulate a crash, and the client eventually give up.

…

In the zmq_poll() loop, whenever we pass this time, we send a heartbeat to the queue. Here’s the essential heartbeating code for the worker: #define HEARTBEAT_LIVENESS 3 // 3-5 is reasonable #define HEARTBEAT_INTERVAL 1000 // msec #define INTERVAL_INIT 1000 // Initial reconnect #define INTERVAL_MAX 32000 // After exponential backoff ... // If liveness hits zero, queue is considered disconnected size_t liveness = HEARTBEAT_LIVENESS; size_t interval = INTERVAL_INIT; // Send out heartbeats at regular intervals uint64_t heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL; while (true) { zmq_pollitem_t items [] = { { worker, 0, ZMQ_POLLIN, 0 } }; int rc = zmq_poll (items, 1, HEARTBEAT_INTERVAL * ZMQ_POLL_MSEC); if (items [0].revents & ZMQ_POLLIN) { // Receive any message from queue liveness = HEARTBEAT_LIVENESS; interval = INTERVAL_INIT; } else if (--liveness == 0) { zclock_sleep (interval); if (interval < INTERVAL_MAX) interval *= 2; zsocket_destroy (ctx, worker); ... liveness = HEARTBEAT_LIVENESS; } // Send heartbeat to queue if it's time if (zclock_time () > heartbeat_at) { heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL; // Send heartbeat message to queue } } The queue does the same, but manages an expiration time for each worker.

pages: 1,380 words: 190,710

Building Secure and Reliable Systems: Best Practices for Designing, Implementing, and Maintaining Systems
by Heather Adkins , Betsy Beyer , Paul Blankinship , Ana Oprea , Piotr Lewandowski and Adam Stubblefield
Published 29 Mar 2020

If many clients are caught in this loop, the resulting demand makes recovering from the outage difficult.9 Client software should be carefully designed to avoid tight retry loops. If a server fails, the client may retry, but should implement exponential backoff—for example, doubling the wait period each time an attempt fails. This approach limits the number of requests to the server, but on its own is not sufficient—an outage can synchronize all clients, causing repeated bursts of high traffic. To avoid synchronous retries, each client should wait for a random duration, called jitter. At Google, we implement exponential backoff with jitter in most of our client software. What can you do if you don’t control the client? This is a common concern for people operating authoritative DNS servers.

…

Sooner or later, overload, network issues, or some other issue will result in a dependency being unavailable. In many cases, it would be reasonable to retry the request, but implement retries carefully in order to avoid a cascading failure (akin to falling dominoes).1 The most common solution is to retry with an exponential backoff.2 A good framework should provide support for such logic, rather than requiring the developer to implement the logic for every RPC call. A framework that gracefully handles unavailable dependencies and redirects traffic to avoid overloading the service or its dependencies naturally improves the reliability of both the service itself and the entire ecosystem.

…

= nil { return ctx, err } if ai.allowedRoles[callInfo.User] { return ctx, nil } return ctx, fmt.Errorf("Unauthorized request from %q", callInfo.User) } func (*authzInterceptor) After(ctx context.Context, resp *Response) error { return nil // Nothing left to do here after the RPC is handled. } Example 12-3. Example logging interceptor that logs every incoming request (before stage) and then logs all the failed requests with their status (after stage); WithAttemptCount is a framework-provided RPC call option that implements exponential backoff type logInterceptor struct { logger *LoggingBackendStub } func (*logInterceptor) Before(ctx context.Context, req *Request) (context.Context, error) { // callInfo was populated by the framework. callInfo, err := FromContext(ctx) if err != nil { return ctx, err } logReq := &pb.LogRequest{ timestamp: time.Now().Unix(), user: callInfo.User, request: req.Payload, } resp, err := logger.Log(ctx, logReq, WithAttemptCount(3)) return ctx, err } func (*logInterceptor) After(ctx context.Context, resp *Response) error { if resp.Err == nil { return nil } logErrorReq := &pb.LogErrorRequest{ timestamp: time.Now().Unix(), error: resp.Err.Error(), } resp, err := logger.LogError(ctx, logErrorReq, WithAttemptCount(3)) return err } Common Security Vulnerabilities In large codebases, a handful of classes account for the majority of security vulnerabilities, despite ongoing efforts to educate developers and introduce code review.

pages: 629 words: 109,663

Docker in Action
by Jeff Nickoloff and Stephen Kuenzli
Published 10 Dec 2019

If that container was configured to always restart, and Docker always immediately restarted it, the system would do nothing but restart that container. Instead, Docker uses an exponential backoff strategy for timing restart attempts. A backoff strategy determines the amount of time that should pass between successive restart attempts. An exponential back-off strategy will do something like double the previous time spent waiting on each successive attempt. For example, if the first time the container needs to be restarted Docker waits 1 second, then on the second attempt it would wait 2 seconds, 4 seconds on the third attempt, 8 on the fourth, and so on. Exponential backoff strategies with low initial wait times are a common service-restoration technique.

UNIX® Network Programming, Volume 1: The Sockets Networking API, 3rd Edition
by W. Richard Stevens, Bill Fenner, Andrew M. Rudoff
Published 8 Jun 2013

This function is called when the retransmission timer expires. 83 int 84 rtt_timeout(struct rtt_info *ptr) 85 { 86 ptr->rtt_rto *= 2; /* next RTO */ 87 88 89 90 } if (++ptr->rtt_nrexmt > RTT_MAXNREXMT) return (-1); /* time to give up for this packet */ return (0); lib/rtt.c Figure 22.14 rtt_timeout function: applies exponential backoff. 86 lib/rtt.c The current RTO is doubled: This is the exponential backoff. 608 Advanced UDP Sockets 87–89 Chapter 22 If we have reached the maximum number of retransmissions, −1 is returned to tell the caller to give up; otherwise, 0 is returned. As an example, our client was run twice to two different echo servers across the Internet in the morning on a weekday.

…

This value will be overridden by the SCTP stack if it is greater than the maximum allowable streams the SCTP stack supports. • sinit_max_attempts expresses how many times the SCTP stack should send the initial INIT message before considering the peer endpoint unreachable. • sinit_max_init_timeo represents the maximum RTO value for the INIT timer. During exponential backoff of the initial timer, this value replaces RTO.max as the ceiling for retransmissions. This value is expressed in milliseconds. Note that when setting these fields, any value set to 0 will be ignored by the SCTP socket. A user of the one-to-many-style socket (described in Section 9.2) may also pass an sctp_initmsg structure in ancillary data during implicit association setup.

…

Indeed, the TCP kernel implementation (Section 25.7 of TCPv2) is normally performed using fixed-point arithmetic for speed, but for simplicity, we use floating-point calculations in our code that follows. Another point made in [Jacobson 1988] is that when the retransmission timer expires, an exponential backoff must be used for the next RTO. For example, if our first RTO is 2 seconds and the reply is not received in this time, then the next RTO is 4 seconds. If there is still no reply, the next RTO is 8 seconds, and then 16, and so on. Jacobson’s algorithms tell us how to calculate the RTO each time we measure an RTT and how to increase the RTO when we retransmit.

pages: 933 words: 205,691

Hadoop: The Definitive Guide
by Tom White
Published 29 May 2009

Reduce-side tuning properties Property nameTypeDefault valueDescription mapred.reduce.parallel.copies int 5 The number of threads used to copy map outputs to the reducer. mapred.reduce.copy.backoff int 300 The maximum amount of time, in seconds, to spend retrieving one map output for a reducer before declaring it as failed. The reducer may repeatedly reattempt a transfer within this time if it fails (using exponential backoff). io.sort.factor int 10 The maximum number of streams to merge at once when sorting files. This property is also used in the map. mapred.job.shuffle.input.buffer.percent float 0.70 The proportion of total heap size to be allocated to the map outputs buffer during the copy phase of the shuffle.

…

If, after trying all of the servers in the ensemble, it can’t connect, then it throws an IOException. The likelihood of all ZooKeeper servers being unavailable is low; nevertheless, some applications may choose to retry the operation in a loop until ZooKeeper is available. This is just one strategy for retry handling—there are many others, such as using exponential backoff where the period between retries is multiplied by a constant each time. The org.apache.hadoop.io.retry package in Hadoop Core is a set of utilities for adding retry logic into your code in a reusable way, and it may be helpful for building ZooKeeper applications. A Lock Service A distributed lock is a mechanism for providing mutual exclusion between a collection of processes.

Smart Grid Standards
by Takuro Sato
Published 17 Nov 2015

ESI shall be able to receive a message from a service-provider’s back office and deliver to devices in the HAN at the beginning of the valid period. Devices that have negotiated to receive messages shall accept those messages. If unable to deliver to a certain device, ESI must retry sending with exponential back-off time until successful or message expiration. Directed messages shall be able to be addressed to a specific premise. These messages shall be delivered to all devices configured to receive messages from the ESI through which they are delivered. Consumers may grant authorization to the service provider to communicate with or control a HAN device.

Seeking SRE: Conversations About Running Production Systems at Scale
by David N. Blank-Edelman
Published 16 Sep 2018

Service discovery Distributed applications need to find each other. Mechanisms range in complexity from Domain Name System (DNS) to fully consistent solutions like Consul. Distributed system best practices At a theoretical level, microservice practitioners are told that they need to employ best practices such as retries with exponential back-off, circuit breaking, rate limiting, and timeouts. The actual implementations of these best practices are usually varied or missing entirely. Authentication and authorization Although most internet architectures utilize edge encryption via Transport Layer Security (TLS) and some type of edge authentication, the implementations vary widely, from proprietary to OAuth.

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann
Published 17 Apr 2017

Although retrying an aborted transaction is a simple and effective error handling mechanism, it isn’t perfect: • If the transaction actually succeeded, but the network failed while the server tried to acknowledge the successful commit to the client (so the client thinks it failed), then retrying the transaction causes it to be performed twice—unless you have an additional application-level deduplication mechanism in place. • If the error is due to overload, retrying the transaction will make the problem worse, not better. To avoid such feedback cycles, you can limit the number of retries, use exponential backoff, and handle overload-related errors differently from other errors (if possible). • It is only worth retrying after transient errors (for example due to deadlock, iso‐ lation violation, temporary network interruptions, and failover); after a perma‐ nent error (e.g., constraint violation) a retry would be pointless. • If the transaction also has side effects outside of the database, those side effects may happen even if the transaction is aborted.

pages: 1,237 words: 227,370

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann
Published 16 Mar 2017

Although retrying an aborted transaction is a simple and effective error handling mechanism, it isn’t perfect: If the transaction actually succeeded, but the network failed while the server tried to acknowledge the successful commit to the client (so the client thinks it failed), then retrying the transaction causes it to be performed twice—unless you have an additional application-level deduplication mechanism in place. If the error is due to overload, retrying the transaction will make the problem worse, not better. To avoid such feedback cycles, you can limit the number of retries, use exponential backoff, and handle overload-related errors differently from other errors (if possible). It is only worth retrying after transient errors (for example due to deadlock, isolation violation, temporary network interruptions, and failover); after a permanent error (e.g., constraint violation) a retry would be pointless.