Good Retry, Bad Retry: An Incident Story

Beej Jorgensen · 1 month ago

Good Retry, Bad Retry: An Incident Story

@[email protected] · 1 month ago

Ideally, you’d limit your resource utilization to always leave enough of a buffer that your management tools can run. But even if that’s not the case, you should also be able to disable incoming traffic so that your servers stop even seeing the requests. Or you can just plain destroy and recreate with a new version.

But none of that addresses the fact that your retrying clients are basically DDoSing you. That can be mitigated by your WAF filtering requests so that only a fraction are passed to the server, as mentioned in the article, but preferably you’d just scale up to handle the load, or fix your clients to retry less frequently so that they don’t DDoS you with retries. Even a large number of clients shouldn’t be retrying so frequently that it overwhelms your system. Even if you’re selling Taylor Swift tickets, where millions of clients are hammering you, you can scale horizontally to at least implement a queue for users so they’re not hitting refresh every time they get a blank screen.

@[email protected] · 1 month ago

All of what you’re saying seems correct. I think this is more of a meta discussion, on how (in this case) retries, even with exponential back off, aren’t a solution by themselves when you look at the system overall. There are interesting hidden caveats to any common solutions, this is one I personally wasn’t aware of.

Practically, adding a timeout budget so that the clients themselves just error out (forcing a manual refresh) sorta accomplishes the same as what you’re positing.