The Problem We Were Actually Solving
When I first started looking at the issue, we were getting an average of 50,000 failed login attempts every month for users in these countries. Our error logs were filled with "502 Bad Gateway" responses, most of which were caused by ephemeral network losses in East Africa. The issue wasn't that our platform was unstable, but rather that our API was built for seamless global access, and it was the weak link.
What We Tried First (And Why It Failed)
Our initial solution involved implementing a failover mechanism for our API to switch to a backup server in case of network failures. Sounds simple enough. The problem was that our engineers got overly optimistic and started implementing various 'smarter' failover solutions, like dynamically routing users to the nearest available server. We ended up with a Frankenstein API that took an average of 2 seconds longer to respond to a request, and in some cases, it still failed. The error
Discussion
Break the silence
Take the opportunity to kick things off.