What happened
On April 8, about 1.5% of admin panel traffic started failing. API endpoints were not affected. The issue was fully resolved on April 9.
Why it happened
We recently rolled out a new version of the software that routes user traffic to our services (HAProxy) across our fleet of six load balancers. The rollout had been tested in a test deployment and initially looked healthy in production as well. However, about 7 hours after the upgrade, one of the six load balancers started failing to reach backend services — and that's what caused the errors users experienced.
The root cause is a bug in the new HAProxy version that only shows up under a very specific, still-unidentified set of conditions. Our test deployment and the other five production balancers kept working fine, which is why the issue slipped past our pre-rollout checks.
Why it took us a while to notice
When a request failed, our system automatically tried it again, and the retry usually succeeded. So from the outside things looked a bit slow or occasionally flaky rather than clearly broken. That automatic retry behavior is normally a good thing — it hides small, transient glitches from users — but in this case it also hid the growing problem from our monitoring long enough to delay detection.
What we did
- Rolled back the affected balancer to the previous HAProxy version, which immediately restored normal traffic.
- Paused all further HAProxy upgrades across the fleet until we understand the trigger.
What's next
We're working to reproduce the bug in a controlled test environment so we can pinpoint the trigger, confirm a fix (either a patched version or a config change), and safely resume the rollout.