maestra - Degraded access to admin panel – Incident details

All systems operational

Degraded access to admin panel

Resolved
Degraded performance
Started 14 days agoLasted 1 day

Affected

Admin panel

Degraded performance from 3:37 PM to 4:28 PM

Updates
  • Postmortem
    Postmortem

    What happened
    On April 8, about 1.5% of admin panel traffic started failing. API endpoints were not affected. The issue was fully resolved on April 9.

    Why it happened
    We recently rolled out a new version of the software that routes user traffic to our services (HAProxy) across our fleet of six load balancers. The rollout had been tested in a test deployment and initially looked healthy in production as well. However, about 7 hours after the upgrade, one of the six load balancers started failing to reach backend services — and that's what caused the errors users experienced.

    The root cause is a bug in the new HAProxy version that only shows up under a very specific, still-unidentified set of conditions. Our test deployment and the other five production balancers kept working fine, which is why the issue slipped past our pre-rollout checks.

    Why it took us a while to notice
    When a request failed, our system automatically tried it again, and the retry usually succeeded. So from the outside things looked a bit slow or occasionally flaky rather than clearly broken. That automatic retry behavior is normally a good thing — it hides small, transient glitches from users — but in this case it also hid the growing problem from our monitoring long enough to delay detection.

    What we did
    - Rolled back the affected balancer to the previous HAProxy version, which immediately restored normal traffic.
    - Paused all further HAProxy upgrades across the fleet until we understand the trigger.

    What's next
    We're working to reproduce the bug in a controlled test environment so we can pinpoint the trigger, confirm a fix (either a patched version or a config change), and safely resume the rollout.

  • Resolved
    Resolved

    This incident has been resolved. We'll get back with details at a later date.

  • Investigating
    Investigating

    We are currently investigating elevated error rates affecting approximately 1.5% of traffic to some internal web services. API endpoints are operating normally and are not affected by this incident.