A Note on Today’s Outage
We had an outage this morning from 06:32 to 07:42 PDT. One of the file servers experienced an unusually high load that caused the heartbeat monitor on that file server…
We had an outage this morning from 06:32 to 07:42 PDT. One of the file servers experienced an unusually high load that caused the heartbeat monitor on that file server pair to behave abnormally and confuse the dynamic hostname that points to the active file server in the pair. This in turn caused the frontends to start timing out and resulted in their removal from the load balancer. Here is what we intend to do to prevent this from happening in the future:
- The slave file servers are still in standby mode from the migration. We will have a maintenance window tonight at 22:00 PDT in order to ensure that slaves are ready to take over as master should the existing masters exhibit this kind of behavior.
- To identify the root cause of the load spikes we will be enabling process accounting on the file servers so that we may inspect what processes are causing the high load.
- As a related item, the site still gives a “connection refused” error when all the frontends are out of load balancer rotation. We are working on determining why the placeholder site that should be shown during this type of outage is not being brought up.
- We’ve also identified a problem with the single unix domain socket upstream approach in Nginx. By default, any upstream failures cause Nginx to consider that upstream defunct and remove it from service for a short period. With only a single upstream, this obviously presents a problem. We are testing a change to the configuration that should make Nginx always try upstreams.
We apologize for the downtime and any inconvenience it may have caused. Thank you for your patience and understanding as we continue to refine our Rackspace setup and deal with unanticipated events.
Written by
Related posts
Code referencing now generally available in GitHub Copilot and with Microsoft Azure AI
Announcing the general availability of code referencing in GitHub Copilot and Microsoft Azure AI, allowing developers to permit code suggestions containing public code matches while receiving detailed information about the match.
The nuances and challenges of moderating a code collaboration platform
Sharing the latest data update to our Transparency Center alongside a new research article on what makes moderating a code collaboration platform unique.
GitHub Copilot now available in github.com for Copilot Individual and Copilot Business plans
With this public preview, we’re unlocking the context of your code and collaborators—and taking the next step in infusing AI into every developer’s workflow.