In July we experienced one specific incident resulting in a degraded state of availability for We’d like to share our learnings from this incident with the community in the spirit of being transparent about our service disruptions, and helping other services improve their own operations.

July 13 08:18 UTC (lasting for four hours, 25 minutes)

The incident started when our production Kubernetes Pods started getting marked as unavailable. This cascaded through our clusters resulting in a reduction in capacity, which ultimately brought down our services. Investigation into the Pods revealed that a single container within the Pod was exceeding its defined memory limits and being terminated. Even though that container is not required for production traffic to be processed, the nature of Kubernetes requires that all containers be healthy for a Pod to be marked as available.

Normally when a Pod runs into this failure mode, the cluster will recover within a minute or so. In this case, the container in the Pod was configured with an ImagePullPolicy of Always, which instructed Kubernetes to fetch a new container image every time. However, due to a routine DNS maintenance operation that had been completed earlier, our clusters were unable to successfully reach our registry resulting in Pods failing to start. This issue impact was increased when a redeploy was triggered in an attempt to mitigate, and we saw the failure start to propagate across our production clusters. It wasn’t until we restarted the process with the cached DNS records that we were able to successfully fetch container images, redeploy, and recover our services. 

Moving forward, we’ve identified a number of areas to address this quarter: 

  • Enhancing monitoring ensuring Pod restarts would not fail again based on this same pattern
  • Minimizing our dependency on the image registry
  • Expanding validation during DNS changes
  • Reevaluating all the existing Kubernetes deployment policies

In parallel, we have an ongoing workstream to improve our approach to progressive deployments that will provide the ability to carefully evaluate the impact of deployments in a more incremental fashion. This is part of a broader engineering initiative focused on reliability that we will have more details on in the coming months.

In summary

We place great importance in the reliability of our service along with the trust that our users place in us every day. We look forward to continuing to share more details of our journey and hope you can learn from our experiences along the way.