Design Processing and API Downtime
Incident Report for Avocode
Postmortem

You probably noticed that it was difficult to access Avocode last Friday. We are really sorry about that. After every production incident (even if it doesn't impact customers), we do an internal postmortem to identify what went wrong and how we can improve in the future. Friday's incident impacted you and your team for a significant period of time, so we think it would be helpful to share our findings.

What went wrong?

The root cause of the incident was that our cloud service provider, Google Cloud Platform (GCP), experienced a critical issue in their networking control plane [1]. This issue was significant because it lasted for an extended period of time and it affected all GCP regions globally.

The primary effect of Google's incident was that new instances could be started, but they wouldn't be created with the correct routes for network connectivity and were unusable. Our monitoring detected the issue starting at 23:56 UTC but it didn't start affecting customers until 05:34 UTC, which is right around the time that Europe started their workday. As traffic increased over time, we couldn't scale our infrastructure and therefore we were serving peak traffic with off-peak capacity. We started seeing significant API slowdown after 08:00 UTC.

The issue impacted our customers until 16:09 UTC. At that point, Avocode was fully functional and the processing backlog was completed.

How are we going to address it?

  1. The most difficult thing about this incident is that Google's issue affected all regions globally. We have the ability to set up compute infrastructure in different regions and temporarily redirect users until the incident is resolved. But in this case, it didn't help.

    Towards the end of the incident, we were able to set up a functional Disaster Recovery environment in Amazon Web Services (AWS) and we were about to start redirecting users to it when we realized that Google was back up and running. We plan to invest some time into making sure that we have tooling to spin up Kubernetes clusters in AWS as an additional disaster recovery location.

  2. During the incident, it wasn't easy for users inside the Avocode app to know that we were experiencing issues. We fixed these issues a few hours into the incident.

We will continue to work hard to provide a product that you can trust and depend on. We thank you for your business and your continued trust in Avocode.

[1] https://status.cloud.google.com/incident/compute/19008

Posted Nov 05, 2019 - 15:04 CET

Resolved
This incident has been resolved.
Posted Nov 01, 2019 - 17:10 CET
Update
All design processing has been completed. Avocode should be back at 100% now. If you find any further issues with the service, please contact our support team by clicking the chat icon in the bottom right of the app or avocode.com.

Thanks for your patience. We know that you rely on Avocode during your work day (we do as well) and that downtime means lost productivity. We'll be doing a full postmortem and will discuss how to better handle situations like this.
Posted Nov 01, 2019 - 17:08 CET
Monitoring
The networking issue in Google Cloud seems to be fixed now (fingers crossed). We're slowly scaling our processing back up to prevent the backlog from overwhelming our systems.

We will update here once the backlog is consumed and everything is operational.
Posted Nov 01, 2019 - 16:21 CET
Update
We're continuing to investigate this.

Google Cloud Platform is experiencing a global issue that affects the networking of new compute nodes. As a result, we are unable to scale up to meet current demand.

The incidents now affects Avocode API and app.avocode.com as well.
Posted Nov 01, 2019 - 10:04 CET
Update
We're continuing to investigate this.

Google Cloud Platform is experiencing a global issue that affects the networking of new compute nodes. As a result, we are unable to scale up to meet current demand. The Avocode service may still be functional but it will be slow for the duration of the Google Cloud Platform incident. More details here: https://status.cloud.google.com/incident/compute/19008
Posted Nov 01, 2019 - 08:58 CET
Identified
Our design processing infrastructure is experiencing delays, which means that new designs will be safely uploaded but not available for commenting or inspecting for the duration of the incident.

This is caused by an issue of our cloud provider, we're investigating any workaround currently.

Your designs are safe and will eventually be processed once the incident is over. You can still comment on and inspect designs that have already been uploaded and processed.
Posted Nov 01, 2019 - 08:22 CET
This incident affected: Avocode API, Design Processing, and Avocode on the Web (app.avocode.com).