You probably noticed that it was difficult to access Avocode last Friday. We are really sorry about that. After every production incident (even if it doesn't impact customers), we do an internal postmortem to identify what went wrong and how we can improve in the future. Friday's incident impacted you and your team for a significant period of time, so we think it would be helpful to share our findings.
The root cause of the incident was that our cloud service provider, Google Cloud Platform (GCP), experienced a critical issue in their networking control plane . This issue was significant because it lasted for an extended period of time and it affected all GCP regions globally.
The primary effect of Google's incident was that new instances could be started, but they wouldn't be created with the correct routes for network connectivity and were unusable. Our monitoring detected the issue starting at 23:56 UTC but it didn't start affecting customers until 05:34 UTC, which is right around the time that Europe started their workday. As traffic increased over time, we couldn't scale our infrastructure and therefore we were serving peak traffic with off-peak capacity. We started seeing significant API slowdown after 08:00 UTC.
The issue impacted our customers until 16:09 UTC. At that point, Avocode was fully functional and the processing backlog was completed.
The most difficult thing about this incident is that Google's issue affected all regions globally. We have the ability to set up compute infrastructure in different regions and temporarily redirect users until the incident is resolved. But in this case, it didn't help.
Towards the end of the incident, we were able to set up a functional Disaster Recovery environment in Amazon Web Services (AWS) and we were about to start redirecting users to it when we realized that Google was back up and running. We plan to invest some time into making sure that we have tooling to spin up Kubernetes clusters in AWS as an additional disaster recovery location.
During the incident, it wasn't easy for users inside the Avocode app to know that we were experiencing issues. We fixed these issues a few hours into the incident.
We will continue to work hard to provide a product that you can trust and depend on. We thank you for your business and your continued trust in Avocode.