Incident: apps not functioning from 4 May, 7 AM - 5 May, 1 AM Pacific time (18 hours in duration)
Which of our marketplace apps were affected?
- Due Time
We recently experienced an outage which started around 7 AM Pacific time on 4 May 2017. One of our servers was reported by our monitoring service to be low on disk space. At this time, SweetHawk staff were unfortunately still on an international flight and could not be notified immediately. After landing, we discovered the server had already run out of disk space which meant it was no longer serving web requests. The other server could not keep up with the traffic resulting in many users experiencing the error message to that effect, or simply saw the animated dots, and the apps would not function.
What did we do next?
We immediately tried to create another server to handle the extra load. This failed due to an issue with our deployment service which could not automatically do this due to an error on their part. We have since learnt that the reason for the failure was a difference in the version of the Ruby package on the existing server and on the new server. They are working on a resolution for this. The other web server was also running low on space by then and appeared to have become completely unresponsive. While we worked on restoring the servers, in the meantime, we decided to rebuild the server stack completely on the side. The new servers were up and running faster than the current stack returned to service, so we changed our DNS to the new server and once records started propagating, service was restored. This happened around 1 AM Pacific time the next day (5 May 2017). The incident was open for a total of 18 hours.
Why won't this issue occur in the future?
We regret that this issue happened and while potential future incidents are unavoidable, we are taking learnings where we can to avoid anything like this happening in the future, and if it does, to minimise the impact to you.
Here are some of the changes to our infrastructure and our process we have already implemented:
- We use two separate servers for handling web requests. Our new servers operate at 6 web threads each where the old servers had 2. We decided to increase this amount as we found that a lot of requests need to wait for Zendesk API requests, each server could have handled more load by themselves. This means we are operating with redundancy again in that should one of the web servers fail again, we'll still be able to handle the load while we fix up the spare.
- We've added additional monitoring showing detailed information allowing us to proactively add extra redundant servers as our load increases.
- Because our process relies on being able to add servers to our load balancer, we'll routinely do this to verify this process works reliably and is timely.
This was our first major outage in the past 18 months and we pride ourselves on offering a reliable service. We apologize to our customers and we also thank those who contacted us. Your continued support means a lot. Please write to us if you have any questions or concerns at all.