LeanKit Kanban Outage Retrospective

LeanKit Kanban Outage Report
Thursday March 22nd
8:00 AM CDT – 5:45PM CDT (GMT-5)

Background:
An outage of LeanKit Kanban on March 22nd resulted in approximately 9.5 hours of unscheduled downtime for our customers. For many of our customers, this occurred during peak business hours and caused hardships and difficulties far beyond what is acceptable from our point of view. At this point, we can offer apologies and assurances that we are taking corrective actions to prevent this type of outage from ever happening again.

What Caused the Outage:
At approximately 8:00 AM CDT (GMT-5), the hosting provider that we selected to host our infrastructure mistakenly reallocated the IP addresses for our private subnet. After that point, no network traffic was able to reach any of our infrastructure. This resulted in a complete loss of access to our systems and therefore limited the amount of diagnostic activity that we could perform. We immediately contacted our hosting provider and sent them all the troubleshooting information that we we able to compile. Our provider performed their diagnostic and corrective actions including the replacement major networking components. Unfortunately, despite numerous diagnostic attempts, they were unable to discover the root cause of the issue until late in the afternoon. After discovery of the problem, the issue was resolved quickly and our system connectivity was re-established.

What We Should Have Done Better:
In addition to the underlying network issues, we discovered two significant flaws in our disaster recovery plans. Firstly, capacity planning for our failover environment was not sufficient to handle the growth in our user base. As we increased our production environment resources, we did not sufficiently grow the capabilities of the failover environment. Secondly, as we prepared to switch to the failover environment, we discovered that our data replication process was not sufficient to transfer the offsite data onto the failover servers. Problems with network throughput prevented a quick transfer of data into the failover environment. These two factors, and continued assurances from the hosting provider that the main infrastructure would be up shortly, prevented us from switching to the failover site, which could have reduced the total downtime event.

What We Plan to Do Better:
Based on the events of March 22nd, we have installed or plan to install the following safeguards to ensure that we do not have a similar problem in the future.

Improve Internal Communication – We have established a far more robust communication plan with our hosting provider. The lack of status communications prevented us from being able to relay the most up to date information to our customers as well as providing us with the right information in a timely manner that we needed to make informed decisions. We have now established direct communication pathways with resources that will ensure that appropriate action is being taken and information is being communicated effectively.

Improve External Communication – Despite our attempts to communicate system status information to all our customers, we now recognize that not all customers knew the correct avenues for receiving status information about the LeanKit Kanban system. We plan to have better communication and training so that all customers know the correct avenues for retrieving this status information.

Separate, independent hosting environments – We have begun to establish a completely independent hosting environment that will result in dual hosting environments with warm failover capabilities between the environments. Additionally, we plan to implement processes that ensure that the failover site is near real-time. Both environments will be included in capacity planning exercises to ensure that parity is maintained between them.

Stringent Disaster Recovery testing – We will be increasing the scope and frequency of our Disaster Recovery testing to ensure that any issues are discovered before the process is required.

All of us here at LeanKit have great deal of appreciation and respect for our customers and users of our software. We get a great deal of satisfaction from the knowledge that people are using LeanKit to improve their workplace as well as their lives. We recognize the difficulties that this outage caused for the companies, organizations and individuals using LeanKit. We humbly apologize for this and assure you that we are fully dedicated to putting the safeguards in place to reduce that likelihood and severity of any service disruption in the future.

Best Regards,
Stephen Franklin
Chief Technology Officer
LeanKit

Stephen Franklin

Stephen Franklin is CIO and co-founder of LeanKit. Before LeanKit, he led development teams in practicing Lean and Kanban as a Domain Architect at Healthtrust Purchasing Group and Chief Software Architect at eDoc4u. Follow Stephen on Twitter @leankitstephen.