Unplanned Outage Report for May 26, 2011

Today, from 9:57 AM to 11:15 AM CDT (14:57-16:15 GMT), the LeanKit Kanban application was intermittently unavailable or unresponsive, while we worked to resolve database server performance issues. By 11:15 AM the application was back up and running smoothly.

We sincerely apologize for the inconvenience to all of our customers.This 78 minute outage was by far the longest we’ve ever had, in just over two years of operation. We will continue to be proactive about catching and dealing with any issues that arise, and keeping you informed of all the important details.

The technical details of what caused the outage are listed below.

The database server performance issue was caused by a recently introduced bug fix, and an unanticipated SQL server optimization that got out of hand.

On May 19, we deployed a bug fix for the following situation:
– When a user had the archive tab open, and the board they were viewing had a large number of archived cards that were older than 90 days…
– If the board version updated (live updates from some other user), then the update query was returning ALL archived cards, including those older than 90 days. This made the board update operation slower than normal, because lots of additional data was being returned in the response.

We introduced a fix that altered the database query used to return board updates, so that it would only return archived cards that should be displayed on the archive tab (those less than 90 days old).

When one of our customers who had a large board with a lot of history older than 90 days ran into that same situation this morning, the updated query from May 19 ran. In this situation, the query and the amount of data involved caused SQL Server to attempt to optimize the query by running it in parallel. In some cases, SQL Server will attempt to break a query up into multiple queries and run them in parallel, piecing them back together as each part finishes.

When that happened, we had a runaway query on our hands. SQL Server’s parellel query became a “massively parallel” query, and the SQL Server process pegged the processor on the database server, and caused the LeanKit Kanban application to become unresponsive for many of our users.

Fixing it was as simple as discovering what was happening and adding an appropriate index to the database.

Again, we apologize for any inconvenience this may have caused.

Chris Hefley
CEO and Founder
LeanKitKanban.com

Chris Hefley

Chris Hefley is a co-founder of LeanKit. After years of coping with “broken” project management systems in software development, Chris helped build LeanKit as a way for teams to become more effective. He believes in building software and systems that make people’s lives better and transform their relationship with work. In 2011, he was nominated for the Lean Systems Society’s Brickell Key Award. Follow Chris on Twitter @indomitablehef.