Recent Outages - Causes, Fixes, and Future Stability

Details behind the recent series of outages in the Public Cloud - Their causes, how we’ve fixed them, and expectations about future stability.

Why Have My Apps Gone Offline So Frequently?

V2 is off to a blazing start… as in “it looks like we’re on fire.” We’re mostly joking of course, but it doesn’t feel that way to us or users during a series of reboots. We obviously recognize that 6 outages in a week has been incredibly painful, and cause for serious consideration. We’re sorry. You should know that every issue we’ve experienced has been resolved at this point, but here’s a deeper look at the causes, which have been diverse:

Background Information - We’re on a different Operating System

All V1 services ran on SmartOS servers after October 2013, almost completely without incident. After the shift from Linux, SmartOS behaved beautifully, boosting our stability and making it possible to introspect in ways not possible on Linux. For the release of V2, we engineered Pagoda Box to fully integrate native SmartOS zone technology, which is proving to be a great long-term decision... with short-term hiccups (explained below).

The Source of Recent Outages - Short Answer

Put simply, the shift to a new operating system is the source of recent outages. Our production usage (which differs from traditional SmartOS providers like Joyent) has uncovered a handful of issues not identified during testing or beta. The good news is that SmartOS provides the visibility and control to fix issues as they arise. Here are issues around recent outages, all of which have been corrected in code (except for OS Issue 4 as noted), and are awaiting a maintenance window to apply them to all servers:

The Source of Recent Outages - Long Answer

For Joyent and others, SmartOS traditionally manages a few, large, long-running virtual machines. On Pagoda Box, SmartOS manages many, smaller, more frequently updated virtual zones. Our volume and frequency of zone updates on Pagoda Box are pushing the boundaries of SmartOS zone management. We’ve been actively coding to bring this into acceptable ranges (see below).

Actual usage on V2 has brought to light a few unanticipated differences between running constant production load through a few large Linux virtual machines on SmartOS KVM (V1 since 2013) and running the same load through hundreds of smaller native zones (V2 since November 2014). Here are the two biggest differences.

GLOBAL ISSUE 1 - Factorially Slow Zone Ops and Reboots

Status - FIXED
Fixed in the following commits:
(Awaiting Server Maintenance Window)

Just a few weeks after opening V2 to production loads, the time required for each zone operation (create, restart, reboot, destroy) had increased to unacceptable levels, and continued to grow proportionally with each newly added zone. The trend was unsustainable, as it lengthened downtimes by over 400% during server maintenance or recovering from an outage.

Engineers worked steadily to resolve this issue beginning in late November, reducing December’s zone deployment time from 5 to 2 minutes, 1 minute, 30 seconds, and just a few days ago, around 10 seconds. This has been a very deep and complex effort with important consequences. Notably, deploys that used to take minutes will occur in seconds, and more importantly, servers recover within 30 minutes, rather than 4+ hours.

GLOBAL ISSUE 2 - Excessively Paging RAM to Disk

Status - FIXED
Fixed in the following commits:
(Awaiting Server Maintenance Window)

When a few individual VMs consume their allocated RAM, default SmartOS begins to page RAM in and out of Disk at a global level. Paging (sometimes called Swapping in the Linux community) is exponentially slower and more CPU-intensive than in-memory operations. With only a small number of traditionally large, long-running zones, this relatively isolated occurrence is usually handled without incident. However, when smaller zones number in the hundreds, even a small percentage of paging zones can quickly tip an entire server into a downward spiral, consuming all global CPU until the server becomes unresponsive.

Updating the hardware to include a tuned combination of SSDs and traditional drives helps alleviate the performance hit associated with paging. Additionally, changing to a custom configuration setting in SmartOS detaches this issue from individual VMs.

OS Issue 1 - Deadman Timer

Status - FIXED
Fixed in commit 238ea93df4e059c5a7588858efc2db2f94365859
(Awaiting Server Maintenance Window)

This issue wasn’t really the cause of an outage, but fixing it was critical in preventing future downtime. If a server becomes unresponsive, this patch triggers a core dump on reboot, helping engineers identify the cause.

OS Issue 2 - Available Memory Limit for Lightweight Processes

Status - FIXED
Fixed in commit f90eb5a5652b1f9dcefb32563522f43e944bba2a
(Awaiting Server Maintenance Window)

We originally configured servers to run with the OS recommended memory limit for lightweight processes. Unfortunately, production usage pushed one server over that limit, causing an immediate reboot. We have adjusted our base server configuration to use the max setting, or 4 times the original memory limit.

OS Issue 3 - Excessively Logging Superfluous Events

Status - FIXED
Fixed in commit dd384e06789ef730da77852c1296a8f0cef1231e
(Awaiting Server Maintenance Window)

The illumos kernel will log events that PagodaBox would consider superfluous, like an individual zone exceeding its resource limits. For the hypervisor, these events are unhelpful as the individual zones are responsible to act upon these events (as they are also alerted). Initially this doesn’t seem to be a burden, however as hundreds of zones are loaded onto the hypervisor, a steady stream of multiple events from multiple zones can quickly jam the log stream, throttle disk io, and cause the hypervisor to become unresponsive. A code change in the kernel ensures that the global zone (hypervisor) doesn’t receive these events as they are only relevant to the user zones.

OS Issue 4 - Optimize Directory Name Lookup Cache Bug

Status - FIXED
Fixed in commit 1e36aae7adab16ef14114a6a568aadd271b0f6b8
(Awaiting Server Maintenance Window)

This is an issue we introduced inadvertently trying to optimize performance and recovery time. Pagoda Box engineers added a reverse lookup to make zone deletes faster. The patch used a cache rather than locking and iterating through zones one at a time. The patch tested functional, but production usage uncovered a race condition. Engineers updated the kernel to remove the race condition.

OS Issue 5 - ZFS Corrupted Data Set

Status - RARE
Reported Upstream

This bug is known and has been reported to ZFS engineers, but is considered extremely rare. In the case that a ZFS data set becomes corrupted in some way, attempts to delete the zone result in a reboot. This caused a server to loop repeatedly through reboots until we were able to isolate the affected zone.

OS Issue 6 - LX Brand Bug

Status - FIXED
Bug Report
Fixed in commit 0c3d73e940e1cc8a2daee9fafdc0701d34f363f0
(Awaiting Server Maintenance Window)

While adding code to accommodate LX Brand containers, Joyent introduced a race condition bug upstream. When two lightweight processes are created at the same time, one returns ‘success’ from the Kernel, while the other fails. However, the code assumed ‘success’, and consequently a ‘NULL’ pointer would crash the server. An update to kernel code corrected this issue.

OS Issue 7 - IRQ Null Pointer

Status - FIXED
Fixed in commit 238ea93df4e059c5a7588858efc2db2f94365859
(Awaiting Server Maintenance Window)

An unsafe operation in the OS kernel required the addition of a mutex.

What to Expect Moving Forward

Fundamentally, the system has performed exceptionally for managing V2. We have been able to track, diagnose and correct issues, often within minutes, in spite of the official 9,306,359 lines of code in our OS. This is visibility and control we never had on V1, and is a testament to the architecture and the upgrade.

However, users are also bringing production V2 up to speed for the first time, and it has uncovered a handful of unique differences. Having run V1 without incident on SmartOS for over a year, we’re confident in the overall solution, and we’re comfortable correcting issues thus far. We’re excited to roll out the most recent updates which will reduce automated recovery time significantly, improve deployment speed, and make correcting any undiscovered issues far less painful in the future. Thank you for trusting us.

comments powered by Disqus