Blog

Recent Outages - Causes, Fixes, and Future Stability

Details behind the recent series of outages in the Public Cloud - Their causes, how we’ve fixed them, and expectations about future stability.

Why Have My Apps Gone Offline So Frequently?

V2 is off to a blazing start… as in “it looks like we’re on fire.” We’re mostly joking of course, but it doesn’t feel that way to us or users during a series of reboots. We obviously recognize that 6 outages in a week has been incredibly painful, and cause for serious consideration. We’re sorry. You should know that every issue we’ve experienced has been resolved at this point, but here’s a deeper look at the causes, which have been diverse:

Background Information - We’re on a different Operating System

All V1 services ran on SmartOS servers after October 2013, almost completely without incident. After the shift from Linux, SmartOS behaved beautifully, boosting our stability and making it possible to introspect in ways not possible on Linux. For the release of V2, we engineered Pagoda Box to fully integrate native SmartOS zone technology, which is proving to be a great long-term decision... with short-term hiccups (explained below).

The Source of Recent Outages - Short Answer

Put simply, the shift to a new operating system is the source of recent outages. Our production usage (which differs from traditional SmartOS providers like Joyent) has uncovered a handful of issues not identified during testing or beta. The good news is that SmartOS provides the visibility and control to fix issues as they arise. Here are issues around recent outages, all of which have been corrected in code (except for OS Issue 4 as noted), and are awaiting a maintenance window to apply them to all servers:

The Source of Recent Outages - Long Answer

For Joyent and others, SmartOS traditionally manages a few, large, long-running virtual machines. On Pagoda Box, SmartOS manages many, smaller, more frequently updated virtual zones. Our volume and frequency of zone updates on Pagoda Box are pushing the boundaries of SmartOS zone management. We’ve been actively coding to bring this into acceptable ranges (see below).

Actual usage on V2 has brought to light a few unanticipated differences between running constant production load through a few large Linux virtual machines on SmartOS KVM (V1 since 2013) and running the same load through hundreds of smaller native zones (V2 since November 2014). Here are the two biggest differences.

GLOBAL ISSUE 1 - Factorially Slow Zone Ops and Reboots

Status - FIXED
Fixed in the following commits:
81c6e9169fbd6706b91ff43b5817d673e5ae53d7
cb97f4b359fa41f614a35762de7257a9d4470d07
abc5cf44d9dd987fd41b159978528c7438a07592
4ec73aadf9c3023cff031dca2d30ab9e729fe859
025a8a3eaab180ec6abe09ec6815156b3a74f82a
63b2317a2fcadc12e51b1816e1afa53c310a7c91
1e36aae7adab16ef14114a6a568aadd271b0f6b8
045db5afa64f2ffa5b197350fe67a9408d988e3a
075ac5bf837fcbb1300c383a5520af9f4c69c03a
f5c2b7bd712d6250bbfb8ca569398ce7e0620b30
8f57deaf646489e8ed46557c81d3ab984792b165
327a5c59ea807b552611032a5108c86fb73928e8
d717a264c54cdb15888147744624a8f9dd835e11
68f4be78b33504b2977678e1fb239ab46ba01c84
f0c8d695c23b09c138be7b6aef32258f3c29a933
(Awaiting Server Maintenance Window)

Just a few weeks after opening V2 to production loads, the time required for each zone operation (create, restart, reboot, destroy) had increased to unacceptable levels, and continued to grow proportionally with each newly added zone. The trend was unsustainable, as it lengthened downtimes by over 400% during server maintenance or recovering from an outage.

Engineers worked steadily to resolve this issue beginning in late November, reducing December’s zone deployment time from 5 to 2 minutes, 1 minute, 30 seconds, and just a few days ago, around 10 seconds. This has been a very deep and complex effort with important consequences. Notably, deploys that used to take minutes will occur in seconds, and more importantly, servers recover within 30 minutes, rather than 4+ hours.

GLOBAL ISSUE 2 - Excessively Paging RAM to Disk

Status - FIXED
Fixed in the following commits:
b58a9d5b87b39356240c9ee999fa6f01508adb4c
265fdf1fa798aca8469eca7893f2da49b927ec33
(Awaiting Server Maintenance Window)

When a few individual VMs consume their allocated RAM, default SmartOS begins to page RAM in and out of Disk at a global level. Paging (sometimes called Swapping in the Linux community) is exponentially slower and more CPU-intensive than in-memory operations. With only a small number of traditionally large, long-running zones, this relatively isolated occurrence is usually handled without incident. However, when smaller zones number in the hundreds, even a small percentage of paging zones can quickly tip an entire server into a downward spiral, consuming all global CPU until the server becomes unresponsive.

Updating the hardware to include a tuned combination of SSDs and traditional drives helps alleviate the performance hit associated with paging. Additionally, changing to a custom configuration setting in SmartOS detaches this issue from individual VMs.

OS Issue 1 - Deadman Timer

Status - FIXED
Fixed in commit 238ea93df4e059c5a7588858efc2db2f94365859
(Awaiting Server Maintenance Window)

This issue wasn’t really the cause of an outage, but fixing it was critical in preventing future downtime. If a server becomes unresponsive, this patch triggers a core dump on reboot, helping engineers identify the cause.

OS Issue 2 - Available Memory Limit for Lightweight Processes

Status - FIXED
Fixed in commit f90eb5a5652b1f9dcefb32563522f43e944bba2a
(Awaiting Server Maintenance Window)

We originally configured servers to run with the OS recommended memory limit for lightweight processes. Unfortunately, production usage pushed one server over that limit, causing an immediate reboot. We have adjusted our base server configuration to use the max setting, or 4 times the original memory limit.

OS Issue 3 - Excessively Logging Superfluous Events

Status - FIXED
Fixed in commit dd384e06789ef730da77852c1296a8f0cef1231e
(Awaiting Server Maintenance Window)

The illumos kernel will log events that PagodaBox would consider superfluous, like an individual zone exceeding its resource limits. For the hypervisor, these events are unhelpful as the individual zones are responsible to act upon these events (as they are also alerted). Initially this doesn’t seem to be a burden, however as hundreds of zones are loaded onto the hypervisor, a steady stream of multiple events from multiple zones can quickly jam the log stream, throttle disk io, and cause the hypervisor to become unresponsive. A code change in the kernel ensures that the global zone (hypervisor) doesn’t receive these events as they are only relevant to the user zones.

OS Issue 4 - Optimize Directory Name Lookup Cache Bug

Status - FIXED
Fixed in commit 1e36aae7adab16ef14114a6a568aadd271b0f6b8
(Awaiting Server Maintenance Window)

This is an issue we introduced inadvertently trying to optimize performance and recovery time. Pagoda Box engineers added a reverse lookup to make zone deletes faster. The patch used a cache rather than locking and iterating through zones one at a time. The patch tested functional, but production usage uncovered a race condition. Engineers updated the kernel to remove the race condition.

OS Issue 5 - ZFS Corrupted Data Set

Status - RARE
Reported Upstream

This bug is known and has been reported to ZFS engineers, but is considered extremely rare. In the case that a ZFS data set becomes corrupted in some way, attempts to delete the zone result in a reboot. This caused a server to loop repeatedly through reboots until we were able to isolate the affected zone.

OS Issue 6 - LX Brand Bug

Status - FIXED
Bug Report
Fixed in commit 0c3d73e940e1cc8a2daee9fafdc0701d34f363f0
(Awaiting Server Maintenance Window)

While adding code to accommodate LX Brand containers, Joyent introduced a race condition bug upstream. When two lightweight processes are created at the same time, one returns ‘success’ from the Kernel, while the other fails. However, the code assumed ‘success’, and consequently a ‘NULL’ pointer would crash the server. An update to kernel code corrected this issue.

OS Issue 7 - IRQ Null Pointer

Status - FIXED
Fixed in commit 238ea93df4e059c5a7588858efc2db2f94365859
(Awaiting Server Maintenance Window)

An unsafe operation in the OS kernel required the addition of a mutex.

What to Expect Moving Forward

Fundamentally, the system has performed exceptionally for managing V2. We have been able to track, diagnose and correct issues, often within minutes, in spite of the official 9,306,359 lines of code in our OS. This is visibility and control we never had on V1, and is a testament to the architecture and the upgrade.

However, users are also bringing production V2 up to speed for the first time, and it has uncovered a handful of unique differences. Having run V1 without incident on SmartOS for over a year, we’re confident in the overall solution, and we’re comfortable correcting issues thus far. We’re excited to roll out the most recent updates which will reduce automated recovery time significantly, improve deployment speed, and make correcting any undiscovered issues far less painful in the future. Thank you for trusting us.

comments powered by Disqus