The First Year of Pagoda Box v2 - A Retrospective

A look back at the end of v1, the beggining of v2, and where we are today.

This month marks one year of running on Pagoda Box v2. As we look back to the final days of Pagoda Box v1, we find ourselves wanting to laugh, cry, forget, and relish the nostalgia of constant midnight emergency phone calls and all-night server maintenance windows (a.k.a. movie marathons) … those were the days. At one year out, we thought it’d be fun (and cathartic) to take a look back and see how far things have come.

The End of v1

As detailed in our previous post, v1 was architected based on faulty assumptions; assumptions that ultimately led to v1’s demise. The most notable of these assumptions were 1. Our dependence on an upstream virtualization layer that was semi-stable at best, and 2. Moving to self-managed Linux Containers (LXC); a technology which, at the time, was unfinished and not well-supported.

We took steps to stabilize the v1 infrastructure and minimize its negative effects on customers. We migrated what we could to bare metal machines, however many key elements of our infrastructure still depended on faltering virtualization technologies. The major problems stemmed from the lack of both control and visibility. When things went wrong, it was near impossible to see why. When we could see why, restrictions in these technologies kept us from being able to do anything about it.

Only a handful of us were tasked to keep v1 running as the majority of our team focused on getting v2 out the door as soon as possible. We continued to push changes to v1 and extend our runway, but everything we did amounted to this:

v1 was crumbling.

Luckily, it held out until v2 was ready. When it finally came time to decommission v1 servers, it probably should’ve felt like Travis having to put down Old Yeller…

But in all honesty, it felt more like this:

The Early Days of v2

What many may not fully realize is the Pagoda Box v2 was a top-to-bottom overhaul of our service. Not a single line of code from v1 was used in v2; We switched to a complete bare metal infrastructure; a new operating system (SmartOS) with virtualization baked right into the kernel allowing us to confidently manage virtualization ourselves; a completely re-architected storage system that decentralized writable storage (network storage). It was a huge undertaking over a year in the making with three main internal goals:

  1. Improve Stability
  2. Improve Visibility
  3. Ease Maintenance

We accomplished 2 & 3 right out of the gate. The switch to SmartOS proved to be key in both assessing and resolving issues. However there was still room for improvement when it came to stability. Admittedly, many of you felt like this in the early days of v2:

We learned a lot during those early months. There were interruptions and outages, and we thank you for sticking with us as we ironed issues out. Our low-level visibility and ability to fix root causes of issues helped to reduce potentially day-long outages to minute-long or, at worst, hour-long outages. While zero outages was (and continues to be) the goal, the reduced duration of outages represented a major success.

Since the migration to v2, outages have been fewer and farther between. When they have happened, we’ve been able to bring apps back online quickly.

Where We Are Today

I’m happy to say that we are coming up on the four-month mark of zero non-hardware related issues as well as zero global issues. It’s been a long road, but I’m confident in saying that Pagoda Box is now more stable than it’s been in our 5 year history. This assessment doesn’t come only by measuring uptime, but by knowing the condition of the underlying technologies. The platform is solid.

I’m confident in saying that Pagoda Box v2 has accomplished what it was meant to. We have visibility into the infrastructure from top to bottom. When things happen, we can dig in and fix them. And most importantly, it is stable.

comments powered by Disqus