4 Mistakes that Shook V1 and How They Are Shaping V2
V2 will introduce a new Virtual Data Center, illumos Operating System, Isolated Network Storage and PagodaGrid IaaS
The New V2 : PaaS + IaaS
From a high level, Pagoda Box V2 merges an elastic, 3 tier IaaS with a fully upgraded version of our PaaS. The result is a simple, customizable, powerful Platform / Infrastructure as a Service for developers. Now, rapid deployments and managing Public, Private and Bare Metal resources will be simply awesome. We’re convinced you’re going to love it, now matter how big your app gets.
Even so, we are keenly aware that Pagoda Box V1 hosting and customer support has had a rocky history, especially early on. Moreover, there have been no updated features since late 2012, which has frustrated many V1 users. Recent Shared Writable Storage outages have also been painful. While this won’t change history, following is a frank explanation and a look behind the Pagoda Box curtain.
We’ve been dark for 2 years, fixing (not just patching) architectural V1 issues.
Our Naive V1 Assumptions
Pagoda Box’s first launch was based on 4 ill-fated assumptions:
- Building on a 3rd party cloud would accelerate and simplify operations
- Linux containers were production ready
- A global, distributed Shared Writable Storage solution would be fast and stable
- Standard Ops tools would provide our engineers adequate visibility and control (assuming 1, 2, and 3)
1 - Cloud vs Bare Metal
We built Pagoda Box V1 on a virtualized cloud. Even though we spent nearly a year researching and leveraging the best cloud option available, within 5 months, we began experiencing frequent, severe outages. Engineers traced the issues down into the xen virtualization layer, but without control of the xen implementation or the underlying hosts, our hands were largely tied. We realized Pagoda Box’s success required extensive low-level control, which we simply didn’t have.
We spent several months restructuring internal services and migrated to a bare metal infrastructure, confident that direct metal access would resolve our stability issues. This move did fix the earlier cause of outages, but it also revealed another, more foundational restriction - Linux containers.
2 - Linux Containers in Production*
We love open source, and with LXC integrated into the mainstream Linux kernel, we assumed it was a seasoned, battle-tested, production-ready solution. We automated extensive systems and processes around that assumption, but traced erratic behavior to production LXC containers. Our engineering efforts to work with the LXC development team revealed a single part-time developer maintaining the LXC project. Too late, we realized the technology was not ready for our needs, or extensively supported.*
*Note: Thanks to the popularity of docker, the collaboration effort on lxc has profoundly increased. There are notable vendors providing production solutions hosted in lxc containers, but our particular workloads overwhelmed the scope of Linux containers. While some of our initial issues with Linux containers may have been resolved, illumos and SmartOS has proven a much more production-ready hypervisor for our services.
3 - Shared Writable Storage
For V1, we launched a shared writable storage cluster, accessible through a fuse client and bind-mounted directly into containers. Even though the storage cluster was secure and fully redundant, it provided no way to isolate disk I/O. ‘Noisy neighbors’ impacted users of V1 writable storage as excessive read / writes from even a few heavy apps, could slow writable storage for all users. Many of our ‘outages’ have simply been excessive disk I/O impacting group writable storage.
4 - Outpacing Our Ops Team
Finally, Pagoda Box V1 provided great workflow enhancements for web developers. Unfortunately, because we assumed our virtualized cloud provider would manage underlying layers, nearly all our intuitive automation was developed for end-users. As we hurriedly migrated to bare metal, we were ill-prepared. Our engineers had only a basic array of system admin tools, without extensive automation. As our first three assumptions failed, expanding responsibilities and lack of tools left our Ops Team unable to adequately manage growth.
Brutal Reality = No Low Level Shortcuts
In layman’s terms, we made faulty assumptions about foundational technologies, and focused almost exclusively on the top-most layers. Mass adoption exposed those false assumptions in the worst possible way: unstable customer applications. At the very time we should have been releasing features and delighting customers, all resources were channeled into reassessing foundational technologies. After months of exploring possible solutions and trying to stem the tide of mounting issues, we shut down sales and marketing, limited support to existing customers, and threw all resources into rebuilding Pagoda Box V2 from the ground up. To correct every issue we experienced, we vowed to create a fully architected, every-layer, top-to-bottom solution with no shortcuts and no turtles.
Step 1 - New Virtual Data Center
V2 has replaced tunnels and port forwarding with a native SHOVEL (SmartOS Hypervisor Overlay of Ethernet LANs) network, fully virtualized with a native tcp stack that allows web components to talk to databases on a native IP address. Our SHOVEL Network uses the vxlan protocol, which enables attaching dynamic IP addresses to a virtual server, accessible only to other services inside that app. SHOVEL runs in kernel space, so it’s extremely fast, and due to encapsulation and the illumos network stack implementation, totally secure..
Step 2 - Smart OS Virtualization*
We chose a Joyent SmartOS distribution of the illumos kernel to replace our prior Linux operating system. It’s reliable, accessible, and proven. illumos is a Solaris variant, forked from OpenSolaris before it was closed to development by Oracle. Sun Microsystems stabilized the container technology (called zones) long before LXC was introduced to Linux. As a result, the illumos kernel is extremely stable with decades of enterprise workload experience, native os virtualization as a first-class-citizen, and Pagoda Box engineers now have extensive troubleshooting visibility with Dtrace, MDB, ptools, and others (we’ll probably blog about these at some point).
*Most of the existing V1 infrastructure has been hosted on SmartOS within kvm branded zones since October 2013, even while app components retained LXC technology. We saw an immediate and significant improvement in server stability.
Step 3 - Isolated Network Storage
Writable storage is no longer centralized in a storage cluster. V2 provides private, scalable network storage inside each app, exposing a writable directory through an NFS server. Services inside each app can mount their own directories just as they would a local file system. Thanks in part to SmartOS, disk I/O can now be isolated for each app, so users can manage their own resources as usage increases.
Step 4 - Robust Internal Automation
Correcting our V1 assumptions has taken almost 2 1/2 years, and the dashboard is just the ‘tip of the iceberg’. More than a year was spent creating tools and automation for our internal Ops Team, who are now working inside a full IaaS, with ‘back-end’ automation and control rivaling our customer facing PaaS. This development will enable more rapid feature releases, less intrusive maintenance, and more visibility for customer support.
Shortly before the release of Pagoda Box V1, we met with two industry luminaries to discuss the future of Pagoda Box, and their past experience launching a groundbreaking PaaS. They stated generally, “if we had known how hard it would be to get [our PaaS] off the ground, we never would have tried it.” At the time, we thought we could relate. Three and one half years later, we recognize that hard lessons led to the innovations that are V2. While V1 has been painful for us and our users, we’re confident you’ll appreciate the result.