Enterprise Grade

Background

A small agency provides online presence and social media management services to high-end customers: hedge funds, law firms, oil tycoons. There's many players in the field, from low-cost freelancers to established brand names; the agency is able to survive in this crowded market because they offer top-notch security and 5-star customer service.

Most of the computing infrastructure is hosted locally, in the basement of the agency's headquarters. It's an aging data center, with a capricious diesel generator, multiple unreliable high-speed internet connections, and overloaded server racks. The infrastructure is a major liability, with repeated outages causing the loss of customer accounts.

A thorough analysis indicates that to be truly resilient, the data center would require significant upgrades; a secondary site for disaster recovery would have to be built, and systems administrators would have to be hired. None of those are part of the agency's core competencies, so a decision is made to migrate the entire infrastructure to the cloud

Initial approach

The agency's business doesn't run on a commercial software platform, because none of the products that have been explored proved flexible enough to meet the needs of the demanding customers. The agency has instead opted for a highly customized approach, building new capabilities as needed. The result is a complex ecosystem with a constellation of bespoke systems partially integrated with shared components.

As the migration to cloud computing begins, the IT staff wants to rearchitect the entire infrastructure and make it cloud-friendly. They want to use microservices, an API gateway, event queues, containers. They want to go serverless and use native services.

A year goes by. Then two. The new cloud architecture looks good on paper but the agency cannot do a stop-the-world migration, so the IT staff spends a lot of time in firefigthing mode: repairing the old infrastructure, adding features to old systems, patching old databases. There's no end in sight and not much time is spent on the cloud project.

Black Tuesday

Things take a turn for the worse on a Tuesday. As they perform repairs on broken pipes, a city crew accidentally damages power lines down the street. The agency's building goes dark, including the data center in the basement where the diesel generator refuses to start. Employees scramble to install new generators and UPS units, but for hours the agency is dead in the water. As a result, two of the biggest accounts are lost.

Management can no longer afford delays in the cloud project. They know they need external help to move things along, so they reach out to us.

Throwing out the baby with the bathwater

As we get introduced to the IT staff, we realize that while the agency's ecosystem is quite complex, they have a solid handle on things. The cloud architecture they designed is excellent, and every single member of the staff embodies the agency's core values of security and customer service. The target is therefore not the issue; they just need a better roadmap.

The least elegant way to leverage cloud computing is called a lift and shift: simply clone the local infrastructure and host it in the cloud. It's crude and typically not cost-effective. In their quest for a better infrastructure, the IT staff initially turned away from a lift and shift, and opted instead for something much, much worse: the "might as well" approach.

We're migrating this app to the cloud? Might as well implement that REST API we keep postponing!
We're migrating this database to the cloud? Might as well modernize things and upgrade to NoSQL!
We're migrating this old message queue to the cloud? Might as well build a proper publisher-subscriber solution that runs in containers!

This type of approach works well when there's unlimited time and budgets, not when the local infrastructure is falling apart and eating up 40% of the IT staff time.

Time to get real

We decide to implement a strategy loosely based on the four steps of the VisibleOps approach.

Stabilize the patient. The first thing we do is a simple lift & shift. We replicate the entire infrastructure in the cloud, down to server names and operating system versions. Although we initially face resistance from the IT staff, who still believe they should do things "right", they soon realize that no longer having to deal with hardware issues or loss of network connectivity saves them a lot of time and stress.
Find fragile artifacts. We identify all the components that can be quickly transitioned to managed services: databases, application servers, message queues. This frees up even more time for the IT staff, since they no longer have to deal with operating system patches, database backups and similar chores.
Establish repeatable build library. Using the cloud vendor's API, we build a foundation for full-blown infrastructure automation. The IT staff, which was so far handling individual machines and databases as if they were unique artifacts, soon embrace the rebuild, don't repair philosophy of VisibleOps.
Enable continuous improvement. We help the IT staff transition to a continuous delivery model, embracing things such as blue/green deployment and feature flags. With these new processes and tools in place, they can now slowly move toward the target architecture they've had in mind for a long time.

The future looks bright

The agency's infrastructure is now rock-solid, and the IT staff can take things to the next level without our assistance. We helped the client reach this happy state in a much shorter time than they initially expected and helped boost morale in the organization as a whole in the process. The agency can now focus on delivering value to their customers.

As for the aging data center: with the exception of the diesel generator, everything has been left in place. It was initially considered an emergency exit plan in case cloud computing proved unreliable, but has since been "upgraded" to a gaming infrastructure for the employees to enjoy during lunch time.

Stairway to the cloud