I recently had the pleasure of setting up a continuous deployment environment for a frequently visited 24-7 .NET-based web application that allows visitors to purchase tickets for hundreds of events. Since this application is routinely updated with new functionality by a variety of developers, the need arose for non-disruptive deployments without a complicated and risky manual deployment process. In this post, I share the design we came up with and explain how we tackled some of the problems we ran into.
Deployment, or how we used to do it
Originally, the application was deployed manually from Visual Studio by publishing it to the webserver with WebDeploy. But not before making a backup of both the application and the database to allow fallback in case of unforeseen problems. Changes to the database were manually synchronized from a staging server to the production server (schema and data). When the publish was completed, a set of manual smoke tests was performed to see if everything was operating correctly. This whole procedure took - on average - around an hour. Not too bad. But since the publish potentially disrupted active users (500 errors, missing files or timeouts) it was usually taken offline for the duration. Most publishes were done late in the evening or in the weekends to minimize the impact on visitors and customers.
This approach was limited for several reasons:
- As traffic to the application increased, it became increasingly difficult to find a window to perform the publish, perform tests and roll back when necessary;
- Although the publish itself was quite straightforward, it did involve a lot of manual steps (making backups, taking the site offline, running tests, rolling back when necessary, tagging the published commit) that are prone to errors (forgetting, wrong order, etc.);
- The publish was usually done by the developer that added the functionality. Since it involved many manual steps, it was hard to delegate the publish to others;
- Deployments had to be done in the evenings or the weekends which was obviously not preferred by developers; Taken together, these limitations made publishing new functionality increasingly harder. It’s no surprise that the publish was often postponed instead or put on hold until more functionality was finished. This made deployments even harder as the number of changes and smoke tests increased. Not very agile; because deployment was painful, it wasn’t done as often as possible. A more healthy approach would be to focus in making the deployment less painful.
Deployment, or what the new design looked like
The new approach was inspired by the blue/green deployment pattern as described by Martin Fowler. This pattern involves a router, proxy or load-balancer and several webservers that act as ‘active’ (green) and ‘passive’ (blue) nodes. Whenever a publish is performed, the passive nodes are updated to the latest release. Since users are routed to the active nodes only, there are no disruptions for them. Available automated tests and checks are run on the blue nodes and, when everything is ok, the ‘passive’ and the ‘active’ nodes switch roles by routing traffic from the green to the blue nodes. The previously active - but now idle - nodes stick around, but with a previous version of the software.
A schematic view of blue/green deployment. Courtesy of Martin Fowler.
What’s nice about this pattern is that you have a built-in option for rolling back in case of problems. If your green nodes are active, but you run into issues, you can simply route traffic back to the blue nodes and try again.
To further benefit from this approach, I wanted to automate the deployment process to such an extent that committing code changes to the master branch (in Git) was sufficient to trigger a publish to production. The goal was to make publishing so simple and non-disruptive, that new changes can be pushed to production without worries at any moment in the day. But this design presented us with a few challenges:
- Do we set up this environment ourselves, or do we use some Platform-as-a-Service that already exists;
- How do you update a database and data?
- How do you deal with user sessions?
- How do you avoid publishing broken code?
I will discuss these challenges below.
What platform to use?
This was the easiest challenge. Although setting up your own continuous deployment environment is possible with tools such as AppVeyor and Octopus Deploy (more about this in a future post), I opted for the existing PaaS platform AppHarbor. AppHarbor is easy to set up, free to use for small websites (and testing purposes), and supported a lot of the features we were looking for.
AppHarbor allows you to automatically deploy new builds to web workers (in this case 2). Notice the number of deployments done to production in a single hour, and the average deployment duration of only a few minutes.
First of all, AppHarbor pretty much does blue/green deployment out of the box. You can make AppHarbor listen to a Git repository and pull, build, test and deploy new commits automatically. By design, builds are deployed to a new web worker instance (in Amazon AWS). Only when the deployment has succeeded, are visitors redirected to the new instances transparently. AppHarbor updates their router (Nginx) automatically. No manual work is involved.
Second, AppHarbor makes setting up load-balanced clusters easy. For our application we opted for two beefed up, load-balanced, web workers and a dedicated SQL-Server instance. Because of AppHarbor’s design we can easily add more web workers in the future as load increases. The deployment procedure remains the same; the new instances only become active when the deployment has succeeded (and the instances respond).
AppHarbor is a PaaS platform. So you pay for the instances and the features (like IP-based SSL, custom hostnames or a custom load-balancer) that you use. Its Freemium model is useful for very small sites (like my own project, TeamMetrics). But the pricing is quite steep after that. There are alternatives, like Azure. But I prefer AppHarbor’s pricing model and feel that configuration is much simpler. AppHarbor’s staff was also very helpful when setting up different SSL-certificates for different hostnames.
How do you update the database?
One of the biggest challenges in continuous deployment is how to update the database. A new version may contain new or updated columns or require specific data in the database. Since the application was already built on top of Entity Framework, I simply configured it to use Code-First migrations. This requires that every schema and data change is scripted (automatic, but with a manual trigger) into a C# class that is added to the solution and automatically run wherever the application is deployed. Data migrations can also be added to these scripts. You can read more about Entity Framework and Code-First Migrations here.
One limitation to our approach was that we use a single SQL-Server instance for both the green and the blue environment. This means that breaking database changes (like dropping a column, table or a record that is needed by the old version) will introduce short disruptions as the database is updated and the new version becomes available. Developers have to be aware of this and push breaking changes only during hours with low traffic. Since the vast majority of database updates to this application are additive we felt comfortable with this limitation. When breaking changes start occurring more frequently, we will have to set up two database instances (a blue and a green one).
How do you deal with session data?
Another challenge was how to deal with user sessions. The application features a multi-step purchase process where state (selected products, purchase information) is persisted in the user’s session until the purchase is finalized. Furthermore, logged in customers have a user session. One of the problems we quickly ran into was that user sessions were lost after every build or switch to a different web worker in the cluster. This was undesirable, as it caused annoying disruptions to users like having to re-login again or restarting the purchase.
The solution I came up with was to use MemCached to write the session data to a remote server (in this case MemCachier). I installed the EnyimMemcached NuGet-package and added a bit of configuration to the Web.config. That was all that was needed to make all calls to HttpContext.Session in the code work with MemCachier instead of using the server’s own memory. Except for the web.config file, no code had to be changed. Pretty cool.
The result of this approach is that users don’t lose their sessions when new builds become active or when they switch from load-balanced web worker A to B. This is really cool, as users can continue to use the application during the deployment. The only way to notice that an update took place is by the checking the (incremented) version number in the footer.
How do you avoid publishing broken code?
Continuous delivery only works well if the code is of good quality. Publishing bugs or broken code is going to frustrate users and customers alike. Any new builds should be thoroughly tested before going live. In the case of AppHarbor, this is done automatically as available (unit) tests are located and run by the buildserver. The new deployment is only successful when all tests pass. You can set up a similar ‘gated check’ in your own build- or deployment server.
Obviously, this approach requires discipline from developers. They have to write unit- and regression tests for new functionality. New functionality should be tested carefully before pushing it to the repository.
A good approach is to use Git Flow. This means that the source repository has a master branch that contains the ‘production version’. AppHarbor (or your CI server) deploys only new commits on this branch. Whenever new functionality is written, it is done either on a special ‘development branch’ or even on a ‘feature branch’. When the work is done, the code is stable and everything is ready for deployment, the changes are merged back into the master branch. This triggers the aforementioned automatic deployment.
It’s also a good idea to have a bunch of sensors that frequently check if your application is still responding and loading correctly. I installed NewRelic for this and use the uptime monitoring feature to notify the development team when issues arise after a publish. I’m still investigating how to add UI tests to the test battery. This would allow us to run a number of integration tests that simulate a limited number of usage scenarios (like purchasing a ticket). But I haven’t found a workable approach for this yet.
We’re pretty happy with the end result. New functionality is now automatically deployed within 2 minutes of committing the changes to the application’s master branch in Git. Deployments are done transparently, so users can continue to use the application and don’t lose their session. Even the database schema and data are updated as part of the deployment. As an added bonus, the application is now running in a load-balanced environment that can be easily scaled with additional web workers. In other words, this application is now ready for actual Agile Development.
But the best part is that as developers we don’t have to worry about deployment any longer. This is all taken care of automatically. Instead of stressing for a deployment, we can now focus on doing what we like; deliver valuable functionality.