Quis custodiet ipsos custodes?
A majority of Ansible use cases are in application deployment and continuous delivery, a job at which Ansible truly excels. But when using Ansible for such mission critical things, an age-old question might arise:
Who is going to guard to guardians?
In other words, how are we going about continuous delivery and super-cool automated deployments if the Ansible scripts themselves don’t pass the same process?
In my previous post – Testing Ansible I’ve identified 4 different steps in testing our Ansible scripts.
This DTAP process should give an overview and a loosely coupled framework for putting our Ansible code to production.
Development – “Ground zero”
The most important thing about development is to be completely fearless about making errors and to make errors completely and easily reversible.
To achieve that comfort in development, a virtual environment is a must: be it a container, VPS, or a VM… What ever suits the project you are developing. My recommended development environment is Hashicorp’s Vagrant.
Vagrant gives you the comfort of putting up multiple virtual machines in a single environment – properly abstracting the production infrastructure you might have.
When the development is finished, the following tests need to be run:
- Syntax test – did we write code or did we write gibberish?
- Dry run – are all the prerequisites for the configuration changes present?
- Run the scripts – will the script actually run?
- Idempotency test – will I make any harm by running the configuration script again on an already configured machine?
It’s important to note that these tests can be fully automated and take almost no time to run – ideal for development.
Testing – “The first trials”
The development is finished – great! We’re 1/4 of the way closer to production.
We are left with one important test mentioned in Testing Ansible: Delayed assertion.
Delayed assertion is just you writing more code to accurately test if all conditions required by the feature are met.
After running the smoke tests mentioned in the Development phase and running the Delayed assertion tests, we need to ask the authority to give us clearance for staging.
Authorization for staging
Our code is now swimming with the big fishes – no more development comfort, ad-hoc changes and SSH sessions…
Once we ask the authority for permission to go to staging – we are in the rapids flowing to production, everything from now on is fully automated.
The authority, in our case, is the CI Server.
CI Server’s role in this step is to re-run all the steps done in development, to fix a common cause for failure – developer not testing properly.
The CI Server workflow:
- Code is commited to the repository.
- Needed VMs/VPSs are spun up exactly as in the development environment
- Tests are being run on the machines in the exact way as they should be run on the development environment
- If everything is ok – we are ready to get serious.
Staging / Acceptance – “Getting serious”
We have the code, we have the proof that the tests are passing. Now comes staging.
The attitude we have towards the staging environment should be identical to the production environment – otherwise we simply haven’t set the stage well.
The only difference between the staging and the production environment is in the fact that no end-users are using it!
The same tests as in the previous step are run but this time on an exact copy of the production infrastructure.
Depending on the CI tool which you are using, this will be easier or harder to setup, but the ideal workflow should be the following:
- Run tests in the staging environment
- If the tests fail, mark the build as unsafe and don’t destroy the staging environment
- If the tests pass, mark the build as passing and destroy the staging environment
Why don’t we destroy the staging environment when the tests are failing but quickly dispose of it if they’re not?
Simply because we wan’t to have access to the environment which failed to provision normally – to gather data on the failure and to make sure we avoid it in the next build. Marking the build as unsafe in this case simply means that this specific revision CANNOT finish up in the production environment – no excuses.
Production – “The point of no return”
There isn’t much to say about production, I recommend visiting List of religions and spiritual traditions on Wikipedia and picking who to pray to that nothing breaks. Once you stop praying that nothing breaks and start praying that the tests you have written have good coverage you know you’re getting better. Once you stop praying even that the tests are ok, and leave the office immediately after deploying to production, you know that you are a sociopath who just likes to watch the world burn, congratulations!
CI tool – I’m currently experimenting with Go CD which, coming from a short and troublesome experience with Jenkins seems like a nice refreshment.
Anti-concurrent deployments – this is a major issue when you’ve built an automated workflow from development to production, you don’t want people running deployments at the same time, because something will break, and it will break hard. If you can’t setup this kind of control in your CI tool, I recommend Etsy’s PushBot which is an IRC Bot which allows developers to queue in for their turn on deploying.
Military-grade ACLs – you don’t want to trust no-one, not even yourself. Granularize access to certain parts of the workflow wherever, whenever possible. A good practice would be to implement a sharded key shared by multiple members of the team for deploying changes to production, after successfully passing the tests in staging environment.