How to avoid breaking production — Part 2
9 min read
If you didn't read part 1 yet, we strongly advise you to do it here!
This article is also authored by Lucas Tagliani and Thayse Onofrio, since we recently paired in a tech lead role.
image source: Nappy
6. Deploy as much as possible
The more we introduce changes in production, the smaller those changes will be. If there are any issues in our deployment, it will be pretty easy to pinpoint what caused the issue. Besides that, when deploying something that was recently developed, it's still fresh in our minds, and we remember all the context about it.
In contrast, waiting to bundle several days of work together to create a release candidate and deploy it to production will include different changes, the team most likely will not remember all the context of those changes so well, and it will take days to get feedback on the change that was introduced.
In our experience, we used to have a different scenario, deploying to production about once a week. We started gradually changing that, encouraging the team to deploy more often. What we've seen is that deploying to production every day has improved the confidence in our releases. We get fast feedback on a change that was just developed and are able to quickly roll back or deactivate the feature in case it isn't working as expected, without affecting other functionalities. Besides that, the team is delivering value every day.
To enable us to increase the frequency of deployments, we used several practices such as: changing from blue-green deployment to canary release, making extensive usage of feature flags, and enabling a/b testing.
7. Blue/Green deployment
When you deal with thousands of users simultaneously as we do, you might want to keep zero downtime when deploying.
If you need to keep your app available when deploying to prod, blue/green could be a very good strategy to make sure of it. This way you can keep two routes up - usually called blue and green -, at the same time. One of them will be the "live" version of your app, which will be available for your users. The other route will be the "dormant" one, which usually won't have much throughput.
We need to make sure the code artifact will work nicely with the production environment. So this is what happens: we deploy the new releases to the dormant route, then we run all the user journey tests against it, using a real environment instead of an application inside a docker container, as we do in previous steps. It is also important to keep in mind that each environment will consume data from different sources. It will also have different feature flags enabled, and probably even some unique A/B test configurations. If there are any failures on our tests, the deployment job stops right there.
This is an important part of our deployment process and has helped us catch issues before promoting changes a number of times in the past. We also do manual testing there before promoting the dormant route to be the live one. Once all that is done, and we are confident that everything is fine, we just click a button and the routes are switched (dormant becomes live at the same time that live becomes dormant).
Besides that, if after the switch of routes we find a problem (hours later for example), all it requires is to "click a button" to roll back to the dormant route - which was live and stable before the latest release. It is easier than having to set up a new deploy, and you don't really need to think about what was the last working version because it already knows it.
Note that it might be smart to manage the instances (horizontal scalability) of your routes. If your dormant route will not receive so many requests, it will possibly save money if you set it to have the minimum of instances.
8. Canary releases
Could we be even more careful when deploying to production? Yes, we could. Canary releases could be considered the next level of blue/green deployment for us. More than deploying to a dormant route with canary, we can now release it for our users little by little.
We usually define that only 1% of the users would see the new release for the first ~15 minutes. Then we would check the numbers, and monitor the charts and logs to see if anything would call our attention negatively for that 1% of users. If it does, we would abort the deployment. If it does not, we would keep increasing the amount of users that would see the new version of our app.
We increase to something like 5%, 15%, 30%, 50%, 75%... until we reach 100%. Keep looking at the data before each increase.
This way, even if you are releasing a version of your app that has problems, it will be visible to fewer users than the regular way of deploying to production. It has many benefits like saving you money and avoiding all your users being impacted by the issues.
If you are curious about why it has this name, open this link, scroll all the way down and check the Notes.
9. Feature Flags
Feature flags are an important part of what enables us to deploy frequently and with confidence. This technique allows us to modify production functionally without having to deploy code changes. Whenever we're developing a new feature, or even doing a big refactor, we hide the code behind a feature flag. This way, we can safely merge code when a feature is still in progress, knowing that it will not affect production code. When the feature is complete, we can simply turn the flag on.
Besides that, from a business point of view, feature flags are essential when we need a feature or a functionality change to go live at a specific date or time of the day. We deploy the code changes previously behind a feature flag, and then at the given release date, we click a button to turn this feature flag on and the change goes live.
After working with this technique for a while, there are some things that we believe it's important to be mindful of. First, tests should cover scenarios with the feature on and off. That's the only way we will be sure we're not changing any behavior when the flag is off, and that we're adding the correct behavior when the flag is on.
Also, feature flags should live only for as long as they are needed, so, it's important to clean them up. Removing feature flags that are not used anymore is important to decrease complexity in code and tests related to it. Each part of the code that is affected by a feature flag can have double the complexity, as there is a new logic flow added for its conditional. To achieve that, we started writing cards related to the removal of the feature flags as soon as we create them, so we don't forget to remove the code that is not needed anymore and to remove the feature flags from whatever tool is being used to create them.
10. A/B Tests
Somewhat similar to feature flags, a/b tests can be used to have different functionality that can be enabled without deployments. However, instead of being a functionality that is turned on or off, with A/B tests we can have different functionalities enabled at the same time, but for a specific percentage of users. As an example, we could change the color of a button that leads to promotions for a percentage of users, and a/b test it to find out which button color goes through the sales funnel more often and drives more revenue. Therefore, it is valuable from a business point of view.
Before deciding if we want to release a new functionality, we can A/B test it and release it to a specific set of users, based on user attributes, and then compare business metrics to determine the value the functionality brings - or doesn't. From a technical perspective, we can't get too attached to the code. Based on the experiment results, new functionality can end up being removed, and that's completely ok, and we don't want to keep unnecessary complexity in our system.
Similar to feature flags, it's important to test any variations of the A/B tests and to remember to clean them up, enabling by default the functionality that is decided based on the test results.
11. On-Call Model
Even with all the quality layers that we have in our process plus the techniques we have to release our application, we are still aware that we work with software, which means something unexpected can happen at any time. When it happens, no matter the time of the day, we need to be ready and capable of handling it.
Given we are a big team, people On Call work in pairs. Usually, we try to match one experienced developer with someone that doesn't have that much experience yet. We do it as part of the onboarding process. It is important to make sure that not only experienced team members can deal with or understand those tough situations. Besides that, we truly believe two brains are better than one to try to solve problems, especially under pressure related to the production environment.
Here is how it happens: whenever we get an incident in production, it will trigger a phone call, SMS and push notification to two developers of our team at the same time. Then they will log in, take a look at the issue and address it together as soon as possible.
Besides all the other techniques and tools we mentioned previously, something that is important to maintain a working system in production is observability. It provides us the ability to troubleshoot our system when an incident happens.
When a pair of developers is on call and is paged to address an issue, they need to be able to quickly and easily find out what the issue is and how to solve it. This is made possible by having good monitoring tools, logs, and dashboards in place that will outline the problem and allow devs to find the root cause. This way, we can set up alerts based on specific things - like logs, amount of errors, or a threshold of response time - and have it automatically call developers on support.
The alerts you set up will depend on your project and its needs. As an example, we keep a close eye on the response time and the number of errors, on both the server and client sides. If those numbers are suddenly higher than expected, we'll get alerted and be able to act fast.
Identifying issues in production is essential, but we also need to be able to watch our applications and find trends before they become an issue. That's why we have the habit of constantly looking into our dashboards and exploring any different behavior we see there. The same is done for lower environments, so that we can detect any issues before the artifact reaches other environments.
These are the main practices we follow to avoid issues in production. It does not mean they don’t happen at all, but we face fewer issues than we would if we didn’t have this whole process. We know that because we faced a lot of those issues in the past, and through each issue, we were able to adapt and learn new practices that would improve the reliability of our app. By the way, this is not written in stone. We continue to learn, update and improve our practices.
Did you know all the topics from this article? How different is this process in your current project? Are you planning to add one of these into your current path to production?
Don’t hesitate to comment here or send us a message!