The GOV.UK Notify team at Government Digital Service uses the open source Python library Celery to manage our task queues.
Task queues let applications work asynchronously, independent of user requests. The ability to do this is a fundamental part of how Notify works.
Updating software dependencies is not always the most glamorous part of the job but it is still something that we can learn from.
This post explains how we recently upgraded to version 5.2.0 of Celery, the impact this had and what we learned by doing this.
Using task queues on GOV.UK Notify
On GOV.UK Notify we use task queues for almost everything, including:
- sending notifications
- calculating nightly billing data
- sending services HTTP callbacks
- deleting notifications as part of our data retention policy
When you send an HTTP request to the Notify API to send an email, we don’t actually send the email before returning a response. Instead we put a task on a queue to send the email, then we return the HTTP response.
There are several advantages to doing this:
- API response times are quicker if we put the task on a queue rather than trying to send the email while we process the HTTP request.
- If something goes wrong when we try to send the email, it’s easy to retry it.
- If we receive too many requests at once, we can let the tasks sit on a queue until we can process them a few moments later.
We use Amazon SQS as our queuing service. It stores and keeps track of all the tasks on queues. Then we use Celery to pull, put and process tasks from our queues.
Why we upgraded to Celery 5.2.0
Until recently, we were running version 3.1.26 of Celery on the Notify API. We wanted to upgrade to a more recent version of Celery so we could:
- (hopefully) improve performance and throughput
- take advantage of newer features, such as exponential backoffs
- make sure we did not end up using a version of Celery that was no longer supported
- upgrade the version of Python we use (Celery 3.1.26 only supports up to Python 3.6, which became an end-of-life product on 23 December 2021)
We delayed the upgrade for almost 2 years so we could focus on supporting services that were set up in response to the coronavirus (COVID-19) pandemic. But, because we have a commitment not to use end-of-life software, we could not wait any longer.
How we rolled out the changes safely
When making a big change to our applications, we need to be confident the change will not cause any problems. In this case, we:
- Read through the Celery changelog to spot any issues or changes we needed to make.
- Upgraded to Celery 5 in our local development environment.
- Merged the change into our main branch and ran our continuous integration tests.
- Deployed the upgrade to our preview environment and ran our functional tests.
- Deployed the upgrade to our staging environment and ran load tests. We compared the results to load tests done on Celery 3.
- Deployed a single instance to production, known as a ‘canary’.
- Slowly increased the number of canary instances until Celery 5 was serving 100% of traffic.
- Monitored for a week so we could spot any problems or concerns.
What we learned from using a canary
When we tested Celery 5 in our staging environment we saw a clear improvement in performance. Even so, we were cautious about deploying changes to our production environment.
We introduced a canary, a single instance of Celery 5 that we could use to monitor performance. For continuity, all the other instances in production continued to use Celery 3.
Using a canary is generally considered good practice, but we ran into a problem.
When we deployed our canary we saw it burst to 500% in CPU usage. Celery 5 was hogging almost all the available tasks while the Celery 3 instances did very little.
To try and fix the issue, we increased the scale of our canary to 5 instances. This reduced the CPU usage by distributing the tasks across 5 instances instead of 1.
We think this issue was caused by the difference between long polling versus short polling. Celery 5 uses long polling by default, but our Celery 3 instances were still using short polling.
Our deployment and load test on staging might have caught this if we had used a canary in that environment too.
Next time, we’ll consider running load tests against 2 situations:
- A single (canary) instance with all the other instances running the old version.
- All instances running the new version.
The impact on performance
Since we deployed Celery 5 on 5 November 2021, tasks are now picked up and processed more quickly.
For example, we used to send 96% of notifications to our email or text message providers within 10 seconds. Since the upgrade, this has increased to 99.98%.
As a result, we updated our performance dashboard to show data rounded to 2 decimal places. This was because we felt that displaying 100% might look suspiciously good!
One user even contacted Notify because they were surprised (and pleased) that their HTTP request speeds were 40% faster.
In the past couple of months we’ve upgraded some of our core dependencies. Next, we plan to retest the performance limits of Notify to make sure we can handle whatever 2022 will throw at us!