On Friday 24 June, following the vote to leave the European Union, the Petitions service had to scale fast to accommodate unprecedented traffic levels. This is a guest post from Andrew White, the chief technology officer of Unboxed, on how they scaled to cope with this demand and how they recognised fraud.
Unboxed is a G-Cloud supplier that now runs the Petitions service on behalf of Parliament.
Our first attempts to scale
When any big news breaks I check the Petitions site regularly to make sure our service will be able to withstand demand. I was making these checks when the results of the EU referendum came in on Friday morning, and I noticed that one petition calling for a second EU referendum had picked up around 20,000 signatures in the space of around an hour. At that point I knew the petition was going to take off.
On that Friday we scaled to 4 application servers and 2 background workers (these are servers that run tasks independent of web requests - ours are primarily used to send confirmation emails that validate signatures). This setup had handled the traffic levels from the Meningitis petition that had experienced 54,000 users per hour so at that point I was feeling pretty confident. However, soon enough we realised the referendum petition would quickly surpass any traffic levels we had experienced before and the service began to struggle.
Employing further measures
Sites like GOV.UK, which are primarily read-only, can use content delivery networks to cache things and serve content but this won’t work for the Petitions website, which is highly transactional with our content changing all the time. The pinch point in our system is invariably the database and our understanding of this was furthered with petitions like the one calling to block Donald Trump from visiting the UK when we discovered that scaling to 6 application servers resulted in the database slowing to a crawl. We can’t just ramp up the number of application servers without putting undue strain on the database.
The load on our database becomes an issue during popular petitions as everyone is trying to write to the same column and row in the petitions database table that records signature count. The database manages this writing by creating locks that ensure each writer takes its turn. When the list of users grow, the locks causes the service to start running slowly. On the Friday, when we noticed the service struggling, we looked for a fast way to reduce the load on the database – our number of users at that point were in the range of 60,000 to 70,000 per hour.
We already had a robust set of tools in our technology stack to help us scale. Our primary tool is a combination of Amazon CloudFormation and Elastic Compute Cloud (EC2) which we use to automatically deploy and scale instances in response to demand. Backing these instances is a PostgreSQL RDS database which acts as our primary data store. One other component in our stack is Amazon ElastiCache, a key-value store configured with the open-source Memcached engine, which caches generated fragments of HTML so we can speed up page build times.
While this set of tools gives us an effective means to scale, we had to make some changes to how the Petitions application runs to reduce load on the database. Our ElastiCache cluster is lightly used since we don’t cache much HTML because the site constantly updates. It seemed sensible for ElastiCache to take on the additional function of keeping track of signature counts so the database would carry less load.
We did this by using the counter feature of Memcached to increment signature counts and then periodically write them back to the database. Even if the counters became out of sync because Elasticache runs on a different server to the database, we knew we could always recalculate the correct total (we had to do this once) because each individual signature is stored as a row in a table.
The other change we made was related to a feature that tracks signatures by country. The vast majority of signatures coming from the UK caused a database bottleneck so we had to disable UK signatures from being recorded (in the signature counts per country statistics) while we fixed the problem. The country was still recorded on each individual signature and the signature was still recorded in the overall count during this period. The API published an incorrect count for UK signatures for a while but this count isn’t exposed by the main public website, and our petition map website, which displays signatures by constituency, wasn't affected.
Both these changes immediately reduced the load on our database and we managed to get through Friday evening reasonably okay apart from the backlog of jobs that had built up. This we could easily deal with by scaling to 8 background workers. By the morning we’d cleared the job queue.
We also took advantage of the lull during the night to double the size of our database instance. Unfortunately during Saturday afternoon we realised when we hit 70,000 concurrent users that even doubling the database size wouldn’t be enough so we resized again up to the largest that Amazon offer - 40 vCPUs and 160GB of RAM.
We wouldn’t normally resize in a peak period because there is a small amount of downtime as the relational database service (RDS) instance switches from the main instance to the failover instance. However the new database size meant we could increase to 12 application servers and 6 background workers. These changes helped us comfortably cope with the peak evening traffic of over 100,000 concurrent users who were signing the petition at over 140,000 signatures per hour.
Dealing with Fraud
Petitions tend to follow a predictable pattern once they start to become popular so on Sunday we were confident that we’d peaked in terms of rate but we couldn’t relax because reports started to come in about possible fraud - so at this point we had to shift focus.
When the new Petitions service was built last year one of the primary design goals was to make it fast and accessible - especially on mobile devices where we get the majority of our traffic. This means that we can’t use anti-abuse technologies like captchas as they have significant accessibility, privacy and performance issues. Any UK citizen wherever they are in the world can sign petitions, as well as any UK resident, so we can’t mandate the use of official IDs like National Insurance numbers for authorisation purposes. Therefore our primary tool for validation is a confirmation link sent via email.
Since 90% of our emails are sent to large providers like Google, Apple and Microsoft that have their own anti-abuse measures, we are generally confident that these are valid accounts. This assurance allowed us to focus our search on emails coming from so called ‘disposable domains’, which are temporary email accounts that only exist for short periods before being discarded and can be created by scripts. We have a list of these domains but new ones are created constantly so it was just a matter of checking for these new domains.
Another warning sign is large numbers of signatures coming from the same IP address and this coupled with the domain checking allowed us to invalidate around 30,000 signatures that purportedly came from the Vatican.
We had similar reports about signatures coming from Bracknell. By the time the reports were published, we’d already removed the signatures as we were also checking the predicted number of signatures for each constituency compared to those we observed (an analyst also confirmed independently the signatures were fake).
In the end we decided not to use Memcached to record signature counts going forward because doing so involves 2 different servers, which could cause confusion. Instead we moved the counting to a background worker job, which could more efficiently smooth out signing peaks caused by high demand.
We feel now we’re in a good position in terms of scaling. The changes we made to our application and systems allowed us to cope with the Saturday peak demand with room to spare. It’s hard to imagine a scenario that would result in a more popular petition but should it ever arise, we’ll be ready.
Our main focus now is improving our approach to fraud. We’ll be using a combination of tools that make the invalidation of signatures into an admin task rather than relying on developers to do the work. We’ll also be relying less on blacklists of domains that endlessly need updating. Instead we’ll be switching to whitelists and then applying aggressive rate limiting to unknown domains and/or IP addresses. We’re looking forward to seeing how this will perform next time.