https://technology.blog.gov.uk/2014/08/27/taking-another-look-at-gov-uks-disaster-recovery/

Taking another look at GOV.UK's disaster recovery

Kushal Pisavadia, 27 August 2014 - GOV.UK, Tools

As GOV.UK gets bigger, we often need to revisit the ways that we originally solved some problems. One thing that's changed recently is how we prepare for disaster recovery.

Disaster Recovery

The reality of working in technology is that software systems fail, more often than we'd like and usually in ways that are beyond our control. The process of thinking up high-level failure scenarios and solutions for them is called disaster recovery.

One extreme scenario for GOV.UK is that all our infrastructure disappears. All of of our applications, the servers they run on and even the infrastructure provider we're using. Gone.

Trying to solve this problem directly and up front is difficult. It's too generic and we don't know the root cause yet. But, we can give ourselves some time while we assess the situation and start to resume normal service for our users.

Creating a static copy of GOV.UK

Back in 2012, one of the nice properties of GOV.UK was that the majority of pages were static. Static HTML, CSS, JavaScript and images. Additionally, there were only 20,000 pages in total.

One of our solutions at the time was to run a task overnight to visit every page on GOV.UK that wasn't a form and save it to disk. This process would take a couple of hours to complete. We'd then transfer the files to some backup machines that were ready to be switched over to should the worst occur.

We called the code that did this the GOV.UK Mirror.

Fast forward to present day

Various agencies and departments have transitioned to GOV.UK and we're now up to over 140,000 pages. We found that the GOV.UK Mirror was now taking close to 50 hours to complete. It meant that a given page could be up to two days out of date should we need to switch to our mirror. For pages like foreign travel advice that update more regularly than once a day this was a problem.

What problems are we trying to solve?

We knew that the full crawl was taking too long at 50 hours. We would want this to complete within a day. We also couldn't crawl certain pages that get updated often on an ad-hoc basis such as foreign travel advice. Finally, there was no way of pausing or stopping the mirror process mid-run. We couldn't continue easily from the last good state so we had to restart from the beginning each time.

Building it

We made a conscious decision to split the GOV.UK Mirror into two components. A producer to give us the initial set of URLs and a consumer to crawl them, and write the pages to disk. The two components would communicate using a message queue. This way, we'd remove our reliance on the nightly task to complete the work and could use the message queue for crawling ad-hoc pages. Using a message queue also meant we could continue where we left off.

The producer is now a simpler process that retrieves a list of URLs from our Content API and publishes them to an exchange. Most of the work is done in the consumer component, which is written in Go and the message queue broker we're using is RabbitMQ.

We wanted the ability to horizontally scale out crawling to improve the rate at which we completed the work we were given. We could achieve parallelism on the queue by increasing the number of consumers, but we wouldn't be able to keep track of URLs that had been crawled across the nodes. We needed to think beyond a single process running at one time.

We used Redis to share state across the workers. We use Redis to keep track of URLs that had been crawled before and check whether or not to crawl them as we pick up URLs from the message queue. Now we can have many message queue consumers to get through work faster based on our workload. The total time for a full crawl is now 4 hours.

What have we learnt?

We had been running the GOV.UK Mirror for long enough to know which areas we didn't like, from operating it through to functionality we knew to be missing. Not only that, but we also understood the problem better than we did at GOV.UK's release. There was no magical epiphany that occurred; this is the nature of writing software – you have to adapt and update as you know more.

After two years of running GOV.UK we're finding that we have to revisit many of the choices we had made. The site has grown a lot and it's time to take another look at many of the applications we built back in 2012.

Iterate. Then iterate again.

You can find the code here: https://github.com/alphagov/govuk_crawler_worker/

If this sounds like a good place to work, take a look at Working for GDS - we're usually in search of talented people to come and join the team.

You can follow Kush on Twitter, sign up now for email updates from this blog or subscribe to the feed.

Share this page

14 comments

Comment by Jemima posted on 27 August 2014

Does the CMS add pages to the queue when pages are updated? Allowing you to prioritise updated content?

Link to this comment
- Replies to Jemima>
  Comment by Kushal Pisavadia posted on 27 August 2014
  
  Hi Jemima, we've just started to do this. You can see some of the work in progress in the content-store application here: https://github.com/alphagov/content-store/blob/master/app/models/content_item.rb#L51
  
  Link to this comment
  - Replies to Kushal Pisavadia>
    
    Comment by Jemima posted on 27 August 2014
    
    Love it. Super clever.
    
    Link to this comment
Comment by Felicity posted on 28 August 2014

Is this really a unique problem to GDS and thus requires an expensive custom solution?

Link to this comment
- Replies to Felicity>
  
  Comment by Brad Wright posted on 28 August 2014
  
  Hi Felicity, while static mirrors of content-managed data isn't a unique problem, our particular administration system is entirely bespoke because it's built around user needs. As Kush says, this mirroring is part of our disaster recovery setup so we absolutely need the mirror to be up to date and comprehensive. This requires integration with our publishing tools which we wouldn't easily get from a commercial system without customisation and lock-in - we outline some of our software buying vs. building thinking in the choosing technology page on the Service Manual.
  
  Link to this comment
Comment by Peter Smith posted on 28 August 2014

Could not you just rsync the webserver contents in a cron job? Do you not have all the documents source controlled anyway? Appears to be massively over-engineered, unless there's something I'm missing.

Link to this comment
- Replies to Peter Smith>
  
  Comment by Brad Wright posted on 28 August 2014
  
  Hi Peter, GOV.UK's content is served from a content management system and is passed through some presentation layers to provide consistent headers and footers - it's not stored on disk in full form, hence the need for a mirroring process to give us the content to rsync.
  
  Link to this comment
Comment by Sam posted on 31 August 2014

You mentioned it now only takes 6 hours for a full backup. Out of curiosity what now seems to be the bottleneck preventing you going any faster (other than just adding more crawler workers)?

Link to this comment
- Replies to Sam>
  
  Comment by Kushal Pisavadia posted on 08 September 2014
  
  Honestly, we haven't looked into this at any great depth yet. I've just taken a look at the various metrics[1] we're sending to graphite and the workflow step that goes out over the network to fetch the page is by far the slowest. At a guess it would be that some pages take a while to generate and that can slow down the response times.
  
  [1] https://github.com/alphagov/govuk_crawler_worker/blob/master/workflow.go#L164
  
  Link to this comment
Comment by serverhorror posted on 02 September 2014

Sounds like you could also achieve vast speedups by using information in the the headers (like those being used for client caching)

Link to this comment
- Replies to serverhorror>
  
  Comment by Kushal Pisavadia posted on 08 September 2014
  
  Potentially, but our intention is to use this as part of a workflow where when we 'publish' a page we can then immediately send a message to the exchange for the crawler. That way, we don't have to rely as much on cache control headers.
  
  Link to this comment
Comment by Storix posted on 08 September 2014

Your approach seems to be "over engineered". The pages are published through a CMS using templates and data from within a database. This is a very common approach and most administrators create a backup or sync the template data and the database on the web server. In this case, I would recommend using a backup product rather than crawling the pages for content.

One important aspect to remember is backing up relational databases. You should either stop the database temporarily, or create a snapshot of the data so that it is backed up in a consistent state. I'm not here to sell you a solution, but we have customers who use our backup product to accomplish exactly what you are trying to achieve. You might want to re-evaluate your disaster recovery plans by incorporating backup software. Good luck.

Link to this comment
- Replies to Storix>
  
  Comment by Kushal Pisavadia posted on 09 September 2014
  
  As already stated in the blog post, this is just one of our many disaster recovery options. We have backups in place for our databases.
  
  In a scenario where our infrastructure disappears we would have to provision new infrastructure. This would need to complete before applying any database backups. Even more so if that's occurred across many providers or data centres.
  
  The static mirror gives us some extra time whilst we work on reprovisioning new infrastructure. It means we reduce the time it takes for users to access common areas of the site.
  
  Link to this comment
Comment by Eula Benham posted on 16 December 2014

I am actually pleased to read this website posts which contains lots of useful facts, thanks for providing such data.

Link to this comment

Taking another look at GOV.UK's disaster recovery

Disaster Recovery

Creating a static copy of GOV.UK

Fast forward to present day

What problems are we trying to solve?

Building it

What have we learnt?

Share this page

14 comments

Technology in government

Categories

Work with us

Sign up and manage updates

Find out more

Recent Posts

Comments and moderation

Disaster Recovery

Creating a static copy of GOV.UK

Fast forward to present day

What problems are we trying to solve?

Building it

What have we learnt?

Sharing and comments

Share this page

14 comments

Related content and links

Technology in government

Categories

Work with us

Sign up and manage updates

Find out more

Recent Posts

Comments and moderation