Modern HTTP for backing services

We have previously written about how we have smoke tests running continuously which check that GOV.UK is responding properly. Late last year, we noticed that we were consistently seeing a problem with the part of GOV.UK served by the Performance Platform.
Scenario: Performance Platform dashboards are available Given I am testing through the full stack And I force a varnish cache miss And I am benchmarking When I visit "/performance/carers-allowance" Then I should get a 200 status code And I should see "Carer's Allowance" And the elapsed time should be less than 2 seconds

This was failing because the elapsed time was over 2 seconds, not meeting our performance budget. This part of GOV.UK updates infrequently (once per day), so we can make it faster by caching the response.

But a reason we have smoke tests with a performance budget is to nudge us into investigating exactly what the problem might be.

Diagnosis

To replicate the problem, we wanted the client view, rather than just the times reported by the application responsible for creating the HTML. So first off, we looked at where the time was being spent using curl.

> URL=https://www-test.performance.service.gov.uk/performance/carers-allowance > curl -w "dns: %{time_namelookup} tcp: %{time_connect} ssl:%{time_appconnect} \ ttfb:%{time_starttransfer} total:%{time_total}\n" -sk -o /dev/null $URL -o /dev/null $URL dns: 0.001 tcp: 0.027 ssl:0.182 ttfb:1.434 total:1.582 dns: 0.000 tcp: 0.000 ssl:0.000 ttfb:1.362 total:1.533

From this:

DNS looks fine (0.001s)
TCP looks fine (0.027s)
TLS is OK (0.182s)
Time To First Byte (ttfb) is awful (1.4s)
After we receive the first byte, it still takes over 100ms to deliver the full response, so the response is presumably quite large
SSL session reuse / Keep-Alive looks to be configured properly (second request has 0s TCP/SSL setup cost)

What good looks like

For comparison, we set the URL to https://www.google.com/:
> curl -w "dns: %{time_namelookup} tcp: %{time_connect} ssl:%{time_appconnect} \ ttfb:%{time_starttransfer} total:%{time_total}\n" -sk -o /dev/null $URL -o /dev/null $URL dns: 0.001 tcp: 0.016 ssl:0.128 ttfb:0.142 total:0.142 dns: 0.000 tcp: 0.000 ssl:0.000 ttfb:0.014 total:0.014

Google looks a bit better:

DNS looks fine (0.001s)
TCP looks fine (0.016s)
TLS is okay (0.128s)
Time To First Byte is great (0.142s implies a response time of ~10ms)
the response looks tiny given that there is no additional time to send it
SSL session reuse / Keep-Alive looks to be configured properly (second request has 0s TCP/SSL setup cost)

Looking at the response from our service in more detail:

> curl -I $URL HTTP/1.1 200 OK Server: nginx Date: Tue, 11 Nov 2014 09:26:24 GMT Content-Type: text/html; charset=utf-8 Content-Length: 132653 Connection: keep-alive Vary: Accept-Encoding Cache-Control: public, max-age=600 ETag: "cnmbPZfU1LO/hl81Y/JLMg==" Vary: Accept-Encoding Strict-Transport-Security: max-age=2628000

We can see that the response is a large HTML page (~130kb). Next, we delved more into why the application was taking 1.5 seconds to render HTML.

Dashboards on the Performance Platform are formed from lots of different data sets which are queried and then aggregated together to form a single HTML page. These data sets are stored in our backing service API, which is accessed internally over HTTPS. To render the Carer’s Allowance dashboard, our application was making 11 HTTP calls to the backing service API. Those 11 calls seemed like a likely candidate for optimisation (were they fast enough, and were they all necessary?), but we went through Brendan Gregg’s USE method to confirm where we should spend our time.

USE method

I used wrk to generate sufficient load on a test server, and then examined the machine.

wrk -d 120 -c 1 -t 1 -R 1 http://www-test.performance.service.gov.uk/performance/carers-allowance

This gave a sustained requests per second of 0.47, after running it 3 times and taking the mean.

resource	type	metrics
CPU	utilisation	`sudo iostat 1` shows low load via the us and sys columns. `top -o cpu` shows that the application process is steady at around 50%
CPU	saturation	`sudo iostat 1` shows low load in the load average column
CPU	errors	None showing
Memory	utilisation	`vm_stat 1` shows lots of free memory. `top -o rsize` shows memory available for the application process
Memory	saturation	`vm_stat 1` shows 0 pageouts
Memory	errors	None showing
Network	utilisation	`netstat -i 1` looks fine. Not maxing out the network connection for download. `sudo fs_usage -f network PID` shows a fair bit of network chat.
Network	saturation	`netstat -s \| egrep 'retrans\|overflow\|full\|out of space\|no bufs'` shows nothing to worry about
Network	errors	`netstat -s \| grep bad` shows nothing bad to worry about.
Storage IO	utilisation	`iostat 1` shows low IO (0 to 20 kb/s) `sudo fs_usage -f filesystem PID` shows expected reading of html template files and nothing else.
Storage IO	saturation	`sudo iopending` is fine
Storage IO	errors	None
Storage capacity	utilisation	`df -h` shows free space. `sysctl vm.swapusage vm.swapusage: total = 2048.00M used = 861.75M free = 1186.25M (encrypted)`
Storage capacity	saturation	There's free disk space
Storage capacity	errors	None

This all showed that the server seemed to be behaving normally, and so we could proceed to looking at the performance of the backing service APIs.

Performance of backing services

> URL='https://api-test.performance.service.gov.uk/data/transactional-services/summaries?flatten=true&sort_by=_timestamp%3Adescending&filter_by=service_id%3Adwp-carers-allowance-new-claims&filter_by=type%3Aseasonally-adjusted' > curl -w "dns: %{time_namelookup} tcp: %{time_connect} ssl:%{time_appconnect} \ ttfb:%{time_starttransfer} total:%{time_total}\n" -sk -o /dev/null $URL -o /dev/null $URL dns: 0.001 tcp: 0.027 ssl:0.177 ttfb:0.231 total:0.231 dns: 0.000 tcp: 0.000 ssl:0.000 ttfb:0.031 total:0.031

The above request to the backing service API show that again the first request incurs the overhead for setting up a TLS connection, and again the Time To First Byte and total time is reasonably fast.

At this point, we had the suspicion that the application producing the HTML was being slowed down because of all the requests to the backing services (and this would only increase as dashboards were iterated to have more data on them).

Looking at the underlying runtime confirmed that we were limited to 5 concurrent connections to a host/port combination. So to serve the 11 HTTP requests to the backing services, our application would:

create 5 SSL connections to a single backing service. This happens concurrently, so we see a delay of 0.2 seconds plus a bit, rather than 5 times 0.2 seconds
Wait for a response to each request sent over the SSL connection (head-of-line blocking)
keep waiting until it had 11 responses

We looked for a simple way to solve this by varying connection pool sizes (the total number of connections allowed to a single host/port) but that seemed to involve either having more delays setting up more SSL connections, or more delays waiting for responses to come back from the backing service. We also felt that this would require monitoring and tuning on an ongoing basis as the design of the dashboards evolved (and the number of backing service calls changed). Fortunately, other people have been thinking about how to improve this situation for web browsers and other contexts.

SPDY is a network protocol developed at Google which is designed to reduce web page latency. It does this by multiplexing requests over a single connection, and can receive responses in any order, so it does not suffer from the head-of-line blocking that we were seeing.

Solution

We were able to change the version of nginx (where we do SSL termination) that we were using to one that supported SPDY, and essentially dropped in a library into the application which provided us with SPDY client support.

Running our wrk test again showed that we had now improved our sustained requests per second to 0.58.

Repeating the original test:

>URL=https://www-test.performance.service.gov.uk/performance/carers-allowance > curl -w "dns: %{time_namelookup} tcp: %{time_connect} ssl:%{time_appconnect} \ ttfb:%{time_starttransfer} total:%{time_total}\n" -sk -o /dev/null $URL -o /dev/null $URL dns: 0.001 tcp: 0.033 ssl:0.223 ttfb:1.299 total:1.482 dns: 0.000 tcp: 0.000 ssl:0.000 ttfb:1.047 total:1.203

This consistently gave us lower response latency. In fact, it was around 20% improvement in our response latency, taking us back to within our 2 second performance budget with no more scoldings from our smoke tests.

For the future

HTTP/2 has been standardised, and includes networking improvements which take things learned from SPDY. SPDY itself has been deprecated. We expect that as HTTP/2 support becomes more widespread, we’ll be able to seamlessly take advantage of that. HTTP/2 should let us see performance benefits from using HTTP to talk to backing services, while retaining features and behaviour that are widely understood by our developers.

If work like this sounds good to you, take a look at working for GDS. We’re searching for talented people to come and join our team.

You can follow James on Twitter, sign up now for email updates from this blog or subscribe to the feed.

1 comment

Comment by Bob Bratchet posted on 17 July 2015

Response time was only improved from 1.58s to 1.48s. Woohoo. I've just measured it externally as 3.3s, so clearly something serious is up with it.

Anything over a second is still rather a long time for a few simple database look-ups.

Link to this comment

Modern HTTP for backing services

Diagnosis

What good looks like

USE method

Performance of backing services

Solution

For the future

Share this page

1 comment

Technology in government

Categories

Work with us

Sign up and manage updates

Find out more

Recent Posts

Comments and moderation

Diagnosis

What good looks like

USE method

Performance of backing services

Solution

For the future

Sharing and comments

Share this page

1 comment

Related content and links

Technology in government

Categories

Work with us

Sign up and manage updates

Find out more

Recent Posts

Comments and moderation