We have previously written about how we have smoke tests running continuously which check that GOV.UK is responding properly. Late last year, we noticed that we were consistently seeing a problem with the part of GOV.UK served by the Performance Platform.
Scenario: Performance Platform dashboards are available
Given I am testing through the full stack
And I force a varnish cache miss
And I am benchmarking
When I visit "/performance/carers-allowance"
Then I should get a 200 status code
And I should see "Carer's Allowance"
And the elapsed time should be less than 2 seconds
This was failing because the elapsed time was over 2 seconds, not meeting our performance budget. This part of GOV.UK updates infrequently (once per day), so we can make it faster by caching the response.
But a reason we have smoke tests with a performance budget is to nudge us into investigating exactly what the problem might be.
Diagnosis
To replicate the problem, we wanted the client view, rather than just the times reported by the application responsible for creating the HTML. So first off, we looked at where the time was being spent using curl.
> URL=https://www-test.performance.service.gov.uk/performance/carers-allowance
> curl -w "dns: %{time_namelookup} tcp: %{time_connect} ssl:%{time_appconnect} \
ttfb:%{time_starttransfer} total:%{time_total}\n" -sk -o /dev/null $URL -o /dev/null $URL
dns: 0.001 tcp: 0.027 ssl:0.182 ttfb:1.434 total:1.582
dns: 0.000 tcp: 0.000 ssl:0.000 ttfb:1.362 total:1.533
From this:
- DNS looks fine (0.001s)
- TCP looks fine (0.027s)
- TLS is OK (0.182s)
- Time To First Byte (ttfb) is awful (1.4s)
- After we receive the first byte, it still takes over 100ms to deliver the full response, so the response is presumably quite large
- SSL session reuse / Keep-Alive looks to be configured properly (second request has 0s TCP/SSL setup cost)
What good looks like
For comparison, we set the URL
to https://www.google.com/:
> curl -w "dns: %{time_namelookup} tcp: %{time_connect} ssl:%{time_appconnect} \
ttfb:%{time_starttransfer} total:%{time_total}\n" -sk -o /dev/null $URL -o /dev/null $URL
dns: 0.001 tcp: 0.016 ssl:0.128 ttfb:0.142 total:0.142
dns: 0.000 tcp: 0.000 ssl:0.000 ttfb:0.014 total:0.014
Google looks a bit better:
- DNS looks fine (0.001s)
- TCP looks fine (0.016s)
- TLS is okay (0.128s)
- Time To First Byte is great (0.142s implies a response time of ~10ms)
- the response looks tiny given that there is no additional time to send it
- SSL session reuse / Keep-Alive looks to be configured properly (second request has 0s TCP/SSL setup cost)
Looking at the response from our service in more detail:
> curl -I $URL
HTTP/1.1 200 OK
Server: nginx
Date: Tue, 11 Nov 2014 09:26:24 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 132653
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: public, max-age=600
ETag: "cnmbPZfU1LO/hl81Y/JLMg=="
Vary: Accept-Encoding
Strict-Transport-Security: max-age=2628000
We can see that the response is a large HTML page (~130kb). Next, we delved more into why the application was taking 1.5 seconds to render HTML.
Dashboards on the Performance Platform are formed from lots of different data sets which are queried and then aggregated together to form a single HTML page. These data sets are stored in our backing service API, which is accessed internally over HTTPS. To render the Carer’s Allowance dashboard, our application was making 11 HTTP calls to the backing service API. Those 11 calls seemed like a likely candidate for optimisation (were they fast enough, and were they all necessary?), but we went through Brendan Gregg’s USE method to confirm where we should spend our time.
USE method
I used wrk to generate sufficient load on a test server, and then examined the machine.
wrk -d 120 -c 1 -t 1 -R 1 http://www-test.performance.service.gov.uk/performance/carers-allowance
This gave a sustained requests per second of 0.47, after running it 3 times and taking the mean.
resource | type | metrics |
---|---|---|
CPU | utilisation | sudo iostat 1 shows low load via the us and sys columns. top -o cpu shows that the application process is steady at around 50% |
CPU | saturation | sudo iostat 1 shows low load in the load average column |
CPU | errors | None showing |
Memory | utilisation | vm_stat 1 shows lots of free memory. top -o rsize shows memory available for the application process |
Memory | saturation | vm_stat 1 shows 0 pageouts |
Memory | errors | None showing |
Network | utilisation | netstat -i 1 looks fine. Not maxing out the network connection for download. sudo fs_usage -f network PID shows a fair bit of network chat. |
Network | saturation | netstat -s | egrep 'retrans|overflow|full|out of space|no bufs' shows nothing to worry about |
Network | errors | netstat -s | grep bad shows nothing bad to worry about. |
Storage IO | utilisation | iostat 1 shows low IO (0 to 20 kb/s)sudo fs_usage -f filesystem PID shows expected reading of html template files and nothing else. |
Storage IO | saturation | sudo iopending is fine |
Storage IO | errors | None |
Storage capacity | utilisation | df -h shows free space.sysctl vm.swapusage |
Storage capacity | saturation | There's free disk space |
Storage capacity | errors | None |
This all showed that the server seemed to be behaving normally, and so we could proceed to looking at the performance of the backing service APIs.
Performance of backing services
> URL='https://api-test.performance.service.gov.uk/data/transactional-services/summaries?flatten=true&sort_by=_timestamp%3Adescending&filter_by=service_id%3Adwp-carers-allowance-new-claims&filter_by=type%3Aseasonally-adjusted'
> curl -w "dns: %{time_namelookup} tcp: %{time_connect} ssl:%{time_appconnect} \
ttfb:%{time_starttransfer} total:%{time_total}\n" -sk -o /dev/null $URL -o /dev/null $URL
dns: 0.001 tcp: 0.027 ssl:0.177 ttfb:0.231 total:0.231
dns: 0.000 tcp: 0.000 ssl:0.000 ttfb:0.031 total:0.031
The above request to the backing service API show that again the first request incurs the overhead for setting up a TLS connection, and again the Time To First Byte and total time is reasonably fast.
At this point, we had the suspicion that the application producing the HTML was being slowed down because of all the requests to the backing services (and this would only increase as dashboards were iterated to have more data on them).
Looking at the underlying runtime confirmed that we were limited to 5 concurrent connections to a host/port combination. So to serve the 11 HTTP requests to the backing services, our application would:
- create 5 SSL connections to a single backing service. This happens concurrently, so we see a delay of 0.2 seconds plus a bit, rather than 5 times 0.2 seconds
- Wait for a response to each request sent over the SSL connection (head-of-line blocking)
- keep waiting until it had 11 responses
We looked for a simple way to solve this by varying connection pool sizes (the total number of connections allowed to a single host/port) but that seemed to involve either having more delays setting up more SSL connections, or more delays waiting for responses to come back from the backing service. We also felt that this would require monitoring and tuning on an ongoing basis as the design of the dashboards evolved (and the number of backing service calls changed). Fortunately, other people have been thinking about how to improve this situation for web browsers and other contexts.
SPDY is a network protocol developed at Google which is designed to reduce web page latency. It does this by multiplexing requests over a single connection, and can receive responses in any order, so it does not suffer from the head-of-line blocking that we were seeing.
Solution
We were able to change the version of nginx (where we do SSL termination) that we were using to one that supported SPDY, and essentially dropped in a library into the application which provided us with SPDY client support.
Running our wrk
test again showed that we had now improved our sustained requests per second to 0.58.
Repeating the original test:
>URL=https://www-test.performance.service.gov.uk/performance/carers-allowance
> curl -w "dns: %{time_namelookup} tcp: %{time_connect} ssl:%{time_appconnect} \
ttfb:%{time_starttransfer} total:%{time_total}\n" -sk -o /dev/null $URL -o /dev/null $URL
dns: 0.001 tcp: 0.033 ssl:0.223 ttfb:1.299 total:1.482
dns: 0.000 tcp: 0.000 ssl:0.000 ttfb:1.047 total:1.203
This consistently gave us lower response latency. In fact, it was around 20% improvement in our response latency, taking us back to within our 2 second performance budget with no more scoldings from our smoke tests.
For the future
HTTP/2 has been standardised, and includes networking improvements which take things learned from SPDY. SPDY itself has been deprecated. We expect that as HTTP/2 support becomes more widespread, we’ll be able to seamlessly take advantage of that. HTTP/2 should let us see performance benefits from using HTTP to talk to backing services, while retaining features and behaviour that are widely understood by our developers.
If work like this sounds good to you, take a look at working for GDS. We’re searching for talented people to come and join our team.
You can follow James on Twitter, sign up now for email updates from this blog or subscribe to the feed.
1 comment
Comment by Bob Bratchet posted on
Response time was only improved from 1.58s to 1.48s. Woohoo. I've just measured it externally as 3.3s, so clearly something serious is up with it.
Anything over a second is still rather a long time for a few simple database look-ups.