One of the GDS design principles is to do the hard work to make things simple. As part of the work we’ve been doing to migrate our publishing platforms, we’ve been simplifying the infrastructure of our search system.
Lots of this work to improve search on GOV.UK is hidden to users. I thought I’d summarise what we’ve built so far, and explain how we measure search performance. We’ve written blogs on some of the different areas but never rounded it all up together before, explaining how the components fit.
Our search infrastructure
Our search infrastructure is made up of 3 parts:
- Elasticsearch is an open source search engine. It’s what actually does the search. It runs queries using the search indexes we define for it, and decides which documents match the user’s query. Back in 2012, we blogged about the reasons we chose Elasticsearch over Solr, another search engine we were using at the time.
- Rummager is a wrapper around Elasticsearch that we built at GDS. It includes an application programming interface (API) that is used by the search function, scoped searches, and finder pages on GOV.UK. Designing the search infrastructure this way simplifies the code for frontend applications that users interact with.
- Search admin is a tool we built to support Rummager. We use it to run manual maintenance on our search index. We can influence the ordering of results for particular search terms, for example.
Together, these tools provide a flexible search system for content published on GOV.UK.
People often ask us why we don’t just install an external search engine provider such as Google instead of building our own infrastructure. Creating our own infrastructure gives us more flexibility and control. Our search system doesn’t just power the main site search, but uses the metadata stored on documents to power other search and navigation features, like finders and publication lists. We also want to have control over how the results are presented so we can customise search results based on user needs.
Indexing documents and generating search results
A piece of content on GOV.UK is more than just a chunk of text. It also contains all sorts of metadata that can be used for searching and filtering, such as content type, associated organisations, and the date it was last updated. All this is included in the JSON document Rummager sends to Elasticsearch. We recently integrated Rummager directly with the new publishing platform, so all the metadata about how content is linked is now indexed in the same way, regardless of which application published the content.
Rummager also describes how Elasticsearch should treat each individual metadata field.
With Rummager we can:
- specify the field type, like date, number or text - this means dates and numbers are understood correctly (for example you can do a search for content created within a date range)
- choose which fields to index - some information helps us display search results
- require search to find an exact match on a field instead of analysing the text
- choose how text is analysed, for example, stemming (words with the same stem are treated the same like ‘cloud’ and ‘clouded’), synonyms (words with the same meaning are treated the same by the search engine) and stopwords (common words like ‘the’ get ignored by the search engine) - the meanings of these words are further explained in a previous blog post on how GOV.UK site search works.
Running and improving search queries
The other part of Rummager is the API for running search queries. If you are comfortable working with JSON APIs, you can use it to get information about content on GOV.UK. GDS performance analysts often use the API to filter GOV.UK content and access the metadata.
The search API is separate from the frontend applications that users interact with. This ensures that the frontend applications are easy to develop: developers don’t need to know any specifics about Elasticsearch to change the user interface. This also lets us experiment with different interfaces and add new search capabilities without duplicating code.
Sometimes we manually intervene to improve specific queries. Search admin lets us create a ‘best bet’ to give a particular page a higher score for a query. Similarly, we can create a ‘worst bet’ to give a search result a lower score. Tara describes this process in more detail in her blog post about how search works.
Understanding search performance
When we deploy changes to our search infrastructure, we need to be confident that we’re really helping users find things more easily. To do this we need to understand search performance.
Some of the tools we use to understand and improve search performance are automated, such as analytics and health checks.
Analytics show us how users respond to a particular page of search results: if the results near the top of the page get more clicks than results further down the page, it means things are working well. If they don’t, we need to do more to improve the search results.
Another automated tool is the ‘health check’ in Rummager. This runs a predefined list of queries, and checks that the results contain the links we would expect. We use the health check to ensure that we don’t accidentally change the search results in a significant way without knowing about it.
Although automated tools are useful, we can’t make assumptions about the things users are looking for through analytics alone. For example, if users click one search result more than others, it only shows the relative importance of that result compared to the others on that page. There may be more appropriate results the search engine didn’t pick up. Ongoing user research ensures improvements we make are based on user needs. During our current work on subject-based taxonomy, user research has identified problems that analytics alone didn’t highlight.
To change the way Elasticsearch queries content, we sometimes use feature flags in Rummager. This lets us quickly integrate a new feature into the Rummager code (for example, to include withdrawn documents in search results), but we can disable the new feature for users until we’re confident it works well. Feature Flags are also appropriate for performing A/B tests and rolling out features gradually, which we’re thinking about doing in the future.
The technical improvements we’ve made to the search infrastructure mean we can now start work on improving the tagging tools in our publishing applications. We’re also going to build a beta for unified navigation on GOV.UK using the content we’ve identified as belonging to the education theme. We’ll continue to update you as we make progress.
Making it easier for users to find things means we need to consider more than just the search system. We’re developing a single subject-based taxonomy for all our content and we’re going to make it easier to tag that content. So improving search is part of a large programme of work we’re doing to improve navigation on GOV.UK.
If this sounds like a good place to work, take a look at Working for GDS - we're usually in search of talented people to come and join the team.