Web scraping public UKSA letters

Luke Shaw

2020/06/19

1 Intro

The UK Statistics Authority (UKSA), and its regulatory arm the Office for Statistics Regulation (OSR), play the role of independently regulating and promoting good practice in official statistics.

Part of this work is through written publicly available letters with producers of such statistics, the most high-profile of which was the letter criticising the claim of ‘£350 million per week’ in relation to the UK leaving the EU.

I am interested in these letters, and having seen them in news stories recently I set out to see what data I could find.

2 Scraping the site

I used the R package rvest to web scrape the data from their website.

This was quite fun, though fiddly at times. All code is available in the GitHub repository uksa-scrape. I learnt a bit about CSS and webpage construction on the way, and found this tutorial on CSS selectors fun and useful. The phrase ‘scraping’ feels appropriate, however, as it was not the smoothest of processes and involved a fair few try-and-see attempts until it looked about right!

I ended up with 2 datasets, one with the number of themed letters, and the other a list of all 1244 letters.

It is worth noting that the letters can be from official statistics creators to the UKSA as well as the other way round, so multiple letters can exist as a conversation between regulator and creator.

2.1 Theme table and plot

Here are the first few elements of the theme table, which has 12 distinct themes and one total. Interestingly nearly half (46%) of the 1244 correspondences are not assigned a theme. This might be correct, or might be due to the theme being a recent addition to the 13 year-long data set.

names values url num_letters
Agriculture, Energy and Environment agriculture-energy-environment https://www.statisticsauthority.gov.uk/correspondence-list/?keyword=&theme=agriculture-energy-environment 30
Business, Trade and International Development business-trade-international-development https://www.statisticsauthority.gov.uk/correspondence-list/?keyword=&theme=business-trade-international-development 20
Children, Education and Skills children-education-skills https://www.statisticsauthority.gov.uk/correspondence-list/?keyword=&theme=children-education-skills 85

From plotting the results we can see ‘Health and Social Care’ and ‘Economy’ are the themes with most letters.

2.2 Individual letters - 2020 in progress

I was thinking what interesting information I could try and glean from the individual letters. Here are the three most recent rows from the individual letter dataset that I scraped:

title date subtitle url
Ed Humpherson to Neil McIvor: Temporary exemption from Code for DfE attendance statistics 2020-06-19 Ed Humpherson, Office for Statistics Regulation to Neil McIvor, Department for Education https://www.statisticsauthority.gov.uk/correspondence/ed-humpherson-to-neil-mcivor-temporary-exemption-from-code-for-dfe-attendance-statistics/
Ed Humpherson to Ken Roy: User engagement in the Defra Group 2020-06-18 Ed Humpherson, Office for Statistics Regulation to Ken Roy, Department for Environment, Food and Rural Affairs https://www.statisticsauthority.gov.uk/correspondence/ed-humpherson-to-ken-roy-user-engagement-in-the-defra-group/
Letter from Sir David Norgrove to Richard Holden MP 2020-06-17 Sir David Norgrove, UK Statistics Authority to Richard Holden MP, House of Commons https://www.statisticsauthority.gov.uk/correspondence/letter-from-sir-david-norgrove-to-richard-holden-mp/

Noticing that there have been 9 letters in June 2020 so far, including one today, I wondered if the number of letters was increasing year on year. Of course, as we haven’t finished 2020, we need to adjust the other years to see if this year is on-track to be the year with the most letters.

So the answer is yes, although this year is looking similar to the last two.

3 Further analysis

This is quite a fun data set, and there is more that could be done with it. It might be interesting to try natural language processing (NLP) on the information in the titles or entire letter text, or to see how frequently different government departments are involved. A way of identifying whether it was an inbound or outbound letter, which may well be hidden in the webpage somewhere, would be an interesting additional column to the letters dataset.

As mentioned above, all the code is on GitHub if anyone out there wants to play around further.

4 Conclusion

The work the UKSA does is vital in promoting and safeguarding the production and publication of official statistics that ‘serve the public good’. OK, I took most of that last sentence from their website, but I do believe it.

Through scraping publicly available information, we can see that the UKSA is on track to publish the most letters this year, though the volume is similar to 2018 and 2019.