Chapter 1 Introduction
This guidance is part of the quality assurance guidance published by the Best Practice and Impact team at the Office for National Statistics. This guidance has been created to support analysis professionals in government who use coding languages for in their work such as the production of statistics or modeling.
1.1 Who is this guidance for?
This guidance is for analysts who are want to make their work reproducible and ensure that they use best practice. The aim of the guidance is to promote good practice in coding across the government analytical professions. It is aimed primarily at analysts who use code but have not been formally trained in computer science, rather than data scientists.
The guidance provides an introduction to techniques and methods. It is not a comprehensive learning resource. However, it is also not an introduction to coding. You are will get the most from this guidance if you are familiar with coding principles, have used a script based language before, or currently use R or Python for analysis. At the end of chapters there are links to resources which provide a more comprehensive and in-depth guide to the methods we discuss. These are not the only learning resources available, but we have tried to link to ones that are used widely.
1.2 What is its aim?
Analysis in government is moving away from legacy systems that are largely manual to workflows built using open source languages and tools that encourage rigorous documentation and a full audit trail. This trend, which builds on best practice in software engineering, provides opportunities for using new tools and methods to enhance the quality of the systems used to produce analysis. Many producers of such systems are not data scientists or coding experts, but other analysts or non-analytical staff who may not be familiar with these new approaches. This guidance will introduce some of the common practices that are recommended for implementing analysis in a transparent and fully auditable way and brings together best practice from across government.
1.3 Why move away from spreadsheets?
Many departments are moving some analysis from spreadsheets to more automated workflows built with open source languages like R and Python. There are several reasons for this.
- Spreadsheet errors
- Spreadsheets intermesh data and code together. This makes it hard to quality assure and test the logic. It is mixed up with the data and it is harder to understand what the code is doing. Splitting the code and data allows us to write tests to make sure the code is doing what it is supposed to. Spreadsheets are not designed to work with best practice software testing.
- Errors are relatively common because of human error. This list of errors caused by spreadsheets illustrates the problem. One of the problems is the high degree of copy and pasting that using spreadsheets encourages, coupled with the intermeshing of data and logic together. It is estimated that 20 per cent of genetic research papers contains errors because a spreadsheet converted gene names into calendar dates. This demonstrates the problem of spreadsheet logic preforming tasks without informing the user. When this is combined with the difficulty of building proper tests, the resulting workflows are error prone. This, in turn, leads to the requirement for large amounts of time intensive, manual quality assurance.
- Spreadsheets intermesh data and code together. This makes it hard to quality assure and test the logic. It is mixed up with the data and it is harder to understand what the code is doing. Splitting the code and data allows us to write tests to make sure the code is doing what it is supposed to. Spreadsheets are not designed to work with best practice software testing.
- Lack of proper version control
- Whilst it is possible to version control spreadsheets, this is difficult and time consuming when compared to the version control functionality of tools like Git. These tools preserve a complete record of every change, along with the reasons and decisions behind those changes. This provides a complete audit trail and allows managers to have confidence in the quality assurance of the analysis without having to go through the work line by line.
- When using Git, it is very difficult to make changes without this being recorded. Along with its ease of use, this makes version control a fundamental part of the analytical workflow rather than relying on the analyst to remember to recorded changes and maintain proper version control manually.
- Spreadsheets are less reproducible
- Spreadsheets often involve lots of manual steps or ‘point and clicks’. These steps are not recorded, which makes reproducing work later difficult or impossible. Notes and documentation may help, but even detailed documentation is unlikely to give a comprehensive record of all the steps taken.
- Testing and QA often happens at the end, not throughout
- Quality assurance is time consuming when done manually and can take up a significant proportion of analyst time. Automating quality assurance and code management, alongside building testing into the workflow, frees up time to do more analysis and interpret the results.
1.4 Why move from SAS or SPSS to an open source tool?
expand
- black box- can’t examine the algorithms
- hard to do Testing
- more limited
- cost money
1.5 Open source
Open source languages like Python and R are increasing in popularity across government. One advantage of using these tools is that we can reduce the number of steps where the data needs to be moved from one program (or format) into another. This is in line with the principle of reproducibility set out in the government’s AQUA guidance on quality analysis in government, as the entire process can be set out in code, greatly reducing the likelihood of manual transcription errors.
The Government Digital Service Technology Code of Practice is a set of criteria to help government design, build and buy technology. It’s a cross-government agreed standard in the spend controls process. Criterion 3 “Be open and use open source” recommends using open source languages to improve transparency, flexibility and accountability.
It is easy to find learning resources and guides for open source languages, as they are supported by large and vibrant communities. This allows for a large degree of self learning, if an output, method or technique is widely used (and even if it is cutting edge), there is usually a library to support it already in place and a guide to teach it that is freely available for all. As a result of these large communities of users, open source languages can perform an enormous array of tasks. This means that departments can try innovative new approaches such as interactive dashboards or web scraping quickly and easily. Which language a department chooses will depend on business needs. In general, R and Python are the most widely used open source analysis tools and are good at similar tasks
1.6 For managers
Managers who lead the production of analysis outputs like statistics or models are responsible for showing that those outputs are fit for purpose and properly quality assured. Whilst spreadsheet based workflows are familiar to all, moving to an open source workflow will require changes to the process. Because the tools and methods are often unfamiliar at the start, managers may be unsure of the robustness of the process. In consequence, processes might be slowed down or not deployed or the manager may feel they need to go line by line through the process. This guidance will support managers and explain what they need to do to be certain that best practice has been followed. If the process follows these best practices then a manager can have confidence in the output. Managers can ask whether business risks that they own have been properly tested and mitigated. Because workflows are fully audited and transparent, a manager can see every change and how the changes were managed. This allows the analysists to focus on the coding and the managers to focus on managing.