Why I want ‘differential privacy’ for my NHS data

I know the NHS takes steps to protect my data, these often hold back its ability to learn from my data. This means the care I may need in the future will not be as good as it could be and will cost more.

Perhaps more worryingly, the current energy and enthusiasm for technology mean that my future data may end up sitting in a new silo that may be owned and controlled outside of the NHS.[1]

The NHS needs to be more sophisticated in how it collects and uses data. Differential privacy, a form of privacy protection, is the way forward.

The need for privacy is harming research and development in the NHS

Technology is a central part of the Long-Term Plan. Data is the fuel for this technology to develop and flourish, so it needs to be more accessible to researchers. Many analysts will be based outside of the NHS, so protecting privacy is critical.[2]

The words “Information Governance” and “GDPR” can strike fear into the hearts of analysts that work with health data. They are necessary but come with a growing cost. For example, A&E performance data is summarised in aggregated statistics. You cannot interrogate these data or pour them into a machine learning model, just observe them and comment on trends. Dull.

Redaction (removing names), deviation (replace birthday with birth year), and pseudonymisation (remove NHS number) help make data more secure and shareable, but they do not go anywhere near far enough to give confidence for more open forms of data sharing or publishing. Eager analysts and innovators are left salivating at all the data they cannot legitimately access to train their models.

A more sophisticated approach to data protection is needed

When someone uses the NHS, they leave a unique fingerprint in the data – the date and time of arrival at a hospital for a specific appointment can theoretically allow the person to be reidentified and linked to their historical data.

Famously in 2014, data on taxi journeys in New York were released. The data included the start and end points, times, fare and tip amounts. Someone correlated these data with timestamped photos of celebrities getting into taxis to identify their journey from the data and spot whether they were good tippers – Bradley Cooper and Jessica Alba are apparently not tippers.[3]

The Royal Society’s recent report on “Protecting privacy in practice” covers a range of Privacy Enhancing Technologies (PETs) to help protect data and sets out a sensible approach to be taken.

What is differential privacy and how does it work?

Differential privacy allows the “forest of data to be studied, without permitting the possibility of looking at individual trees”.[4] The equally eloquent definition given by the Royal Society is: “when a dataset or result is released, it should not give much more information about a particular individual than if that individual had not been included in the dataset”.

Back in the 1960s, researchers wanted to estimate the number communists in society. Few people were willing to give an honest answer. So, when researchers asked, “are you, or have you ever been, a member of the Communist Party?” they gave participants a coin to flip in private before answering. If the coin landed on heads, the participant would answer honestly. If the coin landed on tails, then the participant would flip the coin again and let the coin answer the question, giving the participant plausible deniability.

This is what is known as local differential privacy. Both Apple and Google use a similar approach when collecting data from your device or browser to provide some protection.[5]

There are challenges (mainly mathematical, so solvable) when extending this approach over broader fields of data, particularly unstructured data (e.g. free text descriptions). There are also trade-offs between the amount of privacy and the utility for analysts (i.e. how much noise is introduced).

Generally, the utility of using differential privacy would be substantial – more detailed data available for safe interrogation. There are also benefits to training machine learning models on these data as it can help to avoid a problem known as “overfitting”.

What next?

The Royal Society set out seven recommendations for how the UK could realise the full potential of Privacy Enhancing Technologies and allow their use on a greater scale. The technology is accessible as well – much of the underlying maths is public, and the computer code to implement it is freely downloadable from GitHub. So, the next step is a policy one - perhaps for NHSX?


[1] Read more in the Reform report “Making NHS data work for everyone” [https://reform.uk/research/making-nhs-data-work-everyone]

[2] There is also a strong argument that data available and accessed within the NHS should protect privacy. For example, after Richard Hammond was admitted to hospital in 2006, his records were accessed inappropriately (https://www.yorkshireeveningpost.co.uk/news/hospital-trust-probes-hammond-scan-spy-1-2073885).

[3] https://gawker.com/the-public-nyc-taxicab-database-that-accidentally-track-1646724546

[4] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42852.pdf

[5] https://www.wired.com/story/apple-differential-privacy-shortcomings/