Secondary Uses of Data – A Poachers Tale

Early in my career in health informatics I had plans to make myself fabulously rich by selling pseudonymised patient data from GP for a range of secondary purposes. I managed to spend £15 million of my backers money giving away 1000 GP systems and established a database of 6 million patients records it all ended in tears (at least from the financial perspective) in 1992.

In those early days I had a naïve view of the extent to which pseudonymisation could protect patient privacy and the ease with which data could be used for secondary purposes, so in this world I am very much a poacher turned gamekeeper, but one that still believes in the massive benefits that could flow from intelligent secondary use of patient data.

In this blog piece I won’t dwell on issues of patient privacy, suffice it to say for now, that I don’t now believe that pseudonymisation of rich datasets is fully effective but I do believe that with a sophisticated approach that we can adequately protect patient privacy when we use their data for secondary purposes. What I want to concentrate on here are the challenges of using data for secondary purposes that have nothing to do with the need to protect patient privacy.

There are two ways in which we might consider the use of data secondary, the first is that the use is not directly connected with the care of the individual patient whose data it is and the second is that it’s a use not of direct concern to the person collecting the data. Here I want to concentrate on the second definition uses with which the collector of the data is not concerned and indeed may even be ignorant of. (There are clearly some secondary uses in the sense of the first definition with which the data collector is very concerned – maybe their own research interest.)

There are a number of issues that need to be considered when using data for secondary purposes.

• What were the primary purposes for which the data were collected and how do the requirements of these primary purposes fit with the proposed secondary uses?

• Is there a conflict as to how something is best recorded for the secondary purposes? The requirements of the primary use should and will prevail

• How aware is the data recorder of the secondary purpose?. Awareness may encourage the recorder to take more care that the data is fit for the secondary purpose or may result in a range of gaming activities when they have motivation to “spin” the results of the secondary use either to their own benefit or that of the patient. E.g. blood pressure readings clustering just below the QoF cut of point.

• How important is accuracy in the recording of data to the recorder? A particular issue where users are forced to record data by system design or management pressure. If you have to record something but the accuracy of record has no direct impact on you then you may guess or make-up data or just type any old rubbish to get past a mandatory field for which you don’t have valid data. E.g. A GP recording prescribing details will take great care to record the information accurately as this will be use to produce the prescription and errors would create a serious patient risk, whereas they might be tempted to just guess to complete a mandatory dataset where they don’t see value in recording the data.

• Are definitions shared between the primary and secondary purpose and between different recorders, have they even been told what assumptions about definitions have been made? Researchers are typically much tighter that frontline data recorders e.g. some clinicians will record a diagnoses of “asthma” on the basis of limited clinical findings, just because it is probably right while others will want further confirmation and just record it as “wheezing”.

• System design and configuration can have a profound effect on what and how people record data and the extent to which they code data. Most work using data from multiple GP systems assumes data across different systems are directly compatible when the evidence suggests this is often not the case – Work by Professor Simon de Lusignan based on video observation of many consultations shows a four fold difference between the major systems in the number of consultations with no coded data and a two-fold difference in the average number of codes used . He also found that the way different systems mange pickings list had a significant effect on the data entered Secondary uses have to take account of system biases.

This bring us to Van de Lei’s law, coined by the eponymous Dutch health informatician “Data should not be used other than for the purposes for which it was collected” While I would not take this extreme position (and I suspect Van de Lei said it to emphasise the point, rather that to be taken literally) There are significant challenges in using data where the use is not one that was in the mind of the recorder when they recorded it.

There is a massive growth of interest in health analytics based on data extracted from GP systems. Data quality is adequate for many of these purposes but not as good or consistent as some secondary users seem to assume. While there can be dangers in telling recorders about the secondary uses to which the data they enter will be put in most cases these are greatly outweighed by the benefits of making recorders aware of secondary uses and trying to secure their cooperation to make sure what they enter is fit for the secondary uses to which it will be put.

Users of data for secondary purpose beware.