Synthetic data: pharma’s next big thing?
Faster, better trials, enhanced pharmacoepidemiology and more ambitious cross-border research are just some of the likely use cases for synthetic data
Get ready for a new data technology to begin its journey through the hype cycle: synthetic data.
Synthetic data will account for 60% of all data used in all AI development by 2024, according to Gartner. Facebook is getting in on the action, recently acquiring synthetic data startup AI.Reverie.
Broadly applied, the term refers to ways in which data that is otherwise not available in real-world settings, or not easily available, can be created to develop new insights and solutions to real-world problems.
Synthetic data offers the means to generate the missing or rare data that stymies many an AI product or RWE effort. Financial services AI won’t find rare fraud with no or too little data to train the algorithm. Enter synthetic data. Likewise, with the rare but challenging conditions used to train self-driving cars so they can drive in real-world conditions without killing you.
The applications of synthetic data are enticing in the life sciences too, and this year, multiple major computing platforms and a range of healthcare data players will launch new synthetic data efforts including Aetion, Syntegra, MDClone and Phesi.
Faster, deeper, richer
Synthetic data’s potential to mimic the characteristics of a real dataset, but with private or sensitive information removed, makes it a good alternative to handling large but sensitive samples of real individual-level patient data.
Minimising or even avoiding altogether such concerns and compliance constraints could open up new avenues of study. In this sense it differs from de-identified data as it is created from scratch rather than being based on individual patient records and so cannot be de-anonymised.
“It opens up the process that pharma companies follow in doing RWE studies,” says Josh Rubel, MDClone’s Chief Commercial Officer. “It allows them to explore, tweak and target what they are doing. It’s useful for scoping projects, site selection, powering a study or determining if inclusion criteria are realistic before you go through the big steps of securing access to patient data or engaging with ethics committees.”
It should also be of use in improving machine learning/deep learning model accuracy by increasing the training dataset size, he adds.
The potential applications of synthetic data for pharma include:
Control arms - Using synthetic data to generate control arms can benefit both operators and patients, says Craig Lipset, Advisor and Founder at Clinical Innovation Partners. “The ability to decrease the size of the control arm by supplementing it with synthetic data or replacing the need for it is appealing for patients and operators.”
When participating in such a trial, patients are therefore going to receive a higher standard of care, he says. “Synthesising data from EHR data and other sources could be a nice solution. We can fill in data gaps and make sure it is better organised to fit our needs.”
Site selection and recruitment - More rapidly identifying promising patient cohorts for new drug development could help improve trial site selection and recruitment, says Rubel. “Instead of working with aggregators or funding a clinical study, life sciences companies could subscribe to health systems and receive RWD from the source and do it in a way that does not compromise patient privacy but which also gives rich, granular data to more quickly understand what is happening at the market level.”
Hypothesis testing - Synthetic data should also offer a quick way to test an idea inexpensively and quickly, says Jon D. Morrow, MD, Senior Vice President, Medical Affairs and Informatics, MDClone. “It’s useful for understanding what’s happening in a population for hypothesis testing, where you don’t know exactly what questions you want to ask. A synthetic environment enables that to happen. It can offer a fast path to finding something interesting.”
Training AI and machine learning algorithms - It’s possible to use synthetic data sets to train AI and then apply the lessons it has learned in real life, says Morrow. “It’s faster, less expensive [and all] without compromising patient privacy. I can take a synthetic population and use it to train a machine learning model, then I apply it to real patients and get an equally valid conclusion. It allows you to get data from different sources and you develop more robust and intelligent models.”
Pharmacoepidemiology - When working at a cohort or population level it’s simpler to leverage synthetic data to do analysis on the patient journey and the natural history of disease and to share this between teams without concerns about identifying individuals, says Aaron Kamauu, Advisor and Managing Director at Ikaika Health. “When we have decoupled the issue of patient privacy and confidentiality, we are able to share and use data more broadly across different entities without worrying about regulations relating to patient data."
Inter-organisational and cross-border research projects - Since the major barrier to large-scale multi-organisation or multi-population research is concerns about patient data security and unwillingness among HCPs to share it with pharma, synthetic data opens up new possibilities here, says Morrow. “You likely don’t have to go to an IRB or ethics committee to get access to synthetic data before you do your research and you have access to more data because more partners will be willing to participate.”
Both art and science
While the use cases might be broad in theory, synthetic data is an emerging field. The complex methodologies for creating synthetic data sets are still under development and different ‘flavours’ of synthetic data can be generated depending on the method.
Great care is needed here to make sure a synthetic population preserves the properties of the real one where it’s needed, including the inter-relationships between variables seen and unseen. “That is where the art and science of synthetic data lie,” says Morrow. “If you do it wrong you get a population that does not match the individual or you can identify them. If you do it right it mirrors the population without revealing the identify of individual patients.”
As well as the need for the industry to develop robust methods for creating synthetic data, regulators also need to become comfortable understanding how synthetic data sets are created, along with their practical use and reliability. “Today there is no regulatory body that accepts just synthetic data as part of a RWE submission and no consensus,” says Rubel.
Lipset agrees that both are pre-requisites for its broad application. “We will need confidence building with the regulator around the algorithms that build it and pharma need to build their own confidence before turning to regulators and expecting them to buy into this data.”
Key to all this is trusted real-world data from which useful synthetic data can be generated, says Kamauu. “To generate reliable synthetic data you need reliable real-world data. If there is messiness in real data sets it is hard to overcome this when generating synthetic data.”
The way forward
Given the nascent status of the discipline, the timeframe over which pharma and health providers can expect to use synthetic data in regulatory submissions is likely to be around a decade, says Lipset. “We are going to see three year’s experimentation and three more years of evidence building and confidence building.”
Before it is ready for broad use, synthetic data can also be used iteratively one implementation and disease area at a time by being trialled and piloted alongside research and development projects and then assessed for its maturity and usefulness. This can be done without risk as it need not be tied to the success of a trial, study or project and so risk its failure.
“It’s not all or nothing,” says Lipset. “We can start to experiment creating synthetic control arms as third arms almost. The more we do that, the more we can generate confidence.”
But it is already finding early use cases. A current example is the National COVID Cohort Collaborative, or NC3, initiative which has generated some synthetic data to help researchers study the disease’s potential risk factors, protective factors and long-term health consequences.
Another near-term use is its potential to be used to expand the number of Health Economics and Outcomes Research (HEOR) and pharmacoepidemiology studies that pharma can do, says Kamauu. “These are top of my mind in terms of potential quick wins."
The scope synthetic data offers to re-imagine how research is done by freeing data and speeding research times should provoke thought and exploration among all pharma companies, adds Kamauu.
“There’s a chance for everyone to engage and evaluate as to whether they have been handcuffed in the past from asking certain questions because of time delay or privacy concurs. It enables you to reconsider assumptions about how you can use real-world evidence. You might be doing 50 projects but want to be doing 100. If the privacy infrastructure that exists today could be removed from the equation, what could be done? The answer is, a lot more.”
Since you're here...
... and value our content, you should sign-up to our newsletter. Sign up here