UC Davis Researchers Receive $1.2 million NIH Grant to Study Synthetic Data Use in Healthcare

By Neelanjana Gautam

Three researchers from UC Davis have been awarded a total $1.2 million grant over four years from the National Institutes of Health (NIH) to generate high-quality synthetic data using artificial intelligence and machine learning (AI/ML) that may potentially help physicians predict, diagnose and treat diseases.

The interdisciplinary research team involves principal investigator (PI) Thomas Strohmer, director of the Center for Data Science and Artificial Intelligence Research (CeDAR) and professor in the Department of Mathematics, and two UC Davis Health PIs Rachael Callcut, professor of surgery and chief research informatics officer and Jason Adams, associate professor and physician of Pulmonary, Critical Care and Sleep.

Sharing healthcare data is crucial for understanding patterns and trajectories in diseases to develop personalized medicines and aid in personalized treatment. However, the privacy laws in the healthcare industry stand as a barrier to sharing annotated data for analytical purposes. Strohmer believes that while privacy mechanisms are important and can protect the patient from unintended disclosure of sensitive information about their health, having access to the data is also equally critical for research. For Strohmer and his team, the challenge is therefore to balance the privacy concerns with data access, and to answer the overarching question: How to develop privacy-preserving machine learning techniques to make the data accessible for analytics?

Synthetic data is essentially data generated by a computer program using real-world data as a model. Strohmer was inspired to investigate synthetic data when Nick Anderson, associate professor of UC Davis Health, gave a talk at a CeDAR event on the possibilities of synthetic data creation. CeDAR, one of four IMPACT Centers from the Office of Research, provided Strohmer with a platform for collaboration and visibility, and helped him connect with people interested in machine learning technologies. “It truly started with the coming together of different faculty interested in machine learning and data science from different angles,” said Strohmer.

To address the issues related to privacy and access, the researchers are utilizing the concept of synthetic data. This synthetic data can be generated from real-world data in a way that preserves the statistical properties of the original data, but without the risk of exposing sensitive information or violating privacy rules. The original data is typically multimodal in nature, and can come from various sources such as images, videos, text, speech, etc. The machine learning techniques should be able to analyze the different modalities and combine them in a privacy-preserving way to generate the synthetic data.

Strohmer explains that for medical records, one may first want to preserve the one-dimensional marginals –– for example, number of people that smoke, or number of people that have diabetes –– and then expand it to two or three-dimensional marginals, like how many people that smoke also have diabetes, or how many people that smoke also have both diabetes and COVID-19. Strohmer warned that this detailed method has its own pitfalls and may break privacy rules when the questions are too detailed. “The goal, therefore, is to define privacy in a rigorous, mathematical way –– known as differential privacy in the literature –– and design privacy-preserving machine learning techniques that will not break even when additional information becomes available,” said Strohmer.

Extending the generation of multimodal synthetic data into a clinical domain

The team is using the Acute Respiratory Distress Syndrome (ARDS) –– a high-risk condition –– as a model to test their methods. “About one out of every 10 intensive care unit (ICU) patients, and one out of every four mechanically ventilated patients in the ICU has ARDS,” said Adams.

The physicians chose ARDS because it has evidence-based life-saving treatments, and if diagnosed on time, those treatments can provide beneficial results. “The other advantage to using ARDS as a model is that the data that classifies ARDS is multimodal in nature, and so it can be used to test the robustness of the machine learning algorithms,” said Adams.

Both Callcut and Adams have research expertise in clinical outcomes of patients in ICUs. Callcut has worked on all aspects of ARDS detection, treatment and management for over a decade. One of her goals as the chief of the research division is to unify teams together to work on advanced analytics and machine learning.

Adams explains that, since ARDS patients in the ICUs are extremely sick and volatile, they tend to be routinely monitored through numerous channels. “As a result, a huge amount of multimodal health data is collected from ICU patients, much more than from typical hospitalized or outpatient clinic patients. Therefore, the ICU presents an ideal opportunity to precisely describe the clinical state of a patient, and then use the data to develop predictive algorithms that can do the same,” said Adams.

Callcut’s role has been to create the clinical use cases to help develop the data sources for the team to utilize. Her training as a data scientist helps her understand computational approaches. “At our lab, we are looking at a panel of almost 40 different markers on patients to try to understand how those pathways are interacting with one another, and our real goal is to try to identify those patients early,” said Callcut. “We can then create novel therapies and interventions that can potentially abate the development or severity of ARDS, and that’s why AI/ML algorithms will be so important in this field.”

In addition to the data that Adams and his group have collected, Callcut has a diverse set of data from patients, ventilators and monitors. One of the compelling aspects of this type of research is that researchers will also delve into how well the data fare compared to using real data for understanding its efficacy in clinical environments.

“Part of the benefit and the excitement about this particular partnership between the clinical and analytical sides of our university is the opportunity to develop synthetic datasets that reflect the complexity, but also provide a high fidelity, which is what’s required to get useful machine learning algorithms when they go into the clinical environment,” said Callcut.

About Cedar

Launched in 2019, the Center for Data Science and Artificial Intelligence Research (CeDAR) facilitates breakthroughs in research and innovation to address societal challenges by advancing data science foundations, methods and applications in a concerted effort. As a part of UC Davis, a top research university, CeDAR has the opportunity and resources to bring together world renowned experts from many fields of study with top data science and artificial intelligence researchers.