Challenge
Access to German health insurance billing data is heavily restricted due to strict data protection regulations. In a joint project with a German health insurance provider, we examined possible solutions to the following problem: How can we ensure that an AI model trained on sensitive data never allows inferences about real patients?
Approach
We followed a multi-step process: First, privacy-preserving generative models were trained on health insurance data to capture statistical patterns without memorizing individual records. These models then generated fully synthetic datasets that contain no real patient information and can be produced at scale. Finally, the resulting datasets can be safely used to train downstream AI models while eliminating the risk of exposing patients.
Outcome
A primary objective of the project was to develop an evaluation framework capable of assessing the privacy and utility of the synthetic data. Rigorous tests confirmed that no synthetic record corresponds to or enables re-identification of any real patient. While the synthetic data retains high analytical value, some degree of information loss is inherent and expected. Altogether, the project demonstrated how strictly regulated health data can be made accessible for AI training.
Contact
Dr. Anika Hannemann, ZHAW, Institute for Data Science, Winterthur, anika.hannemann@scrai.ch