
Synthetic Clinical Data Generation for Privacy-Sensitive Applications
Using LLMs to create annotated training data for de-identification systems
This research leverages LLMs to generate synthetic clinical datasets with pre-annotated personally identifiable information, solving the data scarcity challenge in privacy-sensitive domains.
- Domain-adapted LLMs create realistic clinical texts with embedded PHI markers
- Machine-annotated tags eliminate manual labeling efforts for sensitive information
- Synthetic corpora enable training of de-identification systems without privacy risks
- Practical solution for medical institutions needing quality training data while maintaining compliance
For healthcare organizations, this approach offers a pathway to develop robust de-identification systems without exposing real patient data, potentially accelerating AI adoption in clinical settings while maintaining regulatory compliance.
Data-Constrained Synthesis of Training Data for De-Identification