Synthetic Clinical Data Generation for Privacy-Sensitive Applications

Synthetic Clinical Data Generation for Privacy-Sensitive Applications

Using LLMs to create annotated training data for de-identification systems

This research leverages LLMs to generate synthetic clinical datasets with pre-annotated personally identifiable information, solving the data scarcity challenge in privacy-sensitive domains.

  • Domain-adapted LLMs create realistic clinical texts with embedded PHI markers
  • Machine-annotated tags eliminate manual labeling efforts for sensitive information
  • Synthetic corpora enable training of de-identification systems without privacy risks
  • Practical solution for medical institutions needing quality training data while maintaining compliance

For healthcare organizations, this approach offers a pathway to develop robust de-identification systems without exposing real patient data, potentially accelerating AI adoption in clinical settings while maintaining regulatory compliance.

Data-Constrained Synthesis of Training Data for De-Identification

6 | 16