43.2 Synthetic Data

Synthetic data, which models real data while ensuring anonymity, is becoming an essential tool in research. By generating artificial datasets that retain key statistical properties of the original data, researchers can preserve privacy, enhance data accessibility, and facilitate replication. However, synthetic data also introduces complexities and should be used with caution.

43.2.1 Benefits of Synthetic Data

Privacy Preservation
- Protects sensitive or proprietary information while enabling research collaboration.
Data Fairness and Augmentation
- Helps mitigate biases by generating more balanced datasets.
- Can supplement real data when sample sizes are limited.
Acceleration in Research
- Allows for data sharing in environments where access to real data is restricted.
- Enables large-scale simulations without legal or ethical constraints.

43.2.2 Concerns and Limitations

Misconceptions About Privacy
- Synthetic data does not guarantee absolute privacy—re-identification risks remain if it is too similar to the real dataset.
Challenges with Data Outliers
- Rare but important data points may be poorly represented or excluded.
Risks of Solely Relying on Synthetic Data
- Models trained exclusively on synthetic data may lack generalizability.
- Differences between real and synthetic distributions can introduce biases.

43.2.3 Further Insights on Synthetic Data

Synthetic data acts as a bridge between model-centric and data-centric perspectives, making it a vital tool in modern research. An analogy can be drawn to viewing a replica of the Mona Lisa—the essence remains, but the original is securely stored.

For a deeper dive into synthetic data and its applications, refer to (Jordon et al. 2022).

43.2.4 Generating Synthetic Data

When generating synthetic data, the approach depends on whether researchers have full access to the original dataset or are working under restricted conditions.

43.2.4.1 When You Have Access to the Original Dataset

If researchers can directly use the dataset, various techniques can be employed to generate synthetic data while preserving the statistical properties of the original:

Statistical Approaches
- Parametric models (e.g., Gaussian Mixture Models)
  - Fit statistical distributions to real data and sample synthetic observations.
Machine Learning-Based Methods
- Variational Autoencoders (VAEs) – Useful for structured, complex data representations.
- Generative Adversarial Networks (GANs) – Effective for generating high-dimensional data (e.g., tabular, image, and text data).
- CTGAN (Conditional Tabular GAN) – Specifically designed for structured, tabular datasets, addressing categorical and imbalanced data challenges.
Differential Privacy Techniques
- Noise Addition – Introduces controlled noise while maintaining the overall statistical structure.

43.2.4.2 When You Have a Restricted Dataset

In cases where data cannot be exported due to security, privacy, or proprietary constraints, researchers must rely on alternative strategies to generate synthetic data:

Summarization and Approximation
- Extract summary statistics (e.g., means, variances, correlations) to approximate the dataset’s structure.
- If permitted, share aggregated or anonymized data instead of raw observations.
Server-Based Computation
- Conduct in-server analyses where raw data remains inaccessible, but synthetic outputs can be generated on the secure system.
Synthetic Data Generation with Preserved Properties
- Use models trained on the secure dataset to produce synthetic data without directly copying real observations.
- Ensure that key statistical relationships are maintained, even if individual values differ.

References

Jordon, James, Lukasz Szpruch, Florimond Houssiau, Mirko Bottarelli, Giovanni Cherubin, Carsten Maple, Samuel N Cohen, and Adrian Weller. 2022. “Synthetic Data–What, Why and How?” arXiv Preprint arXiv:2205.03257.