Fair AI Data logotype

Synthetic data for Open Science

Synthetic data play an important role in Open Science – they help scientists address pressing
questions when real data is either insufficient or missing entirely. They can address known
biases in real data, and synthetic data can be safe to use and share in sensitive domains. However, for synthetic data to be reproducible, reliable and transparent, it requires specific
ways to label and describe it.


This page is a resource for people interested in making their synthetic data open and valuable in a responsible way. It provides suggestions and guidelines (developed in collaboration with the Swedish National Data Service), further readings that help you delve into the complexities of synthetic data, and a tool kit for testing your synthetic data and labelling the innate artefacts of production which will make your synthetic data valuable or potentially lead to model collapse and misrepresentative
hallucinations.


The right tool for the job

What kind of AI you use to generate your synthetic data will impact how well it represents
your original data, and how much you can control where in your dataset the differences
occur. How you make it is important (and important to document). Here are some good
overviews you might find useful:

  • Chen W, Yang K, Yu Z, et al. (2024) A survey on imbalanced learning: latest research,
    applications and future directions. Artificial Intelligence Review 57(6): 137.
  • Rajabi A and Garibay OO (2022) TabFairGAN: Fair Tabular Data Generation with Generative
    Adversarial Networks. Machine Learning and Knowledge Extraction. 4(2). 2.
    Multidisciplinary Digital Publishing Institute: 488–501.
  • Miceli M, Posada J, Yang T (2021) Studying Up Machine Learning Data. ArXiv:2109.08131.
  • Jacobsen BN (2023). Machine learning and the politics of synthetic data. Big Data & Society.
    10(1).
  • Offenhuber D (2024) Shapes and Frictions of Synthetic Data. Big Data & Society. 11 (2):
    20539517241249390. https://doi.org/10.1177/20539517241249390.
  • Shumailov I, Shumaylov Z, Zhao Y,et al.(2024) AI models collapse when trained on
    recursively generated data.Nature.631, 755–759.


Normals and edge-cases

When you use certain AI methods, there can be a tendency to amplify the ‘normal’ in your
data and minimize edge cases. Here are some interesting articles about this:


  • Mehrabi N, et al. (2022) A Survey on Bias and Fairness in Machine Learning. ACM Computing
    Surveys. 54(6):1-35.
  • Grace-Martin K (2008) Outliers: To Drop or Not to Drop. In: The Analysis Factor. Available at:
    https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/ (accessed 17
    March 2023).
  • Bhanot K, et al (2019) The Problem of Fairness in Synthetic Healthcare Data. Entropy. 2021,
    23, 1165.
  • Axelrod B, Garg S, Sharan V and Valiant G (2020, November). Sample amplification: Increasing dataset size even when learning is impossible. International Conference on Machine Learning.(pp. 442-451). PMLR.


You can test your data for intersectional hallucinations with our tool, contact us for access. 


Intersectional hallucinations and fidelities

There is more to ensuring representative synthetic structured data than just worrying about
edge-case representation. Tabular, structured data is interesting and useful because it tells
us something about the relations between different attributes in the data. Making sure
these are represented in your synthetic data also ensures its usefulness, both for ML models
and for other uses of the data. And it is really important to label when providing data for
open science. Here are some interesting reflections on intersectional and inter-attribute
aspects of synthetic data:


  • Johnson E and Hajisharif S (2024) The intersectional hallucinations of synthetic data. AI & Society. https://doi.org/10.1007/s00146-024-02017-8.
  • Lee, Hajisharif & Johnson (2025) The ontological politics of synthetic data: normalities,
    outliers, and intersectional hallucinations. Big Data & Society.
  • Varley T and Kaminski P (2021) Intersectional Synergies: Untangling Irreducible Effects of
    Intersecting Identities via Information Decomposition. PNAS. Oct 26, 2021.


You can test your data for intersectional hallucinations with our tool, contact us for access. 


Privacy

Sometimes synthetic data is produced to ensure privacy and protect sensitive data. This can
be fraught. Read more, here:

  • Guépin F, et al. (2024) Synthetic Is All You Need: Removing the Auxiliary Data Assumption for
    Membership Inference Attacks Against Synthetic Data. In: Katsikas, S., et al. Computer
    Security. ESORICS 2023 International Workshops. ESORICS 2023. Lecture Notes in
    Computer Science, vol 14398. Springer, Cham. https://doi.org/10.1007/978-3-031-
    54204-6_10.  
  • Yoon J, Mizrahi M, Ghalaty NF,et al.(2023) EHR-Safe: generating high-fidelity and privacy-
    preserving synthetic electronic health records.npj Digit. Med.6, 141.
  • Ganev G, Oprisanu B, De Cristofaro E (2022) Robin Hood and Matthew Effects: Differential
    Privacy Has Disparate Impact on Synthetic Data. Proceedings of the 39th International
    Conference on Machine Learning. Baltimore, Maryland, USA PMLR 162, 2022.
  • Cheng V, et al. (2021) Can you fake it until you make it?: Impacts of Differentially Private
    Synthetic Data on Downstream Classification Fairness. FAccT ’21. March 3-10, 2021.


Finally, there is a movement underfoot to create a shared vocabulary and useful practices
for labelling synthetic data with appropriate metadata and READMEs. Here are two pieces
that are suggesting how to do it: 


LINKS TO SND and AIPEX texts here.


We can also assist in generating synthetic tabular data, that is representative, welcome to contact us



unsplash