Synthetic data play an important role in Open Science – they help scientists address pressing
questions when real data is either insufficient or missing entirely. They can address known
biases in real data, and synthetic data can be safe to use and share in sensitive domains. However, for synthetic data to be reproducible, reliable and transparent, it requires specific
ways to label and describe it.
This page is a resource for people interested in making their synthetic data open and valuable in a responsible way. It provides suggestions and guidelines (developed in collaboration with the Swedish National Data Service), further readings that help you delve into the complexities of synthetic data, and a tool kit for testing your synthetic data and labelling the innate artefacts of production which will make your synthetic data valuable or potentially lead to model collapse and misrepresentative
hallucinations.
What kind of AI you use to generate your synthetic data will impact how well it represents
your original data, and how much you can control where in your dataset the differences
occur. How you make it is important (and important to document). Here are some good
overviews you might find useful:
Lautrup, A. D., Hyrup, T., Zimek, A., & Schneider-Kamp, P. (2024). Systematic review of generative modelling tools and utility metrics for fully synthetic tabular data. ACM Computing Surveys, 57(4), 1-38.
Wang, A. X., Chukova, S. S., Simpson, C. R., & Nguyen, B. P. (2024). Challenges and opportunities of generative models on tabular data.Applied Soft Computing, 112223.
When you use certain AI methods, there can be a tendency to amplify the ‘normal’ in your
data and minimize edge cases. Here are some interesting articles about this:
Foulds, J. R., Islam, R., Keya, K. N., & Pan, S. (2020, April). An intersectional definition of fairness. In2020 IEEE 36th international conference on data engineering (ICDE)(pp. 1918-1921). IEEE.
Gohar, U., & Cheng, L. (2023). A survey on intersectional fairness in machine learning: Notions, mitigation, and challenges.arXiv preprint arXiv:2305.06969
You can test your data for intersectional hallucinations with our tool, contact us for access.
There is more to ensuring representative synthetic structured data than just worrying about
edge-case representation. Tabular, structured data is interesting and useful because it tells
us something about the relations between different attributes in the data. Making sure
these are represented in your synthetic data also ensures its usefulness, both for ML models
and for other uses of the data. And it is really important to label when providing data for
open science. Here are some interesting reflections on intersectional and inter-attribute
aspects of synthetic data:
You can test your data for intersectional hallucinations with our tool, contact us for access.
Sometimes synthetic data is produced to ensure privacy and protect sensitive data. This can
be fraught. Read more, here:
Finally, there is a movement underfoot to create a shared vocabulary and useful practices
for labelling synthetic data with appropriate metadata and READMEs. Here are two pieces
that are suggesting how to do it:
LINKS TO SND and AIPEX texts here.
We can also assist in generating synthetic tabular data, that is representative, welcome to contact us.
© Fair AI Data