Fair AI Data logotype

How reliable is syntehtic data?

Synthetic data can do a lot of things – it is going to be an important complement to existing data, augmenting small data sets so they can be useful for our future AI and ML algorithms. Synthetic data will help us work with sensitive data without exposing real people to privacy exposures. And it can create portable data where policies and regulations otherwise prevent collaborative work and knowledge sharing.

But, for that to happen, the synthetic data needs to be representative of the complexities of our world, whether we are working with biological data, sensor data, social data, or any of the other multitudes of data that we need to complement, anonymize or share. Data is about relations and those relations need to be represented.

This is one of the promises of AI – that it will be able to read relations and pinpoint the important elements of a dataset so that it can then synthesize a completely new, fake dataset which is just as relevant and representative as the original. But does that always work? And how do you know if the relations and components you need are represented? This is the tricky problem we are working with.

One of the most common ways to use AI to synthesize data right now are GANs – Generative Adversarial Networks – a type of Machine Learning that purports to read and understand the essential elements of a dataset and then reproduce them. Sometimes – and for some elements of a dataset – this works. But sometimes the outcomes are more complex and more problematic than they may appear at first sight. Take, for example, some of the work we have been doing in our research* at Linköping University with population data.

Using an open access census dataset** we ran a couple of different GANs to generate a synthetic version of it. With the first GAN we tried, we saw some problematic – if expected – events occur. GANs can, because of the way they work, accentuate the majority elements of a dataset and minimize minority elements. This happened with our first attempt at synthesizing. Among other things, we noticed that in the original dataset, there were 41 countries of origin listed. In our synthetic version, there were only 32. Nine countries disappeared, probably because they had so few representations in the original data that the GAN didn’t reproduce them.

But we didn’t give up. Going on to another GAN (yes, there are different ones, and you can train your own, and tweak them – this is tricky. And important.) we created a new synthetic dataset. This time all the original countries of origin were represented. And when we looked at other single variable comparisons, things looked pretty good. For example, in the original data, there were 30,527 males and 14,695 females. And in the synthetic data, this was pretty close to the same: 30,485 males, 14,737 females. Other elements were pretty close, too. If you look at the diagonal in the below grid , you can see how the single category representation in the synthetic data maps onto the same categories in the original data. It’s not exactly the same, but in most categories, it isn’t too far off.

Comparing the intersections of population data, real vs synthetic)

But, we – and the world – are not comprised of single categories. We – and the world – are complex. In social science terms, this is called intersectionality. Now, population data is not a great representation of intersectionality (for that we would need to have a lot more information about power dynamics and context) but it does tend to identify some of the important complexities that define individuals for those gathering the data.

For example, a person may be a man, a husband, over 70 years old, a Swedish citizen, a person with a university degree, a recipient of a high salary, a person born in foreign country… and populations are made up of individuals with various combinations of these elements. How is that complexity represented by synthetic data? Let’s look back at that grid. Each of the squares below the diagonal are a visual representation of three elements found in the original population data.

In some of these, the synthetic data does a pretty good job of reproducing those relationships, like between income, age, and gender. But in others, there are some pretty significant changes, both in what data is produced and how elements like gender are distributed amongst the other categories. Look, for example, at the differences in gender at the intersection of marital status (x-axis) and occupation (y-axis) between the real and synthetic data:

In the synthetic data, there are a lot more women – more pink. When we checked that intersection, we saw that in the original data there was one ‘male’ who was also listed as ‘wife’. But in the synthetic data, there were about 1300 data points that were listed as both ‘male’ and ‘wife’. Today and in some countries, that would be fine. But given that the synthetic data is trying to represent 1990 US census data, it seems a little off.


Why does this matter? If you were going to use this synthetic data to run a policy model or create a marketing plan, or say you had synthesized the maintenance data of your industrial products and want to use ML to predict future maintenance needs over their lifecycle, your synthetic data would have to represent the complexity of the world you actually want to address.


For us, this is a big problem. And one we want to help solve. Thanks to funding from the Swedish Innovation Agency, we are taking our research results out of the ivory tower by developing software tools to help you analyze the intersectional complexity of your synthetic data. Whether you are creating synthetic data from population data, customer data or the complexity of an industrial product, complex relationships in your synthetic data are important. That is what makes our world so varied and interesting. Our mission is to hold synthetic data accountable to the complexity of the real world.

* This work was supported by the Wallenberg AI, Autonomous Systems and Software Program – Humanities and Society (WASP-HS) and Linköping University, Sweden.

** We used the 1990 Adult US census data.