Reliable synthetic data and intersectional hallucinations

Synthetic datasets can do a lot of things – they are going to be an important complement to existing data, augmenting small datasets so they can be useful for our future AI and ML algorithms. Synthetic data will help us work with sensitive data without exposing real people to privacy breaches. And synthetic data can create portable data where policies and regulations otherwise prevent collaborative work and knowledge sharing.

But, for that to happen, synthetic datasets need to be representative of the complexities of our world, whether we are working with biological data, sensor data, social data, or any of the other multitudes of data that we need to complement, anonymize or share. Data is about relations and those relations need to be represented.

Intersectional hallucinations occur when a synthetic dataset's relations make illogical combinations. A 6 year old doctor, for example. Or a soybean field that produces 7 tonnes of potatoes. These intersections in a synthetic dataset are so off the mark that the synthetic data cannot be used to represent real data. Read on for a concrete example from our research* at Linköping University, Sweden.

You can access our recent article in AI & Society here and you can also find an easy read about it in the The Conversation.

One of the most common ways to use AI to synthesize data right now is to use GANs – Generative Adversarial Networks – a type of Machine Learning that purports to read and understand the essential elements of a dataset and then reproduce them. Sometimes – and for some elements of a dataset – this works. But sometimes the outcomes are more complex and more problematic than they may appear at first sight.

Working with an open access census dataset** we used some different GANs to generate synthetic versions of it. With the first GAN we tried, we saw some problematic – if expected – events occur. GANs can, because of the way they work, accentuate the majority elements of a dataset and minimize minority elements. This happened with our first attempt at synthesizing. Some edge cases disappeared. Among other things, we noticed that in the original dataset, there were 40 countries of origin listed. In our synthetic version, there were only 31. Nine countries disappeared, probably because they had so few representations in the original data that the GAN didn’t reproduce them.

But we didn’t give up. Going on to another GAN (there are different ones, and you can train your own, and tweak them – this is tricky. And important.) we created a new synthetic dataset. This time all the original countries of origin were represented. And when we looked at other single variable comparisons, things looked pretty good. For example, in the original data, there were 30,527 males and 14,695 females. And in the synthetic data, this was pretty close to the same: 30,485 males, 14,737 females.

But, we – and the world – are not comprised of single categories. We – and the world – are complex. In social science terms, this is called intersectionality. Now, population data is not a great representation of intersectionality (for that we would need to have a lot more information about power dynamics and context) but it does tend to identify some of the important complexities that define individuals for those gathering the data.

For example, a person may be a man, a husband, over 70 years old, a Swedish citizen, a person with a university degree, a recipient of a high salary, a person born in foreign country… and populations are made up of individuals with various combinations of these elements. How was that complexity represented by our synthetic data?

Some of the intersections were fairly well represented. They had intersectional fidelity. But in others, there were some pretty significant changes, both in what data was produced and how elements like gender were distributed amongst the other categories.

For example, the synthetic data had 333 datapoints labeled husband/wife and single – an intersectional hallucination. Of these, over 100 datapoints were never married-husbands earning under 50,000 USD a year, an intersection that did not exist in the original data. On the other hand, we found widowed-females-working in tech support in the original data but they were missing in the synthetic version.

These are intersectional hallucinations.

Why does this matter? If you are going to use this synthetic data to create a marketing plan, or say you had synthesized the maintenance data of your industrial products and want to use ML to predict future maintenance needs over their lifecycle, your synthetic data would have to represent the complexity of the world you actually want to address.

This is a problem we want to help solve. Thanks to funding from the Swedish Innovation Agency, we are taking our research results out of the ivory tower by developing software tools to help you analyze the intersectional complexity of your synthetic data. Whether you are creating synthetic data from population data, customer data or the complexity of an industrial product, complex relationships in your synthetic data are important. That is what makes our world so varied and interesting. Our mission is to hold synthetic data accountable to the complexity of the real world.

* This research is supported by the Wallenberg AI, Autonomous Systems and Software Program – Humanities and Society (WASP-HS), Vinnova, and Linköping University, Sweden.

** We used the 1990 Adult US census data.