November 19, 2024

Ask an FCAT Researcher: David Bracken on Synthetic Data

Generative AI models have been trained by enormous amounts of data scraped off the internet, but as new data becomes scarce, companies are increasingly experimenting with synthetic options.

FCAT researcher David Bracken focuses on New Business Foundations. He digs into the newest ideas that companies are leveraging to grow revenue and has a special interest in emerging technologies, including the ways in which customers will use them. Through his work, he has researched everything from the impact of memes on our culture to blockchain technologies and social ties in the digital age.

Lately, he has been exploring the opportunities and challenges surrounding synthetic data — a version of existing data that has been altered to remove private and/or personally identifying information.

Q: Why is synthetic data a hot topic right now?

A: The foundational generative AI models currently in-market have largely been trained by the enormous amount of data that companies have scraped off the internet. Now, they are running out of new data to use, which has led to increasing experimentation with synthetic data to solve some of these data scarcity issues.

Synthetic data is not new. Autonomous-driving companies have been using it for some time, and interest also picked up significantly when more stringent privacy laws were passed in Europe about six years ago. Companies began looking into whether synthetic data could help them get around some of these regulations, but generative AI has triggered a new, growing wave of interest in the technology.

Q: Who is most interested in using synthetic data?

A: One of the main reasons that synthetic data is attractive — particularly to companies that are heavily regulated — is that some standard ways of scrubbing data can be reverse-engineered. They are not foolproof. So, organizations are interested in finding better approaches to strip out identifying factors, but in such a way that the data remains valuable for their purposes.

Synthetic data vendors can create new, fully anonymous datasets by training models on the statistical properties of the data without having them memorize any personal information.

Q: Once they have the synthetic data, how do they apply it?

A: There are a lot of different use cases to consider. It can be implemented in places where companies aren’t able to use traditional data or there isn’t enough data to do what they need to do. Fighting fraud is a good example. Organizations might not have much data on a particular kind of fraud, but they want to be able to train their models so that it can be automatically identified in their systems. So, one option is to create synthetic data that looks like the fraud they are trying to catch, which will help their models get better at uncovering potentially fraudulent activity.

Synthetic data can also be used for customer acquisition and onboarding, as well as software testing. Firms are exploring whether they can use synthetic data to help get their software to market faster, since it can expedite access to the production data software engineers need to move projects forward.

Q: Are there examples where synthetic data is not just faster but better?

A: A lot of traditional datasets are problematic because they are not representative of society or the marketplace for a product. This can lead to biased analysis and decision-making. Synthetic data from underrepresented groups can be implemented to help correct for imbalances.

Q: Where can synthetic data use go wrong?

A: Researchers have started to explore what happens when large language models are trained on significant amounts of synthetic data. Some of this research has found that synthetic data can cause the models to rapidly deteriorate — often referred to as model collapse.

Others have been exploring whether synthetic data can introduce more bias into AI models and if it might complicate our understanding and interpretations surrounding generative AI decision-making.

Q: More companies may be experimenting with synthetic data, but how common is it in the overall data market?

A: Most of the forecasts currently peg this market at about $300M annually, and there are a few reasons for this. The main one is that there just aren’t a lot of standards around creating synthetic data. It’s hard to gauge the accuracy of the data being produced. The vendors that help companies create synthetic data each have their own way of evaluating quality, and as standards improve, I think we can expect this market to grow significantly.

References & Disclaimers

1176567.1.0

Technology & Society

Curious Minds: A Conversation with Dani S. Bassett and Perry Zurn

John Dalton

December 11, 2023

In their most recent book, Curious Minds: The Power of Connection, Dani S. Bassett, J. Peter Skirkanich Professor of Bioengineering and Physics at the University of Pennsylvania, and Perry Zurn, Provost Associate Professor of Philosophy at American University, explore how curiosity works and how we can improve the practice of curiosity in our own lives. Recently, John Dalton, VP of Research at FCAT, had a chance to speak with Dani and Perry about their research.

Design

Humanity-Centered Design: An Interview with Don Norman

John Dalton

February 14, 2023

FCAT had the pleasure of welcoming Don Norman, author of Design for a Better World, for a speaking event where he presented an eye-opening diagnosis of how human behavior has led to numerous societal crises from collapsing social structures to climate change. Norman, both a scientist and business executive, proposes how we can reconsider what’s important in life and how that new way of thinking can help save humanity. As a sneak peek to his new book, FCAT’s VP of Research, John Dalton interviewed Norman to dive into his philosophy of humanity-centered design.

Technology & Society

Making Sense of the Employee Burnout Epidemic

David Bracken

July 5, 2023

Changes to how we live and work have made us more susceptible to burnout. Our excessive use of technology in daily life is causing mental strain. This, combined with stressors created by the pandemic, is leading more people to check out and lash out.

Ask an FCAT Researcher: David Bracken on Synthetic Data

Related posts

Curious Minds: A Conversation with Dani S. Bassett and Perry Zurn

Humanity-Centered Design: An Interview with Don Norman

Making Sense of the Employee Burnout Epidemic

Connect with us