November 19, 2024
Ask an FCAT Researcher: David Bracken on Synthetic Data
FCAT researcher David Bracken focuses on New Business Foundations. He digs into the newest ideas that companies are leveraging to grow revenue and has a special interest in emerging technologies, including the ways in which customers will use them. Through his work, he has researched everything from the impact of memes on our culture to blockchain technologies and social ties in the digital age.
Lately, he has been exploring the opportunities and challenges surrounding synthetic data — a version of existing data that has been altered to remove private and/or personally identifying information.
Q: Why is synthetic data a hot topic right now?
A: The foundational generative AI models currently in-market have largely been trained by the enormous amount of data that companies have scraped off the internet. Now, they are running out of new data to use, which has led to increasing experimentation with synthetic data to solve some of these data scarcity issues.
Synthetic data is not new. Autonomous-driving companies have been using it for some time, and interest also picked up significantly when more stringent privacy laws were passed in Europe about six years ago. Companies began looking into whether synthetic data could help them get around some of these regulations, but generative AI has triggered a new, growing wave of interest in the technology.
Q: Who is most interested in using synthetic data?
A: One of the main reasons that synthetic data is attractive — particularly to companies that are heavily regulated — is that some standard ways of scrubbing data can be reverse-engineered. They are not foolproof. So, organizations are interested in finding better approaches to strip out identifying factors, but in such a way that the data remains valuable for their purposes.
Synthetic data vendors can create new, fully anonymous datasets by training models on the statistical properties of the data without having them memorize any personal information.
Q: Once they have the synthetic data, how do they apply it?
A: There are a lot of different use cases to consider. It can be implemented in places where companies aren’t able to use traditional data or there isn’t enough data to do what they need to do. Fighting fraud is a good example. Organizations might not have much data on a particular kind of fraud, but they want to be able to train their models so that it can be automatically identified in their systems. So, one option is to create synthetic data that looks like the fraud they are trying to catch, which will help their models get better at uncovering potentially fraudulent activity.
Synthetic data can also be used for customer acquisition and onboarding, as well as software testing. Firms are exploring whether they can use synthetic data to help get their software to market faster, since it can expedite access to the production data software engineers need to move projects forward.
Q: Are there examples where synthetic data is not just faster but better?
A: A lot of traditional datasets are problematic because they are not representative of society or the marketplace for a product. This can lead to biased analysis and decision-making. Synthetic data from underrepresented groups can be implemented to help correct for imbalances.
Q: Where can synthetic data use go wrong?
A: Researchers have started to explore what happens when large language models are trained on significant amounts of synthetic data. Some of this research has found that synthetic data can cause the models to rapidly deteriorate — often referred to as model collapse.
Others have been exploring whether synthetic data can introduce more bias into AI models and if it might complicate our understanding and interpretations surrounding generative AI decision-making.
Q: More companies may be experimenting with synthetic data, but how common is it in the overall data market?
A: Most of the forecasts currently peg this market at about $300M annually, and there are a few reasons for this. The main one is that there just aren’t a lot of standards around creating synthetic data. It’s hard to gauge the accuracy of the data being produced. The vendors that help companies create synthetic data each have their own way of evaluating quality, and as standards improve, I think we can expect this market to grow significantly.
References & Disclaimers
1176567.1.0
Related posts
Technology & Society, Artificial Intelligence
Eliminating AI Bias: A Human + Machine Approach
Sarah Hoffman
August 27, 2020
Bias in AI is a known problem. Cases involving medical care, parole, recruiting, and loans have all been tainted by flawed data sampling or training data that includes biased human decisions.1 The good news: large organizations are waking up. Even the Vatican has chimed in with a charter on AI ethics.2 Even better news: there are practical methods for combatting AI bias.
Artificial Intelligence, Design
Building Trust in AI Systems
COLLEEN MCCRETTON
January 28, 2021
Bias in data used by AI algorithms is drawing increasing attention. The internet is full of examples of AI systems bias: recruiting algorithms trained on data that favored male candidates, facial recognition software unable to appropriately identify people of color, medical systems trained with data that is not sufficiently diverse, and many other examples. However, there is another aspect to bias that impacts AI systems and bears some scrutiny as well - the cognitive biases that users bring to the table.
Generative AI is Already Making Inroads in Healthcare
Sarah Hoffman
July 10, 2023
Generative AI is in its infancy, yet patients are using this technology to understand complex medical issues, get second opinions, and receive support and motivation. Medical professionals also stand to benefit by saving time on documentation and emails. Early implementations in healthcare may offer learnings applicable to the financial services industry and beyond.