Synthetic data has its limitations – why human-derived data can help prevent AI model collapse

Synthetic data has its limitations – why human-derived data can help prevent AI model collapse

Subscribe to our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI reporting. Learn more


My goodness, how quickly the tables are changing in the tech world. Just two years ago, AI was being hailed as the “next technology.” Transformation technology to dominate them all.” Now, instead of reaching Skynet levels and conquering the world, the AI ​​ironically deteriorates.

Once the harbinger of a new era of intelligence, AI is now stumbling over its own code and struggling to deliver on its promised brilliance. But why exactly? The simple fact is that we are depriving AI of the only thing that makes it truly intelligent: human-generated data.

To feed these data-hungry models, researchers and organizations are increasingly turning to synthetic data. While this practice has been a staple for a long time AI developmentWe are now entering dangerous territory by relying too much on it, leading to the gradual deterioration of AI models. And this isn’t just a small concern ChatGPT lead to subpar results – the consequences are far more dangerous.

When AI models are trained on output from previous iterations, they tend to propagate errors and introduce noise, leading to degradation in output quality. This recursive process turns the familiar “garbage in, garbage out” cycle into a self-perpetuating problem, significantly reducing the effectiveness of the system. While the AI ​​continues to drift human-like understanding and accuracy, this not only undermines performance, but also raises critical concerns about the long-term feasibility of relying on self-generated data for further AI development.

But this isn’t just a deterioration in technology; It is a degradation of reality, identity and data authenticity – and poses serious risks to humanity and society. The impact could be profound, leading to an increase in critical errors. If these models lose accuracy and reliability, the consequences could be devastating – think medical misdiagnosis, financial losses and even life-threatening accidents.

Another important consequence is that AI development could stall completely and disappear AI systems They are unable to absorb new data and are effectively “stuck in time”. This stagnation would not only hinder progress, but also trap AI in a cycle of diminishing returns, with potentially disastrous effects on technology and society.

But what can companies do in practice to ensure the security of their customers and users? Before we answer this question, we need to understand how it all works.

When a model breaks down, reliability is lost

The more AI-generated content spreads online, the faster it infiltrates data sets and, in turn, the models themselves. And this happens faster and faster, making it increasingly difficult for developers to filter out anything that is not pure, human-generated training data. The fact is that using synthetic content in training can trigger a harmful phenomenon known as “model collapse” or “model collapse”.Model of an autophagy disorder (CRAZY).”

Model collapse is the degenerative process whereby AI systems gradually lose track of the actual underlying data distribution they are designed to model. This often occurs when AI is recursively trained on the content it generates, leading to a number of problems:

  • Loss of nuance: Models begin to forget outlier data or less represented information, which is critical to fully understanding any data set.
  • Reduced diversity: The variety and quality of the results generated by the models noticeably decreases.
  • Reinforcement of prejudices: Existing biases, particularly against marginalized groups, can be exacerbated because the model misses the nuanced data that could mitigate those biases.
  • Generating nonsensical expenses: Over time, models can start producing results that are completely unrelated or nonsensical.

A typical example: A study published in Nature highlighted the rapid degeneration of language models recursively trained on AI-generated text. By the ninth iteration, it was found that these models were producing completely irrelevant and nonsensical content, highlighting the rapid decline in data quality and model usefulness.

Securing the Future of AI: Steps Companies Can Take Today

Companies are in a unique position to responsibly shape the future of AI, and there are clear, actionable steps they can take to ensure the accuracy and trustworthiness of AI systems:

  • Invest in data lineage tools: Tools that track where each piece of data comes from and how it changes over time give companies confidence in their AI inputs. With clear visibility into data lineage, companies can avoid feeding models with unreliable or biased information.
  • Use AI-powered filters to detect synthetic content: Advanced filters can catch up AI generated or low-quality content before it is incorporated into training data sets. These filters help ensure models learn from authentic, human-generated information rather than synthetic data that lacks the complexity of the real world.
  • Work with trusted data providers: Through close relationships with verified data providers, companies receive a steady supply of authentic, high-quality data. This means AI models receive real, granular information that reflects actual scenarios, increasing both performance and relevance.
  • Promote digital literacy and awareness: By educating teams and customers about the importance of data authenticity, companies can help people identify AI-generated content and understand the risks of synthetic data. Raising awareness of responsible data use fosters a culture that values ​​accuracy and integrity in AI development.

The future of AI depends on responsible action. Companies have a real opportunity to keep AI accuracy and integrity grounded. By prioritizing real, human-derived data over shortcuts, prioritizing tools that capture and filter out low-quality content, and promoting awareness of digital authenticity, companies can move AI toward a safer, smarter path. Let’s focus on building a future where AI is both powerful and truly useful for society.

Rick Song is CEO and co-founder of persona.

DataDecisionMakers

Welcome to the VentureBeat community!

At DataDecisionMakers, experts, including engineers who work with data, can share data-related insights and innovations.

If you want to learn more about innovative ideas and current information, best practices and the future of data and data technology, visit us at DataDecisionMakers.

You might even think about it contribute an article Your own!

Read more from DataDecisionMakers



Source link
Spread the love
Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *