AI has been integrated into almost every aspect of the web. However, while trying to sort between what’s real and what’s ai-generated may be frustrating for humans, it could lead to serious issues with new and existing AI models. This theory, called model collapse, could pose a threat to the long-term viability of AI content and the systems that rely on them.
Model collapse focuses on the narrowing of AI knowledge down to a few or even a single idea. This is attributed to AI models consuming AI generated content as public data is taken to better train the model’s output. While some AI data may be viable to use as a training material, models lack the ability to reject bad or impossible data. When inserted into the larger set, it reinforces incorrect patterns, such as repetition, decreases in quality, and even completely incorrect information.
Researchers tested this with both a set of written numbers and AI images. The numbers slowly devolved, becoming blurry and entirely made of one shape. Meanwhile, in the image set, mistakes such as extra fingers or odd proportions weren’t corrected, as the data the model received hadn’t indicated that those features weren’t normal. Because decisions in AI are governed by sets of probabilities, or statistical distributions, the AI is simply predicting the next feature to be added, rather than whether or not it should logically be here. While this inability to recognize mistakes in the same way the human would has always been present in models, the dearth of AI data has compounded many of the problems once thought gone from generated results. This means that things such as extra fingers, repeating words, and morphing faces could once again become commonplace. As well, by the time these features present themselves, the data within the set has become corrupted to the point of near irreparability.
The only current long-term solution is to make sure that the majority of a larger data set is real, human created data. However, this is easier said than done in most cases. AI companies need both high-quality data of a large quantity, as well as having that data be publicly available online. Private data, while fitting the requirements, is often expensive and comes with trickier legal territory regarding copyright, something that has already proved difficult for current models. This is also another cost on top of the immense energy and tech costs associated with creating a proprietary model. As well, certain items, such paintings made by a specific artist, simply are not available in the quantity an AI model requires. This does however incentivize companies to invest in AI early, as any new model will now have to contend with this issue.


Leave a comment