Apple, NVIDIA, Salesforce & more accused of scraping YouTube videos to train AI models

As the controversial practices of artificial intelligence development, use, and maintenance remain a hot-button issue a recent investigation may have revealed a gross and unapproved use of YouTube videos in the training of AI models at Apple, NVIDIA, Salesforce, and other tech giants. The investigation suggests that a massive amount of third-party YouTube channels and their contents may have been scraped by data collectors and used without approval in the training of AI.

A large-scale investigation was carried out by Proof News, as reported by Wired. The investigation looked into materials and datasets utilized in AI model training, which included subtitles and transcripts ripped from an estimated 173,536 YouTube videos, representing more than 48,000 different channels. This data is said to have been utilized by a number of tech giants, including Apple, NVIDIA, Salesforce, and Anthropic.

Among the sources of said data used in this “YouTube Subtitles” dataset were materials from various educational and informational channels such as MIT and Harvard, news media groups such as BBC and the Wall Street Journal, and even entertainment sources like The Late Show With Stephen Colbert and Last Week Tonight With John Oliver. Massive YouTube content creators like MrBeast, Jacksepticeye, and PewDiePie also appeared prominently among the dataset. Creators such as Marques Brownlee of the MKBHD Podcast shared that they never gave permission for use of their videos in such a manner, but their content was used anyways.

With tech giants feverishly chasing down any data they can get for the use of training AI, it will remain to be seen if the outcry prompts an adjustment or stop in data scraping of unapproved videos. Stay tuned as we watch for further updates to this story on our Artificial Intelligence topic.