Apple, NVIDIA, Salesforce & more accused of scraping YouTube videos to train AI models
An investigation found that tech giants had sourced data from over 170,000 YouTube videos across more than 48,000 channels not affiliated with the companies.
As the controversial practices of artificial intelligence development, use, and maintenance remain a hot-button issue a recent investigation may have revealed a gross and unapproved use of YouTube videos in the training of AI models at Apple, NVIDIA, Salesforce, and other tech giants. The investigation suggests that a massive amount of third-party YouTube channels and their contents may have been scraped by data collectors and used without approval in the training of AI.
A large-scale investigation was carried out by Proof News, as reported by Wired. The investigation looked into materials and datasets utilized in AI model training, which included subtitles and transcripts ripped from an estimated 173,536 YouTube videos, representing more than 48,000 different channels. This data is said to have been utilized by a number of tech giants, including Apple, NVIDIA, Salesforce, and Anthropic.
Apple has sourced data for their AI from several companies
— Marques Brownlee (@MKBHD) July 16, 2024
One of them scraped tons of data/transcripts from YouTube videos, including mine
Apple technically avoids "fault" here because they're not the ones scraping
But this is going to be an evolving problem for a long time https://t.co/U93riaeSlY
Among the sources of said data used in this “YouTube Subtitles” dataset were materials from various educational and informational channels such as MIT and Harvard, news media groups such as BBC and the Wall Street Journal, and even entertainment sources like The Late Show With Stephen Colbert and Last Week Tonight With John Oliver. Massive YouTube content creators like MrBeast, Jacksepticeye, and PewDiePie also appeared prominently among the dataset. Creators such as Marques Brownlee of the MKBHD Podcast shared that they never gave permission for use of their videos in such a manner, but their content was used anyways.
With tech giants feverishly chasing down any data they can get for the use of training AI, it will remain to be seen if the outcry prompts an adjustment or stop in data scraping of unapproved videos. Stay tuned as we watch for further updates to this story on our Artificial Intelligence topic.
-
TJ Denzer posted a new article, Apple, NVIDIA, Salesforce & more accused of scraping YouTube videos to train AI models
-
-
-
In theory there is a symbiotic relationship between the search engine and the publishers of content. The search engine can scrape content, and advertise next to their results then send the user to the content publisher. The publisher can then monitize the traffic the search engine sends them.
With AI models, they scape the content, refactor it, and republish it themselves without exchanging anything with the publisher.-
-
-
I mean, the publishers are wrong. They can opt out of being included in search. It’s always bad for their business. Search results increase their revenue, not detract from it.
If the government thinks news needs to be subsidizing or nationalized then just do it with general funding. There’s no need to manufacturer some special search engine tax, let alone some of the particularly bad implementations that subsidize Murdoch outlets.
-
-
-
-