Apple, NVIDIA, Salesforce & more accused of scraping YouTube videos to train AI models

An investigation found that tech giants had sourced data from over 170,000 YouTube videos across more than 48,000 channels not affiliated with the companies.

Image via YouTube
7

As the controversial practices of artificial intelligence development, use, and maintenance remain a hot-button issue a recent investigation may have revealed a gross and unapproved use of YouTube videos in the training of AI models at Apple, NVIDIA, Salesforce, and other tech giants. The investigation suggests that a massive amount of third-party YouTube channels and their contents may have been scraped by data collectors and used without approval in the training of AI.

A large-scale investigation was carried out by Proof News, as reported by Wired. The investigation looked into materials and datasets utilized in AI model training, which included subtitles and transcripts ripped from an estimated 173,536 YouTube videos, representing more than 48,000 different channels. This data is said to have been utilized by a number of tech giants, including Apple, NVIDIA, Salesforce, and Anthropic.

Among the sources of said data used in this “YouTube Subtitles” dataset were materials from various educational and informational channels such as MIT and Harvard, news media groups such as BBC and the Wall Street Journal, and even entertainment sources like The Late Show With Stephen Colbert and Last Week Tonight With John Oliver. Massive YouTube content creators like MrBeast, Jacksepticeye, and PewDiePie also appeared prominently among the dataset. Creators such as Marques Brownlee of the MKBHD Podcast shared that they never gave permission for use of their videos in such a manner, but their content was used anyways.

With tech giants feverishly chasing down any data they can get for the use of training AI, it will remain to be seen if the outcry prompts an adjustment or stop in data scraping of unapproved videos. Stay tuned as we watch for further updates to this story on our Artificial Intelligence topic.

Senior News Editor

TJ Denzer is a player and writer with a passion for games that has dominated a lifetime. He found his way to the Shacknews roster in late 2019 and has worked his way to Senior News Editor since. Between news coverage, he also aides notably in livestream projects like the indie game-focused Indie-licious, the Shacknews Stimulus Games, and the Shacknews Dump. You can reach him at tj.denzer@shacknews.com and also find him on Twitter @JohnnyChugs.

From The Chatty
    • reply
      July 16, 2024 11:52 AM

      Oh no!

    • reply
      July 16, 2024 1:32 PM

      Tech giants give precisely zero fucks. More on News at Ten.

    • reply
      July 16, 2024 4:58 PM

      Isn't that what search engines do, why is it considered wrong?

      • reply
        July 16, 2024 5:40 PM

        Really ?

      • reply
        July 16, 2024 6:54 PM

        Companies that monetize generative AI that uses copywritten source material open themselves up to legal liabilities.

      • reply
        July 17, 2024 6:44 AM

        In theory there is a symbiotic relationship between the search engine and the publishers of content. The search engine can scrape content, and advertise next to their results then send the user to the content publisher. The publisher can then monitize the traffic the search engine sends them.

        With AI models, they scape the content, refactor it, and republish it themselves without exchanging anything with the publisher.

        • reply
          July 17, 2024 8:51 AM

          a bunch of publishers (and governments) believe the relationship isn’t symbiotic and publishers deserve extra money beyond the extra traffic generated to their sites by inclusion in search

          • reply
            July 17, 2024 8:52 AM

            Yeah that’s why I said in theory. Google has gradually manufactured a situation where less and less traffic reaches content publishers. Hence the whole google zero concept.

            • reply
              July 17, 2024 8:57 AM

              I mean, the publishers are wrong. They can opt out of being included in search. It’s always bad for their business. Search results increase their revenue, not detract from it.

              If the government thinks news needs to be subsidizing or nationalized then just do it with general funding. There’s no need to manufacturer some special search engine tax, let alone some of the particularly bad implementations that subsidize Murdoch outlets.

    • reply
      July 16, 2024 9:10 PM

      I’m angry at them not for the scraping but that they scraped such a shitty source

Hello, Meet Lola