News

Apple, NVIDIA, Salesforce & more accused of scraping YouTube videos to train AI models

An investigation found that tech giants had sourced data from over 170,000 YouTube videos across more than 48,000 channels not affiliated with the companies.

TJ Denzer

July 16, 2024 10:33 AM

Image via YouTube

As the controversial practices of artificial intelligence development, use, and maintenance remain a hot-button issue a recent investigation may have revealed a gross and unapproved use of YouTube videos in the training of AI models at Apple, NVIDIA, Salesforce, and other tech giants. The investigation suggests that a massive amount of third-party YouTube channels and their contents may have been scraped by data collectors and used without approval in the training of AI.

A large-scale investigation was carried out by Proof News, as reported by Wired. The investigation looked into materials and datasets utilized in AI model training, which included subtitles and transcripts ripped from an estimated 173,536 YouTube videos, representing more than 48,000 different channels. This data is said to have been utilized by a number of tech giants, including Apple, NVIDIA, Salesforce, and Anthropic.

Apple has sourced data for their AI from several companies

One of them scraped tons of data/transcripts from YouTube videos, including mine

Apple technically avoids "fault" here because they're not the ones scraping

But this is going to be an evolving problem for a long time https://t.co/U93riaeSlY
— Marques Brownlee (@MKBHD) July 16, 2024

Among the sources of said data used in this “YouTube Subtitles” dataset were materials from various educational and informational channels such as MIT and Harvard, news media groups such as BBC and the Wall Street Journal, and even entertainment sources like The Late Show With Stephen Colbert and Last Week Tonight With John Oliver. Massive YouTube content creators like MrBeast, Jacksepticeye, and PewDiePie also appeared prominently among the dataset. Creators such as Marques Brownlee of the MKBHD Podcast shared that they never gave permission for use of their videos in such a manner, but their content was used anyways.

With tech giants feverishly chasing down any data they can get for the use of training AI, it will remain to be seen if the outcry prompts an adjustment or stop in data scraping of unapproved videos. Stay tuned as we watch for further updates to this story on our Artificial Intelligence topic.

TJ Denzer

Senior News Editor

TJ Denzer is a player and writer with a passion for games that has dominated a lifetime. He found his way to the Shacknews roster in late 2019 and has worked his way to Senior News Editor since. Between news coverage, he also aides notably in livestream projects like the indie game-focused Indie-licious, the Shacknews Stimulus Games, and the Shacknews Dump. You can reach him at tj.denzer@shacknews.com and also find him on BlueSky @JohnnyChugs.

Filed Under

From The Chatty

Refresh Go To Thread

Shacknews

 reply
July 16, 2024 10:33 AM

TJ Denzer posted a new article, Apple, NVIDIA, Salesforce & more accused of scraping YouTube videos to train AI models
- [deleted] 465401080
  
   reply
  July 16, 2024 11:52 AM
  
  [deleted]
  - [deleted] 1790515555
    
     reply
    July 16, 2024 6:38 PM
    
    [deleted]
- node
  
   reply
  July 16, 2024 1:32 PM
  
  Tech giants give precisely zero fucks. More on News at Ten.
- quazar
  
   reply
  July 16, 2024 4:58 PM
  
  Isn't that what search engines do, why is it considered wrong?
  - duncandun
    
     reply
    July 16, 2024 5:40 PM
    
    Really ?
  - the man with the briefcase
    
     reply
    July 16, 2024 6:54 PM
    
    Companies that monetize generative AI that uses copywritten source material open themselves up to legal liabilities.
  - disembodied potato
    
     reply
    July 17, 2024 6:44 AM
    
    In theory there is a symbiotic relationship between the search engine and the publishers of content. The search engine can scrape content, and advertise next to their results then send the user to the content publisher. The publisher can then monitize the traffic the search engine sends them.
    
    With AI models, they scape the content, refactor it, and republish it themselves without exchanging anything with the publisher.
    - derelict515
      
       reply
      July 17, 2024 8:51 AM
      
      a bunch of publishers (and governments) believe the relationship isn’t symbiotic and publishers deserve extra money beyond the extra traffic generated to their sites by inclusion in search
      - disembodied potato
        
         reply
        July 17, 2024 8:52 AM
        
        Yeah that’s why I said in theory. Google has gradually manufactured a situation where less and less traffic reaches content publishers. Hence the whole google zero concept.
        
        derelict515
        
         reply
        July 17, 2024 8:57 AM
        
        I mean, the publishers are wrong. They can opt out of being included in search. It’s always bad for their business. Search results increase their revenue, not detract from it.
        
        If the government thinks news needs to be subsidizing or nationalized then just do it with general funding. There’s no need to manufacturer some special search engine tax, let alone some of the particularly bad implementations that subsidize Murdoch outlets.
- Amusatron
  
   reply
  July 16, 2024 9:10 PM
  
  I’m angry at them not for the scraping but that they scraped such a shitty source