Tech Companies Under Fire for Using Swiped YouTube Videos to Train AI Models
The use of generative artificial intelligence (AI) has been on the rise, with tech companies constantly seeking training data to improve their models. However, a recent investigation by Proof News has revealed that some companies, including Apple, Nvidia, and Anthropic, have been using YouTube videos without permission to train their AI models.
The investigation found that these companies were utilizing a dataset called YouTube Subtitles, which contained transcripts of over 173,000 YouTube videos from various channels. These videos ranged from educational content to news sites to popular creators like MrBeast and Marques Brownlee. Despite YouTube’s rules against downloading and using content without permission, these companies went ahead and used the data for their AI models.
Marques Brownlee, a popular tech YouTuber, addressed the issue on social media, stating that Apple had sourced data from companies that scraped data/transcripts from YouTube videos, including his own. While Apple may not be directly responsible for the scraping, this revelation raises concerns about the ethical implications of using unauthorized data for AI training.
Proof News also created a tool for creators to search for their content in the dataset, allowing them to see if their videos were included without permission. While the dataset does not include imagery from the videos, it does contain translated subtitles in multiple languages.
The dataset in question was created by Eleuther AI, a non-profit AI research lab focused on promoting open science norms. The dataset, known as the Pile, includes material from various sources, including the European Parliament and English Wikipedia, and was released under a permissive license for academic and research purposes.
This investigation highlights the ongoing challenges surrounding data privacy and ethics in the AI industry. Companies must be held accountable for their data practices and ensure that they are obtaining data ethically and with proper permissions. As the use of AI continues to grow, it is crucial for tech companies to prioritize transparency and ethical data usage to build trust with users and creators.