YouTube Videos Used to Train AI Models Without Permission

Eve Harrison

In generative AI (GenAI), major tech players are using copious amounts of unlicensed YouTube content to train their AI models. An investigation has revealed that Apple, Nvidia, Anthropic, and Salesforce have been leveraging transcripts from YouTube videos, sparking a conversation about copyright infringement and the ethics of data usage.

What’s Happening & Why This Matters

The Dataset and Companies Involved

An investigation by Proof and Wired discovered that companies like Apple, Nvidia, Anthropic, and Salesforce have utilized an AI dataset called “YouTube Subtitles.” This dataset comprises transcripts from 173,000 YouTube videos across nearly 50,000 channels. Notable content creators like MrBeast, John Oliver, Jimmy Kimmel, and Stephen Colbert have their videos included in this dataset. Additionally, copyrighted music videos from artists such as Katy Perry and Taylor Swift were also found within the dataset.

The Origin and Scope of the Dataset

The YouTube Subtitles dataset is part of a larger 800GB dataset known as “The Pile,” released by AI startup EleutherAI in 2021. The Pile includes a wide array of sources such as PubMed, FreeLaw, Wikipedia, HackerNews, and GitHub. Approximately one-third of The Pile’s data comes from academic sources, while another third is scraped from the broader internet, including YouTube.

Responses and Controversies

While EleutherAI claims that their dataset creation process does not significantly increase harm, the ethical implications are substantial. A Salesforce representative mentioned that The Pile was used because it was publicly available and under a permissive license. However, questions about permissions and the use of copyrighted work remain unanswered. Nvidia, when approached for comment, did not respond but has faced a lawsuit for using parts of The Pile to train its NeMo AI without consent from authors.

YouTube’s Position

YouTube’s CEO, Neal Mohan, stated that using YouTube videos to train AI without explicit permission is a clear violation of their terms. Despite this, Google’s parent company has trained its AI tools on YouTube videos, claiming to have existing creator agreements that allow this. YouTube’s policies clearly state that content cannot be downloaded or used for non-personal, commercial purposes without express authorization.

TF Summary: What’s Next

The discovery of tech giants using unlicensed YouTube content to train AI models raises significant questions about copyright, ethics, and data usage. As the debate continues, content creators and platforms must navigate the complexities of protecting intellectual property while fostering innovation. Developments may include stricter regulations, clearer guidelines, and potentially, new agreements between tech companies and content creators to address these pressing issues.

Share This Article
Avatar photo
By Eve Harrison “TF Gadget Guru”
Background:
Eve Harrison is a staff writer for TechFyle's TF Sources. With a background in consumer technology and digital marketing, Eve brings a unique perspective that balances technical expertise with user experience. She holds a degree in Information Technology and has spent several years working in digital marketing roles, focusing on tech products and services. Her experience gives her insights into consumer trends and the practical usability of tech gadgets.
Leave a comment