[TF Opinion]High Cost of AI Training Data Limits Access to Big Tech Only

Li Nguyen

The high cost of training data is making advanced AI systems inaccessible to all but the wealthiest tech companies. When it comes to training AI models, the quantity and quality of data matters more than the model’s design, architecture, or other characteristics. Models trained on more data perform better, as evidenced by the case of Meta’s Llama 3 outperforming AI2’s OLMo model due to training on a significantly larger dataset.

AI models must have human-annotated data to learn associations between labels and other observed data characteristics. However, the emphasis on large, high-quality datasets favors tech giants with big budgets who then lock up this data, stifling others from catching up.

Acquiring these large datasets often involves unethical or illegal behavior, leading to many questions about copyright infringement and legal reprisals. Even the more transparent deals are not fostering an equal and open AI ecosystem, with users not sharing in the revenue from data licensing‏. Since the cost of AI training data is expected to rise from $2.5 billion to almost $30 billion in a decade, many data brokers are exploiting this growing demand to the detriment of the wider AI research community.

By Li Nguyen “TF Emerging Tech”
Liam ‘Li’ Nguyen is a persona characterized by his deep involvement in the world of emerging technologies and entrepreneurship. With a Master's degree in Computer Science specializing in Artificial Intelligence, Li transitioned from academia to the entrepreneurial world. He co-founded a startup focused on IoT solutions, where he gained invaluable experience in navigating the tech startup ecosystem. His passion lies in exploring and demystifying the latest trends in AI, blockchain, and IoT
