The high cost of training data is making advanced AI systems inaccessible to all but the wealthiest tech companies. When it comes to training AI models, the quantity and quality of data matters more than the model’s design, architecture, or other characteristics. Models trained on more data perform better, as evidenced by the case of Meta’s Llama 3 outperforming AI2’s OLMo model due to training on a significantly larger dataset.
AI models must have human-annotated data to learn associations between labels and other observed data characteristics. However, the emphasis on large, high-quality datasets favors tech giants with big budgets who then lock up this data, stifling others from catching up.
Acquiring these large datasets often involves unethical or illegal behavior, leading to many questions about copyright infringement and legal reprisals. Even the more transparent deals are not fostering an equal and open AI ecosystem, with users not sharing in the revenue from data licensing. Since the cost of AI training data is expected to rise from $2.5 billion to almost $30 billion in a decade, many data brokers are exploiting this growing demand to the detriment of the wider AI research community.