Large language models are already trained on most of the internet. But they’ll keep getting better

It’s been 12 months since the first iteration of chatbot ChatGPT, a large language model, took the world by storm. It can generate content and ideas - from business plans, to recipes, to images and music - in seconds.

Since then, it’s taken another major leap forward with the launch of GPT-4, the latest release improving the model’s capability to generate authentic, human responses.

But can it keep getting better? The answer isn’t so straightforward.

How Large Language Models learn

Chat-GPT, alongside Google’s rival chatbot Bard and Microsoft’s Bing chat, is a large language model (LLM). LLMs are a type of AI algorithm that learns from very large quantities of data, enabling them to mimic and produce outputs remarkably similar to what a human would create.

To replicate human outputs, LLMs use deep learning techniques in neural networks, designed to mimic the structure and function of the human brain. LLMs are developed to recognise patterns and examples provided during training, then produce new and unique outputs that have similar characteristics to the data they have been exposed to. Training data is taken from publicly available content on the internet - from social media, to news articles, to ebooks and blogs.

But training data may soon run out

Training an LLM effectively requires enormous quantities of high-quality, long-form, factually accurate text. GPT-3, the model on which the first release of Chat-GPT is based, was trained on 45TB of content. That’s approximately 293 million pages. GPT-4, the latest release, is trained on 1 petabyte of text; 20 times more than GPT-4.

The internet is vast, but it’s still finite. In fact, 1 petabyte of data is a large proportion of all text that has ever been published online. A report by Epoch AI, a research organization, suggests that by 2026 technology companies will run out of high-quality text to train their models.

Time to take a new approach

As a result, LLM developers are making significant investments to enhance the quality of their existing inputs. Several AI labs have hired armies of data annotators to label images and assess the LLM’s responses to questions and requests, improving the quality of the outputs without providing new training data.

Although some such roles require specialist knowledge - a master's or even a doctorate - much of the work is routine. Developers are looking to regions where labor costs are low to annotate data at scale.

AI companies are also amassing more data through user interactions with their tools. Many AI tools incorporate feedback mechanisms, enabling users to indicate the usefulness of the outputs. For instance, Firefly's text-to-image generator, a ‘diffusion model’, allows users to select from four options. Bard allows users to double-check its responses in Google Search, while Chat-GPT users can give a thumbs-up or thumbs-down in response to its outputs.
These feedback mechanisms can then be integrated back into the underlying models, effectively turning users into trainers.
An even stronger indicator of the efficacy of a chatbot's responses, though, is whether users copy the text and paste it elsewhere. This particular data point facilitated significant enhancements for Google's translation tool.
And there’s another, significant data source that remains largely underutilized: the data stored within the databases of tech firms' corporate clients. Many businesses possess substantial and valuable quantities of data, including call-center transcripts and customer spending records, often without realizing its potential.
This text, audio and other data is particularly valuable as it can be utilized to fine-tune models for specific business applications, such as supporting call-center employees in handling inquiries or enabling analysts to identify strategies for augmenting sales.

Unlocking that information will almost certainly be the key to further improving LLMs with human-like results. Amazon and Microsoft and Google all now offer tools to help companies improve management of their unstructured datasets.

Looking to the future

We’ve come a long way already - and there’s still huge potential for LLMs to improve. And although AI won’t be taking many of our jobs any time soon, tools like Chat-GPT will get even better at helping us automate manual work - freeing up time and productivity to focus on the tasks that require human input.

Want to learn how generative AI can transform your organization, improve efficiencies and meet your business goals? SBM can help. Talk to us.

You are here

Large language models are already trained on most of the internet. But they’ll keep getting better