🙊 Why Small Language Models are better than LLMs in 90% of the cases

Last week, I was at the GAIAS (Generative AI Application Summit). As a co-chair of that event, I invited Julian Simon, Chief Evangelist of Hugging Face. (Foto of us at the end.)

His keynote was insightful on the state of LLMs. But primarily, it opened my eyes to the fact that you don’t use a sledgehammer to crack a nut.

Let’s unpack.

Enjoy reading it in 4:30 min. 

🙊 Why Small Language Models (SLMs) are better than LLMs in 90% of the cases

It is roughly accurate that the larger a model, the better it understands the world and the more emergent capabilities it has (e.g., emulating a persona, reasoning capabilities, etc.)

However, is a large model always the best choice? No. Considering all requirements (performance, latency, costs, etc.), 9 out of 10 times, there is a better fitting model.

Example

If you build an AI that answers your client’s calls, you would include three models: 1x speech-to-text (STT), 1x LM, and 1x text-to-speech (TTS).

This means 3 AI models in sequence for each message exchanged.

Latency is critical in direct client calls to have a good interaction experience.

An LLM like GPT-4o, even though it is much faster now than GPT-4 Turbo, is A) too large to be very fast (<1 sec.), and B) it can be pretty costly.

🤔 Fact: For a client, I once built a solution with GPT-4, and the number of calls incurred half a million $ per month. Too much, even for a global corporation.

SLMs

Meet SLMs. These are models with a model size of ca. 3B parameters - a 100th of an LLM.

As always, the first question is: Which is the best? 

No big surprise; you can find the answer on the Open LLM Leaderboard, filtered for 3B models. Keep Mixture of Experts (MoE) showing.

What is an MoE?
It is an architecture that employs a divide-and-conquer strategy by using multiple specialized sub-models, known as experts, to handle different parts of a task.

Today, you will discover that Microsoft’s new Phi-3 model is the best - and it is not even an MoE. Let’s use this model going forward.

It has 3.8B parameters, a context window of up to 128k, and is a model that you can fine-tune.

💡 This model is so tiny that you can download it and host it on your laptop.

SHORT DEMO on how to make an SLM run on your Laptop
    1. Download Ollama at Ollama.com
    2. Open Terminal and type: “ollama run phi3”
    3. After installation you can use it even offline

When specializing Phi-3 for your task, i.e. communicating with clients via an STT and a TTS, through prompt engineering or fine-tuning (now an affordable and quick option for a 3B model), you reach a very comparable performance to a 100x bigger model.

Fine-tuning example from the Generative AI Book: https://a.co/d/2RwB5ak

📌 I only recommend fine-tuning when there is a specific linguistic style, domain specialization, or some task refinement.

Because SLMs, like Phi-3, are so much smaller than LLMs their response time is a tiny fraction of the response time of LLMs. In real-time scenarios, a big plus.

This thereby implies that SLMs have lower resource requirements.

Why does it work?

LLMs are good generalists that almost always are prompt-engineered to solve various tasks for us.

Analogy for Prompt Engineering
Prompt engineering can be seen as focusing the model to “access that part of the training” that is needed to solve the task at hand.

The majority of skills are not needed.

We don’t need to write in Elvish, generate a recipe for a dragon stew, or create a step-by-step guide for time travel.

Other benefits

  • As you can host an SLM easily by yourself, you have full control over the model.

  • There are no dependencies through an API, giving you full flexibility.

  • There are no API costs, and also you might reach a higher security grade because of that.

  • It has a smaller ecological footprint.

📌 Consider an SLM—it might save you a fortune. If only performance matters, choose the best.

This is a current view. In a year, we'll see larger models and smaller latencies due to constant progress on each element of the AI pipeline.

✍️ Bluedot has build a lean, bot-free AI notetaker

Bluedot is an AI-powered Chrome extension for Google Meet.

(Source) Which language models are hallucinating the least? This Hugging Face Leaderboard has the answer

(Source) Andrej Karpathy’s 4h video on reproducing GPT-2 (128M parameters) is makes you an immediate AI expert

He literally explains everything.🙆 

(Source) More mind-blowing videos of Chinese AI video generation model KLING are appearing

(Source) Fresh out of WWCD: Apple Intelligence in 5 min.

I am still making up my mind.

There he is …

Martin

Did you like the voice memo?

😊

Login or Subscribe to participate in polls.