Generative AI - Short & Sweet
Posts
LLMs on CPU? The 1-bit framework is a Masterpiece

LLMs on CPU? The 1-bit framework is a Masterpiece

+ Grok 2 + news on robots + AI Agents and Blockchain 🤷‍♀️

Martin Musiol
October 29, 2024 • Estimated Reading Time: 11 minutes

100 seconds, packed with extraordinary tech advancements. ✅

Tech Deep Dive: Now you can run up to 100B LLMs locally on your CPU (No GPUs!) and get 5-7 words/second
Grok 2 unparalleled vision capabilities
Latest on humanoid robots
Learn best practices regarding building products/ code with AI!
Based AI Agends → on blockchain

Now you can run up to 100B LLMs locally on your CPU (No GPUs!) and get 5-7 words/second

Microsoft has open-sourced BitNet, an ultra-efficient LLM framework that offers groundbreaking performance by reducing data requirements.

It’s slightly technical, but its concept is not as difficult. Let me explain.

The image is a composite chart comparing model inference speed and energy cost for different model sizes on an Apple M2 Ultra. The chart is divided into three sections: 1. **Main Bar Chart (Left)**: - **Title**: "Model Inference on Apple M2 Ultra" - **Y-Axis**: Inference Speed (tokens/second) - **X-Axis**: Model Size (ranging from 125M to 100B) - **Bars**: Two colors represent two different models: - **Teal**: Represents "llama.cpp" - **Blue**: Represents "bitnet.cpp (ternary)" - **Red Dashed Line**: Represents "Human Reading Speed (5-7 tokens/sec)" - **Data Points**: - 125M Model: - llama.cpp: ~434.40 tokens/sec - bitnet.cpp (ternary): ~593.43 tokens/sec (1.37x) - 350M Model: - llama.cpp: ~185.56 tokens/sec - bitnet.cpp (ternary): ~281.51 tokens/sec (1.52x) - 700M Model: - llama.cpp: ~104.38 tokens/sec - bitnet.cpp (ternary): ~194.30 tokens/sec (1.70x) - 1B Model: - llama.cpp: ~60.95 tokens/sec - bitnet.cpp (ternary): ~114.47 tokens/sec (1.96x) - 1.5B Model: - llama.cpp: ~96.21 tokens/sec - bitnet.cpp (ternary): ~169.45 tokens/sec (1.98x) - 2.5B Model: - llama.cpp: ~85.60 tokens/sec - bitnet.cpp (ternary): ~119.24 tokens/sec (2.42x) - 3.8B Model: - llama.cpp: ~39.73 tokens/sec - bitnet.cpp (ternary): ~84.77 tokens/sec (2.99x) - 7B Model: - llama.cpp: ~28.31 tokens/sec - bitnet.cpp (ternary): ~96.21 tokens/sec (3.35x) - 13B Model: - llama.cpp: ~15.61 tokens/sec - bitnet.cpp (ternary): ~52.36 tokens/sec (3.84x) - 30B Model: - llama.cpp: ~8.79 tokens/sec - bitnet.cpp (ternary): ~33.78 tokens/sec (4.52x) - 70B Model: - llama.cpp: ~3.78 tokens/sec - bitnet.cpp (ternary): ~17.07 tokens/sec (5.07x) - 100B Model: - llama.cpp: ~1.71 tokens/sec - bitnet.cpp (ternary): ~6.58 tokens/sec (5.07x) 2. **Inset Bar Chart (Top Right)**: - **Title**: "Energy Cost (700M Model)" - **Y-Axis**: Energy Cost (µJ/token) - **X-Axis**: Model Names - **Bars**: - Teal: Represents "llama.cpp" with an energy cost of ~0.314 µJ/token - Orange: Represents "bitnet.cpp (ternary)" with an energy cost of ~0.140 µJ/token (55.4% reduction) - **Green Arrow**: Indicates a 55.4% reduction in energy cost from llama.cpp to bitnet.cpp (ternary) 3. **Inset Bar Chart (Bottom Right)**: - **Title**: "Energy Cost (70B Model)" - **Y-Axis**: Energy Cost (µJ/token) - **X-Axis**: Model Names - **Bars**: - Teal: Represents "llama.cpp" with an energy cost of ~28.02 µJ/token - Orange: Represents "bitnet.cpp (ternary)" with an energy cost of ~8.42 µJ/token (70.0% reduction) - **Green Arrow**: Indicates a 70.0% reduction in energy cost from llama.cpp to bitnet.cpp (ternary) for the 70B model 4. **Human Reading Speed Comparison**: - **Red Arrow**: Points to the red dashed line indicating human reading speed, with a range of 5-7 tokens/sec. - **Text**: "Human Reading Speed (5-7 tokens/sec)" 5. **Energy Cost Comparison**: - **Green Arrows**: Indicate the percentage reduction in energy cost when comparing bitnet.cpp (ternary) to llama.cpp for both 700M and 70B models. **Key Observations**: - The blue bars consistently show higher inference speeds for bitnet.cpp (ternary) compared to llama.cpp for all model sizes. - The energy cost for bitnet.cpp (ternary) is significantly lower than llama.cpp, with reductions of 55.4% for the 700M model and 70.0% for the 70B model. - The speed improvement factor (in parentheses) increases with model size, showing bitnet.cpp (ternary) is more efficient as model size grows. **Additional Information**: - The chart demonstrates that bitnet.cpp (ternary) consistently outperforms llama.cpp in terms of both inference speed and energy efficiency across various model sizes. - The human reading speed is included as a baseline for comparison, showing that bitnet.cpp (ternary) exceeds this speed for smaller models and approaches it for larger models. **Legend**: - **Teal**: llama.cpp - **Blue**: bitnet.cpp (ternary) - **Orange**: Used for energy cost comparisons in the inset charts. **Overall Insight**: - Bitnet.cpp (ternary) models are more efficient in terms of both speed and energy consumption compared to llama.cpp models, especially as the model size increases.

Traditional approaches use at least 16 bits to store the trained parameters, which are decimal numbers, such as 6.5.

Staying with the example, 6.5 would store 1 bit for the sign (+/ -), 5 bits for the exponent (order of magnitude), and 10 bits for the number's actual value. 16 bits in total.

BitNet, the 1-bit framework, is much simpler and works, using only 1 bit. It limits each trained parameter to just three values: -1, 0, or 1. This means no other values are possible. There is no 6.5.

Why would they do this?

BitNet runs 4.1 times faster and has 8.9 times the throughput of models using the 16-bit representation. Period. Basta.

By using only integer addition instead of complex multiplications, BitNet is not only optimized to be faster but also requires significantly less memory.

It is genius!

And this is huge because …

…, with this, we can do more in a shorter time. Which is vital in AI’s progression. Look at reasoning models such as o1 (Claude 3.5 Opus might one be as well). You want it to process thousands of ideas in under a second—perfect!

Further, this results in lower energy usage and reduced infrastructure costs. This is especially helpful for edge and mobile devices where resources are limited as well as real-time applications like we see with OpenAI’s Realtime API. (I wrote it and how you can apply it.)

How to get LLMs to make them run on your laptop with 1-bit BitNet? Here is the GitHub Repo + steps of how to do it.

⚠️ If you absolutely want me to do a video on how to implement it, let me know in the comments or reply to this email.

Be an everyday genius 🧑‍🔬

Learning a little every day can have a huge impact—especially if you're learning on Brilliant. Explore thousands of bite-sized, interactive lessons on everything from math and data analysis to programming, AI, and beyond.

xAI’s Grok 2 (the AI Model) now has vision capabilities that are incredibly in-detail (my demo)

I have worked with Anthropic, Google, OpenAI, and other top-notch AI companies, but xAI’s vision capabilities are unparalleled. Look for yourself in my demo below.

Latest updates on humanoid robots 🤖

Clone has built Torso. An upper-body robot that is only actuated with artificial muscles. With that, robot anatomy is getting closer to human-like biology.

Introducing Torso, a bimanual android actuated with artificial muscles.
— Clone (@clonerobotics)
8:10 PM • Oct 23, 2024

Finally, a humanoid robot with a natural, human-like walking gait. Chinese company EngineAI just unveiled their life-size general-purpose humanoid SE01.

Learn best practices regarding building products/ code with AI!

The AI Summit Seoul has been at the forefront of tech for years.

Only signal, no noise.

This year, I am happy to share GenerativeAI.net's self-developed framework for using current AI tools most effectively in product development.

Additionally, I'll host a hands-on workshop this year. Participate, learn frameworks, build your agent, and enjoy working with an agent at your hand.

AI Agents + Blockchain = Based Agent ⛓️

Create AI Agents with full on-chain functionality in less than 3 minutes.

The era of Autonomous Onchain Agents is here, built with the Coinbase SDK, OpenAI, and Replit.

To get started, you need an API key from https://cdp.coinbase.com, a key from OpenAI, and to fork the Replit template. It couldn't be easier to start adding whatever functionality you want to these agents.

Find the how-to here: https://github.com/murrlincoln/Based-Agent/tree/main/Based-Agent

I can’t wait to see what you build. 🙂

⚠️ If you want me to demo how to implement it, let me know in the comments or reply to this email.

That’s a wrap! I hope you enjoyed it.

Martin

Do you write newsletters? I use Beehiiv and highly recommend it.
AI for your org: We build custom AI solutions half the market price, and time (building w/ AI Agents). Contact us to know more.
Would you like to sponsor a post?
My book - Generative AI: Navigating the Course to AGI.
Generativeai.net