Generative AI - Short & Sweet
Posts
9 Areas Where Humans Still Outperform AI

9 Areas Where Humans Still Outperform AI

.. and big update from Mistral

Martin Musiol
November 19, 2024 • Estimated Reading Time: 8 minutes

In partnership with

AI hasn’t fully taken over yet. Humans still excel in certain areas - of course.

These 9 benchmarks outline vital skills and evaluate how AI measures against humans.

(Don’t miss out on the quick AI highlights at the bottom.)

✅ Before we started, we launched a premium version of the newsletter. Subscribing gives you 100% access to all content, exclusive demos, and an ad-free experience. I plan to host AMAs and develop more subscriber-requested demos.

Run Your Sales on Autopilot

Increase the output of your sales team without buying more tools or hiring new SDRs. Onboard Agent Frank, Salesforge’s AI SDR, to fully automate prospecting, personalized outreach and booking meetings while your team focuses on closing deals. Get a personalised email from Agent Frank to see his work in action:

Show me the magic

(✨ If you don’t want ads like these, Premium is the solution. It is like you are buying me a Starbucks Iced Honey Apple Almondmilk Flat White a month.)

9 areas where humans still have an edge compared to AI

It might feel that humans are losing their edge against AI systems that are increasingly better. What is the value that humans can capture versus air systems?

Turns out there are several fields.

In my research, I stumbled upon these 9 datasets/ evaluations that show that humans still have an incredible edge against AI systems - for some time.

What is WorkArena++?

These are 682 tasks that simulate workflows typical for knowledge workers, testing planning, problem-solving, reasoning, info retrieval, and context understanding. Humans outperform AI thanks to more robust reasoning and contextual grasp.

To have really powerful AI agents, we want them to excel on the benchmark. In 2025, we will see great progress here, where the average human might not have a competitive edge.

What is Simple-bench?

Multiple-choice tasks (200+ questions) test spatio-temporal reasoning, and social intelligence. High schoolers outperform state-of-the-art models, currently.

Humans will maintain a competitive edge for the foreseeable future.

What is ARC-AGI?

Assesses AI's ability to learn new skills and solve open-ended problems via patterns and abstract reasoning. Humans excel due to better generalization and abstract thinking.

A simple concept—covered in a past episode. Humans will likely outperform computers for another 2–3 years.

What is MiniWob?

Web-based tasks test reinforcement learning agents in navigation and interaction. Humans currently lead due to better understanding and adaptability. However, with AI gaining access to web pages via visual, textual, and API channels, the margin is narrowing quickly. By 2025, AI will match or surpass humans in these tasks, and I’m already taking over here.

What is WebArena?

Evaluates complex web tasks like info retrieval and form filling. The gap between average human capabilities and AI is shrinking rapidly, similar to MiniWob. While opportunities remain, for now, AI will soon close the gap entirely.

What is Putnam Bench?

Tests theorem-proving algorithms with problems from the Putnam Mathematical Competition. At the same time, the average human doesn’t have an edge, human experts (PhDs) excel. Interestingly, the AI-human baseline is often mislabeled. Starting next year, AI collaborators will reach parity with human PhDs, significantly accelerating scientific progress.

What is NOCHA?

Evaluates object classification and hierarchical annotation. Humans still outperform AI due to sharper visual perception and contextual understanding. Visual AI has evolved gradually over decades—from early convolutional neural networks to current LLM integrations. For at least the next year, AI won’t surpass the average human in these tasks.

What is GAIA?

Tests generalization across tasks and environments, especially for Internet research. Humans currently excel with natural adaptability. However, AI agents are likely to surpass the average human within 2–3 years. Progress depends not only on smarter AI but also on larger context windows, better comprehension, and improvements in model architecture.

What is Lab-Bench?

Focuses on biology-related lab tasks like experimental design and data analysis. Humans excel with expertise and intuition, but the role is shifting. In the coming years, scientists—biologists, chemists, and physicists—will evolve into research project managers, supported by teams of AI agents handling routine tasks.

Updates from Mistral - Pixtral Large (open source) & Le Chat

Pixtral Large: 124B Parameters of Power

This model is crushing benchmarks.

Top scores on MathVista, DocVQA, and VQAv2
Maintains the strong text skills of Mistral Large 2
Built with a 123B decoder + 1B vision encoder
128K token limit for long documents

Want it? It’s free to download on Hugging Face.

Le Chat Also Just Leveled Up

It now does:

Web search with sources cited for fact-checking
Canvas for brainstorming: Edit, export, create seamlessly
Vision upgrades: Reads images & documents
Flux Pro for stunning image generation
Speculative editing: Predicts & refines text faster than you

And yes, it’s still free. → Le Chat

NVIDIA ALCHEMI Accelerates Sustainable Materials Discovery for EV Batteries and Solar Panels

Ian Buck announces NVIDIA Alchemi, an AI-driven digital chemistry service to discover new compounds for medicine, chemistry and materials design, speeding up the discovery process by 100x
— Tsarathustra (@tsarnick)
8:24 PM • Nov 18, 2024