We Are Using Karpathy's Autoresearch to Improve a Real Product, Overnight

In partnership with

[I think this is the biggest achievement so far in recurring self-improving AI products → by Andrej Karpathy.]

When Andrej Karpathy released autoresearch, the part that stuck with me was not the model improvements.

It was the loop.

Make a change. Measure it. Keep it if it helps. Throw it away if it does not. Repeat. All night. No human in the loop.

In his published run, the agent executed 125 experiments on a single GPU and improved the model while Karpathy slept.

That is already interesting for ML research, and might lead to Artificial Superintellice (ASI).

But I kept thinking: what if you point this at a product instead of a model?

We Tried It

We started applying this inside FlowCursor, where one of the hardest problems is getting the right context to the AI model so it can answer the right question.

Think YouTube comments instead of the sidebar. The correct calendar block instead of the navigation. The right code region instead of random visible text.

We scoped an agent to the files that influence context quality. We built a benchmark around real user situations: docs, task boards, calendars, YouTube, PDFs, code editors. We built a scoring function that penalizes broken builds and rewards better extraction.

Then we ran the loop.

88% resolved. 22% stayed loyal. What went wrong?

That's the AI paradox hiding in your CX stack. Tickets close. Customers leave. And most teams don't see it coming because they're measuring the wrong things.

Efficiency metrics look great on paper. Handle time down. Containment rate up. But customer loyalty? That's a different story — and it's one your current dashboards probably aren't telling you.

Gladly's 2026 Customer Expectations Report surveyed thousands of real consumers to find out exactly where AI-powered service breaks trust, and what separates the platforms that drive retention from the ones that quietly erode it.

If you're architecting the CX stack, this is the data you need to build it right. Not just fast. Not just cheap. Built to last.

See the data

The Honest Result

The gains so far are modest. Early signals, not a victory lap.

But what we did see is that better context selection and better wrong-surface filtering can move the product in the right direction on internal cases. A few of those changes survived the keep/discard gate. Most did not.

That is actually the interesting part.

Because once you have a working loop and an honest benchmark, improvement stops being only about taste or guesswork. It starts becoming a search problem.

And search scales with compute.

Why Compute Is the Scarce Resource Now

Once the loop is real, the bottleneck is not ideas. It is compute.

More cycles. More candidate changes. More benchmark cases. More parallel agents. More analysis of what worked and what regressed.

For years, product improvement was bottlenecked by engineering time and good instincts. With a loop like this, part of it shifts to how much honest search you can afford to run.

This is one reason tools like Cursor Pro matter in practice for this kind of work. Pro includes extended Agent limits, Cloud Agents, unlimited Tab completions, and maximum context windows. For agent-heavy workflows like this, that headroom is the difference between running 5 experiments and running 50. Not because more compute replaces judgment, but because it lets you search more around the judgment you already have.

What Changes When You Apply This to Products

The core idea from Karpathy stays the same. The setup changes.

For model training, he had one file to mutate, one metric, and one clear target.

For products, you need to get four things right:

A tight mutation surface.
An honest fitness function.
Real benchmark cases.
Safe cycle infrastructure.

Getting each of those right for a product, not a training script, is where the real work lives. And it is where most teams will get stuck if they try this without a guide.

The Practical Guide: for Premium Subscribers

I wrote a PDF guidance paper that walks through exactly how to set up a product autoresearch loop in your own business.

It covers:

how to scope the mutation surface so the agent stays focused and experiments stay reviewable
how to design a fitness function that actually reflects user value, not just code changes
how to build benchmark cases from real user scenarios
how to structure the cycle scripts so failed experiments never touch your main codebase
how to think about compute budgets, agent tooling, and search strategy
the mistakes we made and what we would do differently

This is the kind of applied, practical document I plan to keep producing for premium subscribers. Not just ideas, but the actual setup for applying them in products and businesses.

If you subscribe to premium today, you get the PDF guide when it ships.

And because compute is one of the scarce resources in this kind of work: the first 3 premium subscribers who reply to the welcome email will also get one of my Cursor Pro invites. I only have 3 of these. They are for new Cursor users only.

Unlock the Autoresearch Guidance (+ Cursor Pro)

Get the full PDF guide on setting up product autoresearch in your business, plus all future applied deep dives. The first 3 premium subscribers who reply to the welcome email also get a 2-week Cursor Pro invite (new Cursor users only, I only have 3).

Subscribe to Premium

Benefits:

PDF guide: product autoresearch from scratch
Applied deep dives on evals, benchmarks, and agent workflows
First 3 subscribers get a 2-week Cursor Pro invite

We Are Using Karpathy's Autoresearch to Improve a Real Product, Overnight

We Tried It

88% resolved. 22% stayed loyal. What went wrong?

The Honest Result

Why Compute Is the Scarce Resource Now

What Changes When You Apply This to Products

The Practical Guide: for Premium Subscribers

Unlock the Autoresearch Guidance (+ Cursor Pro)

Benefits:

Keep Reading

Generative AI - Short & Sweet

Home

Account

GenerativeAI.net