Why Most Agent AIs Don't Work (Yet)

And what to do about it.

Way back in March (2023) — which feels like an eternity ago in AI years, is when excitement around agents first seemed to spark.

That was around the time that AutoGPT launched — and it was all the rage. Then came BabyAGI and other similar projects.

These were all very cool — but had one teeny, tiny problem. They didn’t work. Not really. And for the most part, they still don’t.


How Agent AIs Do What They Do

Let’s first dig into how projects like AutoGPT, BabyAGI and their descendants work.

You supply it a high-order goal (by virtue of a prompt). Then it takes that goal and uses an LLM (Large Language Model) like GPT to break down the goal into a series of tasks. It can then complete those tasks, usually by invoking the LLM again, or doing a web search or invoking other tools. It has a working “memory” so that it knows which tasks have been completed and can take the output of one task and use it in other tasks.

To do what it needs to do, the agent uses multiple invocations of an LLM.

And that’s where we start to have a problem.

The Chaos Of Compounding Errors

The problem is that LLMs have an error rate. They’re great, but not perfect. Also, as it turns out, the quality of the results can vary widely based on the quality of the prompt supplied. When you use ChatGPT as a human, you have a general sense of things, and will often tweak a prompt to cajole the LLM into giving you what you want. With agents, the agent is writing the prompt — or getting an LLM to write the prompt (it’s all very meta). That’s fine — until it’s not.

Let’s walk through this.

Let’s say, on average, for a given problem domain, an LLM will generate a “right-ish” output, on average, 95% of the time (I’m just making that up — work with me here). This is for some definition of “right-ish”. That’s pretty good and certainly usable and useful for many use cases. It’s why ChatGPT is so darn useful.

Now, let’s say an agent needs to invoke the LLM a dozen times to accomplish a goal. Everything from creating the list of tasks or sub-tasks, generating output for a particular task, track progress, etc.

Now mathematically, if each invocation has just a 95% success rate — or a 5% error rate, the success rate of the final result is 0.95 to the 12th power (you can just type that into Google) which is about 54%.

So basically a coin toss, in that example. Half the time you’ll get something right-ish and the other half you’d get something wrong-ish.

The problem is that for general-purpose agent platforms, that error rate can be higher, because it’s trying to be all things to all people. And, based on the kinds of tools/functions it supports, the error rate can be really high. Further, based on the complexity of the initial goal, the agent may need to invoke the LLM 10s of times.

So, more times than not, your curiosity and excitement will wane as you realize that Agent AIs just don’t work.

What We Can Do To Make Agent AIs Work

There’s good news though.

Though in the abstract, general sense, Agent AIs don’t quite work — yet, it is possible to create special-purpose agents that do work. And that’s precisely what developers are working on.

Instead of creating a broad “this can do anything!” agents like AutoGPT, builders are building agents for specific types of goals in a particular domain. This means you can reduce the error-rate considerably.

Here are example ways to increase the success rate of these custom agents:

  1. Provide a more focused, use-case specific system prompt to kick things off. Since you know what kinds of goals the agent supports, you can improve the odds of coming up with the right set of tasks initially. In fact, there’s nothing wrong with actually defining the set of tasks up-front, if you know what needs to be done.

  2. Provide better tools/functions for the agent to access. So, instead of just doing a generic Google search, you could give the agent access to a set of proprietary APIs that are built to retrieve specific kinds of data or perform specific kinds of tasks.


  3. Get human assistance along the way. There is nothing wrong with the agent recognizing that a particular type of task is better done by a human. Or, the agent can ask a human a question that is difficult for the agent to get an answer to, but easy for the human to answer.


  4. Allow human feedback. Instead of just running from beginning to end and letting the error rate compound, the agent could intermittently ask a human for feedback to make sure it’s on the right track — and adjust if not. The really good agents would use this feedback and do RLHF (Reinforcement Learning from Human Feedback) whereby the agent gets “smarter” over time, based on the feedback.

Where We Go From Here

A few things are happening.

  1. New agent frameworks like Microsoft’s AutoGen are starting to get really good.

  2. LLMs like GPT4-Turbo are getting better at splicing in tools/functions, making the “core” more stable and predictable for certain tasks.

  3. Software companies are getting more focused and solving discrete customer problems using available tools.

I think we’re going to see a lot of progress on agents in 2024.

Progress will come from builders reducing scope (taking a more pragmatic approach) and LLM providers increasing accuracy.

On the reducing scope front, one question you might have is: "When is an agent not a real agent?”

Or, What’s the Minimum Viable Agent?

That’s the topic of my next post. Stay tuned.

Meanwhile, appreciate any thoughts/comments/feedback.