OpinionPulse AI·

The AI Agent Hype is Ahead of the Product. Here's What's Still Broken.

AI agents like Devin look great in demos but fail at real work. I tried them. Here's what's broken with memory, error recovery, and why I can't trust them.

By Rohan Mehta·6 min read
Share
The AI Agent Hype is Ahead of the Product. Here's What's Still Broken.
Originally reported by Pulse AI. The summary below is original editorial commentary written by Pulse AI based on publicly available reporting.

I remember watching the Devin demo video, the one where it supposedly completes a whole project from Upwork. As an editor writing for AI builders here in India, and as someone who still spends weekends tinkering with code, I felt that familiar jolt of excitement. It’s the feeling that the future has just lurched forward. This wasn’t just a better chatbot; it was a vision of an autonomous colleague. So, naturally, I had to get my hands on everything that claimed to be an 'AI agent'. I've spent the last month trying to get Devin, Manus, Claude’s experimental agent mode, and even early peeks at what OpenAI is building with 'Operator' to do actual, useful work for me. My verdict? The hype is dangerously ahead of the product.

Let’s start with Devin, the poster child for the agent revolution. I gave it what I thought was a straightforward, real-world task for Pulse AI. "Take our existing Ghost blog template, which is a standard Handlebars.js setup. I need you to add a persistent dark mode toggle. It should respect the user's system preference on first load, but then remember their choice in local storage. Once you're done, deploy the updated theme to our staging site on Netlify." This is a classic junior developer task. Devin started strong, correctly identifying the files to modify. It even installed the necessary packages. But then, it hit a wall. It couldn't properly handle the Netlify CLI's authentication flow. It kept trying the same failed command, caught in a loop of its own making. After an hour of watching it spin its wheels and generate error logs, I gave up. The demo showed an agent acing a job interview; my reality was an agent that couldn't figure out how to log in.

Next up was Manus, a less-hyped but powerful-looking agent I got access to. I decided to give it a content-related task instead of a coding one. "Scan the 'Articles' directory on my local machine," I instructed. "Inside, you'll find the last 15 posts I've written for Pulse AI. Read them, identify the three most common themes I keep returning to, and then draft a new article outline that synthesizes my perspective on the future of AI agents." This felt like a perfect job for an LLM with legs. At first, it seemed to be working, listing the files correctly. But then the reasoning fell apart. It hallucinated themes that weren't there, mashing together half-sentences from different articles into incoherent 'key points'. It claimed I wrote extensively about 'quantum AI accelerators' – a topic I've never touched. It failed to grasp the context that connected the posts, treating them as a disconnected bag of words. The agent couldn't see the forest for the trees, and the trees it did see were imaginary.

Then there are the OS-level agents, the ones that promise to operate your computer for you. I tried both Claude's latest mode and a tool similar to what's been demonstrated of OpenAI's Operator. My task was simple digital hygiene. "Go to my 'Downloads' folder. Create three subfolders: 'Images', 'Documents', and 'Archives'. Sort all the existing files into these folders based on their file type. Finally, delete any file in any of those folders that is older than six months." What happened was a masterclass in frustration. The agent started by asking me for permission for every single file move. 'Can I move `screenshot-2024-05-12.png` to `Images`?' it would ask, completely defeating the purpose of autonomy. After I told it to proceed with all future similar actions, it encountered a `.webp` file, panicked because it wasn't a `.jpg` or `.png`, and gave up on the entire sorting task. It never even got to the part about deleting old files. I'm glad it didn't, frankly. I wouldn't trust it to delete my shopping list, let alone my files.

These failures aren't just bugs; they point to three fundamental, architectural problems. The first is **memory**. And I don't just mean a bigger context window. These agents suffer from a kind of task amnesia. They can hold the immediate goal in mind, but they lose the plot during a long-horizon task. The agent sorting my files forgot the final, crucial instruction—deleting old files—as soon as it hit a snag with the sorting part. It couldn’t hold a multi-step plan in a resilient way. It’s like having a conversation with someone who forgets the beginning of your sentence by the time you've reached the end. For any task that takes more than five minutes, this is a fatal flaw.

The second, and perhaps most critical, broken piece is **error recovery**. A human developer who sees `Authentication Failed` doesn't just run the same command 50 more times. They stop. They think. They might google the error message, realize they need to generate an API token, or check their environment variables. The agents I tested do none of this. They are incredibly brittle. When faced with an unexpected output—a slightly different UI element, a new version of a library, a permissions error—their logic shatters. They lack the common sense to try a different approach or even to stop and ask for specific, targeted help. They just spiral, creating a mess that's often harder to clean up than just doing the task myself from the start.

This all leads to the third, most human problem: **trust**. The whole point of an agent is to delegate a task with confidence. But after my experiences, I simply don't trust them. The cost of supervising the agent—constantly watching it, correcting its mistakes, worrying about what it might break—is far higher than the value of the automation it provides. It’s like hiring an intern you have to watch over for every single keystroke. You can't turn your back on it. How can I possibly let an agent work on a client's production codebase or organize my personal file system when I've seen it get stuck in a login loop and hallucinate entire documents? The trust just isn't there, and without it, agents remain a novelty, not a tool.

So, what's the path forward? Shoving more parameters into a model won't solve this. We need a fundamental shift in focus. First, we need to solve state management and long-term memory. This might look like dedicated memory architectures that exist outside the context window, allowing an agent to maintain and update its 'world model' and 'plan' over hours or days. Second, we must build robust scaffolding for error handling and self-correction. Agents need tools to introspect their own failures, form hypotheses about why something went wrong, and plan alternative routes—just like a real engineer. And their ability to ask for help needs to be smarter, escalating to a human with a well-defined question, not just giving up. Finally, we need to move beyond simplistic benchmarks like SWE-Bench. We need new tests that measure resilience, task-completion over long horizons, and the ability to recover from unexpected setbacks. These are the real metrics that matter.

I want to be clear: I'm not an AI pessimist. I'm a builder, and I believe the promise of agentic AI is one of the most exciting frontiers in technology. But as a community, we need to be honest about where we are. We're in the flashy demo phase. To get to the useful product phase, we have to stop being mesmerized by the parlor tricks and get serious about solving the boring, fundamental, and incredibly difficult problems of memory, resilience, and trust. The autonomous AI colleague is coming, but it's not here yet. It's time to roll up our sleeves and actually build it.

Free Newsletter

Get the 50 AI Tools every Indian professional should know in 2026.

One email a week. Free PDF on signup. Unsubscribe anytime.

Why it matters

  • 01Current AI agents excel in short-burst demos but consistently fail at multi-step, real-world tasks due to fundamental limitations.
  • 02The key blockers aren't just about model intelligence, but poor long-term memory, brittle error-recovery logic, and the resulting lack of user trust.
  • 03The path to useful agents requires a shift from demo-driven development to building robust scaffolding for state management, self-correction, and trustworthy human-in-the-loop systems.
Read the full story at Pulse AI
Share