Pilot vs Production with LLMs

A recent MIT report says 95% of companies are getting zero return.

Generic LLM chatbots like ChatGPT appear to show high pilot to implementation rates (~83%). Even custom agents built on SOTA models are great for small workflows or single-system tasks but when you push them into multi-system interactions they tend to lose context and deliver poor quality. The solutions often feel brittle, over-engineered, and unnecessarily complicated.

Well-designed custom agents still outperform generic tools like ChatGPT or Claude, but most of the time they fail to match the user experience people expect. That creates friction: avid ChatGPT users distrust internal custom tools that aren’t faster, easier, or as familiar.

What custom tools need is better UX with familiar elements, stronger domain fluency, and a tight feedback loop. We’ve seen this in coding Claude Code and other tools use human-in-the-loop patterns to get reliable, high-quality outputs. Those same lessons transfer across domains.

High ROI often hides in overlooked functions like operations and finance. We can now build small back-office tools super-quick, things that used to take long engineering cycles and they can improve customer retention and recover 2–5% of lost revenue.

MIT’s Project Iceberg estimates about $2.3 trillion in potential value from automation.

Read full report here: https://assets.73ai.org/v0.1_State_of_AI_in_Business_2025_Report.pdf/