We're using these tools well,
and leaving most of their power on the table where it counts.
My team and I were running a large modernization of our infrastructure to the cloud. We were short much of the experience it called for.
The chat tools turned out to be invaluable. They helped us understand the options and the current best practices, and analyze the tradeoffs between them. They also helped me present the work: telling the story of how the new architecture served the business, with artwork, music, and a clean, consistent set of data-driven slides. We took on transformative work with more confidence, and produced work I could actually stand behind: checked, sourced, and defensible.
All of that came in bites. Each piece was useful on its own and disconnected from the next, and I corrected and stitched by hand the whole way. When I tried to thread the next multi-week effort through a single chat tool, end to end, it fell apart fast. Not because the tools were weak. Because nothing held the parts together, and holding them together with care is a full time job on its own. That is the analyst's job, and it has always been mine.
I know this work. Systems analysis under waterfall first, then agile. Waterfall gave the analyst full authority: document everything, analyze thoroughly, build nothing until the plan was sound. Agile broke that open — faster delivery, less documentation, the analyst's work chopped into sprints and given fewer resources. The discipline didn't disappear. It just got squeezed. These tools landed in that same gap.
That is the piece missing from how most of us use these tools. We have been handed extraordinary tools for analysis and knowledge work, and nothing that connects them to each other or to the exact project we are running.
The prompt is the first thing you learn to control. Most people stop there, and for a lot of work, that is exactly right.
The first good prompt is a paragraph. Then it grows to a full page, maybe two. You learn to structure it, put a goal at the top, feed it good context, and teach it to write the way you write. The chat tools are built for this whole spectrum of engagement, from a quick question to a carefully constructed page that produces a strong piece of research. This is the kind of ask they do well.
A requirements document for a large system build is not that kind of ask. Neither is research into industry best practice for a modernization plan. You do not write one prompt, however good, and trust what comes back. The way it gets done now is familiar. A trusted analyst, who is often you, runs a long process: research, meetings, interviews, compiling, reviewing, analyzing, then producing a document and explaining and selling it. There can be days or weeks of accumulating work toward one deliverable that someone will act on.
When faced with that kind of multi-week effort, most people reach for LLMs in one of two ways — and both fall short.
The first is to work in fragments. You ask the tool to summarize this, draft that, react to the other thing, then you assemble the pieces by hand, editing as you go — maybe feeding pieces back in for suggestions, or relying on Word for corrections. The continuity lives in your head, not the tool. The tools add real speed here. You cover more ground and synthesize more than you could alone. But you are stepping in and out and between tools, handling much of it entirely outside of them. When you try to run a full project this way, the failures will stop you.
The second is more aligned to application development than to analysis, research, and writing. You craft one thorough prompt, or build a single long agent, and send it winding through the whole research in one pass to produce the thing. The better version keeps a person in the middle, the approach the field is converging on now, supervising the run rather than turning it loose. That will certainly produce higher quality individual research artifacts than a fancy prompt. But building a highly governed, flexible agent is real engineering, and most writers and analysts will never build one for the same reason most writers never learn LaTeX: the investment only makes sense if you treat the tool itself as a specialty. Every project has its own shape, and engineering one from scratch each time is its own job.
What I am building uses aspects of both but is neither. A framework that stays in place across the whole project, so the LLM's research, analysis, and writing run through the full process instead of in fragments you assemble or an agent you rebuild each time. The framework holds the continuity in a way that the chat tools by themselves cannot.
Sounds like an agent?The difference lies in where the judgment sits. An agent decides its own next steps and runs on its own. Here the person decides and the tool executes the parts handed to it. It borrows how an agent works under the hood: persistent memory and a structured workspace it reads and writes, without the autonomy. The person is not checking in at the edges of an automated run. The person is on site, directing the build: the tools do the heavy lifting, and the next move is always theirs to call.
A better prompt does not fix these problems. No instruction tells the model what happened in a previous session, and no phrasing compensates for information buried in a long context.
Some will say the tools will catch up: bigger context windows, better integrations, assistance built into the software you already use. They will keep improving. But serious work scales to fill every capability the tool can offer, and without a framework governing what the tools are working on and what the goals are, you cannot stand behind what they produce.
The evidence base: see Research on Current Limits to AI Context in the supporting material below.
I am early in this work. The framework I've assembled in Claude.ai is a solid toolbelt for the analyst but it has too much duct tape and super glue on it to try to ask anyone else to use it. The next version will be in Cowork and be closer to something others would enjoy using. The framework is the subject of a separate paper in progress.
For years, a good analyst ran every piece of a project alone. Research, first draft, rewrite, source check, revise again. Capable, but always limited by what one person can hold and how much time they have. That analyst now has a crew: one trade for research, another for drafting, another for fact-checking and revision. The crew can document, synthesize, analyze, write, and refine at a pace no single person matches. Most people with that crew do not yet know how to use them. They are running the tools the way they used a search engine.
The general contractor does not commission the job and come back in three months. They are on site while the work is done: checking each piece as it goes in, sending back what is not right, standing behind what does. The analyst who runs this crew the same way can stand behind what the crew produces, because they were there while it was built.
As the tools get cheaper at producing words, the words stop being the scarce part. What stays scarce is knowing what the work should say, deciding when to pivot, losing nothing along the way, and standing behind it at the end.
That was always the analyst's job, and it still is. What changed is the crew. The trade knowledge moved into the tools. The judgment did not move with it.
This paper was directed across sessions. It began as a research brief, then moved through structured drafts, editorial review against a voice guide, and revision passes logged throughout, with every instruction and critique preserved in the transcript. Session handoff documents captured the state of the work each time context shifted.
Working this way made the expensive things cheap. Reframing the argument, testing a different way to tell the story, pinning down where this sits in the industry: each would have been a day's work on its own, so alone I would have done one of them, and done it worse. Here each cost an afternoon, so I did all of them, and the placement got sharper for it.
The same process caught something quieter. This piece makes five citations. When a second pass from a different LLM checked the draft against the research document it was built from, only one of the five citations was actually there. The other four had to be found and verified before the piece could stand behind them. A single confident pass would have shipped them unchecked, and they would have looked authoritative.
The provenance block below is not a formality. It is the record of that work. The output is inseparable from its lineage, and the lineage is verifiable.
None of this is an argument against the tools on your everyday work. On a memo, a quick analysis, a first draft, they are excellent, and the framework described here would be wasted overhead. The case for it is narrow. It shows up only where the work runs long and the stakes are real. There the failure modes are not opinion. They have been measured.
| Who | The long work | What breaks across a project | What it costs |
|---|---|---|---|
| Social science and policy researchers | A paper or report over months, many drafts | Citations invented. Settled decisions reopen. | Roughly 20% of AI-generated citations fabricated in one review study (Linardon et al. 2025, JMIR Mental Health / Deakin). Credibility on the line. |
| Business and systems analysts | Requirements and specs across dozens of sessions | Terminology drifts. A requirement no one stated appears. Context lost between threads. | Requirements failures drive rework that can reach 40% of a project budget (PMI Pulse of the Profession). Poor requirements is the most cited cause of project failure (BA Times). |
| Legal analysts | Case prep and briefs built over weeks | Citations hallucinated. Argument loses consistency across sessions. | Leading legal AI tools hallucinated 17–34% of the time in Stanford testing (Magesh et al. 2025, Stanford HAI/RegLab). Over 1,600 court cases now involve AI hallucinations (Charlotin, HEC Paris) — many involving pro se litigants and consumer-grade tools used without professional oversight. |
Segment detail and full sourcing: companion research paper (the evidence base; access on request).
Three different jobs, the same failures. The tool makes things up. It loses the thread across a long context. It forgets the project between sessions. These are properties of how the tools work today. Models keep improving, but real projects run longer than any single context and span more sessions than any single thread, so the need to manage the work does not disappear with the next release. The analyst row carries the thinnest direct measurement. No one has yet run a controlled study on requirements drift in AI-assisted work the way Stanford ran one on legal citations. But if you push the tools on this kind of long, high-stakes work, you will see it yourself.