A Team You Have to Build and Lead

LLM Evolution: From Query to Context
Hugh McCutchen  |  Data & Analytics Platform Leader  |  April 2026
Movement One

Better Than Productivity

Most writing about LLM tools focuses on productivity: faster drafts, quicker summaries, writing feedback. Those are truly nice steps forward and bring value. The more useful question for me is whether these tools can help a researcher, analyst, or leader get out of their own head: find the argument inside the material, communicate it graphically and in writing, and build a story worth following.

I spent twenty-two years building enterprise analytical data platforms for a Fortune 10 healthcare company serving hundreds of external organizations and thousands of users with multiple reporting solutions and data warehouse products on premises and in the cloud. I was never able to define and communicate some of the most important things I learned.

The problem was the time and effort required to extract all of that experience, sort it out, and then figure out how to communicate it in a way people would actually follow. My need to close that gap was rising at the same time LLM tools were rising to meet it. That alignment is what this paper is about: accumulated experience looking for a way out, and tools finally capable of helping find it.

Not only have the tools improved. I have learned how to use them for greater impact.

The Writing Problem

Every attempt to write about the platform came out as an internal briefing. Emails outlining technical decisions. Project plans. Twenty-page analyses comparing reporting tool capabilities, licensing models, embedded versus user-based cost structures by product. Dense, context-dependent, built for people already inside the problem. Useless for anyone trying to decide whether the problem was worth solving.

The Tool Problem

I was asking the wrong question of the wrong tool in the wrong way. "Summarize this" produces a summary. "Research this topic" produces a research report. Neither is thinking. Neither is writing. Neither is storytelling.

The tool was doing what I asked it to do. LLM tools can do far more, especially today. I just had not figured out what to ask for.

Movement Two

What My Use of LLMs Got Wrong

The early sessions produced research. What I needed was clarity and argument.

"Research X for industry best practices" produces a well organized and thorough research report. But it was flat with no judgment about what actually matters. I got that output reliably.

The tools were not failing. I was using them as a research partner, not a thinking partner or a storyteller. The failure was mine.

Part of what I was running into was the limit of what the tools could actually do at that stage. They could not produce reliable diagrams. They could barely maintain consistent formatting across a long output. And when sessions got long and complex, they would lose context. Something we had worked out earlier would surface again as an open question. A conclusion we had reached would quietly disappear. It was not that the tools were wrong. They were hitting a ceiling I did not yet understand. Some of this was tool maturity. The products were evolving under me in real time. What felt like a methodology problem was sometimes the tools catching up. That distinction matters for where the methodology actually starts.

The Writing Track

The white paper drafts from this period had no opening argument, no narrative pull, and no reason for a stranger to keep reading.

The Tool Track

Built reusable prompts in a notepad. Attached context documents to carry state across sessions when the context window exhausted. Started asking the tool to embody a specific role: analyst, skeptic, domain expert. The outputs changed when the question changed. Something was beginning to shift.

Movement Three

Evolution in Five Parts

Looking back, my use of these tools moved through distinct phases. The phases are not a ladder everyone climbs at the same speed. In my observation, most people get stuck early. The gap between Phase 1 and Phase 5 is not about your skills outside of the tool. It is about what you expect from it and what you are willing to bring to it.

Evolution of LLM methodology: from Fancy Googling to Leading the Team
Five phases mapped to the arc of anyone who has used these tools long enough to stop querying and start leading.
What Changed in My Practice
Tool / Period
Phase 1
Fancy Googling Type a question. Get a formatted answer. Copy it. The tool is a retrieval device. This is also where most people lose trust: they treat the tool as an oracle, get one wrong answer, and conclude it is useless. Both responses miss the same thing. The tool knows nothing about you, your problem, or what you actually want. It is a large group of moderately capable strangers. They need to be led, not consulted.
Late 2024
Gemini
First-generation Gemini Pro. Single-turn Q&A. One-shot questions, no session architecture. Cross-domain exploration to build baseline fluency.
Phase 2
Learning to Configure You discover that how you ask matters. Reusable prompts. Context attachments. The sessions still drift and the output still disappoints more often than not. You are investing more thought in how you ask. But something has changed: you are treating it as something you configure, not just query. That is the door.
Early 2025
Gemini + ChatGPT
Gemini 1.5 Pro / ChatGPT GPT-4o. Prompt reuse begins. Context docs attached manually. ADF compared with QLIK and QLIK alternatives. First sustained analytical session work.
Phase 3
Character, Comparison, and the Delusional Thread You start asking the tool to play a specific role and to be critical. You move into Projects to hold context. You run Gemini for depth, ChatGPT for story structure. You are managing a team now. But the team is unreliable on complex problems. The delusional thread appears. Long-running requests end in a total freeze with no way to return to them. Rules defined and successfully applied early in a session vanish by the end: a decision reached in turn three is quietly replaced by a different decision in turn fifteen, with no acknowledgment that anything changed. You learn to cross-check not because any one tool is right, but because the disagreements reveal something. The products are also maturing in real time around you.
Mid 2025
ChatGPT (primary)
GPT-4o / GPT-4 Turbo. Switched for quality: better multi-turn reasoning, stronger analytical writing. Deep research on Databricks capabilities and governance architecture. Embedded AI cost modeling. SOX compliance framing, executive communications.
Phase 4
Reports That Tell a Story The output starts to feel written, not generated. Internal reports have a narrative arc. You start asking a different question: not "create a data table with these columns" but "what assertions are not backed up" or "what could I provide that would make my goal more clear." The tool can help you find that answer, but only if you treat the brainstorming as a separate session from the drafting. Story first. Content second. Running them together produces worse work in both.
Late 2025 – Feb 2026
Claude (entry) + Copilot
Claude Sonnet 3.5 / Copilot with Sonnet mode. SVG architecture diagrams (13+ versions), Word docs, HTML briefings, SOX ITGC mapping. Writing Style Guide built from email analysis in Copilot.
Phase 5
Repeatable, With My Voice The work is now consistent, not occasional. A session protocol. Separate roles for brainstorming, drafting, and editing. Sonnet for generation, Opus for critique. A project charter that holds the thread across sessions. Google Docs as handoff between projects. Two style guides: one that removes the bot, one that adds the person. This is not a tool anymore. The question is not how to use it. The question is how to use it better, every time.
Mar 2026 – present
Claude (primary)
Claude Sonnet 4.5 / Sonnet 4.6 / Opus 4.6. Projects with persistent knowledge base, custom system prompt, provenance discipline, multi-mode sessions. Force Multiplier Report, White Paper series, Career Transition work.
Five phases of methodology maturity mapped to actual tool adoption. The third column is not illustrative. It is the timeline.

I don't think any of the tools would have supported the work I do now six months ago, definitely not nine or twelve months ago. Sustained context across a complex multi-session project, coherent pushback on argument structure, professional-grade artifact production. The methodology matters. But the capability had to exist first.

The Writing Track

Phase 4 is when the question shifted from "create a data table with these columns" to "what assertions are not backed up?" That question only gets a useful answer if you ask it before you start drafting.

The first white paper in this series took shape when I stopped loading source material and asking for a summary. I started with a blank brainstorming session: what is the argument, what is the story, what does the reader need to walk away knowing. Content came after.

The Tool Track

Phase 5 required two style guides, not one. The first was extracted from my own writing. It removed the patterns that make output sound generated. The second encoded my voice. It put me back in.

Those two guides changed the output more than any other single investment. But they only work because the session protocol keeps the tool from drifting away from them between turns.

Movement Four

The Infrastructure Behind the Output

Most people think the methodology is in the prompts. It is not. The prompts are the visible part. The methodology is in the infrastructure you build around them before you ever start drafting.

Phase 5 output looks different from Phase 2 output because of habits, mindset, and the infrastructure you build for your work.

The industry has a name for this now. What is crystallizing in 2026 as "context engineering" focuses on the full information environment a model operates in: not individual prompts, but the systems that manage context across sessions, roles, and workflows. The session protocol, the project knowledge base, the system prompt, and the provenance framework described in this paper are a working version of what that discipline looks like in practice for a knowledge worker. The industry named it recently. The practice came first.

The Writing Track

A simple statement of the goal or argument is needed before any real writing starts. Sometimes I arrived with it. More often I found it by telling the tool I wanted to write something about X for some purpose Y, and that I wanted help working it out. The tool is good at pulling the argument out of you. It asks who the audience is. It asks what someone should leave knowing. It presses for clarity when the answer is vague. That session is worth running. Just do it before the drafting begins.

Added top-level instructions to the tool itself: challenge my ideas, be a stoic, keep it brief, do not flatter me. These run before any project context loads. They set the default stance.

Used Copilot to extract my own writing style from work emails without pulling sensitive content into the project. The prompt asked for twenty examples of long-form writing from a period before I used LLM tools at all, on topics complex enough that I would have written carefully. For each, it pulled a representative paragraph and assessed the style patterns across all of them. That report, with the examples included, became the foundation of the style guide here.

Built a Writing Style Guide as a project file. Named six voice modes. Made the tool audit every draft for mode drift.

Built a Banned Patterns document across five categories: words (delve, pivotal, multifaceted); openers (in an era of); structures (three-adjective chains, not only X but also Y); closings (in conclusion); and tone patterns like manufactured urgency and inflated authority. The em dash gets its own line. Updated after every session.

The Tool Track

Stopped asking "research this" with no context. Started by creating projects with instructions and using research to build context, then asking "how do we solve this?" or "does X make sense given what we know?"

Separated research sessions from drafting sessions from editing sessions. Each requires a different mode. Running them together produces worse output in all three.

Pushed back when the output drifted. Too hyperbolic, too generic, too diplomatic. The tool learns the correction within the session. Accepting drift and editing around it produces worse work than refusing the drift outright.

Treated the provenance block as a discipline, not a formality. If I could not populate all five fields, something about the work was unresolved.

This is less about prompting technique than about the infrastructure you build around the session and the discipline you bring to how sessions are structured. Two categories matter most.

Infrastructure
Prompting Patterns

Tool-level instructions — persistent rules the tool applies in every session, regardless of project.

Project manifest — the charter, context files, and session protocol that hold the thread across sessions.

Style guide for formatting — structure, layout, and visual consistency rules.

Style guide for writing — voice modes, banned patterns, argument discipline.

Google Docs as shared layer — style guides and reference documents that need to travel across projects live here, not inside any single project.

Separate projects for research — one project collects data, builds context, and assembles the raw material. It does not draft.

Separate projects for writing — a different project takes that material and works toward voice, purpose, and story. The two do not run together.

Thread hygiene — recognize when a thread is getting tired. Long sessions lose context. Ask the tool for a transition block to copy into the next thread rather than assuming context held.

I've built my infrastructure in Claude, but other tools are still your best critics and serve a specific function in this methodology: second and third opinions from something with no context and no investment in the argument. Ask them directly to find where the ideas are not well supported, where there are inconsistencies, or what angles you have not considered. "Read this" produces a reaction. "How can I make this compelling to my boss who is the VP of technology?" will light you on fire.

Movement Five

Managing the Team

The most useful reframe I found was this: stop thinking about LLM as a tool you use and start thinking about it like you would a team you have to build and lead.

At Phase 1, the team is a room full of capable strangers who know nothing about you or your problem. They will research anything. They will produce output on demand. None of it will have judgment because you did not ask for judgment. Ask them for everything they know about a topic and they will give you everything. It will be organized and structured. But no one will want to read it, and if they do they will not know why they read it. A colleague told me he uses Copilot to tell him what is important in my writing. That is the failure mode this paper is about.

By Phase 4, the team has matured enough to push back on your framing. They can hold a complex problem across a long session without losing the thread. They can produce professional-grade documents without requiring you to manually correct every output. The product itself changed to make this possible. In my experience, this level of collaborative coherence was not reliably available even six months ago.

When the output disappoints, stay in the manager role. Tell the team specifically what needs to change, ask how we got there, and let them try again. The output on the third attempt is consistently better than what I would have written on the first.

Redirecting only works when you know where you are going. Arrive knowing the argument you want to make. When the argument is not ready, use the session to find it before touching any prose. You can still get there from a pile of material and a vague sense that something is worth saying. It requires more patience and more iteration, but the process holds.

What the Tool Cannot Manufacture

The infrastructure matters. The session protocol matters. But none of it produces value without what the practitioner brings. The tool will sound confident either way. It may not tell you whether the argument is real or challenge any claims you make. By default, it won't tell you a conclusion you are making is not supported.

What cannot be outsourced: a real argument, or the intellectual honesty to find one and discard the weak version. Domain expertise the tool cannot fake or verify. Standards high enough to recognize drift when it happens. The willingness to let the tool destroy weak sections rather than edit around them. The discipline to hear "this doesn't hold" and act on it.

If all you do is feed the tool your style guides, your project context, and your own writing examples, you risk building a high-fidelity echo chamber. The tool reflects your framing back with more polish. A good methodology can address this by design. The counter-measures: adversarial modes that require the tool to argue against your position; Devil's Advocate review on every major structural decision; second-opinion passes through Gemini and ChatGPT with no project context. These are not optional flourishes. They are the mechanism that keeps the process from becoming a mirror. A tool that only agrees with you is not a thinking partner. It is a transcription service with better vocabulary.

Leading the Team Is Work

This paper started as a chat that I knew was the germ of a paper. Yesterday afternoon I opened a session with Sonnet and said something like: most people have no idea how powerful these tools can be, or how much work is actually required to get the fullest value from them.

Nine hours of active work and less than 24 hours elapsed later, what you have read here was built across more than five threads, three tools, and at least four models. There were well over 200 prompt exchanges. Seven named versions. ChatGPT rewrote the entire introduction. Gemini questioned whether the methodology was just a sophisticated echo chamber — which forced a better answer than I had. The cartoon is a play on the evolution march: Claude wrote the prompt, Gemini generated the image, and I coached it with more details to make it funnier. We went through three titles and four subtitles before landing here. I rewrote dozens of Claude's sentences. Claude rewrote dozens of mine.

The tools will get better. The prompting will get easier. But the big ideas — research, context, knowledge and intent, how to build the session, how to manage the team, how to stay in the director role — those do not change with the model version.

Movement Six

18 Months In

Most writing about LLM tools focuses on what the tools can do. That framing is too small. The tools are able to work as hard as you are. They will be as smart as you are about how you build your team, how you train them, what policies and guardrails you give them.

This paper has been about that investment. The tools provided structure, pushback, and form. The argument, the standards, and the judgment were not theirs to supply. That division of labor is the thing worth understanding.

I have been figuring this out over the last 18 months and the last six months have been the most dramatic in terms of the quality of the team. I have gone from herding a mob of tenth graders to forming and leading a well-trained team of college students. I do not know yet how much of what I have learned will transfer to new audiences and new topics. But the early results are interesting enough to keep going, and interesting enough to write about.

Provenance

Model
Claude Sonnet 4.6 | April 2026
Sources
Writing Style Guide v1 (April 14, 2026) | Banned_Patterns_v1.md | White_Paper_Session_Playbook_v1.md | Personal_Context_for_White_Papers.md | Project_Charter_v1 | Session transcript and elicitation exchanges, April 2026 | white_paper_v3.html (Paper A, HTML structure and CSS palette reference) | Final Revision Instructions v6→v7 (GPT-5 and Gemini editorial reviews, April 2026) | Gemini stoic-analytic editorial review (April 21, 2026) | ChatGPT alternate ending and editorial brief (April 21, 2026) | Opus 4.6 career project review (April 2026)
Writing Process
Built through approximately 8 hours of iterative collaboration over a 24-hour period. Hugh McCutchen described his methodology, tools, and the arc from Phase 1 to Phase 5 through an elicitation session; no pre-existing draft. Argument structure, phase taxonomy, panel content, and all prose developed through iterative exchange before any HTML was written. Banned Patterns audit conducted before drafting. Voice checked against Writing Style Guide six-mode framework.
Version History
v1 — Initial draft. Full five-movement structure. Phase arc diagram, tool comparison table, and six sidebar panels (Mechanic/Shift pairs). Disclosure block in conclusion. | v2 — Section 1 rewritten: foreword tone replaced with article opening. Parallel two-track panel structure (Writing Track / Tool Track) introduced from Section 1 and carried through all movements. "Comprehensive" banned-word violation fixed in Section 2. Em dashes in cover and footer replaced. Section 5 student metaphor tightened. Disclosure block opening paragraph rewritten. Section 3 bridge paragraph split. Panel labels unified across all sections. | v3 — Phase numbering changed from 0–3b to 1–5. "More work than it saves" hyperbole removed from Phase 2. "Tool is the same" false claim corrected in Section 3 Tool Track. Mind reader → oracle. "What comes back is still you" replaced with practitioner-accurate version. Pullquote updated. "Will tell you when flat" → ask-for-the-analysis framing. "Correct every output" → "manually correct." Callout rewritten in first person. "Writing for yourself" sentence replaced with directing/doing distinction. Claude-is-new acknowledgment added to Phase 5 description. Conclusion phase refs and manager/director language updated. | v4 — Print button text added. Minor refinements across sessions. | v5 — Section 4 fully rewritten: "The Mirror, the Microscope, the X-Ray" replaced with "The Infrastructure Behind the Output." Ungrounded mirror/microscope/X-ray/oracle metaphor chain removed. Section now leads with infrastructure claim and lets panels demonstrate it. Pullquote relocated from Section 4 to Section 6 before disclosure block. Section 2 seam fixed: two paragraphs merged, "I want to be honest" intention sentence removed. Section 5: 56-word sentence split. Bridge sentence added between directing/doing point and vision-clarity point. Redundant "session protocol" panel item cut from Writing Track (now covered in Section 4). Redundant "vision clarity as input variable" panel item cut from Tool Track. Section 6 closing tightened: two paragraphs restating team metaphor and phase distribution removed. Section 3 Tool Track panel: redundant methodology-vs-product restatement cut. All em dashes removed from body prose. | v6 — Opening sentence reframed to writing partnership framing. "I could not do anything useful with it" removed. Phase 4 rewritten: story-first/brainstorming-before-content replaces diagram focus. Phase 5 rewritten: voice/repeatability framing; Sonnet/Opus alternation, charter, Google Docs handoff added. Phase 4/5 panel pair rewritten around de-botting/voice-injection framing. Section 5 compressed: director/doer restatement paragraphs removed (already established in Section 4); Claude Projects absolute claim softened to lived-experience framing. "Most people at Phase 1" softened to observation. Section 6 fully replaced: pullquote and "does this sound like AI" challenge removed; ChatGPT-sourced ending adopted and voiced to match paper; boundary paragraph added (writing ambiguity vs. enterprise metrics ambiguity; LLMs do not remove the need for governed data); disclosure rewritten to resolve "twenty years vs. Claude wrote it" tension. s7 appendix removed entirely. Nav and scroll observer updated. Sources for this pass: Gemini stoic-analytic review (April 21, 2026); ChatGPT alternate ending and editorial brief (April 21, 2026). Model: Claude Sonnet 4.6, April 21, 2026. | v7 — Title changed to "A Team You Have to Build and Lead" with "LLM as a Thinking Tool" as subtitle; updated in HTML tag, cover, and Section 5 body. Section 1 opening rewritten: productivity framing replaced with thinking-partner claim; practitioner identity established without relying on Paper A link. New subsection "What the Tool Cannot Manufacture" added to Section 5: names non-delegable practitioner inputs and addresses mirror/echo-chamber risk explicitly. Disclosure block flattened: "Not outsourced thinking. Directed thinking." replaced with methodological statement; conditional framing removed. Closing line expanded with specific DE topic examples. Sources: GPT-5 and Gemini editorial reviews via Final Revision Instructions document. Model: Claude Sonnet 4.6, April 22, 2026. | v7.1 — Section 1 title updated to "Better Than Productivity." Section 3 title updated to "Evolution in Five Parts." Origin paragraph in Section 2 replaced with single bridge sentence. "What I was bumping up against" softened. Tool disagreement sentence rewritten. Context engineering paragraph added to Section 4. Model: Claude Sonnet 4.6, April 22, 2026. | v8 — Last two panel pairs in Section 5 (Writing Track / Tool Track after echo chamber section) removed. "Leading the Team Is Work" subheading and story paragraph added to Section 5: production stats, multi-tool collaboration narrative, cartoon origin, titles iterated. "Twelve months ago" corrected to "six months." Subtitle updated to "LLM Evolution: From Query to Context." 82% stat removed from Section 4; context engineering paragraph stands without it. Phase 2 output sentence reframed to habits/mindset/infrastructure. Gemini/ChatGPT paragraph rewritten with Claude infrastructure framing. One-sentence argument opener replaced. Nav title label removed. Google Fonts calls removed; system font stack substituted. All anchor nav links converted to span+data-target pattern. Opus review from career project incorporated. Sources this pass: Opus 4.6 career project review (April 2026). Model: Claude Sonnet 4.6, April 22, 2026.
Assumptions
Phase descriptions reflect Hugh's direct experience and are not generalized survey data. The claim that most users remain at Phase 1 or 2 is an observation, not a measured statistic. [UNVERIFIED: no external research cited on LLM methodology maturity distribution.] Tool characterizations (Gemini for depth, ChatGPT for narrative, Claude for sustained context) reflect Hugh's working experience as of early 2026 and may not reflect current product capabilities.
Scope Exclusions
Does not cover prompt engineering as a technical discipline. Does not cover enterprise AI deployment, fine-tuning, or RAG architectures. Does not evaluate LLM tools for any purpose other than research, writing, and analytical thinking. Does not describe the content of the BI platform paper (Paper A) in detail; that paper stands alone.