That is the third article in a collection on agentic engineering and AI-driven improvement. Learn half one right here, half two right here, and search for the following article on April 15 on O’Reilly Radar.
The toolkit sample is a manner of documenting your mission’s configuration in order that any AI can generate working inputs from a plain-English description. You and the AI create a single file that describes your instrument’s configuration format, its constraints, and sufficient labored examples that any AI can generate working inputs from a plain-English description. You construct it iteratively, working with the AI (or, higher, a number of AIs) to draft it. You take a look at it by beginning a contemporary AI session and making an attempt to make use of it, and each time that fails you develop the toolkit from these failures. Whenever you construct the toolkit nicely, your customers won’t ever must learn the way your instrument’s configuration information work, as a result of they describe what they need in dialog and the AI handles the interpretation. Meaning you don’t should compromise on the way in which your mission is configured, as a result of the config information will be extra advanced and extra full than they’d be if a human needed to edit and perceive them.
To know why all of this issues, let me take you again to the mid-Nineteen Eighties.
I used to be 12 years outdated, and our household acquired an AT&T PC 6300, an IBM-compatible that got here with a person’s information roughly 159 pages lengthy. Chapter 4 of that guide was known as “What Each Person Ought to Know.” It lined issues like the way to use the keyboard, the way to care to your diskettes, and, memorably, the way to label them, full with hand-drawn illustrations and actually helpful recommendation, like how you need to solely use felt-tipped pens, by no means ballpoint, as a result of the stress may injury the magnetic floor.

I bear in mind being fascinated by this guide. It wasn’t our first pc. I’d been writing BASIC packages and dialing into BBSs and CompuServe for a few years, so I knew there have been all kinds of fantastic issues you may do with a PC, particularly one with a blazing quick 8MHz processor. However the guide barely talked about any of that. That appeared actually bizarre to me, whilst a child, that you’d give somebody a guide that had a complete web page on utilizing the backspace key to right typing errors (actually!) however didn’t really inform them the way to use the factor to do something helpful.
That’s how most developer documentation works. We write the stuff that’s simple to put in writing—set up, setup, the getting-started information—as a result of it’s rather a lot simpler than writing the stuff that’s really onerous: the deep clarification of how all of the items match collectively, the constraints you solely uncover by hitting them, the patterns that separate a configuration that works from one that just about works. That is one more “searching for your keys beneath the streetlight” downside: We write the documentation we write as a result of it’s best to put in writing, even when it’s not likely the documentation our customers want.
Builders who got here up by means of the Unix period know this nicely. Man pages have been thorough, correct, and infrequently fully impenetrable in case you didn’t already know what you have been doing. The tar man web page is the canonical instance: It paperwork each flag and possibility in exhaustive element, however in case you simply wish to know the way to extract a .tar.gz file, it’s nearly ineffective. (The best flag is -xzvf in case you’re curious.) Stack Overflow exists largely as a result of man pages like tar’s left a spot between what the documentation stated and what builders really wanted to know.
And now we’ve AI assistants. You possibly can ask Claude or ChatGPT about, say, Kubernetes, Terraform, or React, and also you’ll really get helpful solutions, as a result of these are all established initiatives which have been written about extensively and the coaching information is in every single place.
However AI hits a tough wall on the boundary of its coaching information. In case you’ve constructed one thing new—a framework, an inner platform, a instrument your workforce created—no mannequin has ever seen it. Your customers can’t ask their AI assistant for assist, as a result of the AI doesn’t know your factor even exists.
There’s been a variety of nice work shifting AI documentation in the fitting path. AGENTS.md tells AI coding brokers the way to work in your codebase, treating the AI as a developer. llms.txt offers fashions a structured abstract of your exterior documentation, treating the AI as a search engine. What’s been lacking is a follow for treating the AI as a assist engineer. Each mission wants configuration: enter information, possibility schemas, workflow definitions, often within the type of a complete bunch of JSON or YAML information with cryptic codecs that customers should be taught earlier than they’ll do something helpful.
The toolkit sample solves that downside of getting AIs to put in writing configuration information for a mission that isn’t in its coaching information. It consists of a documentation file that teaches any AI sufficient about your mission’s configuration that it may possibly generate working inputs from a plain-English description, with out your customers ever having to be taught the format themselves. Builders have been arriving at this similar sample (or one thing very related) independently from totally different instructions, however so far as I can inform, no person has named it or described a strategy for doing it nicely. This text distills what I discovered from constructing the toolkit for Octobatch pipelines right into a set of practices you possibly can apply to your personal initiatives.
Construct the AI its personal guide
Historically, builders face a trade-off with configuration: preserve it easy and straightforward to know, or let it develop to deal with actual complexity and settle for that it now requires a guide. The toolkit sample emerged for me whereas I used to be constructing Octobatch, the batch-processing orchestrator I’ve been writing about on this collection. As I described within the earlier articles on this collection, “The Unintended Orchestrator” and “Preserve Deterministic Work Deterministic,” Octobatch runs advanced multistep LLM pipelines that generate information or run Monte Carlo simulations. Every pipeline is outlined utilizing a posh configuration that consists of YAML, Jinja2 templates, JSON schemas, expression steps, and a algorithm tying all of it collectively. The toolkit sample let me sidestep that conventional trade-off.
As Octobatch grew extra advanced, I discovered myself counting on the AIs (Claude and Gemini) to construct configuration information for me, which turned out to be genuinely helpful. Once I developed a brand new function, I might work with the AIs to give you the configuration construction to assist it. At first I outlined the configuration, however by the top of the mission I relied on the AIs to give you the primary lower, and I’d push again when one thing appeared off or not forward-looking sufficient. As soon as all of us agreed, I might have an AI produce the precise up to date config for no matter pipeline we have been engaged on. This transfer to having the AIs do the heavy lifting of writing the configuration was actually helpful, as a result of it let me create a really strong format in a short time with out having to spend hours updating current configurations each time I modified the syntax or semantics.
Sooner or later I noticed that each time a brand new person needed to construct a pipeline, they confronted the identical studying curve and implementation challenges that I’d already labored by means of with the AIs. The mission already had a README.md file, and each time I modified the configuration I had an AI replace it to maintain the documentation updated. However by this time, the README.md file was doing manner an excessive amount of work: It was actually complete however an actual headache to learn. It had eight separate subdocuments exhibiting the person the way to do just about every part Octobatch supported, and the majority of it was targeted on configuration, and it was changing into precisely the sort of documentation no person ever desires to learn. That notably bothered me as a author; I’d produced documentation that was genuinely painful to learn.
Trying again at my chats, I can hint how the toolkit sample developed. My first intuition was to construct an AI-assisted editor. About 4 weeks into the mission, I described the thought to Gemini:
I’m excited about the way to present any sort of AI-assisted instrument to assist folks create their very own pipeline. I used to be excited about a function we might name “Octobatch Studio” the place we make it simple to immediate for modifying pipeline levels, probably helping in creating the prompts. However possibly as an alternative we embody a variety of documentation in Markdown information, and count on them to make use of Claude Code, and provides a lot of steering for creating it.
I can really see the pivot to the toolkit sample taking place in actual time on this later message I despatched to Claude. It had sunk in that my customers may use Claude Code, Cursor, or one other AI as interactive documentation to construct their configs precisely the identical manner I’ve been doing:
My plan is to make use of Claude Code because the IDE for creating new pipelines, so individuals who wish to create them can simply spin up Claude Code and begin producing them. Meaning we have to give Claude Code particular context information to inform it every part it must know to create the pipeline YAML config with asteval expressions and Jinja2 template information.
The standard trade-off between simplicity and adaptability comes from cognitive overhead: the price of holding all of a system’s guidelines, constraints, and interactions in your head when you work with it. It’s why many builders go for easier config information, so that they don’t overload their customers (or themselves). As soon as the AI was writing the configuration, that trade-off disappeared. The configs may get as difficult as they wanted to be, as a result of I wasn’t the one who needed to bear in mind how all of the items match collectively. Sooner or later I noticed the toolkit sample was price standardizing.
That toolkit-based workflow—customers describe what they need, the AI reads TOOLKIT.md and generates the config—is the core of the Octobatch person expertise now. A person clones the repo and opens Claude Code, Cursor, or Copilot, the identical manner they’d with any open supply mission. Each configuration immediate begins the identical manner: “Learn pipelines/TOOLKIT.md and use it as your information.” The AI reads the file, understands the mission construction, and guides them step-by-step.
To see what this seems to be like in follow, take the Drunken Sailor pipeline I described in “The Unintended Orchestrator.” It’s a Monte Carlo random stroll simulation: A sailor leaves a bar and stumbles randomly towards the ship or the water. The pipeline configuration for that entails a number of YAML information, JSON schemas, Jinja2 templates, and expression steps with actual mathematical logic, all wired along with particular guidelines.

Right here’s the immediate that generated all of that. The person describes what they need in plain English, and the AI produces the whole configuration by studying TOOLKIT.md. That is the precise immediate I gave Claude Code to generate the Drunken Sailor pipeline—discover the primary line of the immediate, telling it to learn the toolkit file.

However configuration technology is simply half of what the toolkit file does. Customers also can add TOOLKIT.md and PROJECT_CONTEXT.md (which has details about the mission) to any AI assistant—ChatGPT, Gemini, Claude, Copilot, no matter they like—and use it as interactive documentation. A pipeline run completed with validation failures? Add the 2 information and ask what went fallacious. Caught on how retries work? Ask. You possibly can even paste in a screenshot of the TUI and say, “What do I do?” and the AI will learn the display and provides particular recommendation. The toolkit file turns any AI into an on-demand assist engineer to your mission.

What the Octobatch mission taught me concerning the toolkit sample
Constructing the generative toolkit for Octobatch produced extra than simply documentation that an AI may use to create configuration information that labored; it additionally yielded a set of practices, and people practices develop into fairly constant no matter what sort of mission you’re constructing. Listed here are the 5 that mattered most:
- Begin with the toolkit file and develop it from failures. Don’t wait till the mission is completed to put in writing the documentation. Create the toolkit file first, then let every actual failure add one precept at a time.
- Let the AI write the config information. Your job is product imaginative and prescient—what the mission ought to do and the way it ought to really feel. The AI’s job is translating that into legitimate configuration.
- Preserve steering lean. State the precept, give one concrete instance, transfer on. Each guardrail prices tokens, and bloated steering makes AI efficiency worse.
- Deal with each use as a take a look at. There’s no separate testing part for documentation. Each time somebody makes use of the toolkit file to construct one thing, that’s a take a look at of whether or not the documentation works.
- Use a couple of mannequin. Completely different fashions catch various things. In a three-model audit of Octobatch, three-quarters of the defects have been caught by just one mannequin.
I’m not proposing a normal format for a toolkit file, and I feel making an attempt to create one can be counterproductive. Configuration codecs fluctuate wildly from instrument to instrument—that’s the entire downside we’re making an attempt to unravel—and a toolkit file that describes your mission’s constructing blocks goes to look fully totally different from one which describes another person’s. What I discovered is that the AI is completely able to studying no matter you give it, and might be higher at writing the file than you’re anyway, as a result of it’s writing for an additional AI. These 5 practices ought to assist construct an efficient toolkit no matter what your mission seems to be like.
Begin with the toolkit file and develop it from failures
You can begin constructing a toolkit at any level in your mission. The way in which it occurred for me was natural: After weeks of working with Claude and Gemini on Octobatch configuration, the information about what labored and what didn’t was scattered throughout dozens of chat classes and context information. I wrote a immediate asking Gemini to consolidate every part it knew concerning the config format—the construction, the foundations, the constraints, the examples, every part we’d talked about—right into a single TOOLKIT.md file. That first model wasn’t nice, however it was a place to begin, and each failure after that made it higher.
I didn’t plan the toolkit from the start of the Octobatch mission. It began as a result of I needed my customers to have the ability to construct pipelines the identical manner I had—by working with an AI—however every part they’d want to try this was unfold throughout months of chat logs and the CONTEXT.md information I’d been sustaining to bootstrap new improvement classes. As soon as I had Gemini consolidate every part right into a single TOOLKIT.md file and had Claude evaluate it, I handled it the way in which I deal with another code: Each time one thing broke, I discovered the foundation trigger, labored with the AIs to replace the toolkit to account for it, and verified {that a} contemporary AI session may nonetheless use it to generate legitimate configuration.
That incremental strategy labored nicely for me, and it let me take a look at my toolkit the way in which I take a look at another code: attempt it out, discover bugs, repair them, rinse, repeat.
You are able to do the identical factor. In case you’re beginning a brand new mission, you possibly can plan to create the toolkit on the finish. But it surely’s more practical to start out with a easy model early and let it emerge over the course of improvement. That manner you’re dogfooding it the entire time as an alternative of guessing what customers will want.
Let the AI write the config information (however keep in management!)
Early Octobatch pipelines had easy sufficient configuration {that a} human may learn and perceive them, however not as a result of I used to be writing them by hand. One of many floor guidelines I set for the Octobatch experiment in AI-driven improvement was that the AIs would write the entire code, and that included writing the entire configuration information. The issue was that though they have been doing the writing, I used to be unconsciously constraining the AIs: pushing again on something that felt too advanced, steering towards constructions I may nonetheless maintain in my head.
Sooner or later I noticed my pushback was inserting a synthetic restrict on the mission. The entire level of getting AIs write the config was that I didn’t must preserve each single line in my head—it was okay to let the AIs deal with that stage of complexity. As soon as I finished constraining them, the cognitive overhead restrict I described earlier went away. I may have full pipelines outlined in config, together with expression steps with actual mathematical logic, with no need to carry all the foundations and relationships in my head.
As soon as the mission actually acquired rolling, I by no means wrote YAML by hand once more. The cycle was all the time: want a function, focus on it with Claude and Gemini, push again when one thing appeared off, and certainly one of them produces the up to date config. My job was product imaginative and prescient. Their job was translating that into legitimate configuration. And each config file they wrote was one other take a look at of whether or not the toolkit really labored.
This job delineation, nevertheless, meant inevitable disagreements between me and the AI, and it’s not all the time simple to seek out your self disagreeing with a machine as a result of they’re surprisingly cussed (and infrequently shockingly silly). It required persistence and vigilance to remain in charge of the mission, particularly once I turned over giant obligations to the AIs.
The AIs constantly optimized for technical correctness—separation of considerations, code group, effort estimation—which was nice, as a result of that’s the job I requested them to do. I optimized for product worth. I discovered that maintaining that worth as my north star and all the time specializing in constructing helpful options constantly helped with these disagreements.
Preserve steering lean
When you begin rising the toolkit from failures, the pure development is to overdocument every part. Generative AIs are biased towards producing, and it’s simple to allow them to get carried away with it. Each bug feels prefer it deserves a warning, each edge case feels prefer it wants a caveat, and earlier than lengthy your toolkit file is bloated with guardrails that value tokens with out including a lot worth. And for the reason that AI is the one writing your toolkit updates, you should push again on it the identical manner you push again on structure choices. AIs love including WARNING blocks and exhaustive caveats. The self-discipline you should carry is telling them when to not add one thing.
The best stage is to state the precept, give one concrete instance, and belief the AI to use it to new conditions. When Claude Code made a alternative about JSON schema constraints that I might need second-guessed, I needed to determine whether or not so as to add extra guardrails to TOOLKIT.md. The reply was no—the steering was already there, and the selection it made was really right. In case you preserve tightening guardrails each time an AI makes a judgment name, the sign will get misplaced within the noise and efficiency will get worse, not higher. When one thing goes fallacious, the impulse—for each you and the AI—is so as to add a WARNING block. Resist it. One precept, one instance, transfer on.
Deal with each use as a take a look at
There was no separate “testing part” for Octobatch’s TOOLKIT.md. Each pipeline that I created with it was a brand new take a look at. After the very first model, I opened a contemporary Claude Code session that had by no means seen any of my improvement conversations, pointed it on the newly minted TOOLKIT.md, and requested it to construct a pipeline. The primary time I attempted it, I used to be stunned at how nicely it labored! So I saved utilizing it, and because the mission rolled alongside, I up to date it with each new function and examined these updates. When one thing failed, I traced it again to a lacking or unclear rule within the toolkit and glued it there.
That’s the sensible take a look at for any toolkit: open a contemporary AI session with no context past the file, describe what you need in plain English, and see if the output works. If it doesn’t, the toolkit has a bug.
Use a couple of mannequin
Whenever you’re constructing and testing your toolkit, don’t simply use one AI. Run the identical process by means of a second mannequin. An excellent sample that labored for me was constantly having Claude generate the toolkit and Gemini test its work.
Completely different fashions catch various things, and this issues for each creating and testing the toolkit. I used Claude and Gemini collectively all through Octobatch improvement, and I overruled each after they have been fallacious about product intent. You are able to do the identical factor: In case you work with a number of AIs all through your mission, you’ll begin to get a really feel for the totally different sorts of questions they’re good at answering.
When you may have a number of fashions generate config from the identical toolkit independently, you discover out quick the place your documentation is ambiguous. If two fashions interpret the identical rule in another way, the rule wants rewriting. That’s a sign you possibly can’t get from utilizing only one mannequin.
The guide, revisited
That AT&T PC 6300 guide devoted a full web page to labeling diskettes, which can have been overkill, however it acquired one factor proper: it described the constructing blocks and trusted the reader to determine the remaining. It simply had the fallacious reader in thoughts.
The toolkit sample is similar thought, pointed at a special viewers. You write a file that describes your mission’s configuration format, its constraints, and sufficient labored examples that any AI can generate working inputs from a plain-English description. Your customers by no means should be taught YAML or memorize your schema, as a result of they’ve a dialog with the AI and it handles the interpretation.
In case you’re constructing a mission and also you need AI to have the ability to assist your customers, begin right here: write the toolkit file earlier than you write the README, develop it from actual failures as an alternative of making an attempt to plan all of it upfront, preserve it lean, take a look at it by utilizing it, and use a couple of mannequin as a result of no single AI catches every part.
The AT&T guide’s Chapter 4 was known as “What Each Person Ought to Know.” Your toolkit file is “What Each AI Ought to Know.” The distinction is that this time, the reader will really use it.
Within the subsequent article, I’ll begin with a statistic about developer belief in AI-generated code that turned out to be fabricated by the AI itself—and use that to elucidate why I constructed a top quality playbook that revives the standard high quality practices most groups lower many years in the past. It explores an unfamiliar codebase, generates an entire high quality infrastructure—checks, evaluate protocols, validation guidelines—and finds actual bugs within the course of. It really works throughout Java, C#, Python, and Scala, and it’s accessible as an open supply Claude Code ability.

