Our earlier article framed the Mannequin Context Protocol (MCP) because the toolbox that gives AI brokers instruments and Agent Expertise as supplies that educate AI brokers the right way to full duties. That is completely different from pre- or posttraining, which decide a mannequin’s basic conduct and experience. Agent Expertise don’t “practice” brokers. They soft-fork agent conduct at runtime, telling the mannequin the right way to carry out particular duties that it might want.
The time period mushy fork comes from open supply growth. A mushy fork is a backward-compatible change that doesn’t require upgrading each layer of the stack. Utilized to AI, this implies expertise modify agent conduct via context injection at runtime relatively than altering mannequin weights or refactoring AI techniques. The underlying mannequin and AI techniques keep unchanged.
The structure maps cleanly to how we take into consideration conventional computing. Fashions are CPUs—they supply uncooked intelligence and compute functionality. Agent harnesses like Anthropic’s Claude Code are working techniques—they handle assets, deal with permissions, and coordinate processes. Expertise are purposes—they run on prime of the OS, specializing the system for particular duties with out modifying the underlying {hardware} or kernel.
You don’t recompile the Linux kernel to run a brand new software. You don’t rearchitect the CPU to make use of a distinct textual content editor. You put in a brand new software on prime, utilizing the CPU’s intelligence uncovered and orchestrated by the OS. Agent Expertise work the identical method. They layer experience on prime of the agent harness, utilizing the capabilities the mannequin supplies, with out updating fashions or altering harnesses.
This distinction issues as a result of it adjustments the economics of AI specialization. Fantastic-tuning calls for important funding in expertise, compute, information, and ongoing upkeep each time the bottom mannequin updates. Expertise require solely Markdown recordsdata and useful resource bundles.
How mushy forks work
Expertise obtain this via three mechanisms—the talent package deal format, progressive disclosure, and execution context modification.
The talent package deal is a folder. At minimal, it accommodates a SKILL.md file with frontmatter metadata and directions. The frontmatter declares the talent’s identify, description, allowed-tools, and variations, adopted by the precise experience: context, downside fixing approaches, escalation standards, and patterns to observe.

skill-creator package deal. The frontmatter lives on the prime of Markdown recordsdata. Brokers select expertise based mostly on their descriptions.The folder also can embody reference paperwork, templates, assets, configurations, and executable scripts. It accommodates all the pieces an agent must carry out expert-level work for the precise process, packaged as a versioned artifact that you would be able to evaluation, approve, and deploy as a .zip file or .talent file bundle.

skill-creator. skill-creator accommodates SKILL.md, LICENSE.txt, Python scripts, and reference recordsdata.As a result of the talent package deal format is simply folders and recordsdata, you should use all of the tooling now we have constructed for managing code—monitor adjustments in Git, roll again bugs, keep audit trails, and the entire finest practices of software program engineering growth life cycle. This similar format can also be used to outline subagents and agent groups, that means a single packaging abstraction governs particular person experience, delegated workflows, and multi-agent coordinations alike.
Progressive disclosure retains expertise light-weight. Solely the frontmatter of SKILL.md masses into the agent’s context at session begin. This respects the token economics of restricted context home windows. The metadata accommodates identify, description, mannequin, license, model, and really importantly allowed-tools. The complete talent content material masses solely when the agent determines relevance and decides to invoke it. That is just like how working techniques handle reminiscence; purposes load into RAM when launched, not unexpectedly. You may have dozens of expertise out there with out overwhelming the mannequin’s context window, and the behavioral modification is current solely when wanted, by no means completely resident.

Execution context modification controls what expertise can do. When brokers invoke a talent, the permission system adjustments to the scope of the talent’s definition, particularly, mannequin and allowed-tools declared in its frontmatter. It reverts after execution completes. A talent may use a distinct mannequin and a distinct set of instruments from the guardian session. This sandboxed the permission surroundings so expertise get solely scoped entry, not arbitrary system management. This ensures the behavioral modification operates inside boundaries.
That is what separates expertise from earlier approaches. OpenAI’s customized GPTs and Google’s Gemini Gems are helpful however opaque, nontransferable, and inconceivable to audit. Expertise are readable as a result of they’re Markdown. They’re auditable as a result of you’ll be able to apply model management. They’re composable as a result of expertise can stack. And they’re governable as a result of you’ll be able to construct approval workflows and rollback functionality. You may learn a SKILL.md to know precisely why an agent behaves a sure method.
What the information exhibits
Constructing expertise is simple with coding brokers. Understanding whether or not they work is the onerous half. Conventional software program testing doesn’t apply. You can not write a unit take a look at asserting that skilled conduct occurred. The output is perhaps right whereas reasoning was shallow, or the reasoning is perhaps subtle whereas the output has formatting errors.
SkillsBench is a benchmarking effort and framework designed to deal with this. It makes use of paired analysis design the place the identical duties are evaluated with and with out talent augmentation. The benchmark accommodates 85 duties, stratified throughout domains and issue ranges. By evaluating the identical agent on the identical process with the one variable being the presence of a talent, SkillsBench isolates the causal impact of expertise from mannequin functionality and process issue. Efficiency is measured utilizing normalized acquire, the fraction of attainable enchancment the talent really captured.
The findings from SkillsBench problem our presumption that expertise universally enhance efficiency.
Expertise enhance common efficiency by 13.2 share factors. However 24 of 85 duties received worse. Manufacturing duties gained 32 factors. Software program engineering duties misplaced 5. The mixture quantity hides variances that domain-level analysis reveals. That is exactly why mushy forks want analysis infrastructure. In contrast to onerous forks the place you commit totally, mushy forks allow you to measure earlier than you deploy extensively. Organizations ought to phase evaluations by domains and by duties and take a look at for regression, not simply enhancements. For example, what improves doc processing would possibly degrade code era.
Compact expertise outperform complete ones by practically 4x. Targeted expertise with dense steering confirmed +18.9 share level enchancment. Complete expertise protecting each edge case confirmed +5.7 factors. Utilizing two to a few expertise per process is perfect, with 4 or extra displaying diminishing returns. The temptation when constructing expertise is to incorporate all the pieces. Each caveat, each exception, each piece of related context. Resist it. Let the mannequin’s intelligence do the work. Small, focused behavioral adjustments outperform complete rewrites. Ability builders ought to begin with minimal viable steering and add element solely when analysis exhibits particular gaps.
Fashions can not reliably self-generate efficient expertise. SkillsBench examined a “convey your personal talent” situation the place brokers had been prompted to generate their very own procedural data earlier than making an attempt duties. Efficiency stayed at baseline. Efficient expertise require human-curated area experience that fashions can not reliably produce for themselves. AI may help with packaging and formatting, however the perception has to return from individuals who even have the experience. Human-labeled perception is the bottleneck of constructing efficient expertise, not the packaging or deployment.

Expertise can partially substitute for mannequin scale. Claude Haiku, a small mannequin, with well-designed expertise achieved a 25.2% go charge. This barely exceeded Claude Opus, the flagship mannequin, with out expertise at 23.6%. Packaged experience compensates for mannequin intelligence on procedural duties. This has price implications: Smaller fashions with expertise could outperform bigger fashions with out them at a fraction of the inference price. Tender forks democratize functionality. You don’t want the largest mannequin when you’ve got the correct experience packaged.

Open questions
Many challenges stay unresolved. What occurs when a number of expertise battle with one another throughout a session? How ought to organizations govern talent portfolios when groups every deploy their very own expertise onto shared brokers? How rapidly does encoded experience change into outdated, and what refresh cadence retains expertise efficient with out creating upkeep burden? Expertise inherit no matter biases exist of their authors’ experience, so how do you audit that? And because the trade matures, how ought to analysis infrastructure comparable to SkillsBench scale to maintain tempo with the rising complexity of talent augmented techniques?
These aren’t causes to keep away from expertise. They’re causes to spend money on analysis infrastructure and governance practices alongside talent growth. The potential to measure efficiency should evolve in lockstep with the know-how itself.
Agent Expertise benefit
Fantastic-tuning fashions for a single use case is now not the one path to specialization. It calls for important funding in expertise, compute, and information and creates a everlasting divergence that requires reevaluation and potential retraining each time the bottom mannequin updates. Fantastic-tuning throughout a broad set of capabilities to enhance a basis mannequin stays sound, however fine-tuning for one slim workflow is strictly the form of specialization that expertise can now obtain at a fraction of the associated fee.
Expertise aren’t upkeep free. Simply as purposes generally break when working techniques replace, expertise want reevaluation when the underlying agent harness or mannequin adjustments. However the restoration path is lighter: replace the talents package deal, rerun the analysis harness, and redeploy relatively than retrain from a brand new checkpoint.
Mainframes gave technique to client-server. Monoliths gave technique to microservices. Specialised fine-tuned fashions at the moment are giving technique to brokers augmented by specialised experience artifacts. Fashions present intelligence, agent harnesses present runtime, expertise present specialization, and analysis tells you whether or not all of it works collectively.

