So Lengthy and Thanks for All of the Context – O’Reilly

I bought a extremely attention-grabbing query final week from Mike Loukides, my editor at Radar, after he learn the third a part of this trilogy on context administration. “One other problem I’ve examine,” Mike requested, “is the tendency for a mannequin to disregard the center of the context. I’ve seen that significantly for the fashions with very giant context home windows. Is there something to be mentioned about that?”

Wonderful query, Mike, and sure, there’s. In that very same e mail he identified that clearing the context and reloading it with simply what’s essential does a fairly good job coping with this “ignore the center” drawback when it occurs, however that’s clearly a stopgap.

It’s price a deeper dive into what’s really occurring when an AI begins forgetting what’s in the course of its context, as a result of the issue is deeper (and extra attention-grabbing!) than it might sound at first. It seems that there’s a primary drawback that’s basic to how LLMs handle context, and we’re nonetheless studying about it as an trade. That drawback known as a U-shape. There’s been plenty of actually attention-grabbing analysis into the U-shape drawback just lately, and a number of other helpful methods have emerged that may assist you handle it. And it’s in all probability not a coincidence that I’ve had to make use of all of them in my ongoing experiments with AI-driven improvement and agentic engineering (even when I didn’t all the time notice that’s what I used to be doing on the time).

Just a few weeks in the past, actually, I bumped into the precise failure mode that Mike described. I used to be operating the High quality Playbook, my open supply code high quality engineering talent, and bumped into bother with certainly one of its phases—the one which writes up the bugs the sooner phases discover. There’s part of the bug writeup course of the place it had simply created a file referred to as BUGS.md that had an summary of every of the bugs, and needed to create particular person writeups for every bug it discovered. However as a substitute of filling within the particulars accurately, it produced skeletal-looking stub recordsdata, with a generic template that had clean values as a substitute of populated ones.

The factor is, the directions for write a populated writeup have been within the immediate. The precise bug information was in BUGS.md. I used to be completely sure that the whole lot the agent wanted was sitting in its context window, as a result of I may see that it hadn’t compacted but, and the talent’s intermediate artifacts let me see that earlier phases had learn and reasoned about each recordsdata (which I talked about in my final article on this sequence). However the agent was producing stubs anyway. It actually regarded just like the agent had the whole lot it wanted sitting in plain sight, and simply wasn’t utilizing the data it had. Irritating!

I assumed on the time that the mannequin was simply an fool (which, arguably, was true however irrelevant). It seems that I had run straight into the U-shaped context drawback.

Within the earlier three articles I coated what context is and why it disappears, maintain essential data in recordsdata as a substitute of leaving it within the agent’s context window, and detect and get well when context has been compacted out from beneath you. All three have been about dropping context, by way of fragmentation, by way of compaction, by way of lengthy classes that overrun the window. This text is about this totally completely different U-shaped failure mode, the place the context continues to be sitting within the window and the mannequin simply isn’t utilizing it.

The U-shape failure, and why greater home windows don’t repair it

The U-shape is an energetic space of educational investigation, so I’m going to start out by going into somewhat little bit of that analysis, as a result of I believe it’ll really assist us pin down what’s occurring. I’ll begin with an experiment run by Nelson Liu, an AI researcher at Stanford, who examined how language fashions really use the contents of lengthy inputs by giving them paperwork with the related reply positioned at completely different positions and measuring whether or not the mannequin may nonetheless discover it. An attention-grabbing factor his findings present is that the U-shape didn’t seem like a quirk of a single mannequin. The U-shape confirmed up throughout mannequin households, and even fashions with bigger context home windows nonetheless exhibited it.

When you have time, it’s really price having a look on the paper that Liu and his crew wrote, referred to as “Misplaced within the Center: How Language Fashions Use Lengthy Contexts.” (It’s surprisingly readable for an educational paper.) The consequence they reported was a sturdy U-shape: The mannequin carried out finest when the related data was initially of its context window or on the latest finish and worst when it was within the center. Efficiency on questions the place the reply was buried mid-context fell off sharply, even when the reply was sitting proper there in plain sight. The sector now makes use of the phrases primacy bias and recency bias for these two preferences, and the U-shape is what you get whenever you plot them collectively towards place.

I’m going to lean somewhat into academia right here, as a result of plenty of researchers are nonetheless studying about how LLM context really works and what habits has emerged in it.

One cause the U-shape issues greater than “simply one other LLM quirk” is that latest analysis has began displaying it’s a structural property of how transformers work, not a discovered artifact. A 2025 ICML paper referred to as “On the Emergence of Place Bias in Transformers” defined it because the equilibrium between two opposing forces contained in the mannequin: The causal masks amplifies the affect of the primary few tokens (the primacy bias), whereas place encodings like RoPE closely weight the tokens closest to the place the mannequin is producing (the recency bias). The center is the place these two forces cancel out. A 2026 paper by Borun Chowdhury, a researcher at Meta, referred to as “Misplaced within the Center at Delivery: An Precise Concept of Transformer Place Bias,” took the argument even additional by proving mathematically that the U-shape exists in the intervening time of initialization, earlier than any coaching has occurred, with random weights.

That issues as a result of the pure assumption about giant context home windows is that extra room means fewer issues. Most of at this time’s frontier fashions provide you with 1,000,000 tokens or extra, with some pushing properly previous two million, and a few have made actual progress on the best model of the lost-in-the-middle check, the needle-in-a-haystack benchmark, the place the mannequin has to retrieve a single sentence buried in a protracted doc. Google’s Gemini 1.5 Professional reported near-perfect single-needle recall at 1M tokens, and present Gemini 3 fashions are comparable.

So the correct model of “greater home windows don’t repair it” is that this: Larger home windows have made easy single-fact retrieval significantly better. They haven’t made long-context agent work dependable by default. A two-million-token window means an even bigger center to fall into.

The essential concept that’s rising right here is that it’s more and more wanting just like the U-shape isn’t only a bug in at this time’s fashions that can ultimately be labored out or educated away by extra information or higher fine-tuning. As a substitute, it looks like the U-shape may very well be a geometrical property of the LLM structure itself.

In different phrases, we’re all going to should take care of the U-shape. And meaning we want methods for managing it, and any efficient approach we use isn’t more likely to grow to be out of date any time quickly. And that’s my aim on this article: to point out you the methods which have emerged for managing U-shaped context reminiscence loss that you should utilize at this time in your individual work.

5 methods to assist with U-shaped context issues

The earlier article on this sequence laid out a sample for detecting and recovering from context loss, which I referred to as externalize-recognize-rehydrate. The methods beneath lengthen the identical self-discipline to the lost-in-the-middle drawback. The precept I maintain coming again to is that working reminiscence is untrustworthy, and the self-discipline that follows from it’s to externalize what issues, curate what stays in context, and confirm what the agent claims to know towards what’s on disk. The 5 methods are how I do this in observe, and each is drawn from an actual second within the High quality Playbook’s improvement.

Curate, don’t accumulate

That is the approach which, in its most brute-force type, is strictly what Mike talked about in his e mail to me: simply clear the context and reload it with simply what issues, periodically and intentionally. In different phrases, don’t belief an collected session to remain coherent; construct the artifact, then begin contemporary towards it. And when you have the AI write down the essential elements of the context (like we’ve talked about all through this sequence), then you can begin a brand new session with refreshed AI that has a extra focused, curated context as a place to begin.

I bumped into this through the v1.5.2 launch prep for the High quality Playbook. I used to be utilizing a protracted Claude Code session that had been working by way of a sequence of fixes. However I seen that it was simply beginning to present its age: It had forgotten a few issues it ought to know, and its pondering occasions have been beginning to develop.

When it got here time to land the ultimate 4 fixes for the discharge, I labored with the AI to jot down a context temporary, or a separate doc with the whole lot the implementing session wanted. The query was whether or not to maintain utilizing the present session, which already “knew” the codebase from the sooner work, or open a contemporary CLI session and level it on the temporary. I requested one other session what to do:

Ought to we run that in a brand new cli session fairly than proceed my present
claude code session that has the present context?

The AI gave me a superb reply—begin a contemporary session, utilizing a beginning immediate to learn the temporary—and it gave three causes which have caught with me. First, the temporary was self-contained, together with file paths, line numbers, precise diffs, regression check our bodies, and preflight greps. Something the brand new session wanted to know was already there, and persevering with context purchased nothing. Second, contemporary context is stricter about adherence. A session that already “is aware of” the codebase tends to skim the brand new directions and improvise from prior assumptions. Surgical fixes are precisely the case the place you need the agent to learn the temporary rigorously fairly than depend on reminiscence of what felt proper final spherical. And third, the audit path: The temporary is the artifact, and the implementing session is reproducible from simply the temporary. If the identical work needs to be redone in six months by a distinct mannequin, you level on the temporary and say, “That is the enter.”

The method labored rather well. I used to be capable of decide up improvement seamlessly, and the mannequin’s reminiscence issues disappeared.

Place vital data on the edges

The U-shape says the mannequin attends finest to the start and finish of its context. The pure transfer is to place your most load-bearing data in these positions and maintain the center for belongings you don’t want the mannequin to concentrate on. Something essential that lives solely in the course of an collected context tends to slip out of consideration.

The opposite facet of this system is what not to place within the center. If one thing issues, don’t bury it in a protracted preamble of context you’ve been accumulating; transfer it to the perimeters, restate it the place the mannequin will act on it, and let the center take in the much less essential materials. Fortunately, there’s a helpful approach that may assist with this drawback.

In Claude Code, for instance, one actually clear technique to put data initially of context is to make use of the system immediate. The CLI provides you --append-system-prompt for precisely this. (Many of the different suppliers’ CLI instruments have comparable choices.) In the event you put your temporary (or chosen elements of it) there, the agent will attend to it strongly all through the session, and that in flip will assist maintain the per-turn consumer immediate targeted on the motion you need the agent to take proper now.

Brief classes over lengthy ones

Don’t run one lengthy session. Run many brief ones, every studying contemporary from disk. It will assist you iterate in your temporary and your exterior improvement context, so as a substitute of counting on an opaque context window, you’ve got a visual and consistently altering set of paperwork that provide you with much more visibility into—and management over—your AI’s context.

One thing helpful I began doing was taking all my chat historical past from Gemini, ChatGPT, Claude, and Cowork and placing it right into a single folder I may maintain up to date and listed for quick search. I constructed out a whole system to handle this, which seems to be an awesome instrument after I’m writing articles like this, as a result of I can search by way of my improvement historical past for particular examples and methods that I’ve used. The system makes use of Haiku 4.5 to learn by way of chat historical past, summarize what occurred, and create an index. Haiku turned out to be a sensible sufficient mannequin to learn every particular person interplay in a chat and write a helpful index entry for it. However the mannequin being sensible sufficient to do one abstract didn’t imply its context administration may sustain throughout all 18,000 data. I ran smack into the U-shape drawback.

The primary try tried to maintain dedupe state and progress counts within the mannequin’s head, and it failed spectacularly. The mannequin actually didn’t wish to maintain observe of particular deterministic issues like correct numbers or the present state. Haiku 4.5, specifically, appears particularly unhealthy at this. What labored was reframing the structure totally. Right here’s the precise immediate that I gave it to repair the issue:

okay, so we want context administration. it would not want to recollect issues,
it simply wants to jot down them down as they go. we had this similar context
administration drawback with High quality Playbook, when it was operating out of
context. Simply write down after every message.

The protocol I greenlit for the complete run made the short-session self-discipline express:

Resume processing from the cursor recorded in progress.json, working by way of every enter file so as.
Replace progress.json after each line.
Anticipate to expire of context properly earlier than ending—that’s high-quality. Simply cease cleanly after every step (or a gaggle of steps), then spin up a contemporary session that reads progress.json and continues.
When all recordsdata are full, set standing: “full” in progress.json and report again.

Merchandise 3 is the approach in a single line: count on context loss, so be sure you’ve written your state down, and construct contemporary restarts into the method. The technical particulars, like spinning up subagents, orchestrating with script, and so on., will change, however the core concept stays the identical. In plenty of methods, you may consider treating the agent like a pipe, not a database. The state lives on disk, and the session is one thing you throw away and exchange.

Restate key data near the purpose of use

When the mannequin wants a constraint to use proper now, repeat it proper now. Don’t belief an instruction from earlier within the session to hold ahead by way of the center of the context.

That is the approach that fastened the issue I opened the article with, the place the High quality Playbook appeared to neglect the whole lot it had simply written right into a file referred to as BUGS.md and produced stubs when it wanted to jot down the identical data into extra detailed recordsdata, and as a substitute writing generic clean templates with the bug-specific fields left clean.

The repair was to restate the read-the-source rule proper earlier than the motion that wanted it, utilizing this immediate:

Earlier than writing BUG-NNN.md, re-read the BUG-NNN entry in BUGS.md.
Copy the Spec foundation, Minimal replica, Location, Anticipated habits,
Precise habits, Regression check title, and Patches fields
from that entry into the writeup. Don't paraphrase from reminiscence.

“Don’t paraphrase from reminiscence” is the road that did the precise work. The instruction couldn’t belief the agent’s reminiscence of what BUGS.md mentioned, though BUGS.md was sitting proper there within the context window. So the instruction compelled a contemporary learn of the file in the intervening time of writing. The restatement and the fresh-read collectively fastened the bug.

The identical sample applies any time a rule was acknowledged earlier within the session and the mannequin must act on it now. Restate the rule subsequent to the motion, and pressure the mannequin again to the supply fairly than letting it work from reminiscence.

Check the center

The earlier 4 methods are about avoiding lost-in-the-middle failures. This one is about catching them. In the event you don’t know whether or not the agent is definitely utilizing the data you assume it’s utilizing, discover out, with a deterministic verify fairly than a judgment name.

The sample is the one I used within the Haiku summarizer that I described earlier: examine what the agent claims to know towards what’s on disk. You’ve one thing the agent claims to know (its progress, its present state, the newest model of a rule), and you’ve got one thing on disk that’s the bottom reality (a file, a log, a database file). In the mean time the agent’s declare needs to be trusted, you verify it.

Within the summarizer’s resume protocol, each new session began by cross-checking progress.json towards the precise final line written to the abstract file, and the agent printed a checkpoint report when it did—at session begin, and periodically by way of the run. A consultant one regarded like this:

Checkpoint Report: ✓ progress.json confirmed: cursor for cowork_04_06 is at 238, standing is
"operating" ✓ Disk state verified: Final line in summaries/cowork_04_06.md is [237]
assistant: Instrument invocation repeating chat file learn. ⚠ Discrepancy famous: The prior session left a bulk observe claiming data
238–296 are duplicates however did not write particular person traces for them. Per
your directions, I need to write one line per file, even for duplicates,
within the format [idx] <sender>: Duplicate of file [X] (<observe>). Standing: Cursor matches disk state. Able to resume from file 238.

The agent doesn’t must introspect whether or not it misplaced context, solely to check two recordsdata. Once they agree, the agent proceeds; after they disagree, the agent flags the discrepancy and stops earlier than including any new work on high of a damaged state. Disagreement is the sign.

You’ll be able to construct this sort of verify into any agent that does multistep work. Decide one thing the agent has to trace, decide the file that’s the supply of reality for it, and have the agent examine the 2 at each session begin. When the agent’s view of the world drifts from the file, you discover out earlier than the drift turns into a buried bug.

The self-discipline behind these methods

Once I constructed the High quality Playbook’s multi-phase structure, I used to be fixing the compaction drawback. Lengthy pipeline runs have been filling the context window and triggering silent compaction in the course of work. Breaking the pipeline into separate phases that learn contemporary from disk and stopped after every part fastened it.

What I didn’t notice till later was that the identical structure additionally helps with the lost-in-the-middle drawback. Every part has its personal brief, targeted context, with the part temporary initially and the newest progress replace on the finish, so there’s virtually no center for data to fall into. The architectural transfer that helped with working reminiscence disappearing seems to additionally assist with working reminiscence being there and unused.

That’s the lesson I wish to land. Each failure modes, context loss and lost-in-the-middle, are issues of working-memory unreliability, and the self-discipline that addresses them is identical: maintain the working set small, put the load-bearing data on the edges of the window, and verify the agent’s claims towards floor reality on disk when it issues.

Context home windows will maintain getting greater, and compaction will get smarter. Among the methods in these 4 articles could ultimately be pointless. However the underlying constraint gained’t disappear. In spite of everything, we’ve added much more RAM to our computer systems because the 1MB 286 I wrote about within the final article, and reminiscence administration has gotten way more advanced since then. And plenty of of those issues are structural; for instance, it’s more and more wanting just like the U-shape itself is a geometrical property of the transformer structure, not a coaching artifact that extra compute will clean out.

The underside line is that in case your agent’s potential to do its job is determined by data, that data must reside someplace extra sturdy than working reminiscence. That was true for my dad’s 32 kilobytes of core reminiscence at Princeton within the Nineteen Seventies, it was true for my 640 kilobytes of standard RAM on my 286 within the Eighties, it was true for the 200K-token home windows in final yr’s fashions, and it is going to be true for no matter comes subsequent.

Supply hyperlink

What's Hot

Day 4 within the Palais: Stella McCartney on classic, and an AI musical

Marks & Spencer Launches New Chocolate Sandwich

New Pipeline System Deliberate For Syria

How an IEEE Awardee Grew to become Bewitched by Engineering

The 16 Finest Amazon Prime Day Offers Beneath $30 We have Discovered

The newest on Trump’s Reflecting Pool renovation: Arrests and alleged “vandalism”

Kalshi IPO talks, new payments

Identification Decision + PAR Retail Assist Retailers Win Commerce {Dollars}

Suggestions for the Coming Stocktake

The highest 10 biggest-selling RTD Model Champions 2026

Is Turkey Jerky Wholesome? Inside Chomps’ Nutritious Turkey Sticks

What’s Hemp CBD Oil? The Full 2026 Information to Purity and Wellness