Why Context Windows Won’t Save Us
Memory is the Moat
There is an emerging consensus that for AI companies to retain healthy margins and avoid the downward spiral of commoditization, memory is essential. Beyond just a data moat, memory captures knowledge of the individual user or customer: user preferences, working context, communication patterns, past decisions, project histories, and relationships. In a world where building software becomes easier and intelligence is sold to the lowest bidder among open source players, memory is one of the few durable competitive advantages remaining.
Yet it remains an unsolved technical problem. Current approaches range from RAG systems to fine-tuning to memory tools, each with significant tradeoffs in cost, interpretability, and effectiveness.
Exponential growth is further than we think
Overall, the AI community has placed an enormous bet on context windows as the solution. The logic seems sound: if we just make them big enough, we won’t need to think about memory at all. Scaling laws will do to context as they have done to RAM or processing power. We’ve gone from 4K to 128K to 1M tokens in just a few years, with the implicit promise that infinite context is just around the corner.
But a back-of-envelope calculation based on existing trends suggests otherwise.
Today’s commercially available LLMs operate with ~1M token context windows, which translates to roughly 1 hour of continuous operation. Extrapolating current trends (assuming context windows continue to 10x every 2 years and token-to-time ratios remain constant), we’d need 4 orders of magnitude improvement to support truly long-running agents that run for a year.
That’s 8 years minimum before we reach the scale needed for persistent AI assistants that operate over meaningful timescales.
Interestingly, this timeline roughly aligns with extrapolating from METR’s blog post on AI’s ability to complete long-horizon tasks, which suggests a ~7.5 year timeline for agents capable of year-long engagements. The convergence of these estimates should make us pause. Given the pace of AI development, can we really wait nearly a decade for context windows to solve memory?
Even if we could wait, there’s the migration problem. When your AI agent hits its context limit after operating for weeks or months, what happens? At those time scales, you don’t just restart with a blank slate, you lose continuity, relationships, accumulated knowledge. The handoff between agent instances becomes a critical failure point. Someone will need to own the problem of knowledge migration and session continuity. That’s both a technical challenge and a massive opportunity.
The future is this: memory techniques are increasingly critical regardless of where context windows end up. Just as we still have hard-drives in a world where you can buy 128GB of ram for a few hundred dollars, memory beyond context will be critical with agents. The winners in the next phase of AI won’t be those who simply wait for bigger context windows. They’ll be those who crack the memory problem first.
Our Bet: Productivity Apps for AI Agents
Given these constraints, we’re betting on text-based memory systems and agentic search over traditional approaches like RAG or fine-tuning. This bet is informed by several converging signals:
Recent results suggest reasoning + search is sufficient. Letta’s benchmarking work on AI agent memory and Claude’s recent memory tools demonstrate that agentic retrieval combined with reasoning capabilities can match or exceed more complex approaches. You don’t need sophisticated vector databases or fine-tuned models. Smart search strategies and the ability to reason over retrieved text is often sufficient.
The bitter lesson argues for scaling simplicity. The bitter lesson points to simple, scalable methods consistently winning over hand-crafted complexity. Text-based search is simple. It scales. It’s debuggable. Clever RAG techniques add layers of embedding complexity that may not be necessary, and fine-tuning bakes knowledge into opaque weights that are difficult to update or audit.
Interpretability is non-negotiable in production. When memory systems fail, you need to understand why. Text-based memory is human-readable. You can inspect what’s stored, trace retrieval decisions, and debug systematically. Vector embeddings are black boxes; cosine similarity scores don’t tell you why the wrong information surfaced.
In-context learning converges on fine-tuning performance through implicit dynamics. Recent research on transformer architectures reveals that in-context learning works by implicitly modifying model weights at inference time. When transformers process context, the self-attention layer transforms that context into low-rank weight updates to the MLP layers, essentially performing implicit fine-tuning without explicit parameter changes. Rather than forcing information into vector embeddings or performing explicit fine-tuning, we should lean into it: store memory as text, provide it in context, and let the learning dynamics do the rest without expensive fine-tuning infrastructure.
In some sense, we’re building productivity apps for AI (in contrast to building productivity apps with AI). The same way Notion and Slack help teams organize knowledge, AI systems need infrastructure to store, organize, retrieve, and synthesize information over time.
What makes text-based systems particularly powerful:
Composability and evolution. Text can be version-controlled, merged, edited, summarized, and expanded. It can be shared across agents, compressed for efficiency, organized hierarchically. The flexibility of text as a memory substrate means we can experiment rapidly with different organizational schemas, compression techniques, and retrieval strategies without rebuilding infrastructure.
Agentic search over passive retrieval. Traditional RAG is fundamentally passive: embed everything, retrieve by similarity, hope for the best. Agentic text search is active: the AI decides what to search for, when to search, how to refine its query based on what it finds. It can follow threads of reasoning, make connections across disparate pieces of information, and update its search strategy dynamically. This mirrors how humans actually use memory, not as a static database lookup, but as an active process of reconstruction and synthesis.
What Memory Actually Means: Beyond Storage and Retrieval
In the short term, we’re betting on pure text approaches over fine-tuning or RAG. In the long term, we believe there are two critical types of memory that need to be solved for AI agents to reach their full potential.
Cognitive science distinguishes between three types of memory:
Semantic memory: facts and concepts (”Paris is the capital of France”)
Episodic memory: personal experiences (”I visited Paris in 2019”)
Procedural memory: skills and know-how (”how to ride a bike”)
Current AI systems are heavily optimized for semantic and episodic memory. RAG systems retrieve facts. LLMs generate based on patterns in text. Memory benchmarks such as LoCoMo measure the ability to remember past conversation history across many sessions. But the frontier is in procedural memory, the ability to remember what you did, learn from experience, and build skills over time.
Procedural memory is where the gold is. Whether you call it planning, reinforcement learning, or continual learning, we view them as fundamentally the same thing. They’re all about developing policies (choose actions based on the current state of the world) that improve through experience. When an RL agent learns to play chess, it’s building procedural memory. When a language model is trained to plan across multiple steps, it’s developing a policy. When a system does continual learning, it’s updating its procedures based on new experience. This matters because virtually every valuable task in the world requires procedural memory: editing a slide deck, deploying servers, or managing projects.
But there’s a related challenge that’s adjacent procedural memory and is harder to define. To execute procedures effectively, agents need to anticipate whether an action will be good or bad before taking it. This requires something beyond just knowing what to do. It requires judgment about quality, appropriateness, and effectiveness. We view this as taste: a distinct form of memory that emerges from experience but serves a different purpose than procedural knowledge alone.
Consider Mary’s Room, a classic philosophical thought experiment. Mary is a scientist who knows everything physical about color: wavelengths, neural processing, the complete physics of light. But she’s lived her whole life in a black-and-white room. When she finally sees red for the first time, does she learn something new?
The thought experiment points to a deeper truth about knowledge: there’s a difference between information and experience, between semantic knowledge and embodied understanding. An AI can “know” that coffee tastes bitter and energizing, but without the qualia of actually drinking coffee, does it understand coffee?
This matters because AI agents will need to develop taste, genuine aesthetic judgment born from experience, to complete some tasks. The ability to recognize good code, elegant prose, and effective strategy all require something more than retrieval and cannot be summed up in words alone. Consider the task of creating good design or making a sound strategic decision. These capabilities require an agent to build internal models through experience, not just retrieve examples. Taste is what allows humans to navigate ambiguous situations where there’s no clear right answer. AI agents will need the same capability to be truly useful in complex, open-ended domains.
The Path Forward
The AI industry stands at an inflection point in how we think about memory. Context windows will continue to grow, but they won’t solve memory alone. RAG and vector databases will continue to improve, but they won’t create the interpretable, evolvable memory systems that long-running agents need. Fine-tuning will remain useful for specific domains, but it won’t provide the flexibility required for agents that must adapt continuously.
The short-term bet for us is clear: text-based memory systems with agentic search. Build productivity infrastructure for AI agents that’s interpretable, composable, and leverages the implicit learning dynamics already built into transformers.
Our long-term vision is richer and more ambitious. We’re working toward agents that don’t just remember facts but develop long term procedural knowledge. Like a master craftsman who has accumulated decades of experience at their trade, we anticipate agents of the future will be much the same. Agents that build taste through repeated exposure, developing aesthetic judgment that can’t be encoded in rules. Agents that compound their capabilities over time, not by accumulating more data, but by refining their policies and sharpening their judgment.
This is the future of AI memory: not bigger context windows, but smarter learning systems. Memory isn’t just about what you store. It’s about what you learn, what you forget, and what you become.
That’s the future we’re building toward.
Back-of-envelope caveat: These calculations assume context window growth rates hold and token-to-time ratios remain constant. METR’s timeline measures task length by human time, not AI completion time. The directional insight holds despite these simplifications.



