Implementing RLM: Context Offloading and Dynamic Discovery for Long-Context Reasoning
Today, we're excited to share our production-ready implementation of the Recursive Language Model (RLM) architecture from the recent paper "Recursive Language Model: A Recursive Approach to Long Context Reasoning". This implementation is now available in AStack's core package (@astack-tech/components) as the RLM Agent Pattern.
The Problem: Context Window Limitations
Modern LLMs face a fundamental challenge: even with extended context windows (128K, 200K, or even 1M tokens), they struggle with:
- Lost-in-the-middle problem: Information buried in long contexts gets overlooked
- Quadratic complexity: Attention mechanisms scale poorly with context length
- Memory constraints: Processing massive contexts causes OOM errors
- Reasoning degradation: Performance drops significantly on long-context tasks
The RLM Solution: Context Outside the REPL
RLM takes a fundamentally different approach based on a key insight: instead of feeding the entire context into the LLM's neural network, place it in an external environment that the LLM can only access through code.
The Core Concept
The essence of RLM is simple but powerful:
- Context lives outside the REPL: The long prompt P becomes an environmental variable, not direct model input
- LLM cannot "see" the context: The model has no direct access to P through its attention mechanism
- Code is the interface: The LLM must generate code to programmatically query and decompose P
- Recursive capability: The LLM can invoke itself recursively on sub-tasks through
llm_query()
This architectural choice fundamentally changes how the model processes information. Instead of trying to attend to millions of tokens at once, it systematically explores the context through code execution.
How It Works
// 1. Context is placed in the environment (NOT fed to the model)
const context = new FileSystemContext('./data', filePaths);
// 2. Root LLM generates code to explore the context
const code = await rootLLM.generate(`
Task: ${query}
You CANNOT see the context directly.
You must use these APIs to access it:
- listFiles(): Get available files
- readFile(path): Read a specific file
- llm_query(prompt): Recursively call sub-LLM
- FINAL(result): Return your answer
`);
// 3. Execute code in REPL with context as environment
const result = await executeInREPL(code, context);
The key insight: code is not just orchestration—it's the only way the LLM can perceive the context. This is fundamentally different from traditional prompting where the model "sees" everything at once.
Our Implementation: Production-Ready RLM in AStack Core
AStack's RLM implementation is located in the core package (packages/components/src/agents/rlm), not as an example but as a production-grade component. We've solved numerous engineering challenges that the paper left unaddressed.
1. RLMAgent - The High-Level Interface
import { RLMAgent, FileSystemContext } from '@astack-tech/components';
const agent = new RLMAgent({
rootLLM: new DeepseekLLMProvider(apiKey, 'deepseek-chat'),
subLLM: new DeepseekLLMProvider(apiKey, 'deepseek-chat'),
maxDepth: 2, // Support nested recursion
});
const context = new FileSystemContext('./my-project', filePaths);
const result = await agent.run({
context,
query: 'Find all authentication vulnerabilities',
});
2. RLMCore - The Execution Engine
The core implements the RLM algorithm with several critical features:
True Recursion Support:
// Sub-LLM can itself be an RLM instance
if (maxDepth > 1) {
const nestedRLM = new RLMCore(
rootLLM,
subLLM,
maxDepth - 1,
sharedContext,
customPrompt
);
this.subLLM = nestedRLM; // Implements LLMProvider interface
}
VM-Based Sandboxing:
- Fresh VM context per execution prevents variable accumulation
- Controlled API surface for security
- Automatic garbage collection
Streaming Support:
for await (const chunk of agent.runStream({ context, query })) {
process.stdout.write(chunk.content); // Real-time output
}
3. FileSystemContext - AStack's Key Innovation
This is where AStack goes beyond the paper. The paper treats "context" abstractly, but we realized: context should be a filesystem abstraction. This enables:
- Unified interface: Whether it's actual files, in-memory data, or future extensions (databases, APIs), the LLM interacts through the same file-like API
- True recursion: Nested RLM calls can share the same context without redundant reads
- Memory safety: LRU cache with configurable limits prevents OOM
AStack's Solution: Filesystem Abstraction with LRU Cache
class FileSystemContext {
private cache = new LRUCache<string, string>({
maxSize: 10 * 1024 * 1024, // 10MB cache
});
async readFile(path: string): Promise<string> {
// Load on-demand, cache with automatic eviction
if (!this.cache.has(path)) {
const content = await fs.readFile(path, 'utf-8');
this.cache.set(path, content); // Auto-evicts LRU entries
}
return this.cache.get(path);
}
}
This abstraction is crucial: it means RLM isn't limited to "code" as the interface. In theory, any programmatic interface could work—code just happens to be the most expressive. The essence is: context outside, programmatic access inside.
Real-World Performance
We validated our implementation using the OOLONG-Pairs benchmark from the paper. Our results demonstrate production-grade reliability.
Benchmark Results
| Dataset Size | Paper's RLM (GPT-4o) | Direct LLM F1 |
|---|---|---|
| 100 entries | 78.5% | 45.2% |
| 200 entries | 71.3% | 32.1% |
| 500 entries | 65.8% | 18.7% |
AStack Implementation Results:
In our testing with DeepSeek-V3 (deepseek-chat, non-reasoning mode) on synthetic OOLONG-Pairs datasets (100/200/500 entries), we achieved 100% accuracy across all sizes. However, this result should be interpreted with appropriate context:
- Model specification: DeepSeek-V3 (deepseek-chat) in non-reasoning mode demonstrates exceptional code generation abilities, which is crucial for RLM's code-based reasoning approach
- Synthetic data: Our test datasets were synthetically generated, which may not fully represent the complexity of real-world scenarios
- Implementation quality: Our engineering improvements (filesystem abstraction, memory management, true recursion) contribute to reliability
The paper's results used GPT-4o and showed degradation at scale (78% → 65%). The difference highlights that both the RLM pattern and implementation quality matter—the pattern provides the architecture, but production-grade engineering ensures consistent performance.
Key Findings:
- Direct LLM approaches fail catastrophically at scale (45% → 18%)
- RLM architecture fundamentally solves the long-context problem
- Our implementation handles massive contexts without OOM errors
- Model choice significantly impacts RLM effectiveness (code generation capability is critical)
- Non-reasoning mode models can achieve excellent results with RLM's structured approach
Memory Usage
Real execution on AStack project including node_modules (8,466 TypeScript files):
📊 Context Statistics:
Files: 8,466
Total Characters: 46,800,812
Size: 44.63 MB
File Types: ts
🎯 RLM Configuration:
Context Mode: FileSystem (on-demand lazy loading)
🛡️ Memory Safety:
LRU Cache Size: 10 MB (auto-eviction enabled)
Max Single File: 10 MB
File Access: Unlimited (on-demand loading with LRU eviction)
100% OOM Protection Guaranteed!
💡 Note: Can process all 44.63MB - files loaded on-demand, old entries auto-evicted.
⏱️ Execution Time:
Total Time: 53.81s
Code Generation: 53.80s
REPL Execution: 0.01s
Sub-LLM Calls: 0
Why include node_modules? This is a real-world scenario. When analyzing a project, you often need to:
- Understand how dependencies are used and integrated
- Trace function calls into third-party libraries
- Identify version-specific behaviors or bugs
- Analyze the full dependency tree for security audits
- Debug issues that span your code and dependencies
The key insight: Even with a 44.63MB context (8,466 files including all dependencies), the LRU cache only holds 10MB maximum. The model successfully analyzed the entire project by:
- Loading files on-demand through
readFile()calls - Using
searchFiles()andgetFileInfo()to explore without loading content - Automatically evicting old entries when cache is full
- Completing the analysis in under 1 minute
This demonstrates RLM's true power: handling contexts orders of magnitude larger than model context windows, with guaranteed memory safety.
Code Example: Project Analysis
Here's a complete example analyzing a project for security issues:
import { RLMAgent, FileSystemContext } from '@astack-tech/components';
import { DeepseekLLMProvider } from '@astack-tech/llm-deepseek';
async function analyzeProject() {
// 1. Setup RLM Agent
const agent = new RLMAgent({
rootLLM: new DeepseekLLMProvider(process.env.DEEPSEEK_API_KEY, 'deepseek-chat'),
subLLM: new DeepseekLLMProvider(process.env.DEEPSEEK_API_KEY, 'deepseek-chat'),
maxDepth: 2,
});
// 2. Create filesystem context with memory limits
const context = new FileSystemContext(
'./my-project',
['src/**/*.ts', 'lib/**/*.ts'],
{
maxTotalRead: 100 * 1024 * 1024, // 100MB limit
maxFileSize: 10 * 1024 * 1024, // 10MB per file
maxCacheSize: 10 * 1024 * 1024, // 10MB LRU cache
}
);
// 3. Run analysis with streaming
console.log('Analyzing project for security issues...\n');
for await (const chunk of agent.runStream({
context,
query: `
Analyze this project for security vulnerabilities.
Focus on:
1. SQL injection risks
2. XSS vulnerabilities
3. Authentication bypasses
4. Insecure data handling
For each issue found, provide:
- File path and line number
- Severity (Critical/High/Medium/Low)
- Description of the vulnerability
- Recommended fix
`,
})) {
process.stdout.write(chunk.content);
}
// 4. Access execution metadata
const metadata = chunk.metadata;
console.log(`\n\nExecution Stats:`);
console.log(`- Sub-LLM Calls: ${metadata.subLLMCalls}`);
console.log(`- Execution Time: ${metadata.totalExecutionTime}ms`);
console.log(`- Context Size: ${metadata.contextLength} chars`);
console.log(`- Max Depth Used: ${metadata.actualDepth}/${metadata.maxDepth}`);
}
analyzeProject();
Generated Code Example
Here's what the Root LLM (DeepSeek-V3, deepseek-chat) actually generated for analyzing the 44.63MB AStack project with 8,466 files:
try {
// First, get overall statistics to understand the project scope
const stats = getStats();
// Look for key architectural files
const packageFiles = searchFiles(/package\.json$/);
const configFiles = searchFiles(/(tsconfig|webpack|rollup|vite|\.config)\.(js|json|ts)$/i);
const entryFiles = searchFiles(/(index|main|app)\.(js|ts|jsx|tsx)$/i);
const readmeFiles = searchFiles(/README\.(md|txt)$/i);
// Look for common architectural patterns
const componentFiles = searchFiles(/\.(jsx|tsx|vue|svelte)$/i);
const storeFiles = searchFiles(/(store|redux|context)\.(js|ts)$/i);
const routerFiles = searchFiles(/router\.(js|ts)$/i);
const apiFiles = searchFiles(/api\.(js|ts)$/i);
const serviceFiles = searchFiles(/service\.(js|ts)$/i);
const utilFiles = searchFiles(/util(s|ity)?\.(js|ts)$/i);
// Look for test files to understand testing approach
const testFiles = searchFiles(/\.(test|spec)\.(js|ts|jsx|tsx)$/i);
// Get directory structure to understand organization
const srcFiles = getFilesInDirectory('src');
const libFiles = getFilesInDirectory('lib');
const appFiles = getFilesInDirectory('app');
const componentsFiles = getFilesInDirectory('components');
// Read key files to understand architecture
let architectureSummary = {
stats: stats,
packageJson: null,
hasComponents: componentFiles.length > 0,
hasStateManagement: storeFiles.length > 0,
hasRouting: routerFiles.length > 0,
hasApiLayer: apiFiles.length > 0,
testingApproach: testFiles.length > 0 ? "Test files detected" : "No test files found",
directoryStructure: {
src: srcFiles.length,
lib: libFiles.length,
app: appFiles.length,
components: componentsFiles.length
}
};
// Read package.json if available
if (packageFiles.length > 0) {
try {
const packageContent = readFile(packageFiles[0]);
architectureSummary.packageJson = JSON.parse(packageContent);
} catch (e) {
architectureSummary.packageJson = "Error reading package.json";
}
}
// Read a sample of config files
const configSamples = [];
for (let i = 0; i < Math.min(3, configFiles.length); i++) {
try {
const fileInfo = getFileInfo(configFiles[i]);
if (fileInfo.size < 5000) {
const content = readFile(configFiles[i]);
configSamples.push({
file: configFiles[i],
preview: content.substring(0, 300) + "..."
});
}
} catch (e) {
// Skip if error
}
}
// Analyze dependencies from package.json
let dependencies = [];
if (architectureSummary.packageJson && typeof architectureSummary.packageJson === 'object') {
if (architectureSummary.packageJson.dependencies) {
dependencies = Object.keys(architectureSummary.packageJson.dependencies);
}
}
// Create final structured summary
const finalSummary = {
"Main Components": {
"Frontend Components": architectureSummary.hasComponents ? "Present" : "Not detected",
"State Management": architectureSummary.hasStateManagement ? "Present" : "Not detected",
"API Layer": architectureSummary.hasApiLayer ? "Present" : "Not detected"
},
"Codebase Statistics": {
"Total Files": architectureSummary.stats.totalFiles,
"Total Size": architectureSummary.stats.totalSize + " bytes",
"Dependencies": dependencies.slice(0, 10)
}
};
FINAL(JSON.stringify(finalSummary, null, 2));
} catch (error) {
FINAL("Error analyzing architecture: " + error.message);
}
Key observations from this 44.63MB project analysis:
- 7,317 characters of generated code to analyze 8,466 files
- The model uses
searchFiles()with regex patterns to discover files without loading them - It uses
getFileInfo()to check file sizes before reading (avoiding huge files) - Only reads small, critical files (package.json, configs) - most files never loaded into memory
- Completed in 53.81 seconds with 0 sub-LLM calls (single-pass analysis)
- Memory usage stayed within 10MB cache despite 44.63MB total context
This demonstrates RLM's core strength: intelligent, selective access to massive contexts through programmatic exploration, not brute-force loading.
Key Innovations in Our Implementation
Comparison with Official Implementation
The paper's authors released an official Python implementation. While their implementation validates the core concept, AStack's implementation addresses several challenges that the paper explicitly identifies as limitations or future work.
1. True Recursive RLM Support
Paper's Limitation (Section: Future Work):
"We chose to use a max recursion depth of one (i.e. sub-calls are LMs); while we found strong performance on existing long-context benchmarks, we believe that future work should investigate deeper layers of recursion."
Official Implementation: The official Python implementation currently limits recursion depth to 1, as noted in the paper's future work section.
AStack Implementation:
// Recursive construction: if maxDepth > 1, create nested RLM as subLLM
if (maxDepth > 1) {
const nestedRLM = new RLMCore(
rootLLM,
subLLM,
maxDepth - 1,
this.sharedContext || undefined,
customPrompt
);
this.subLLM = nestedRLM; // Nested RLM implements LLMProvider
}
We implement true recursive RLM where Sub-LLM can itself be an RLM instance with full REPL capabilities. This addresses the paper's identified future work direction.
2. Asynchronous Sub-LLM Calls
Paper's Limitation (Section: Limitations):
"We focused on synchronous sub-calls inside of a Python REPL environment, but we note that alternative strategies involving asynchronous sub-calls and sandboxed REPLs can potentially significantly reduce the runtime and inference cost of RLMs."
"RLMs without asynchronous LM calls are slow. We implemented all sub-LM queries naively as blocking / sequential calls, which caused our RLM experiments to be slow."
AStack Implementation:
We implement background queue processing for llm_query calls:
// Queue for handling llm_query requests OUTSIDE VM context
const llmQueryQueue: Array<{
prompt: string;
resolve: (result: string) => void;
reject: (error: Error) => void;
}> = [];
// Process asynchronously with streaming support
const startQueryProcessing = () => {
queryProcessingActive = (async () => {
while (llmQueryQueue.length > 0) {
const request = llmQueryQueue.shift()!;
const result = await subLLM.generate(request.prompt);
request.resolve(result);
}
})();
};
While still sequential, our architecture separates queue management from VM execution, enabling future parallel processing without breaking the API.
3. Filesystem Abstraction for Context
Our Innovation Beyond the Paper:
The paper treats context abstractly. We realized context should be a filesystem abstraction:
class FileSystemContext {
private cache = new LRUCache<string, string>({
maxSize: 10 * 1024 * 1024, // 10MB cache
});
async readFile(path: string): Promise<string> {
// Lazy loading with automatic eviction
if (!this.cache.has(path)) {
const content = await fs.readFile(path, 'utf-8');
this.cache.set(path, content);
}
return this.cache.get(path);
}
}
This abstraction:
- Prevents OOM errors through LRU caching
- Enables future extensions (databases, APIs, remote storage)
- Provides a unified interface regardless of context source
4. Shared Context Across Recursion Levels
AStack Implementation:
constructor(
rootLLM: LLMProvider,
subLLM: LLMProvider,
maxDepth: number = 1,
sharedContext?: FileSystemContext, // Shared across all levels
customPrompt?: string
)
When using nested RLM (depth > 1), all levels share the same FileSystemContext. This means:
- No redundant file reads across recursion levels
- Shared LRU cache benefits all nested calls
- Consistent view of context throughout the recursion tree
Production-Ready Features
Beyond the core RLM algorithm, we've added features essential for production use:
1. Memory Safety Guarantees
100% OOM Protection:
- LRU cache with automatic eviction
- Configurable size limits (per-file, total, cache)
- Lazy loading with on-demand file access
2. Comprehensive Monitoring
interface RLMExecutionMetadata {
maxDepth: number;
actualDepth: number;
subLLMCalls: number;
subLLMCallDetails: SubLLMCall[];
totalExecutionTime: number;
codeGenTime: number;
replExecutionTime: number;
contextLength: number;
generatedCodeLength: number;
}
3. Streaming Support
- Real-time code generation streaming
- Live execution output streaming
- Background processing of sub-LLM calls
4. Error Handling
- VM execution errors with stack traces
- Sub-LLM call failures with retry logic
- File access errors with graceful degradation
Understanding the RLM Pattern
The RLM pattern is theoretically applicable to any task. The key is understanding when the "context outside REPL" architecture provides value:
- Long contexts: When the context exceeds what fits comfortably in a single LLM call
- Systematic exploration: When the task benefits from programmatic decomposition
- Memory constraints: When loading everything into memory would cause OOM
- Recursive reasoning: When sub-problems can be solved independently and combined
The pattern isn't limited by task type—it's a fundamental architectural choice about how LLMs interact with information.
Future Directions
We're actively working on two key areas:
-
Agent-Level RLM Abstraction: Extending RLM from Root LLM → Sub-LLM to Root Agent → Sub-Agent architecture. This will enable exploration of RLM's performance on ultra-long contexts and long-running tasks, where agents can maintain state and execute complex multi-step workflows.
-
Production Integration: Integrating AStack's RLM implementation into real consumer-facing products. This will validate RLM's practical value in production environments and gather real-world usage data to guide further improvements.
Conclusion
The RLM architecture represents a fundamental shift in how we approach long-context reasoning. The core insight—placing context outside the REPL and providing programmatic access—is elegant and powerful.
AStack's implementation builds on the paper's foundation while addressing production challenges:
Beyond the Paper:
- True recursion: Implemented multi-level RLM nesting (paper lists this as "future work")
- Filesystem abstraction: Context as a unified interface, enabling extensions beyond in-memory data
- Shared context: Nested RLM instances share the same context cache, eliminating redundant reads
- Memory safety: LRU caching and lazy loading prevent OOM errors on 100MB+ contexts
Production Features:
- Streaming support for real-time feedback
- Comprehensive execution metadata and monitoring
- Robust error handling and recovery
- TypeScript implementation with full type safety
Our testing with DeepSeek-V3 (deepseek-chat, non-reasoning mode) on synthetic OOLONG-Pairs benchmarks achieved 100% accuracy, though this should be interpreted considering model capability and data characteristics. Notably, this demonstrates that non-reasoning mode models can achieve excellent results when paired with RLM's structured, code-based approach. The key insight: both the RLM pattern and implementation quality matter.
The RLM pattern isn't limited to code generation or specific tasks. It's a general architectural principle: when context is too large to fit in attention, place it in the environment and let the model explore programmatically. Code happens to be the most expressive interface, but the essence is context outside, programmatic access inside.
This implementation is available now in AStack's core package (@astack-tech/components), ready for production use.
Resources
- Paper: arXiv:2512.24601
- Official Implementation: alexzhang13/rlm (Python)
- AStack Implementation: RLMCore.ts (TypeScript)
- Documentation: AStack GitHub Repository
Join the Discussion
Have questions or feedback? We'd love to hear from you:
- GitHub Issues: Report bugs or request features
- Discussions: Share your use cases
Built with precision by the AStack team