Afu OS

AI MemoryContext EngineeringPersonal AI

Overview

Afu OS is a personal operating system for working with AI. It externalizes projects, preferences, state, and operating rules into a structured vault that different AI agents can read, operate on, and accumulate context from — so they understand the same person over time instead of starting from zero every session.

It was not designed top-down. It grew from 19 consecutive days of real daily use: 632 Markdown files across 15 top-level directories, 19 routing preference rules extracted from actual decisions, and six issue-to-root-cause-to-protocol-fix loops. The system now handles morning page-turns, inbox routing, evening close, and cross-session memory through explicit protocols rather than prompts. A companion retrieval layer — the Afu Memory Engine — adds semantic search over 3,456 vector chunks, but its real contribution was confirming where RAG adds value and where it doesn't.

Warm editorial illustration cover for Afu OS

The Problem

Every AI tool starts from zero. Open a new chat, and you explain who you are, what you're working on, and what decisions you've already made. Close it, and everything resets. Knowledge bases solve part of this — they store content — but they don't capture operating state: what's in progress, what's blocked, what was decided yesterday and why, which preferences should carry forward and which were one-off.

The deeper problem is that personal context isn't just a collection of facts. It's a live system: today's focus, open loops, pending decisions, routing rules that get refined through use, and protocols that turn recurring judgment calls into executable steps. Without this layer, even a well-stocked knowledge base leaves the AI guessing about what matters right now.

Product Judgment

Three convictions shaped the system.

First, protocols over prompts. Telling an AI to "be helpful" in a system prompt is fragile. Writing a structured checklist that the AI must execute — with hard constraints, stop conditions, and explicit verification steps — is reliable. The page-turn protocol, routing, and evening close are not suggestions. They're executable specifications with pass/fail criteria.

Second, freeze deliberately. When a subsystem works well enough, freeze it. Phase 04a and 04b were frozen not because they were perfect, but because the cost of continued redesign exceeded the value of additional refinement. Stability is a feature when you're the only user.

Third, retrieval is a chain, not a single step. The system uses four paths in fixed order: direct read of known files → directory search → keyword grep → AME semantic retrieval → back to source files for verification. No single retrieval method is trusted alone. The AME RAG experiment proved that semantic similarity often retrieves what sounds right rather than what proves right — so it's positioned as a discovery layer, not the default entry point.

How It Works

Afu OS is structured as a file-system-native vault with five functional layers.

Data & Knowledge includes inbox, notes, projects, external file indexes, and daily logs. Everything that enters the system is routed through explicit intake protocols.

Personal Context includes a daily cockpit refreshed each morning with calendar events, inbox state, top outcomes, and active focus. Open loops, pending decisions, and routing preferences persist across sessions.

Memory & Retrieval uses a four-path retrieval chain: direct read of known files, directory search, keyword grep, and AME semantic retrieval using local BGE-M3 embeddings. Every retrieval path ends with source verification — no answer is trusted without checking the original file.

Action Protocols govern the daily rhythm: page-turn, prompted routing, and evening close. Each protocol has explicit completion criteria; skipped steps must be logged with reasons.

Operating Memory stores stable cross-session context. Personal Portrait provides semantic understanding of the user. Both are loaded at session start, creating a minimal persistent procedural memory layer that doesn't depend on any single model or chat history.

What I Tested

The system has been validated through 19 consecutive days of real use, not synthetic benchmarks. Six issue-to-root-cause-to-protocol-fix loops are documented: page-turn failing to refresh inbox state, page-turn dropping tomorrow-seed carry-forward, evening close skipping system activity cleanup, and others. Each fix is traceable to a specific protocol amendment with a dated issue log.

The AME retrieval layer was evaluated separately. A small-sample query comparison tested five real-use questions across three retrieval modes — direct-only, vector-only, and hybrid. The hash-local embedding scored 4/5 relevance pass; BGE-M3 scored 3/5 on a stricter evaluation that flagged forbidden signals. More importantly, the experiment revealed that the retrieval objective itself was wrong: optimizing for semantic similarity retrieved self-reflective content while missing hard evidence like PRDs, demos, and eval data. This led to a fundamental correction: AME is a discovery layer, not the primary entry point.

What Failed and Changed

The most instructive failure was the AME RAG experiment. The engineering pipeline ran perfectly — chunking, embedding, vector indexing, incremental updates, query routing — every component worked at production level. But the answers were worse than direct file access because the system optimized for the wrong objective. It retrieved semantically similar content while missing the highest-evidence material. The Memory Pack looked complete, which made it more dangerous than an obviously incomplete result — it gave the agent permission to stop searching.

The fix was not better embeddings. It was repositioning AME from the default entry point to the third step in a four-path chain, always followed by source verification. The retrieval objective was also reformulated from semantic similarity to a composite of evidence strength, source authority, recency, and task fit.

Earlier protocol failures followed a different pattern: the evening close protocol was documented but not consistently executed, leading to stale state carrying into the next morning's page-turn. The fix was hardening mandatory system activity cleanup and inbox actionable scan into non-negotiable checklist items with explicit pass/fail criteria.

Current Limits

This is a single-user system validated through one person's daily use. It has not been tested with other people, other vault structures, or other AI tools beyond Claude. The AME retrieval layer uses a small-sample index; production-scale retrieval with the full vault has not been benchmarked. Cross-agent context transfer — the original motivating vision — has been demonstrated between Claude Code sessions but not across fundamentally different agent architectures. The protocols are stable for the current operating mode but have not been stress-tested under higher tempo or multi-project parallel execution.

These limits are acceptable because Afu OS is not a product being sold. It's a living system that earns its complexity through daily use, and every feature that exists has been paid for by a specific problem that actually occurred.

Technical Debt

The AME retrieval layer runs locally with BGE-M3 embeddings loaded from HuggingFace — this works but requires manual model caching and has no automated index rebuild. The four-path retrieval chain is manually sequenced by the agent rather than orchestrated by a unified retrieval router. Protocol compliance is enforced by agent instruction-following, not by a runtime validator — which means enforcement quality depends on the agent, not on guaranteed execution. There is no automated test suite for protocol execution; verification is through manual audit of daily logs and issue tracking.

Next Step

The system is stable in its current operating mode. Three directions are worth exploring when the tempo allows.

Can the protocol enforcement layer move from agent-read checklists to a runtime validator that guarantees execution regardless of which model is driving? Can the AME retrieval chain be productized — not as a better search engine, but as a context packaging layer that teaches agents when to trust retrieval and when to go back to source files? And can the system be generalized beyond one person without losing the specificity that makes it work?

The most important next step, though, is simpler: keep using it. Every protocol fix so far came from a real failure, not a hypothetical one. The system gets better by being used, breaking, and being repaired — not by being redesigned in advance.