SharedGround

Human–Agent CollaborationAgent ProtocolShared State

Overview

SharedGround is a shared task environment where a human and an AI agent can observe, decide, and act on the same workspace without waiting for turns. It turns research materials — sources, evidence, claims, and briefs — into structured shared state, then uses an explicit action protocol to handle permissions, conflicts, and stale updates when both sides can act continuously.

The project began with a question that most AI collaboration tools don't ask: what changes when the human keeps modifying the task while the agent is working? Not "how can AI produce a better report," but "how should a product work when both the human and the agent can act on the same evolving task?" V0.1 built a shared workspace and proved it was possible. V0.2 rebuilt the architecture around browser-owned state after discovering that API-owned state could silently overwrite human edits. A 20-step demo with a mock agent validates the full collaboration loop, and 125 deterministic tests verify every permission boundary, conflict path, and degradation mode.

Warm editorial illustration cover for SharedGround

The Problem

Every AI collaboration tool today is fundamentally turn-based: you ask, it answers, the conversation ends, state resets. Even when an AI reads all your materials, you still have to wait for it to finish, your modifications only exist as subsequent messages, and the AI has no reliable way to know which of its earlier judgments are still valid after you've changed something upstream.

Chat is good for exchanging messages. It is not good for expressing a task environment that both sides can continuously modify. In a chat interface you cannot clearly answer: which claim is the current valid version, which evidence supports which claim, what the human modified and whether the agent respected that modification, which version the brief was generated from, and who had control at what time.

This matters because complex knowledge work — research, strategy, analysis — is not a sequence of Q&A. It's an evolving structure where the human changes goals, content, and constraints mid-stream. The agent needs to continue working without interruption, without overstepping, and without losing track of what changed. Current tools don't even attempt this.

Product Judgment

The project is anchored by one conviction that cuts against the grain of most AI product thinking: the goal is not making the agent more autonomous — it's designing how human and agent share state and exchange control.

First, human always wins. When the human edits something the agent is currently working on, the edit takes effect immediately. The agent's stale writes are rejected at the reducer level — not hidden by UI, but structurally blocked. This isn't a design preference; it's the product's core contract with the user.

Second, structured actions over natural language. The agent doesn't say "I've updated the claim." It outputs a typed JSON action with a schema, an expected version number, and a reasoning trail. This makes every action validatable, version-checkable, auditable, and — critically — rejectable by the system without parsing natural language.

Third, a mock agent first, then a real model. The demo works without any API key. The mock agent provides a deterministic, stable trajectory that lets you validate the collaboration protocol independently of model reliability. The real agent path is an optional upgrade, not a dependency.

Fourth, evaluation by deterministic rules, not LLM-as-judge. The evaluation framework computes ratios — grounded claim rate, citation integrity, stale write rejection rate, human modification acknowledgement — using rule-based calculations. No model subjectivity. Results are reproducible by anyone who runs the test suite.

How It Works

The architecture is built around a single hard boundary: the browser owns the true workspace state. The server never writes directly to it.

A Zustand store holds six structured object types — research task, sources, evidence, claims, brief, and activity events — each with version numbers. When the agent runs, the browser sends a workspace snapshot to the API. The API returns only proposed actions, never a full replacement state. Those proposed actions pass through a pure-function reducer that validates schema, checks permissions, compares expected versions against current versions, and either applies or rejects each action. Rejection always produces a logged event with a reason.

The agent operates on a continuous step loop: observe current workspace → understand recent human changes → judge the situation → propose up to three structured actions → reducer applies or rejects → result feeds back to the next step. This is not "user asks, AI answers." It's a persistent Observe-Decide-Act-Feedback cycle where human and agent can interleave actions indefinitely.

Three control modes govern the relationship. The agent can REQUEST human input when it hits the boundary of its authority, blocking itself until the human responds. The agent can WAIT — voluntarily yielding control, signaling "I've done what I can, you take it from here." And the human can PAUSE at any time, which terminates the current agent run, invalidates stale in-flight responses, and returns full control.

The permission matrix is deliberately asymmetric. The agent can propose claims, challenge claims, and edit briefs — but it cannot confirm, finalize, or mark anything as human-reviewed. Only the human can close a claim or finish the task. Every boundary is enforced at the reducer level, not in a prompt.

What I Tested

The test suite has 125 deterministic tests, all passing. They cover the permission matrix exhaustively — can the agent mark something as human_confirmed? No. Can the human? Yes. Does the reducer enforce this? Verify. They also cover the full conflict path: human edits a claim while the agent is running → version increments → agent's stale write carries the old expectedVersion → reducer rejects with STALE_OBJECT_VERSION → event logged. Every degradation mode is tested: agent overstepping, WAIT after unanswered request, evidence with non-existent source, brief referencing non-existent claim, incomplete evidence chain.

The demo flow is a 20-step complete path: the mock agent searches and adds sources, extracts evidence, proposes claims, the human confirms one, revises another, contests a third, the agent detects changes and adjusts, requests human input, waits, edits the brief, and the activity log shows the full process end to end. The default research case is EU industrial policy changes and their impact on Chinese companies' European investments, with eight preset sources covering the NZIA, CRMA, FSR, Chips Act, and specific company cases.

The real agent path was tested on 3–5 short Markdown documents. It works, but with the expected caveats: model format drift, latency variance, and context window constraints. That's why the mock agent is the demo default — the collaboration protocol is the product, and the model is a variable, not the foundation.

What Failed and Changed

V0.1 demonstrated the full collaboration loop: shared workspace, structured actions, human edits, WAIT, event logging. But it had a critical flaw in its architecture: the API returned a complete new state that replaced the browser state entirely. If the human edited a claim or brief while the agent was running, the old API response — generated from a snapshot taken before the edit — could overwrite the human's changes. Human edits were vulnerable to race conditions baked into the architecture, not the model.

V0.2 fixed this by inverting the ownership model. The API no longer returns state — it returns only proposed actions. The browser reducer applies them against the current latest state, with version checks at the object level. Three mechanisms were added together: object versioning on every writable entity, expectedVersion on every agent write, and rejection of any write where the version has moved. Additionally, each agent run gets a runId, and the pause operation terminates the current run — so stale responses from an old run are discarded before they ever reach the reducer.

The shift was architectural, not cosmetic. V0.1 built a shared workspace. V0.2 began to share the work.

Current Limits

Zero real users have tested SharedGround. The 125 tests prove the protocol works; they don't prove anyone wants it. There's no production deployment, no server-side persistence — state lives in localStorage and disappears when you close the browser. Single human, single agent, single browser. No multi-user, no WebSocket, no database, no Redis. The real agent path works on small document sets but isn't production-stable. Context selection is simple — all sources go into context, no chunking or retrieval for large corpora. The evaluation framework validates process observability, not research output quality.

I list these limits because they're real, and because the collaboration protocol they bracket is specific enough to be worth testing. A system that can prove the human's edits are never silently overwritten is more interesting than a system that claims to do everything.

Technical Debt

The local-first architecture is deliberate for V0.2, but localStorage is not a database, and the entire workspace state disappears on browser close unless explicitly exported. Server-side persistence is the first infrastructure upgrade if the project moves beyond mechanism validation. Human action version checking isn't implemented — currently only agent writes carry expectedVersion; human operations pass through the reducer without version validation, which means the system trusts the human's last write unconditionally. The test suite is heavily weighted toward agent actions and reducer behavior; human-action edge cases have thinner coverage. Brief version history has skeleton support through derivation tracking but no full diff or rollback.

These debts are acceptable for V0.2 because the project's job right now is to validate the conflict protocol, not to serve production traffic. They're documented so the next step is clear.

Next Step

The architecture is stable and the protocol is validated in code. Three questions are worth answering next.

Can 5–8 real users — researchers, analysts, or strategists — complete the same task in SharedGround vs. a standard chat interface, and does the structured workspace produce measurably better collaboration? Does the real agent path hold up under systematic reliability testing with better retry, fallback, and error recovery? And when the document set grows from five sources to five hundred, does the context selection layer hold, or does it need to be rebuilt?

The Action Protocol itself is designed to generalize beyond research. The same Source → Evidence → Claim → Brief structure maps naturally to product requirements, data analysis, travel planning, and document collaboration. But the most urgent generalization is from zero users to one — getting the protocol in front of someone other than me.