AEP · Agent Evidence Packet open standard · 2026

A better
agentic
file format.

AEP — Agent Evidence Packet — a structured, hash-verified, byte-roundtrip-preserving companion file format for AI agent workflows. The substrate replaces "throw it in a Markdown file" with a queryable, falsifiable, integrity-checked layer that survives across sessions and across LLMs.

License
Apache-2.0 (spec + reference impl) · CC-BY-4.0 (docs)
Status
v1.5 LTS production-hardened · v1.5.2-RC2 hardening · v1.2 immune-system spec staged
Latency
Doctor cached p95 8.3 ms · Doctor cold p95 5.07 ms · Viewer first-paint p95 80 ms
Hardening evidence
0/5,000 prompt-injection · 0/500 hook bypass · 0/1,200 sandbox escape · 1.0000 mutation catch (2,700/2,700) · 0/8 fabrication across independent audits

§ I — The Problem

Why "just put it in a Markdown file" breaks at scale

Modern AI agents communicate, learn, and self-improve through files on disk: prompts in .md, system documents in .html, automation in .ps1 or .sh. This works fine at single-author scale. It collapses the moment you try to compound learning across agents, sessions, or organizations.

Five specific failure modes show up by the time a corpus reaches a few hundred files:

  1. No integrity surface. A Markdown file's content is the content. You cannot prove it hasn't been tampered with, drifted from its canonical source, or silently rewritten by a downstream agent.
  2. No structured query layer. "Find every claim tagged as experimental across these 200 lessons" requires either regex archaeology or pulling everything into a database that immediately falls out of sync with the source files.
  3. No hash-chained provenance. When agent A's output becomes agent B's input becomes agent C's verdict, there is no cryptographic chain proving any step happened, let alone in the claimed order.
  4. No claim-level confidence. Every sentence in a Markdown document carries the same epistemic weight to a downstream LLM: zero. There's no signal distinguishing "I verified this against a test" from "I'm guessing."
  5. No combine-decompose discipline. If you want to cluster 50 related lesson files into a single umbrella document, you can — but you cannot reliably round-trip back to the originals byte-identically. The compression is lossy.
The Markdown file is a great place to write a thought. It is a terrible place to compound a thousand thoughts across a hundred agents over a year. — operator framing, captured during the cascade

§ II — The Format

What AEP actually is

An AEP — Agent Evidence Packet — is a directory next to your canonical source file. If you have my-doctrine.md, AEP adds my-doctrine.aepkg/ as a sibling. The original file is unchanged. Bit-for-bit identical, on every read.

The companion directory always contains four files:

my-doctrine.md                       ← canonical source (untouched)
my-doctrine.aepkg/
  ├── meta.json                       schema, source SHA-256, file class
  ├── data/
  │    └── claims.jsonl               one queryable claim per line
  ├── views/
  │    └── source.md                  byte-identical projection of canonical
  └── integrity.json                  hash committing to the full tree

The meta.json declares schema version and file class. The claims.jsonl is the queryable substrate — each line is a structured claim with optional truth-tag, evidence pointer, and cluster tag. The views/source.md is a byte-identical projection used to verify the canonical file hasn't drifted. The integrity.json hash-commits to the full directory tree.

Four file-class handlers

§ III — Evidence

What's been empirically demonstrated

AEP v1.5 LTS shipped through a production-hardening cascade. Numbers below are measured on real corpus operations, not synthetic benchmarks. Every claim carries a truth-tag and chains into a hash-verified receipt ledger.

14cohorts at 100%
1,749new conversions
~2,890effective coverage
100%mass-conversion rate
15durable lessons
0fabrication detected
0canonical mutations
0OS-level incidents

Performance scoreboard

GateMetricTargetMeasuredStatus
Prompt-injection resistance0 / N weakened≥ 99%0 / 5,000 weakenedPASS
Hook bypass (v1.5.1 RC1 patch)0 / N bypasses0 / 5000 / 500PASS
Sandbox escape (post-patch)0 / N bypasses0 / 1,2000 / 1,200PASS
Read latency p95 (cached)milliseconds≤ 300 ms8.3 msPASS · 36× under
Read latency p95 (cold)milliseconds≤ 1,500 ms5.07 msPASS · 295× under
Viewer first-paint p95milliseconds≤ 2,000 ms80 msPASS · 25× under
Validator catch rate (mutation suite)0–1.0≥ 0.951.0000PASS · 2,700 / 2,700
False-positive on clean fixturesper 90000 / 900PASS
Cross-runtime byte parityPython + Node + Perl10 / 1010 / 10PASS
Accessibility (WCAG 2.1 AA viewer)required + bonus10 / 1010 / 10PASS
Token efficiency vs raw .mdreduction %≥ 60%88.7%PASS
Independent audit fabrication rateacross 8 audits00 / 8PASS

Combine-decompose bijection at scale

The hardest property — losslessly combining N packets into a cluster and decomposing back to byte-identical originals — was verified at five escalating scales:

ScaleTopologyByte-roundtripWalltimeMemory
N = 3linear version history3 / 3— pilot— pilot
N = 5sibling derivation chain5 / 5— pilot— pilot
N = 10homogeneous cohort10 / 100.45 s588 KB
N = 20cross-cohort heterogeneous20 / 201.13 s727 KB
N = 1005-class broad mix100 / 1004.18 s1.78 MB
DAG re-anchormulti-parent claim graph15 / 15 across 5 variants— synthetic— synthetic

Scaling is sublinear in N (each doubling of N increases walltime < 2×). Projection at N = 1,000 ≈ 42 seconds and ~17 MB — linear, falsifier-named (super-linear at N ≥ 2,000 would force re-design).

§ IV — Comparison

Raw .md / .html / shell scripts vs AEP companion

Raw .md / .html / .ps1 files

Content is bytes; no metadata layer.

"Has this been modified?" — unknowable without git context that often isn't loaded.

"Find all experimental claims" — full-text regex; brittle, expensive, false-positive-prone.

"How confident is this paragraph?" — invisible. Every sentence has the same epistemic weight.

"Did agent A's output reach agent C unmodified?" — unanswerable.

"Combine these 50 related lessons" — manual concatenation. No round-trip back to originals.

"Find what cites this lesson" — grep across the whole tree, every time.

Shell scripts (.ps1 / .sh) introduce a third class — executable text — that mixes content with side-effects. Encoding bugs, injection surfaces, OS-specific failure modes.

AEP companion (.aepkg/)

Canonical file untouched + integrity.json hash-commits to the tree.

"Has this drifted?" — single SHA-256 compare against views/source.md.

"Find all experimental claims" — jq over data/claims.jsonl. O(N) once, indexable.

"How confident is this paragraph?" — truth_tag on every non-trivial claim; 6 canonical tiers.

"Did agent A's output reach agent C unmodified?" — hash-chained receipt ledger. Every step has a sha that points to its predecessor.

"Combine these 50 related lessons" — cluster combine + decompose verified byte-roundtrip at N = 100. Lossless.

"Find what cites this lesson" — jq '.evidence_pointers[]' on the claim graph. Pre-indexed.

Executable surfaces stay in their own protected scope (PreToolUse hooks with airlock); content stays declarative. Side-effects require manual gate.

The asymmetry compounds. With raw files, every new document adds linear discovery cost. With AEP, every new document adds queryable claims to a substrate that gets sharper, not heavier, as it grows.

§ V — Discipline

The five hooks that make it actually work

The file format is necessary but not sufficient. AEP ships with five PreToolUse hooks that enforce discipline at the point of writing:

Hook 1 — Defender alert stops burn

Any OS-level security event (Defender / AV) interrupts the autonomous loop. Receipt-logged. No silent retries.

Hook 2 — Secret-pattern airlock (K3)

Mass-read operations cannot exfiltrate secret-shaped content via Bash, language-runtime one-liners, path-traversal, benign-wrapper smuggling, or symlink indirection. 0 / 500 bypass rate at production-N.

Hook 3 — Canonical doctrine write protection (LC-05)

Writes to load-bearing canonical doctrine files require an explicit operator approval token. Implements the "single-writer / append-only / reviewer" discipline that closes the LLM-self-modification attack surface.

Hook 4 — Truth-tag required (LC-09)

Substantive artifacts (> 200 LOC or heading-bearing) must declare a truth-tag — or explicitly tag "unknown." Reflexive enforcement: the hook itself self-tags. 18 / 18 tests pass; FP rate < 5%.

Hook 5 — Codex-first burn law (§45)

Non-trivial drafting fires an external model verification call before the canonical write. Burns operator quota deliberately to keep verification cheap and per-task.

Each hook composes additively; an Edit/Write tool call traverses the full chain before the write lands. Chain regression test: 5 / 5 hooks fire correctly on benign + adversarial test inputs.

§ VI — Receipts

The hash-chained receipt ledger (HCRL)

Every agent action that produces an artifact emits a receipt row to a per-agent ledger. Each row carries a SHA-256 that hash-commits to its predecessor row's SHA. The whole chain is a DAG — branches allowed for parallel agent invocations, but every row's parent SHA is verifiable.

{
  "ts":               "2026-05-18T09:32:14Z",
  "agent":            "implementer",
  "action":           "shape-migrator-v1.5.3",
  "artifacts":        ["path/to/migrated-asset.js"],
  "chain_from_sha":   "656300f991786fff…",
  "this_row_sha":     "7d5154fa13b74a4c…",
  "truth_tag":        "STRONGLY PLAUSIBLE",
  "claim":            "803 / 803 packets byte-roundtrip PASS"
}

"Did this run actually happen?" reduces to "is the SHA in the chain?" "Did agent B see the output of agent A unmodified?" reduces to "does B's chain_from_sha match A's this_row_sha?" The receipts survive context wipes, account changes, and surface migrations. The substrate compounds across the discontinuity.

Why this matters for users

Without a hash-chained ledger, "what did the agent actually do last week?" is unanswerable except by trusting the agent's own self-report. With HCRL, every artifact has cryptographic lineage back to genesis. Auditors get mechanical proof of provenance. Multi-agent handoffs become verifiable. Independent re-validation requires zero re-running of expensive workloads — the receipts ARE the proof.

§ VII — Capabilities

What AEP actually enables

§ VIII — What Ships

Everything that ships with v1.5 LTS — and why each part matters

AEP isn't just a file format. It's a substrate: spec layers, reference implementations, a runtime constitution, five enforcement hooks, a multi-language doctor, a viewer surface, and a test corpus. Each component closes a specific failure mode that raw .md / .html / shell scripts leave open.

The spec ladder — 6 progressive layers

LayerWhat it isWhy it matters to you
v0.4Schema baselineThe minimum bar — a packet that parses, hashes, and validates. Stop here and you already have integrity.
v0.5JSONL + canonicalizationNFC-normalized, BOM-rejected, line-stable JSON. Two machines produce identical bytes from the same logical content.
v0.6JSON-LD bridge + signingClaims become machine-queryable across systems. Optional Ed25519 attests authorship without trust-the-server.
v0.88 frontier-break primitives (F1-F8)Reproduction + falsifier sandbox + counterexample replay + cross-runtime preflight. The substrate becomes self-verifying.
v1.0.3Regexical Memory (AEP-native spaced repetition)Lessons aren't just stored — they're recalled at the right time with measurable decay.
v1.1 / v1.2F12-F19 + A1-A8 research grade + immune-system layerCoverage witness + provenance graph + attack registry + four-stage immune system (prevent · detect · repair · translate).

v1.5 LTS operational constitution

constitution/aep_constitution_v1_5_lts.json (~12 KB) — the single source of truth for runtime policy. Declares: policy precedence, forbidden actions, secret-airlock rules, 4 trust tiers, safety-floor categories, 4 proof budgets, sandbox requirements, extension ABI rules (kernel-frozen), 30+ performance gates, 7 release-freeze invariants.

Why it matters: the constitution is what makes "v1.5 LTS" a meaningful label rather than a marketing tag. Every claim about the system is testable against this file. If the runtime can't honor the constitution, that's a release-blocking regression, not an unhappy corner case.

5 PreToolUse hooks (the discipline layer)

HookWhat it doesWhy it matters
defender_guardHalts the autonomous loop on OS-level security alertsThe day Defender flags one of your scripts is the day you stop and look — never the day you click "Allow" without reading.
aep_pre_tool_guard (K3 airlock)Blocks mass-read operations that would exfiltrate secret-shaped content0/500 bypass attempts at production-N. Secrets stay in the user's home, not in agent context.
aep_post_tool_ledger (K6 receipts)Writes a hash-chained receipt on every tool callYou can prove what happened in any session, weeks later, without re-running anything.
aep_prompt_contractEnforces first-turn agent-evidence-packet contracts (≤101 tokens)88.7% token reduction vs raw .md. Agents read the structured layer at a fraction of the cost.
aep_stop_doctorRuns the doctor at session-stop; emits a verdict + lesson-capture triggerSessions end with a receipt, not an "I think it worked." 8.3 ms cached / 5.07 ms cold — invisible cost.

The AEP Doctor — instant verdict in three runtimes

Why it matters: cross-runtime byte parity — Python + Node + Perl all compute the same SHA-256 on every packet in the conformance corpus — is the strongest portability statement a file format can make. If three languages agree, the canonicalization is real, not an implementation artifact.

Universal converters — 11 file classes

Why it matters: the converter is the on-ramp. If turning your existing 500-file corpus into AEP packets isn't a single command, the format isn't useful. 100% mass-conversion rate across 1,749 new conversions in the v1.5 LTS hardening cascade.

The Viewer — zero-CDN civilian surface

viewer/index.html — a drag-and-drop browser viewer that renders any AEP packet without external dependencies. Verdict-first design: the user sees PASS/WARN/FAIL before they see the structure. Accessibility: WCAG 2.1 AA (10/10 required + bonus). First-paint p95: 80 ms.

Why it matters: agents read JSONL, humans don't. The viewer is the bridge — drag a .aepkg/ onto it and you see the substrate the way a reviewer does, not the way a parser does.

Independent reference implementations

Why it matters: a spec without independent implementations is a wish. Two languages computing the same hashes from the same bytes is the spec being true.

Test corpus — 41 vectors + 11 attack fixtures

Why it matters: these aren't synthetic micro-benchmarks. Each fixture corresponds to a real-world attack that broke an earlier release. Permanent regression coverage means the same attack can't ship again silently.

Compounding-discipline scaffolding

Why it matters: if you adopt AEP, you inherit the discipline cascade — a validated mutation suite, a frozen extension ABI, an outcome linter, and a release-gate matrix. Compounding isn't a hope, it's a CI step.

Documentation — the prose layer

Why it matters: the substrate isn't useful until adopters can read it. The prose layer documents the why; the code is the what; the test corpus is the proof.

§ IX — Try It

Four ways to try AEP

Path A — Read the spec

The full spec lives at spec/ in this repo. Versions:

Path B — Convert your own files

The universal converter is tools/universal_aepify.py (831 LOC Python; 18 / 18 tests pass; 11 file classes covered).

python tools/universal_aepify.py path/to/your/file.md
# produces  path/to/your/file.aepkg/  alongside the canonical
# verify    python tools/universal_aepify.py --verify-only path/to/your/file.md

For directory-scope aggregate companions (high-volume .jsonl / .gz):

python tools/universal_aepify_v2.py path/to/dir/*.jsonl \
    --aggregate-mode \
    --timestamp-stripped

For lossless cluster combine + decompose (N related packets → one umbrella → byte-identical originals):

python tools/aep_cluster_combine.py path/to/cluster/*.aepkg \
    --out path/to/umbrella.aepkg

python tools/aep_cluster_combine.py --decompose path/to/umbrella.aepkg \
    --out path/to/restored/

Path C — Run the doctor

The doctor produces an instant verdict on any packet's integrity, byte-roundtrip safety, and conformance level. Cached verdicts return in ~8 ms; cold in ~5 ms.

python scripts/aep_doctor_supreme.py path/to/your-file.aepkg

# cross-runtime byte-parity (Python + Node + Perl):
node   scripts/aep_doctor_node.cjs path/to/your-file.aepkg
perl   scripts/aep_doctor_perl.pl  path/to/your-file.aepkg

Path D — Read the receipt ledger

Every agent action's receipt lives in the per-agent HCRL JSONL. Each row chains to its predecessor via SHA-256. Walk the chain backwards from any row to verify provenance back to genesis.

jq -c '.this_row_sha + " ← " + .chain_from_sha' \
    receipts/agent-name.jsonl | tail -10

§ X — Limits

What AEP is not

Honest framing matters. AEP is a substrate, not a magic spell. These limits are named explicitly so adopters know what's on roadmap and what's structural.

§ XI — Ladder

Where you are on the agentic-file-system ladder

Most teams sit on rung 0 or 1 and don't realize there's a ladder. The compounding starts at rung 3.

  1. Rung 0 — Prompts in chat. Nothing persists. Every session restarts from zero.
  2. Rung 1 — Prompts in .md files. Saved on disk; loaded into context. No structure, no integrity, no query.
  3. Rung 2 — Prompts in repo with light convention. Folder hierarchy, naming conventions. Grep-able but unverifiable.
  4. Rung 3 — Structured claim layer (AEP basic). Per-file companions with claim graph. Queryable, hash-verified. The substrate begins to compound.
  5. Rung 4 — Receipt ledger + truth-tag canon. Hash-chained provenance + claim-level confidence. Independent audit becomes mechanical.
  6. Rung 5 — Combine-decompose discipline (current production state). Lossless corpus consolidation. Cross-agent recall. VERIFIED at N = 100, projecting linear to N = 1,000.
  7. Rung 6 — Substrate-as-API. The AEP layer exposed as an MCP server queryable from any compliant agent. FRONTIER — projected 60-90 days.

§ XII — Stakes

Why this matters beyond one repo

Every team building with LLM agents is building, implicitly, an agentic file system. Most are doing it accidentally — Markdown files thrown into folders, prompts kept in Slack, lessons learned that evaporate when the laptop reboots. The compounding never starts.

AEP names the format and ships the discipline. Adopting it means: your team's output gets sharper over time even when the underlying models don't change. Your audits become mechanical instead of social. Your handoffs between sessions, accounts, and surfaces survive context wipes. The substrate accretes value the way good code accretes value: not by being clever, but by being structured and verifiable.

The model providers will keep making models smarter. The teams that win will be the ones whose substrate compounds the smartness across every session.

Capability is what the model gives you. Compounding is what you build on top of it. AEP is the file format for compounding. — captured during the v1.5 LTS hardening cascade

Markdown is a great place to write a thought.
AEP is the format for thousands of thoughts
across hundreds of agents, over years,
surviving every context wipe and every account change.

— aep · agent evidence packet · open standard · 2026 —