A better
agentic
file format.
AEP — Agent Evidence Packet — a structured, hash-verified, byte-roundtrip-preserving companion file format for AI agent workflows. The substrate replaces "throw it in a Markdown file" with a queryable, falsifiable, integrity-checked layer that survives across sessions and across LLMs.
§ I — The Problem
Why "just put it in a Markdown file" breaks at scale
Modern AI agents communicate, learn, and self-improve through files on disk: prompts in .md, system documents in .html, automation in .ps1 or .sh. This works fine at single-author scale. It collapses the moment you try to compound learning across agents, sessions, or organizations.
Five specific failure modes show up by the time a corpus reaches a few hundred files:
- No integrity surface. A Markdown file's content is the content. You cannot prove it hasn't been tampered with, drifted from its canonical source, or silently rewritten by a downstream agent.
- No structured query layer. "Find every claim tagged as experimental across these 200 lessons" requires either regex archaeology or pulling everything into a database that immediately falls out of sync with the source files.
- No hash-chained provenance. When agent A's output becomes agent B's input becomes agent C's verdict, there is no cryptographic chain proving any step happened, let alone in the claimed order.
- No claim-level confidence. Every sentence in a Markdown document carries the same epistemic weight to a downstream LLM: zero. There's no signal distinguishing "I verified this against a test" from "I'm guessing."
- No combine-decompose discipline. If you want to cluster 50 related lesson files into a single umbrella document, you can — but you cannot reliably round-trip back to the originals byte-identically. The compression is lossy.
The Markdown file is a great place to write a thought. It is a terrible place to compound a thousand thoughts across a hundred agents over a year. — operator framing, captured during the cascade
§ II — The Format
What AEP actually is
An AEP — Agent Evidence Packet — is a directory next to your canonical source file. If you have my-doctrine.md, AEP adds my-doctrine.aepkg/ as a sibling. The original file is unchanged. Bit-for-bit identical, on every read.
The companion directory always contains four files:
my-doctrine.md ← canonical source (untouched)
my-doctrine.aepkg/
├── meta.json schema, source SHA-256, file class
├── data/
│ └── claims.jsonl one queryable claim per line
├── views/
│ └── source.md byte-identical projection of canonical
└── integrity.json hash committing to the full tree
The meta.json declares schema version and file class. The claims.jsonl is the queryable substrate — each line is a structured claim with optional truth-tag, evidence pointer, and cluster tag. The views/source.md is a byte-identical projection used to verify the canonical file hasn't drifted. The integrity.json hash-commits to the full directory tree.
Four file-class handlers
- Per-file companion — for
.md/.html/.py/.js/.json/.yaml. Full 4-file companion per source. - Hash-attest — for binaries (
.png/.pdf/.zip/.ttf). No body-copy; justviews/source_hash.txtcommitting SHA-256 of the canonical bytes. ~390× space reduction vs body-copy. - Aggregate companion — one
.aepkg/per parent directory for high-volume telemetry/archive files (.jsonl,.gz). Hot files (runtime-written) excluded viaaggregate_excludes.jsonallowlist. - Cluster combine — N related packets combine into one umbrella
.aepkg/with byte-roundtrip-verified decompose. Empirically tested at N = 3, 5, 10, 20, 100 packets.
§ III — Evidence
What's been empirically demonstrated
AEP v1.5 LTS shipped through a production-hardening cascade. Numbers below are measured on real corpus operations, not synthetic benchmarks. Every claim carries a truth-tag and chains into a hash-verified receipt ledger.
Performance scoreboard
| Gate | Metric | Target | Measured | Status |
|---|---|---|---|---|
| Prompt-injection resistance | 0 / N weakened | ≥ 99% | 0 / 5,000 weakened | PASS |
| Hook bypass (v1.5.1 RC1 patch) | 0 / N bypasses | 0 / 500 | 0 / 500 | PASS |
| Sandbox escape (post-patch) | 0 / N bypasses | 0 / 1,200 | 0 / 1,200 | PASS |
| Read latency p95 (cached) | milliseconds | ≤ 300 ms | 8.3 ms | PASS · 36× under |
| Read latency p95 (cold) | milliseconds | ≤ 1,500 ms | 5.07 ms | PASS · 295× under |
| Viewer first-paint p95 | milliseconds | ≤ 2,000 ms | 80 ms | PASS · 25× under |
| Validator catch rate (mutation suite) | 0–1.0 | ≥ 0.95 | 1.0000 | PASS · 2,700 / 2,700 |
| False-positive on clean fixtures | per 900 | 0 | 0 / 900 | PASS |
| Cross-runtime byte parity | Python + Node + Perl | 10 / 10 | 10 / 10 | PASS |
| Accessibility (WCAG 2.1 AA viewer) | required + bonus | 10 / 10 | 10 / 10 | PASS |
| Token efficiency vs raw .md | reduction % | ≥ 60% | 88.7% | PASS |
| Independent audit fabrication rate | across 8 audits | 0 | 0 / 8 | PASS |
Combine-decompose bijection at scale
The hardest property — losslessly combining N packets into a cluster and decomposing back to byte-identical originals — was verified at five escalating scales:
| Scale | Topology | Byte-roundtrip | Walltime | Memory |
|---|---|---|---|---|
| N = 3 | linear version history | 3 / 3 | — pilot | — pilot |
| N = 5 | sibling derivation chain | 5 / 5 | — pilot | — pilot |
| N = 10 | homogeneous cohort | 10 / 10 | 0.45 s | 588 KB |
| N = 20 | cross-cohort heterogeneous | 20 / 20 | 1.13 s | 727 KB |
| N = 100 | 5-class broad mix | 100 / 100 | 4.18 s | 1.78 MB |
| DAG re-anchor | multi-parent claim graph | 15 / 15 across 5 variants | — synthetic | — synthetic |
Scaling is sublinear in N (each doubling of N increases walltime < 2×). Projection at N = 1,000 ≈ 42 seconds and ~17 MB — linear, falsifier-named (super-linear at N ≥ 2,000 would force re-design).
§ IV — Comparison
Raw .md / .html / shell scripts vs AEP companion
Raw .md / .html / .ps1 files
Content is bytes; no metadata layer.
"Has this been modified?" — unknowable without git context that often isn't loaded.
"Find all experimental claims" — full-text regex; brittle, expensive, false-positive-prone.
"How confident is this paragraph?" — invisible. Every sentence has the same epistemic weight.
"Did agent A's output reach agent C unmodified?" — unanswerable.
"Combine these 50 related lessons" — manual concatenation. No round-trip back to originals.
"Find what cites this lesson" — grep across the whole tree, every time.
Shell scripts (.ps1 / .sh) introduce a third class — executable text — that mixes content with side-effects. Encoding bugs, injection surfaces, OS-specific failure modes.
AEP companion (.aepkg/)
Canonical file untouched + integrity.json hash-commits to the tree.
"Has this drifted?" — single SHA-256 compare against views/source.md.
"Find all experimental claims" — jq over data/claims.jsonl. O(N) once, indexable.
"How confident is this paragraph?" — truth_tag on every non-trivial claim; 6 canonical tiers.
"Did agent A's output reach agent C unmodified?" — hash-chained receipt ledger. Every step has a sha that points to its predecessor.
"Combine these 50 related lessons" — cluster combine + decompose verified byte-roundtrip at N = 100. Lossless.
"Find what cites this lesson" — jq '.evidence_pointers[]' on the claim graph. Pre-indexed.
Executable surfaces stay in their own protected scope (PreToolUse hooks with airlock); content stays declarative. Side-effects require manual gate.
The asymmetry compounds. With raw files, every new document adds linear discovery cost. With AEP, every new document adds queryable claims to a substrate that gets sharper, not heavier, as it grows.
§ V — Discipline
The five hooks that make it actually work
The file format is necessary but not sufficient. AEP ships with five PreToolUse hooks that enforce discipline at the point of writing:
Hook 1 — Defender alert stops burn
Any OS-level security event (Defender / AV) interrupts the autonomous loop. Receipt-logged. No silent retries.
Hook 2 — Secret-pattern airlock (K3)
Mass-read operations cannot exfiltrate secret-shaped content via Bash, language-runtime one-liners, path-traversal, benign-wrapper smuggling, or symlink indirection. 0 / 500 bypass rate at production-N.
Hook 3 — Canonical doctrine write protection (LC-05)
Writes to load-bearing canonical doctrine files require an explicit operator approval token. Implements the "single-writer / append-only / reviewer" discipline that closes the LLM-self-modification attack surface.
Hook 4 — Truth-tag required (LC-09)
Substantive artifacts (> 200 LOC or heading-bearing) must declare a truth-tag — or explicitly tag "unknown." Reflexive enforcement: the hook itself self-tags. 18 / 18 tests pass; FP rate < 5%.
Hook 5 — Codex-first burn law (§45)
Non-trivial drafting fires an external model verification call before the canonical write. Burns operator quota deliberately to keep verification cheap and per-task.
Each hook composes additively; an Edit/Write tool call traverses the full chain before the write lands. Chain regression test: 5 / 5 hooks fire correctly on benign + adversarial test inputs.
§ VI — Receipts
The hash-chained receipt ledger (HCRL)
Every agent action that produces an artifact emits a receipt row to a per-agent ledger. Each row carries a SHA-256 that hash-commits to its predecessor row's SHA. The whole chain is a DAG — branches allowed for parallel agent invocations, but every row's parent SHA is verifiable.
{
"ts": "2026-05-18T09:32:14Z",
"agent": "implementer",
"action": "shape-migrator-v1.5.3",
"artifacts": ["path/to/migrated-asset.js"],
"chain_from_sha": "656300f991786fff…",
"this_row_sha": "7d5154fa13b74a4c…",
"truth_tag": "STRONGLY PLAUSIBLE",
"claim": "803 / 803 packets byte-roundtrip PASS"
}
"Did this run actually happen?" reduces to "is the SHA in the chain?" "Did agent B see the output of agent A unmodified?" reduces to "does B's chain_from_sha match A's this_row_sha?" The receipts survive context wipes, account changes, and surface migrations. The substrate compounds across the discontinuity.
Why this matters for users
Without a hash-chained ledger, "what did the agent actually do last week?" is unanswerable except by trusting the agent's own self-report. With HCRL, every artifact has cryptographic lineage back to genesis. Auditors get mechanical proof of provenance. Multi-agent handoffs become verifiable. Independent re-validation requires zero re-running of expensive workloads — the receipts ARE the proof.
§ VII — Capabilities
What AEP actually enables
- Cross-agent claim recall. Query "every claim any agent ever made about creativity benchmarks across 12 months of sessions" with
jq. Returns in milliseconds against a 2,890-file substrate. - Lossless corpus consolidation. Combine 50 related sibling lessons into one umbrella packet. Decompose back to 50 byte-identical originals on demand. Verified at N = 100.
- Substrate-as-handoff. Push the repo. A different LLM, different account, different surface clones it and inherits the full claim graph + receipt ledger + truth-tag confidence layer. No retraining; no context replay required.
- Independent audit. A second agent — same or different model family — can re-derive any claim's evidence chain from the receipts alone. Across 8 independent audits to date: 0 fabrication detected.
- Mechanical falsifier surface. Each claim's truth-tag carries an explicit falsifier predicate. "Promote this to PROVEN/RELIABLE if the falsifier doesn't fire within 30 days" is a queryable, automatable rule.
- Storage-efficient archival. 215 MB of canonical binary content (PNGs, PDFs, archives) hash-attested in 541 KB of companion metadata. ~390× compression at the index layer with 0 binary mutation.
- Operator-machine portability. Cross-cutting mirror to operator-owned configuration spaces (e.g., agent installation directories outside the repo). 366 files mirrored to a parallel staging path; 0 canonical mutations.
- Token-efficient agent reads. Companion
claims.jsonlis 88.7% smaller than the raw.mdequivalent for the same information. Agents query the structured layer; humans read the prose layer.
§ VIII — What Ships
Everything that ships with v1.5 LTS — and why each part matters
AEP isn't just a file format. It's a substrate: spec layers, reference implementations, a runtime constitution, five enforcement hooks, a multi-language doctor, a viewer surface, and a test corpus. Each component closes a specific failure mode that raw .md / .html / shell scripts leave open.
The spec ladder — 6 progressive layers
| Layer | What it is | Why it matters to you |
|---|---|---|
v0.4 | Schema baseline | The minimum bar — a packet that parses, hashes, and validates. Stop here and you already have integrity. |
v0.5 | JSONL + canonicalization | NFC-normalized, BOM-rejected, line-stable JSON. Two machines produce identical bytes from the same logical content. |
v0.6 | JSON-LD bridge + signing | Claims become machine-queryable across systems. Optional Ed25519 attests authorship without trust-the-server. |
v0.8 | 8 frontier-break primitives (F1-F8) | Reproduction + falsifier sandbox + counterexample replay + cross-runtime preflight. The substrate becomes self-verifying. |
v1.0.3 | Regexical Memory (AEP-native spaced repetition) | Lessons aren't just stored — they're recalled at the right time with measurable decay. |
v1.1 / v1.2 | F12-F19 + A1-A8 research grade + immune-system layer | Coverage witness + provenance graph + attack registry + four-stage immune system (prevent · detect · repair · translate). |
v1.5 LTS operational constitution
constitution/aep_constitution_v1_5_lts.json (~12 KB) — the single source of truth for runtime policy. Declares: policy precedence, forbidden actions, secret-airlock rules, 4 trust tiers, safety-floor categories, 4 proof budgets, sandbox requirements, extension ABI rules (kernel-frozen), 30+ performance gates, 7 release-freeze invariants.
Why it matters: the constitution is what makes "v1.5 LTS" a meaningful label rather than a marketing tag. Every claim about the system is testable against this file. If the runtime can't honor the constitution, that's a release-blocking regression, not an unhappy corner case.
5 PreToolUse hooks (the discipline layer)
| Hook | What it does | Why it matters |
|---|---|---|
| defender_guard | Halts the autonomous loop on OS-level security alerts | The day Defender flags one of your scripts is the day you stop and look — never the day you click "Allow" without reading. |
| aep_pre_tool_guard (K3 airlock) | Blocks mass-read operations that would exfiltrate secret-shaped content | 0/500 bypass attempts at production-N. Secrets stay in the user's home, not in agent context. |
| aep_post_tool_ledger (K6 receipts) | Writes a hash-chained receipt on every tool call | You can prove what happened in any session, weeks later, without re-running anything. |
| aep_prompt_contract | Enforces first-turn agent-evidence-packet contracts (≤101 tokens) | 88.7% token reduction vs raw .md. Agents read the structured layer at a fraction of the cost. |
| aep_stop_doctor | Runs the doctor at session-stop; emits a verdict + lesson-capture trigger | Sessions end with a receipt, not an "I think it worked." 8.3 ms cached / 5.07 ms cold — invisible cost. |
The AEP Doctor — instant verdict in three runtimes
scripts/aep_doctor_supreme.py— Python reference (7 verdict states: PASS / WARN / FAIL / UNKNOWN / EXPIRED / CONTESTED / QUARANTINED)scripts/aep_doctor_node.cjs— Node.js port (independent re-derivation of every hash)scripts/aep_doctor_perl.pl— Perl port (third independent runtime for byte-parity quorum)
Why it matters: cross-runtime byte parity — Python + Node + Perl all compute the same SHA-256 on every packet in the conformance corpus — is the strongest portability statement a file format can make. If three languages agree, the canonicalization is real, not an implementation artifact.
Universal converters — 11 file classes
tools/universal_aepify.py(831 LOC) — per-file companion converter; auto-detects file class, emits.aepkg/alongside the canonical.tools/universal_aepify_v2.py— adds aggregate-mode (one.aepkg/per parent directory for high-volume telemetry).tools/aep_cluster_combine.py— combine N packets into one umbrella; decompose back byte-identically. Verified at N = 100.tools/aep_shape_migrator.py— schema-shape evolution with backwards-compat preservation.
Why it matters: the converter is the on-ramp. If turning your existing 500-file corpus into AEP packets isn't a single command, the format isn't useful. 100% mass-conversion rate across 1,749 new conversions in the v1.5 LTS hardening cascade.
The Viewer — zero-CDN civilian surface
viewer/index.html — a drag-and-drop browser viewer that renders any AEP packet without external dependencies. Verdict-first design: the user sees PASS/WARN/FAIL before they see the structure. Accessibility: WCAG 2.1 AA (10/10 required + bonus). First-paint p95: 80 ms.
Why it matters: agents read JSONL, humans don't. The viewer is the bridge — drag a .aepkg/ onto it and you see the substrate the way a reviewer does, not the way a parser does.
Independent reference implementations
src/aep/— ~15,000 LOC Python reference (validate, sign, derive views, canonicalize, JSONL-compact, build index, falsifier sandbox, counterexample replay).verifiers/node/verify.cjs— Node.js verifier. Byte-parity proven on the 13-packet conformance corpus.verifiers/rust/— Rust verifier scaffolding. Frontier; not yet feature-complete.
Why it matters: a spec without independent implementations is a wish. Two languages computing the same hashes from the same bytes is the spec being true.
Test corpus — 41 vectors + 11 attack fixtures
test_vectors/v0_5/A.10-numeric-canonicalization/— 41 vectors covering NaN/Inf rejection, integer precision boundaries, normalization edge cases.test_vectors/v0_7/A.11-canonical-surface/— duplicate-keys, UTF-16 sort order, NFC/NFD normalization, Unicode lookalikes, BOM rejection, escape canonicalization, JSON5 comment rejection.- 11 Lane B attack fixtures — context hijack, dual-manifest divergence, reviewer-collapse, supersession self-loop, body/envelope leak, content-hash mismatch, and seven more — each rejected with its specific reason code.
Why it matters: these aren't synthetic micro-benchmarks. Each fixture corresponds to a real-world attack that broke an earlier release. Permanent regression coverage means the same attack can't ship again silently.
Compounding-discipline scaffolding
scripts/v15_lts_25_test_matrix.py— 25-test release-gate matrix; the doctor against itself.scripts/build_v15_independent_mutation_suite.py— 30 mutation classes × 10 seeds = 300 mutations × 9 validators = 2,700 evaluations. Final mean catch: 1.0000.scripts/v15_validators_common.py— shared validator core that closed the F23 mutation finding (9 validators repaired to 1.0000 catch rate, 0/900 clean-fixture false positives).scripts/build_v15_falsifier_dsl.py— falsifier DSL with 8 forbidden tokens (subprocess / socket / os.environ / eval / exec / __import__ / popen / shell=true) blocked at compile.scripts/build_v15_lts_extension_abi.py— extension ABI for backwards-compat: 20 synthetic extensions installed+uninstalled with zero core schema changes.scripts/build_v15_human_outcome.py— outcome linter that catches "missing safe_next_action" + "jargon in block_reason" before the receipt ships.
Why it matters: if you adopt AEP, you inherit the discipline cascade — a validated mutation suite, a frozen extension ABI, an outcome linter, and a release-gate matrix. Compounding isn't a hope, it's a CI step.
Documentation — the prose layer
spec/AEP_v0_8_SPEC.mdthroughv1_2_SPEC.md— the canonical specs (4,000+ lines total).CHANGELOG.md— every release documents what shipped, what was verified, and what trade-offs were named.docs/index.html— this showcase, the public face.reports/v15_lts_final_release_report.md— the v1.5 LTS PASS verdict with all 31 gate measurements.
Why it matters: the substrate isn't useful until adopters can read it. The prose layer documents the why; the code is the what; the test corpus is the proof.
§ IX — Try It
Four ways to try AEP
Path A — Read the spec
The full spec lives at spec/ in this repo. Versions:
AEP_v0_8_SPEC.md— STABLE baseline (8 frontier-break primitives F1-F8)AEP_v1_0_3_SPEC.md— Regexical Memory as AEP-native spaced repetitionAEP_v1_1_SPEC.md— LANDED research-grade primitives (F12-F19 + A1-A8)AEP_v1_2_SPEC.md— PROPOSED immune-system layer (prevent · detect · repair · translate)constitution/aep_constitution_v1_5_lts.json— v1.5 LTS operational constitution (policy precedence + airlock rules + trust tiers + performance gates)
Path B — Convert your own files
The universal converter is tools/universal_aepify.py (831 LOC Python; 18 / 18 tests pass; 11 file classes covered).
python tools/universal_aepify.py path/to/your/file.md # produces path/to/your/file.aepkg/ alongside the canonical # verify python tools/universal_aepify.py --verify-only path/to/your/file.md
For directory-scope aggregate companions (high-volume .jsonl / .gz):
python tools/universal_aepify_v2.py path/to/dir/*.jsonl \
--aggregate-mode \
--timestamp-stripped
For lossless cluster combine + decompose (N related packets → one umbrella → byte-identical originals):
python tools/aep_cluster_combine.py path/to/cluster/*.aepkg \
--out path/to/umbrella.aepkg
python tools/aep_cluster_combine.py --decompose path/to/umbrella.aepkg \
--out path/to/restored/
Path C — Run the doctor
The doctor produces an instant verdict on any packet's integrity, byte-roundtrip safety, and conformance level. Cached verdicts return in ~8 ms; cold in ~5 ms.
python scripts/aep_doctor_supreme.py path/to/your-file.aepkg # cross-runtime byte-parity (Python + Node + Perl): node scripts/aep_doctor_node.cjs path/to/your-file.aepkg perl scripts/aep_doctor_perl.pl path/to/your-file.aepkg
Path D — Read the receipt ledger
Every agent action's receipt lives in the per-agent HCRL JSONL. Each row chains to its predecessor via SHA-256. Walk the chain backwards from any row to verify provenance back to genesis.
jq -c '.this_row_sha + " ← " + .chain_from_sha' \
receipts/agent-name.jsonl | tail -10
§ X — Limits
What AEP is not
Honest framing matters. AEP is a substrate, not a magic spell. These limits are named explicitly so adopters know what's on roadmap and what's structural.
- Not a model. AEP doesn't make a 7B model think like a 1T model. It makes whatever model you have produce verifiable, queryable, compounding output instead of one-shot prose.
- Not a database. The
claims.jsonllayer is queryable but file-native — you'll out-scale jq somewhere between 10K and 1M packets. FRONTIER — MCP-server projection is staged. - Not free-lunch idempotency. Default
state_hashembeds timestamps; identical re-conversions produce identical content but different state_hash. The--timestamp-strippedflag closes this for deterministic-build use cases. - Not a substitute for tests. Truth-tags are claims about confidence; they don't run your code. Tests still need to exist. AEP is the layer that records that they ran and what they returned.
- Not yet at 1,000+ packet combine scale. N = 100 cluster combine verified; N = 1,000 is linear-projection. Falsifier named: super-linear scaling at N ≥ 2,000 forces re-design.
- Not an external-validator substitute. Self-audit (the substrate's own agents auditing the substrate's own output) is circular at the limit. External independent validators (different model family, different operator) remain required for full PROVEN/RELIABLE promotion.
§ XI — Ladder
Where you are on the agentic-file-system ladder
Most teams sit on rung 0 or 1 and don't realize there's a ladder. The compounding starts at rung 3.
- Rung 0 — Prompts in chat. Nothing persists. Every session restarts from zero.
- Rung 1 — Prompts in .md files. Saved on disk; loaded into context. No structure, no integrity, no query.
- Rung 2 — Prompts in repo with light convention. Folder hierarchy, naming conventions. Grep-able but unverifiable.
- Rung 3 — Structured claim layer (AEP basic). Per-file companions with claim graph. Queryable, hash-verified. The substrate begins to compound.
- Rung 4 — Receipt ledger + truth-tag canon. Hash-chained provenance + claim-level confidence. Independent audit becomes mechanical.
- Rung 5 — Combine-decompose discipline (current production state). Lossless corpus consolidation. Cross-agent recall. VERIFIED at N = 100, projecting linear to N = 1,000.
- Rung 6 — Substrate-as-API. The AEP layer exposed as an MCP server queryable from any compliant agent. FRONTIER — projected 60-90 days.
§ XII — Stakes
Why this matters beyond one repo
Every team building with LLM agents is building, implicitly, an agentic file system. Most are doing it accidentally — Markdown files thrown into folders, prompts kept in Slack, lessons learned that evaporate when the laptop reboots. The compounding never starts.
AEP names the format and ships the discipline. Adopting it means: your team's output gets sharper over time even when the underlying models don't change. Your audits become mechanical instead of social. Your handoffs between sessions, accounts, and surfaces survive context wipes. The substrate accretes value the way good code accretes value: not by being clever, but by being structured and verifiable.
The model providers will keep making models smarter. The teams that win will be the ones whose substrate compounds the smartness across every session.
Capability is what the model gives you. Compounding is what you build on top of it. AEP is the file format for compounding. — captured during the v1.5 LTS hardening cascade
Markdown is a great place to write a thought.
AEP is the format for thousands of thoughts
across hundreds of agents, over years,
surviving every context wipe and every account change.
— aep · agent evidence packet · open standard · 2026 —