Ciki Zeng
Case Study · CikiBrain

The AI-agent operating system I architected to run a one-person, four-product company. Verification enforced in code, not prompts.

AI made one person dangerously fast. CikiBrain is the counterweight: a four-layer memory architecture, ~19–20 enforcement hooks across five lifecycle events, and a self-evolving ledger that turns live work into evidence. I designed it, directed AI to build it, and I'm the verification gate.

Designed & architectedDirected the AI buildOperated daily
The hero mechanism — a self-evolving capability ledger

A system that learns its own operator's capability — from live signals, storing no prompt text.

Most of CikiBrain governs work. This loop governs evidence: what happened, what it proves, and whether the human approves moving the baseline.

Capture — silent, PII-safe
A hook logs capability signals as they happen — challenge, refuse-fake-done, demand-proof, architect, catch — as allow-listed keys plus a non-reversible hash. No prompt text is ever stored.
Enrich into evidence
A skill distills the high-signal events into structured capability evidence — raw signal becomes reviewable proof.
Propose a baseline delta
It proposes a concrete update to the operator's demonstrated-capability record.
The operator approves — the human is the decision gate
The baseline evolves
Only approved evidence moves the record — it grows from live signals, never a manual rewrite. The evolved baseline then informs what's worth capturing next, and the loop closes.
Raw → distilled → evolve — the same loop the rest of the system runs on. The system learns its operator's capability from live signals, by consent, storing no prompt text.

There's no app to screen-record here — CikiBrain is the operating system that runs the company, not a product with a UI. Everything shown is the architecture and the mechanisms; every on-screen signal is synthetic or generic. No vault contents, client, or private data appears anywhere on this page. It's a solo build, dogfooded daily — this shows capability and judgment, not traction.

The problem

The keyboard got faster. The judgment did not.

In an AI-leveraged company, output is cheap: code, plans, fixes, copy. The expensive part is deciding when output is wrong, when “done” is not done, and when not to trust the model.

CikiBrain is built around one idea: systematize that judgment into rules the system runs on its own, so AI scales the work without scaling the recklessness.

codecopyplansfixes
AI speeds up output
bottleneck
judgment
What deserves trust?
shipstopverify
Architecture

Four layers, one rule: the things that matter run in code.

Each layer owns one job. The important shift is layer three: when a recurring failure matters, it stops being advice and starts running as code.

AssetObsidian vault
Knowledge, decisions, case studies, sellable assets — the long-term memory.
RuntimeCLAUDE.md
Active behavioral rules and routing the agent must follow this session.
Enforcementhooks that run in codePrompts are suggestions · hooks are law
~19–20 mandatory checks across five lifecycle events. Not reminders the model may skip — code that runs whether the AI remembers to or not. This is the layer that makes the difference.
Trainingmemory buffer
Feedback that graduates into a rule on recurrence — then retires when it's obsolete.
Storyboard · 6 frames

The system, told as six small pictures.

Each frame is one design call: what broke, what moved into the system, and what evidence proves the mechanism exists.

codecopyplansfixes
AI speeds up output
bottleneck
judgment
What deserves trust?
shipstopverify
1
frame 01

The bottleneck moved.

AI made the keyboard faster. It did not make judgment cheaper. Once one person can generate code, plans, copy, and fixes faster than she can review them, the scarce thing becomes the decision: what deserves trust?

The work scaled first. The verification system had to catch up.
  • The system runs a one-person, four-product software company.
  • Its job is not to write more output; its job is to keep output honest.
  • That is why the case study starts with governance, not features.
prompt rule
Please verify before saying done.
Best effort. Easy to skip.
replaced by
hook gate
Runs whether the model remembers or not.
scopeproofsecurity
2
frame 02

A prompt rule failed.

The turning point was small and ugly: a rule already existed, and the AI still skipped it. That is when the methodology stopped being a document and became enforcement.

Prompts are suggestions. Hooks are law.
  • Important rules moved from prose into mandatory checks.
  • ~19-20 hooks run across five lifecycle events.
  • The AI does not get to remember discipline; the system runs it.
1
catch
cheap observation
2
promote
recurs into a rule
3
enforce
runs in code
4
retire
expires when obsolete
3
frame 03

Rules got a lifecycle.

A rule should not live forever just because one bad session hurt. CikiBrain catches a pattern, promotes it when it recurs, enforces it in code, then gives it an exit condition.

No rule without a retire-if.
  • Cheap observations stay cheap until they repeat.
  • Recurring failures graduate into executable guardrails.
  • Obsolete guardrails are designed to retire instead of piling up.
1
debug log
2
cloud sync
3
AI index
not one bug, a chain
source-write quarantine
Stop the leak before it becomes a cloud artifact.
4
frame 04

The security problem was a chain.

The scary leak was not just a committed secret. It was a debug transcript that could sync, version, and become searchable somewhere else. So the defense had to model the whole pipeline.

Stop unsafe source writes before they become durable artifacts.
  • The system treats cloud sync and AI indexing as part of the threat model.
  • Security hooks watch the source-writing path, not only the final repo.
  • The public page shows the mechanism without exposing private contents.
claim
Done.
gate asks
evidence
Build, log, screenshot, commit, or live behavior.
only then: done
5
frame 05

It catches me too.

All-green tests can still be false confidence. A long session can drift. A founder can want to call something done too early. So the gates are aimed at the operator as much as the model.

Done means evidence, not confidence.
  • Build output, live behavior, logs, screenshots, or commits become proof.
  • False completion is treated as a system failure, not a personality flaw.
  • The verification gate is part of the architecture, not a closing ritual.
live signalallow-listed keyhash
distill
operator approval
Evidence moves the capability baseline.
The system learns from behavior, not from storing prompt text.
6
frame 06

The ledger closes the loop.

The newest layer observes capability signals from real work, stores allow-listed keys and a non-reversible hash, then proposes evidence for operator approval. No prompt text is stored.

Capability evolves from behavior, by consent.
  • Raw signals become reviewable capability evidence.
  • The human remains the decision gate before the baseline moves.
  • This is the same loop the rest of the operating system runs on.
Architecture & judgment

The four calls that define the system.

The design is not a pile of rules. Each call turns a recurring failure mode into a visible system behavior.

failurerulehook
becomes
system behavior

Judgment becomes executable

If a mistake repeats, it stops being advice.

Verification, scope, and grounding checks run as mechanical gates.

signalhashbaseline
becomes
system behavior

Evidence updates the operator

The system learns from behavior, not from prompt text.

Live signals become reviewable capability evidence by consent.

claimgateevidence
becomes
system behavior

Done requires proof

Confidence is not a closing condition.

Build output, live behavior, logs, screenshots, or commits close the loop.

catchenforceretire
becomes
system behavior

Rules are allowed to die

A system that only adds rules eventually becomes noise.

Every L2+ rule carries a retire-if condition.

Outcomes
evidence receipt
Products governed
4 shipped products
Enforcement surface
~19-20 hooks across five lifecycle events
Method library
45 real incidents turned into reusable rules
Evidence shape
files and config, not a black-box app

One system governs four shipped products and catches false-completion before it becomes a public claim.

The honest evidence is not commit count. It is the system surface: memory architecture, lifecycle hooks, and a methodology library where real incidents become reusable rules.

What I owned

The role was not "AI user." It was system operator.

1

Designed the governance

Four-layer memory, enforcement hooks, capability ledger, and auto-derived registry.

2

Directed the AI build

Set rules, wrote specs, decided what graduated into code and what retired.

3

Owned verification

Validated behavior on the running system, then encoded that judgment so it survives drift.

This is how I work: design the system, encode the judgment, own the verification.

If you're evaluating someone to design or operate AI-augmented systems, this is the clearest piece of how I think — the system I run everything else on. There's more in the collection.