I Vibe-Coded a Code Competition Lab

I recently rebuilt CogArch.

The current version is not a vague multi-agent ecosystem. It is a local-first code competition lab.

The question behind it is simple: can two models make each other better the way two humans get sharper through competition? So the repo takes real coding problems, runs two independent agents against them, ranks the results by hidden tests, and turns that ranking into training data for the next cycle.

Two agents, four specialists each

Each agent has four specialists: logical, creative, skeptical, and empathetic.

They all solve the same problem independently. Each specialist can make up to 10 attempts. After a failed attempt it sees partial pass counts, failed assertions, and traceback output, but never the hidden tests themselves.

That detail is the whole point. CogArch is not trying to fake depth with a long conversation. It is trying to create different solution paths and score them with something harder than vibes: pass rate.

Inside each agent, the coordinator is blunt. It picks the specialist that reached the highest pass rate, and if two specialists tie, it prefers the one that got there in fewer attempts.

Competition becomes the learning signal

Once one agent can do that, the useful move is to duplicate it.

Agent A and Agent B solve the exact same MBPP problem with no shared state. Then the system compares their best outputs and decides who won.

What I like here is that the repo does not waste failure. If both sides miss the problem, CogArch still ranks them by partial credit and turns the better attempt into chosen and the worse one into rejected. Unless both sides produce effectively identical code, the round can still become DPO data.

That feels closer to real engineering work than a clean win-or-nothing setup. A lot of progress comes from near misses.

The benchmark split is disciplined

The part I respect most is that the repo draws a real line between training and evaluation.

MBPP is the training arena. HumanEval is held out. The system never trains on HumanEval, then uses HumanEval Pass@1 with a single attempt and no feedback to see whether a cycle actually improved anything.

That keeps the project honest. If the score moves, it has to move on problems the system did not learn from directly.

Memory is more than chat history

CogArch also treats memory as more than a long prompt.

It keeps working memory for the current session, episodic memory for past problem attempts, semantic memory for patterns extracted during consolidation, and procedural memory in the fine-tuned weights themselves. Retrieval uses nomic-embed-text through Ollama, so a specialist can pull in similar past episodes before attempt one.

That is where the wake/sleep framing still earns its keep. During the wake phase, the system solves problems and records outcomes. During sleep, it consolidates episodes into reusable patterns, prunes weak history, builds per-specialist datasets, fine-tunes, and updates the model registry.

I like the rollback more than the fine-tune

The flashy part is QLoRA fine-tuning through unsloth, exporting GGUFs, and registering versioned specialist models in Ollama.

The sober part is better: rollback.

If a cycle regresses past the configured threshold, CogArch can revert the specialist models instead of pretending every learning step was progress. That is the kind of behavior I want from experimental systems. Keep the loop ambitious, but keep the evaluation honest.

Why this version feels real

What makes the current repo interesting to me is how concrete it is.

it runs locally through Ollama
it uses hidden tests instead of self-scoring
it turns both wins and near-misses into preference data
it separates MBPP training from HumanEval evaluation
it keeps memory and model versions across cycles

That is a much sharper claim than "many agents talk to each other." It says structure, memory, competition, and disciplined evaluation might help coding systems improve over time.

CogArch is still experimental, still CLI-first, and still rough around the edges. But now it reads less like a metaphor and more like a real research rig.

Written with Vox.