From 4 Rules to 50+ Automated Phases: What Happened When I Took CLAUDE.md Seriously

Four months ago I copied Andrej Karpathy's CLAUDE.md into my repo. 4 rules. 65 lines. It felt like putting a post-it note on a nuclear reactor.

I'm not a developer. I'm a business owner who runs AI implementation projects for other businesses. Claude Code is my daily operating tool. I spend 8-10 hours a day inside it. When Karpathy published his guidelines, I figured: this guy led the AI behind Tesla's autopilot. He probably knows something about directing AI systems that I don't.

He was right. But what surprised me is where the file went next. That 65-line config file is now the core of a system that runs a 12-phase automated pipeline, manages context across 10+ repos, and improves itself after every project. The file didn't just grow. It evolved into something Karpathy's original post never imagined.

TL;DR: Karpathy's CLAUDE.md has 4 rules and 65 lines. I took it through 5 stages: basic rules, organized context, skills and hooks, cross-repo intelligence, and a self-improving pipeline. Each stage closed a failure mode the previous one couldn't. The starter template at the bottom gets you from Stage 1 to Stage 2 in 10 minutes.

This is what that evolution looked like. Five stages. Each one unlocked something the previous stage couldn't do. If you're using Claude Code right now, you're somewhere on this ladder. The question is: how far are you willing to take it?

The failure modes Karpathy named are real. Silent assumptions. Over-engineering. Unauthorized changes. Every stage on this ladder exists because one of those failures burned me first.

Stage 1 · Level 1The Starting Point

Karpathy's CLAUDE.md has 4 rules: Think Before Coding, Simplicity First, Surgical Changes, Goal-Driven Execution. The whole thing fits in 65 lines. It's a masterclass in constraint design. Each rule targets a specific AI failure mode:

Think Before Coding stops Claude from guessing when it should ask.
Simplicity First prevents the 200-line solution to a 50-line problem.
Surgical Changes keeps Claude from "helpfully" refactoring code you didn't ask about.
Goal-Driven Execution forces verifiable outcomes instead of vague claims that something "works."

For a developer working on one project at a time, this is solid. I ran with it for about two weeks. Then the cracks showed up.

The file had no project context. Claude didn't know my stack, my deploy target, my naming conventions. Every session started with me re-explaining the same things. The rules said "match existing style" but Claude had no way to know what the existing style was. I was spending 15 minutes at the start of every session just getting Claude back to baseline.

CLAUDE.md · Stage 1 (Karpathy's original)

## 1. Think Before Coding
## Don't assume. Surface tradeoffs.
# - State assumptions explicitly
# - If multiple interpretations, present them
# - If simpler approach exists, say so

## 2. Simplicity First
## Minimum code that solves the problem.
# - No features beyond what was asked
# - No abstractions for single-use code

## 3. Surgical Changes
## Touch only what you must.
# - Don't "improve" adjacent code
# - Match existing style

## 4. Goal-Driven Execution
## Define success. Loop until verified.
# - Transform tasks into verifiable goals
# - State a brief plan for multi-step tasks

The lesson: rules without context are suggestions. Claude followed the rules. It just didn't know enough about my world to follow them well.

Stage 2 · Level 2Getting Organized

Stage 2 was about giving Claude the context it needed to apply the rules properly. I organized the file into sections: Project Context, Coding Standards, Workflow Rules, Communication Style, Common Pitfalls. Each section answered a question Claude kept asking me.

"What framework are we using?" became a line in Project Context. "How should I name files?" became a rule in Coding Standards. "Should I commit after every change?" became a workflow rule. The 66-line file became 150+ lines. It felt like a lot. But the 15-minute warmup at the start of each session dropped to zero.

Before · Stage 1 After · Stage 2

## 1. Think Before Coding
# State assumptions explicitly
# Surface tradeoffs

## 2. Simplicity First
# No features beyond asked

## 3. Surgical Changes
# Match existing style

## 4. Goal-Driven
# Verify before claiming done

## Project Context
# Name: Brightly CRM
# Stack: Node.js + Railway
# Deploy: railway up

## Coding Standards
# Files: kebab-case
# Functions: camelCase
# Imports: stdlib, packages, local

## Workflow
# Verify before claiming done
# One commit per task
# Never push to main
# Run tests after every change

## Pitfalls
# Never modify .env files
# Never skip API tests

The trust shift was immediate. Claude stopped guessing my stack. It stopped asking "should I use TypeScript or JavaScript?" because the answer was already in the file. The rules from Stage 1 still applied, but now they had something to work with.

Where are you on the ladder?

3 questions. 15 seconds. See which stage matches your current setup.

1. How many lines is your CLAUDE.md?

2. Does it include project-specific rules?

3. Does Claude remember decisions across sessions?

Level 1

Keep reading. The next 4 stages show you what's possible.

CLAUDE.md Maturity Ladder

Basic

Project context and a few ground rules. Where everyone starts.

Organized

Sections by concern. 50-100 rules. The file starts working harder than you.

Automated

Custom skills, hooks, automated workflows. An operating system, not a config.

Cross-repo

Knowledge base, decision log, shared context across projects.

Self-improving Built for teams

Automated phases, measurement loops, learning from outcomes.

Preview: CLAUDE.md Starter Kit

# CLAUDE.md Starter Kit This is a ready-to-use template for configuring Claude Code on your project. Drop it in your repo root as CLAUDE.md and customize each section. --- ## Project Context # Project: [your project name] # One-liner: [what this project does in one sentence] # Stack: [framework] + [language] + [database] # Deploy target: [where this runs] # Production URL: [https://your-app.com] ## Coding Standards # Language: [TypeScript, Python, PHP, etc.] # Style: [your preferences]

Get the full 60-rule template below ↓

Stage 3 · Level 3Skills and Hooks

Stage 2 solved the context problem. But a new one emerged: repetition. I kept writing the same instructions across different projects. "Always invoke the frontend-design skill before writing UI code." "Run tests after every change." "Never push to main." These weren't project-specific. They were universal operating principles.

That's when CLAUDE.md stopped being a config file and became an operating system.

Claude Code supports two mechanisms that made this jump possible: skills (reusable instruction sets that Claude loads on demand) and hooks (shell commands that fire automatically before or after specific events). Skills handle the "how to do X" problem. Hooks handle the "never let me forget Y" problem.

CLAUDE.md · Stage 3 (skills + hooks)

## Skills (invoke before specific work)
# frontend-design: MANDATORY before any UI code
# ship-it: full commit + push + PR + merge cycle
# investigate: systematic debugging with root cause

## Hooks (automatic guardrails)
# PreToolUse: block writes to .env files
# PostToolUse: scan for leaked secrets
# SessionStart: check git status + open PRs

## Role Definition
# Brady = decision maker + UI reviewer
# Claude = CTO. Owns all technical decisions.
# Never ask Brady to run commands.

The role definition was the biggest unlock. Karpathy's original rules treat Claude as an assistant that needs permission for everything. When I defined Claude as the CTO who owns technical decisions, the entire dynamic shifted. Claude stopped asking me which testing framework to use and started making that call itself. My job narrowed to reviewing UI and making strategic decisions. Everything else was Claude's problem.

When I defined Claude as the CTO, the dynamic shifted. It stopped asking permission for technical decisions and started owning them. My job narrowed to the two things only I can do: reviewing UI and making strategic calls.

The trust gap Karpathy identified (84% of developers use AI tools but only 29% trust them) closes here. You don't trust an assistant. You trust a system with guardrails. The hooks prevent the bad outcomes. The skills standardize the good ones. Trust isn't a feeling at this stage. It's architecture.

Get the CLAUDE.md Starter Kit

60 rules. 7 sections. Drop it in your repo and start customizing.
One email. No spam.

Unsubscribe anytime. Your email stays with us.

You're in.

Fork the Gist to start. Star it to get updates when the template evolves.

Open on GitHub →

Know someone building with Claude Code?

Forward this to a colleague

Stage 4 · Level 4Cross-Project Intelligence

Here's where the single-file model breaks down. I run 14 repos. Some are client projects. Some are internal tools. Some are content engines. Each has its own CLAUDE.md. But the lessons I learn in one project apply everywhere.

When a CRM build taught me that "newsletter forms convert without traffic, so fix traffic first," that insight applied to every project with a form. When a deploy failure taught me that "infrastructure changes should be their own phase, not bundled into features," that applied to every repo.

Stage 4 adds three files that sit alongside CLAUDE.md:

KNOWLEDGE.md holds lessons that earned their way in through evidence. Not opinions. Not best practices from a blog post. Rules that were proven by specific outcomes in specific phases. Each rule tracks its citations so you can trace why it exists.
DECISIONS.md logs every significant choice with date, phase, and rationale. When you come back to a project 3 weeks later and wonder "why did we use webhooks instead of polling?", the answer is there. No re-research needed.
A memory system that captures feedback, project context, and user preferences across sessions. Not just "remember X." Typed memories: what the user prefers, what the project needs, what was decided, where to find things.

Before · Stage 3 After · Stage 4

# One CLAUDE.md per repo
# Context resets every session
# Lessons stay in your head
# Decisions: "I think we chose X?"

# CLAUDE.md: project rules
# KNOWLEDGE.md: earned lessons
# DECISIONS.md: dated rationale
# memory/: typed persistence

# Global rules (all repos):
#   ~/.claude/CLAUDE.md
# Repo rules (this project):
#   ./CLAUDE.md
# Phase rules (current work):
#   .run/KNOWLEDGE.md

The critical shift: knowledge flows up, not sideways. A lesson learned in one repo gets promoted to global KNOWLEDGE.md only after it's been proven independently in 2+ projects. Until then it stays local. This prevents one project's quirk from becoming a universal rule.

Here's a real example. I shipped a 3-tier pricing page on my own site with zero click tracking. Couldn't tell which tier got the most interest. Lost the baseline window entirely. That failure got flagged by the measure phase and became a KNOWLEDGE rule: "any page with pricing CTAs must have per-CTA click events wired before the phase closes." The next feature that touched a revenue page, the rule fired during the plan phase. Tracking got added before ship. The system caught what I would have forgotten.

Decisions work the same way. When you're 3 weeks into a project and someone asks "why did we use webhooks instead of polling?" the answer is in DECISIONS.md with the date, the phase, and the rationale. No digging through Slack threads. No "I think we chose that because..." It's there. Timestamped. Linked to the research that informed it.

This is where the trust gap closes for real. You're not trusting Claude's memory. You're trusting a structured, searchable, evidence-backed system that Claude reads at the start of every session. The AI doesn't need to be reliable across sessions if the system it reads is reliable across sessions.

Stage 5 · Level 5The Self-Improving Pipeline

This is the part I can't fully open-source. Not because it's secret, but because it's deeply specific to how I run my business. The philosophy, though, is transferable.

Stage 5 wraps every piece of work in a pipeline: classify the task, research it, discuss scope, write a plan, build it, review it, verify it works, ship it, deploy it, announce it, sell it, measure the outcome. One command triggers the whole thing. The system knows where you are and routes you to what's next.

The measurement phase is where it gets interesting. After a feature ships, the pipeline checks: did the hypothesis hold? Did signups go up? Did the page actually convert? If yes, the lesson gets queued for promotion to KNOWLEDGE.md. If no, it gets flagged as a missed hypothesis so the next plan can adjust.

Stage 5 · Pipeline concept (sanitized)

# One command. System knows what's next.
/run

# Pipeline for each task:
# classify → research → discuss → plan
# → build → review → verify → ship
# → deploy → announce → sell → measure

# Measure closes the loop:
# - Did the hypothesis hold?
# - Promote lesson to KNOWLEDGE.md
# - Archive the phase with evidence
# - Route to next roadmap item

The system doesn't just remember what you built. It remembers whether what you built actually worked. And it uses that evidence to make the next plan better.

A concrete example: I shipped a newsletter signup form on a content page. The pipeline's measure phase checked it 7 days later. Two signups. The system flagged the hypothesis as "missed" and surfaced the diagnosis: traffic to the page was near zero. The form worked fine. The problem was upstream. That observation got queued as a KNOWLEDGE candidate: "when newsletter signups are low, diagnose traffic volume before iterating on the form." It's still a candidate today, waiting for a second independent citation before it earns promotion to an active rule. That's the system working as designed. No rule gets promoted on one data point.

Without the measurement loop, I would have spent time rewriting form copy. The system caught the real problem. That's the difference between Stage 2 (organized rules) and Stage 5 (a system that learns from its own outcomes and rewrites its own rules).

The system doesn't just remember what you built. It remembers whether it worked. And it uses that evidence to make the next plan better. That's the gap between a config file and an operating system.

The real question

Karpathy's 4 rules are the right starting point. He nailed the failure modes. But rules alone don't scale to a daily operating tool across 14 repos. What scales is a system that encodes the rules, adds context, automates the guardrails, remembers what it learned, and improves from its own outcomes.

That's the progression: rules → context → automation → intelligence → self-improvement.

Most people stop at Stage 1 or 2. The articles and tutorials online stop there too. If you're using Claude Code every day and your CLAUDE.md is still a flat list of rules, you're leaving power on the table.

I built a starter template based on Stage 2. It's 60 rules organized into 7 sections. You drop it in your repo and start customizing. It won't get you to Stage 5 by itself. But it'll get you out of Stage 1 in about 10 minutes.

Didn't grab the template? Get the CLAUDE.md Starter Kit

Go further: Make Claude Code stop asking you the same questions — the full setup (USER.md, an auto-generated ACCESS.md, permissions, hooks) that turns the template above into a system.

Want this built for your business?

I build Level 5 systems for teams. 90-minute strategy session. We look at your stack, find the highest-ROI automations, and you walk away with a written plan.

Book a Strategy Session →