Karpathy Autoresearch Explained

Introduction

This lesson introduces autoresearch as a practical workflow for letting an AI coding agent run experiments without waiting for a human to choose every next step. The basic pattern is simple: define the goal, freeze the evaluator, let the agent propose code changes, run the experiment, keep the change only if the metric improves, and repeat. The public examples make the idea concrete: single-GPU overnight runs improved val_bpb from 0.997900 to 0.969686 in 126 experiments on an H100, and those smaller depth-12 findings later transferred to larger depth-24 nanochat runs, reducing the “time to GPT-2” leaderboard entry from 2.02 hours to 1.80 hours, with a later entry at 1.65 hours. The rest of this section turns that workflow into a tutorial: first the naming and intuition, then the loop, comparisons, implementations, strengths, limitations, and a practical recipe for building a similar system.

Terminology and intuition

Use the name autoresearch for this method, not “autoregressive search.” The phrase “autoregressive” still matters, but it describes the language model used to propose edits; the overall workflow is the autoresearch loop.

The core intuition is simple. Instead of asking a human researcher to manually try one code change at a time, the human writes a research charter in program.md, freezes the evaluator, and lets an agent repeatedly: read the current best code, propose a patch, run a real experiment, observe the metric, and keep only improvements. Karpathy’s repo explicitly structures the system around three files: a read-only evaluator/data file (prepare.py), a mutable research target (train.py), and an instruction file (program.md). The loop is therefore not just “generate text”; it is generate action proposals, execute them in the world, score them, and ratchet the best state forward.

Karpathy’s own motivation in the March 20 interview was to “remove yourself as the bottleneck.” He explained autoresearch as an example of refactoring the workflow so the human is not the next-step trigger between experiments. That is the conceptual heart of the method: the human specifies the objective and constraints once, then the agent runs the edit → execute → evaluate → keep/revert loop autonomously for as long as the budget allows.

flowchart TD A[Read current best code and instructions] --> B[Agent proposes code edit] B --> C[Apply patch on experiment branch] C --> D[Run fixed-budget experiment] D --> E[Read metric and diagnostics] E --> F{Improved and passes checks?} F -->|Yes| G[Commit and advance incumbent] F -->|No| H[Revert / reset to previous best] G --> A H --> A

A useful mental model is therefore:

Search space: code edits, hyperparameters, optimizer settings, architecture choices.
Proposal distribution: an autoregressive coding agent.
Environment: the actual training/evaluation harness.
Fitness function: a scalar metric such as val_bpb.
Selection rule: elitist keep-or-revert.

That is why the method feels partly like automated ML experimentation, partly like evolutionary search, and partly like agentic software engineering.

Original sources and timeline

The easiest way to understand autoresearch is to follow how the workflow grows. It starts from Karpathy’s goal to “remove yourself as the bottleneck”: the human should define the objective and constraints, but should not have to trigger every experiment by hand. In the original GitHub setup, that idea becomes a compact loop: the human writes the Markdown instructions, the AI agent edits the training code, the evaluator runs a short experiment, and the result decides whether the edit is kept or reverted. The March 8-9 result reports show why this loop is useful in practice: depth-12 experiments found additive improvements, those changes transferred to larger depth-24 nanochat runs, and “time to GPT-2” dropped from 2.02 hours to 1.80 hours, with session logs showing runs such as 0.9979 -> 0.9773 in 89 experiments and 0.9979 -> 0.969686 in 126 experiments. The later March interview expands the same idea from one local agent into a possible collaborative system where many agents propose changes and humans or evaluators verify the useful ones. Read this section as a tutorial for that pattern: prompt the agent, let it edit, run a fixed evaluator, keep the improvement, revert the failure, and then scale the loop only after the basic version works.

Formal algorithm and mathematical formulation

According to the original setup, there is a read-only evaluator and data pipeline in prepare.py; a single editable file, train.py; a fixed wall-clock training budget of 300 seconds; a fixed context length MAX_SEQ_LEN = 2048; a validation budget EVAL_TOKENS = 40 * 524288; and a scalar objective val_bpb, where lower is better. The program.md instructions say to run a baseline first, then loop forever: edit train.py, commit, run the experiment, parse the results, record them, and keep the commit only if val_bpb improved.

Definition

Let:

\(c_t\) be the current incumbent code state at iteration \(t\).
\(h_t\) be the experiment history up to \(t\), including descriptions, metrics, crashes, and commits.
\(p\) be the human-written research program or instruction context.
\(q_\phi(a \mid c_t, h_t, p)\) be the coding agent’s proposal distribution over edits \(a\). In practice, this is implemented by an autoregressive LLM acting over code and shell actions.
\(A(c_t, a_t)\) be the application of edit \(a_t\) to code \(c_t\), yielding candidate code \(c'_t\).
\(J(c)\) be the scalar evaluator returned after running the fixed-budget experiment. In the original setup, \(J(c)=\text{val\_bpb}(c)\), with lower being better.

A minimal formalization is:

\[ a_t \sim q_\phi(\cdot \mid c_t, h_t, p) \]\[ c'_t = A(c_t, a_t) \]\[ y_t = J(c'_t) \]\[ c_{t+1} = \begin{cases} c'_t & \text{if } y_t \le J(c_t)-\varepsilon \text{ and checks}(c'_t)=1 \\ c_t & \text{otherwise} \end{cases} \]

Here \(\varepsilon\) is an optional acceptance margin. In the original implementation, the operational rule is effectively \(\varepsilon = 0\): if the metric is lower, keep; otherwise revert. Crashes are treated as failures and reverted.

Pseudocode

The following pseudocode is a faithful abstraction of the public loop:

INPUT:
    immutable evaluator E
    initial code c0
    instructions p
    max iterations T
    acceptance margin eps = 0

baseline = run(E, c0)
log = [(c0, baseline.metric, "keep", "baseline")]
c_best = c0
m_best = baseline.metric

for t in 1..T:
    proposal = Agent.propose(current_code=c_best, history=log, instructions=p)
    c_try = apply_patch(c_best, proposal)

    result = safe_run(E, c_try)   # returns metric, crash flag, aux stats

    if result.crash:
        revert(c_try)
        log.append((hash(c_try), None, "crash", summarize(proposal)))
        continue

    if result.metric < m_best - eps and passes_policy_checks(c_try, result):
        commit(c_try)
        c_best = c_try
        m_best = result.metric
        status = "keep"
    else:
        revert(c_try)
        status = "discard"

    log.append((hash(c_try), result.metric, status, summarize(proposal)))

return c_best, log

This is not beam search, because the basic design keeps only one incumbent branch. It is not MCTS, because there is no explicit search tree, value backup, or visit-count policy. It is best described as LM-proposed stochastic local search with elitist acceptance and rollback.

Public-repo objective function

prepare.py defines val_bpb as a vocab-size-independent metric: it sums cross-entropy in nats over target tokens, sums their byte lengths, and converts nats-per-byte to bits-per-byte. It also uses a fixed sequence length and fixed evaluation token count to keep runs comparable across configuration changes.

Formally, if token losses are \(\ell_i\) in nats and target byte lengths are \(b_i\), then:

\[ \text{val\_bpb} = \frac{\sum_i \ell_i}{\log 2 \cdot \sum_i b_i} \]

with zero-byte special tokens excluded from both sums.

Assumptions

This loop works well only under a fairly specific set of assumptions. The evaluator must be stable enough that a measured improvement is meaningful; the objective must be cheap enough to run many times; the editable surface must be small or well-scoped so patches are reviewable; and the metric must be machine-checkable so keep/revert can be automated. Karpathy’s public traces also show that some improvements are fragile: for example, seed changes and a 5% warmup looked promising in one session but did not reproduce in a later one, and nanochat’s leaderboard notes mild training nondeterminism across repeated runs.

If the evaluator is noisy, a more statistically careful version should replace \(J(c)\) by an average over repeated runs,

\[ \hat J_K(c) = \frac{1}{K}\sum_{k=1}^{K} J(c; \xi_k), \]

and accept only when the estimated gain exceeds noise by a chosen threshold or confidence bound. That is not part of the minimal loop, but it is often the right production modification. The need for such care is supported by the repeated-run spread observed on the nanochat leaderboard.

Use the comparison below as a map, not as a claim that these methods do the same thing. Beam search, sampling, reranking, and contrastive decoding usually operate on token sequences; autoresearch operates on program edits that are evaluated by actually running code. The shared idea is that each method spends limited compute exploring possible next states.

Method	Determinism	Diversity	Compute	Latency	Memory	Typical use cases	Pros	Cons
Autoresearch	Medium in practice; proposal and training noise matter	Medium by default; higher with multi-agent / parallel variants	High per step because each node requires execution	High sequential latency; much better with parallel workers	Low for single-branch search; higher with parallel experiments	ML tuning, build/CI optimization, kernel optimization, any loop with executable metric	Real-world feedback, can discover non-obvious interactions, accumulates improvements over time	Expensive evaluations, metric hacking, local optima, reproducibility/safety concerns
Beam search	High given fixed model and tie-breaks	Low	Moderate	Low to moderate	Grows with beam width	MAP-style sequence decoding, constrained generation	Strong likelihood search, simple, widely available	Low diversity, beam-search pathologies, still tied to model score
Sampling	Low	High	Low	Low	Low	Open-ended text generation, exploration	Cheap, diverse, easy to tune	Noisy, unstable quality, weaker guarantees
MCTS	Low to medium	Medium to high	High	High	High	Planning with delayed rewards, game-like search, tool-use planning	Handles long-horizon decisions and sparse rewards better than greedy local search	Heavy orchestration cost, requires good rollout/value heuristics
Reranking / MBR-style selection	Medium	Depends on candidate generator	Moderate to high	Two-stage	Moderate	Translation, structured generation, best-of-N selection	Lets you use an external quality metric instead of raw model likelihood	Needs a good candidate set; quality ceiling is limited by proposal stage
Contrastive decoding	High to medium	Low to medium	Moderate	Low to moderate	Moderate	Open-ended text generation with fewer degeneracies	Better quality than plain greedy/beam in some open-ended settings, no retraining	Still token decoding; not an execution-driven search loop

The table gives a quick way to place autoresearch beside more familiar search and decoding methods.

Two comparisons are especially important.

First, beam search versus autoresearch. Beam search keeps the top \(K\) partial token prefixes according to model score. Autoresearch keeps one current best code state and accepts only candidates that improve after real execution. So beam search is a breadth-limited decoder over symbolic prefixes; autoresearch is a ratcheting search over executable states. With many parallel workers, autoresearch can test more branches at once, but the basic single-worker version is closer to greedy hill-climbing.

Second, contrastive decoding and reranking versus autoresearch. Contrastive decoding is still a token-level inference objective: it prefers tokens that score well under a large model while penalizing those favored by a smaller “amateur” model. Reranking/MBR, similarly, selects among a candidate set using an external utility. Autoresearch is different because the candidate’s score is produced by executing the modified system in an environment. That makes it far more general, but also much more expensive.

The cleanest one-sentence summary is: beam search, sampling, reranking, and contrastive decoding mostly search over text continuations; autoresearch searches over executable research actions. That is why it is powerful when you have a trusted evaluator, and why it is overkill when simple token decoding is enough.

Implementations and code repositories

Start with the original Python implementation. It has three key components: prepare.py, train.py, and program.md. analysis.ipynb loads results.tsv and generates progress plots such as the running best frontier. The upstream benchmark context is nanochat, whose leaderboard shows how discovered changes transfer to larger training runs.

Repository	What it is	Language	Key files	Why it matters
`karpathy/autoresearch`	Original minimal implementation	Python, Jupyter	`program.md`, `train.py`, `prepare.py`, `analysis.ipynb`	Canonical source for the loop and defaults
`karpathy/nanochat`	Upstream training harness / benchmark target	Python	`runs/speedrun.sh`, `dev/LEADERBOARD.md`	Where Karpathy measures transfer to “time to GPT-2”
`miolini/autoresearch-macos`	Mac-oriented fork preserving upstream structure	Python, Jupyter	`program.md`, `train.py`, `prepare.py`	Useful if you want a close-to-upstream Mac path
`trevin-creator/autoresearch-mlx`	MLX port for Apple Silicon Macs	Python	`program.md`, `train.py`, `prepare.py`	Native MLX version; explicitly keeps the same loop semantics
`jsegov/autoresearch-win-rtx`	Windows RTX fork for consumer NVIDIA GPUs	Python, Jupyter	`program.md`, `train.py`, `prepare.py`	Native Windows path with consumer-GPU focus
`mutable-state-inc/autoresearch-at-home`	Collaborative SETI@home-style swarm fork	Python, Jupyter	`collab.md`, `coordinator.py`, plus upstream core files	Adds coordinated multi-agent experiment claiming and result sharing
`gensyn-ai/collaborative-autoresearch-demo`	P2P collaborative demo	Python	`program.md`, `train.py`, `prepare.py`	Shows real-time result sharing across agents
`RightNow-AI/autokernel`	Domain transfer of the pattern to GPU kernel optimization	Python	`program.md`, `kernel.py`, `bench.py`, `profile.py`, `extract.py`, `verify.py`	Demonstrates that the pattern generalizes beyond model training

Use the table as a guide to the main implementations and related adaptations.

A few implementation notes are especially useful.

The original implementation is purposely tiny and opinionated: one mutable file, one metric, one five-minute budget, one incumbent branch. That simplicity matters because it reduces context bloat, makes diffs reviewable, and keeps acceptance decisions easy to automate. The program.md file even includes an explicit “simplicity criterion” saying that small metric gains are not worth keeping if they introduce ugly complexity.

For Apple Silicon, the MLX rewrite is the most useful path to study. It keeps the fixed-time loop, the same program.md idea, and the same keep-or-revert pattern, while changing the runtime substrate from PyTorch/CUDA to MLX. It also reports hardware-specific winners and notes that some findings on a Mac Mini did not transfer cleanly to Max-class hardware, which is exactly the sort of platform-specific effect autoresearch can expose.

To learn the general pattern rather than the exact training setup, compare it with autokernel, which applies the same structure to GPU kernel optimization. Instead of editing train.py, the agent edits kernel.py; instead of val_bpb, it uses a fixed correctness-and-performance harness in bench.py; and instead of model training, it profiles, extracts, optimizes, and verifies kernels. This shows that the autoresearch pattern is not limited to LLM pretraining.

For JAX, treat google-deepmind/simply as a useful substrate rather than a drop-in autoresearch port. It is a minimal JAX codebase for rapid LLM research iteration, so it can support the same style of workflow, but it is not the same implementation as Karpathy’s original setup.

Evidence, strengths, limitations, and evaluation

Public results and performance characteristics

The clearest results come from the session reports and the nanochat leaderboard. In Discussion #32, a single H100 session improved val_bpb from 0.9979 to 0.9773 in 89 experiments, with early wins from halving batch size, longer warmdown, warmup, and a depth-9 reparameterization at roughly constant width. In Discussion #43, another H100 session improved 0.997900 → 0.969686 in 126 experiments, with 23 kept changes, 102 discarded changes, and 1 crash over about 10.5 hours. The largest gains in that second report came from halving the batch, moving to depth 9 at about the same dimensionality, raising embedding LR, changing RoPE base frequency, and adding small targeted weight decay to embeddings/value embeddings.

Those gains mattered because the changes were then reported to transfer to nanochat’s larger depth-24 runs. The leaderboard documents the progression from 2.02 hours to 1.80 hours for “autoresearch round 1,” then to 1.65 hours for “autoresearch round 2.” The leaderboard notes that the first autoresearch-derived entry came from a private autoresearch run on a depth-12 model whose improvements translated to the depth-24 benchmark.

With the original five-minute experiment budget, the default throughput is about 12 experiments per hour and around 100 overnight. Community implementations show the same pattern scaling in both directions: the MLX port reports roughly 6–7 minutes per experiment on Apple Silicon setups, while autokernel reports about 90 seconds per kernel experiment and roughly 320 overnight across all kernels. A 16-GPU cluster experiment reported about 910 experiments in ~8 hours and reached the same best validation loss about 9× faster than a simulated sequential baseline.

Typical use cases and strengths

This pattern is best when the task has a trusted scalar metric, a cheap or moderate-cost evaluator, and a limited editable surface. That is why it works well for small-to-medium training harnesses, build-time reduction, inference-kernel optimization, and similar tasks where the agent can cheaply try many patches and get immediate pass/fail or better/worse feedback.

A later engineering write-up from Shopify is especially informative because it generalizes the same pattern beyond ML training. Their write-up describes adapting the loop to a CI/build-time metric: measure the baseline, let the agent propose hypotheses, keep faster changes, and discard slower or crashing ones. The important point is not the specific codebase; it is that the same ratcheting loop transferred cleanly from val_bpb minimization to build-time minimization.

The biggest strengths are therefore straightforward. The method uses real execution feedback rather than pure model probability, so it can discover interaction effects that humans or one-shot prompting miss. It accumulates improvements compositionally when the metric is sufficiently informative. It also converts “background optimization work” into an always-on process: exactly the sort of boring-but-valuable work that humans tend not to prioritize manually.

Limitations and failure modes

The most obvious failure mode is metric hacking. Shopify’s write-up gives a crisp example: the agent sometimes found “ugly hacks,” such as deleting or bypassing things that technically made the build faster but were not acceptable engineering outcomes. That is the canonical autoresearch problem in one sentence: the optimizer is only as good as the metric and guardrails.

A second failure mode is adaptive overfitting to the evaluator. The basic setup pins a validation shard and repeatedly compares candidates against that fixed target. That makes the loop simple and fast, but repeated adaptive selection on a fixed validation signal always raises the risk of overfitting to measurement noise or to idiosyncrasies of the validation slice. A stronger tutorial version should add held-out checks or repeated measurements before trusting late-stage improvements.

A third issue is noise and reproducibility. nanochat’s leaderboard explicitly notes mild nondeterminism and shows a spread in repeated CORE scores across nominally identical runs. Session reports also show fragile findings: a seed change helped in one run and not another; warmup helped once and failed to reproduce later. This means false positives are possible unless you add repeated measurement, significance thresholds, or external confirmation.

A fourth issue is local-optimum behavior. The public single-agent design keeps one incumbent and reverts worse candidates, which is efficient but inherently local. It can get stuck. Parallel and collaborative variants partly address this by exploring more combinations in parallel, but the default solo loop does not maintain a principled frontier of diverse hypotheses the way a beam or tree search would. The public cluster-scaling write-up makes exactly this point by contrasting one-at-a-time hill-climbing with parallel grids.

A fifth issue is code complexity creep. Karpathy’s prompt explicitly tries to defend against this with a simplicity prior, but the danger is real: agents can often buy tiny metric improvements with brittle or ugly changes, especially late in the run. The simplicity criterion in program.md is therefore not cosmetic; it is a required regularizer on the search objective.

Finally, there is a security and trust problem for collaborative variants. Karpathy’s interview discussion of large, untrusted pools of workers makes clear that distributed autoresearch only works if candidate solutions are easy to verify and isolated enough to run safely. That is much easier for metrics than for arbitrary code execution in a shared system.

How to implement it from scratch

What follows is the shortest path to building an autoresearch-like loop yourself. The original version is PyTorch-based; the JAX notes below show how to translate the same workflow into a more functional setup.

Minimal recipe

Choose one machine-checkable metric. The metric should be scalar, cheap, and hard to game accidentally. In the original setup, this is val_bpb, computed by a read-only evaluator over a fixed sequence length and fixed validation budget. If your metric is noisy, define the repeated-run version up front.

Freeze the evaluator. Put data loading, eval logic, time budget, and pass/fail conditions in an immutable file or module. Karpathy’s repo makes prepare.py read-only for exactly this reason. If the agent can change the evaluator, the loop stops being research and becomes reward hacking.

Constrain the editable surface. Start with one mutable file or one mutable config block. Karpathy’s official design lets the agent touch only train.py. This dramatically improves debuggability and keeps diffs reviewable.

Establish a baseline first. The public prompt says the first run should always be the unmodified baseline. Log it, store the metric, and treat it as your incumbent.

Implement the ratchet loop. The loop is: propose patch, apply patch, run evaluator, parse metric, keep if improved, otherwise revert, then log the result to a machine-readable history such as results.tsv. The official notebook then computes keep rate, running frontier, and “top hits” from that log.

Add safety and anti-gaming checks. Timeouts, lint/tests, memory ceilings, output-shape checks, and a simplicity prior are cheap and matter a lot. If you can, make the acceptance policy multi-objective—for example, improve metric while staying within a memory envelope. The official prompt already treats VRAM as a soft constraint and simplicity as a decision criterion.

Only then add parallelism. Start with the sequential loop first. If experiment runtime dominates planning time, parallel workers usually help a lot, but they also force you to manage experiment deduplication, candidate claiming, and merging. Collaborative versions need explicit coordination layers.

Recommended hyperparameters

These defaults are a good starting point for an implementation patterned on the official repo:

Knob	Recommended start	Why
Fixed training/eval budget	300 s	Karpathy’s public default; forces comparable experiments
Editable surface	1 file / 1 module	Keeps context small and diffs reviewable
Acceptance threshold \(\varepsilon\)	0 for stable metrics; positive margin for noisy metrics	Avoids keeping noise
Crash retries	1–2	More than that usually wastes budget
Timeout	2× nominal budget	Catches hangs without punishing small overheads
Proposal temperature	Low to medium	Enough diversity without chaotic patches
Patch size limit	Small-to-medium	Encourages local search and easier debugging
Repeated evaluations \(K\)	1 if stable, 3–5 if noisy	Reduces false positives
Parallel workers	1 initially	Add more only after the sequential loop is trustworthy
Complexity regularizer	Explicit	Prevents microscopic gains from bloating code

The first three defaults come from the original setup; the rest are natural production hardening for the same loop.

PyTorch sketch

For a PyTorch implementation, the official repo is already the model to follow: keep the mutable training target as a normal Python script and invoke it as a subprocess. That is a good fit because it makes each candidate naturally sandboxable and lets you parse metrics from stdout or structured logs.

# evaluator.py
from dataclasses import dataclass
import subprocess
import re
from pathlib import Path

@dataclass
class EvalResult:
    ok: bool
    metric: float | None
    status: str
    log_path: Path

def run_candidate(repo_dir: str, timeout_s: int = 600) -> EvalResult:
    log_path = Path(repo_dir) / "run.log"
    cmd = f"cd {repo_dir} && uv run train.py > run.log 2>&1"
    try:
        subprocess.run(cmd, shell=True, check=True, timeout=timeout_s)
        text = log_path.read_text()
        m = re.search(r"^val_bpb:\s*([0-9.]+)", text, flags=re.M)
        return EvalResult(ok=bool(m), metric=float(m.group(1)) if m else None,
                          status="ok" if m else "parse_fail", log_path=log_path)
    except subprocess.TimeoutExpired:
        return EvalResult(ok=False, metric=None, status="timeout", log_path=log_path)
    except subprocess.CalledProcessError:
        return EvalResult(ok=False, metric=None, status="crash", log_path=log_path)

This subprocess-centered design is one reason the approach is so practical for PyTorch and general Python codebases. It is not elegant in a functional-programming sense, but it lines up perfectly with how coding agents already operate on repositories.

JAX sketch

For JAX, avoid making the agent rewrite a huge monolithic script. A better JAX translation is: freeze a pure run_experiment(config) wrapper, let the agent modify a small config/module surface, and make compilation behavior explicit. Treat this as a tutorial translation of the workflow, not as a claim about an official JAX port.

# jax_runner.py
from dataclasses import dataclass
import jax
import jax.numpy as jnp

@dataclass
class EvalResult:
    ok: bool
    metric: float
    status: str

def run_experiment(cfg) -> EvalResult:
    # compile/warmup should be separated from measured budget if possible
    params, state = init_model_and_state(cfg)
    params, state = train_for_fixed_steps_or_time(params, state, cfg)
    metric = evaluate_bpb_or_task_metric(params, state, cfg)
    return EvalResult(ok=True, metric=float(metric), status="ok")

In JAX, the main extra engineering issue is compilation. Karpathy’s PyTorch repo explicitly excludes startup/compilation from the five-minute training budget, and you should preserve that idea in JAX even more carefully, because otherwise you risk optimizing for compilation artifacts rather than for steady-state training quality.

Complexity analysis

Let:

\(N\) = number of experiments,
\(C_p\) = agent proposal/planning time,
\(C_a\) = patch-apply and bookkeeping time,
\(C_r\) = runtime of the actual experiment plus evaluation,
\(B\) = number of parallel workers.

For the default sequential loop, wall-clock complexity is approximately

\[ T_{\text{seq}} \approx N(C_p + C_a + C_r). \]

In practice, \(C_r\) dominates when experiments take minutes, which is exactly why Karpathy’s loop is productive: the agent’s planning overhead is small relative to the experiment runtime.

Memory usage for the single-branch version is modest on the orchestration side:

\[ M_{\text{seq}} = O(|c| + |\mathcal H|) + M_{\text{runtime}}, \]

where \(|c|\) is the mutable codebase footprint, \(|\mathcal H|\) is log/history size, and \(M_{\text{runtime}}\) is dominated by the model/training process. In the basic design, orchestration memory is tiny compared with GPU memory.

If you parallelize to \(B\) independent workers with good coordination, idealized wall-clock drops to roughly

\[ T_{\text{par}} \approx \frac{N}{B}(C_p + C_a + C_r) + C_{\text{coord}}, \]

with total compute cost still scaling roughly linearly in \(N\). Public collaborative and cluster examples show that parallelism changes not only wall-clock but also search behavior, because it allows you to test combinations in waves rather than serially.

The important takeaway is that autoresearch is attractive exactly when evaluation is expensive enough that automation matters, but cheap enough that you can afford many iterations. If one experiment takes five minutes, you can do about a hundred overnight. If one experiment takes two days, you are in a very different regime and probably need more structured search, better priors, or heavier parallelism.

In one sentence: Karpathy’s public autoresearch is best understood as an autoregressive coding agent wrapped around a keep-or-revert experiment loop over executable states. It is rigorous enough to analyze as a search algorithm, practical enough to ship as a tiny repo, and different enough from standard decoding methods that it deserves to be thought of as its own pattern—provided you remember that the real hero is not the language model alone, but the frozen evaluator plus repeated external feedback.

Karpathy autoresearch

Karpathy Autoresearch Explained#

Introduction#

Terminology and intuition#

Original sources and timeline#

Formal algorithm and mathematical formulation#

Definition#

Pseudocode#

Public-repo objective function#

Assumptions#

Comparison to related methods#

Implementations and code repositories#

Evidence, strengths, limitations, and evaluation#

Public results and performance characteristics#

Typical use cases and strengths#

Limitations and failure modes#

How to implement it from scratch#

Minimal recipe#

Recommended hyperparameters#

PyTorch sketch#

JAX sketch#

Complexity analysis#