home

What Micrograd Taught Me That PyTorch Hides

·6 min read

What Micrograd Taught Me That PyTorch Hides

Every PyTorch tutorial starts the same way: define a model, compute a loss, call loss.backward(), step the optimizer. It works. You get gradients. The model learns. But for months I had no idea what backward() actually did — it was a magic incantation that made the numbers go down.

Then I built micrograd, a tiny autograd engine, from scratch. It's maybe 100 lines of Python. And those 100 lines rewired how I think about every neural network I've touched since.

The Thing Nobody Tells You About Gradients

Here's the core of micrograd — a Value class that wraps a number:

class Value:
    def __init__(self, data, children=()):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(children)

Four attributes. That's it. But the design is doing something subtle: every Value stores not just its number, but how it was created. When you write c = a + b, the resulting Value remembers that a and b are its parents and wires in a _backward function that knows how to push gradients back through addition.

This is the part PyTorch hides from you. When you call loss.backward() in a real framework, it's doing exactly this — walking a graph of operations in reverse and accumulating gradients at each node. But the graph is invisible. You never see it. You just trust that it works.

Building micrograd made the graph visible. Every multiplication, every addition, every tanh — each one creates a node, connects to its parents, and carries a tiny function that says "here's how my gradient flows backward." The computation graph isn't an abstract concept you read about in a textbook. It's a real data structure you can print, traverse, and break.

What Backward Actually Does

Here's the backward pass:

def backward(self):
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build_topo(child)
            topo.append(v)
    build_topo(self)

    self.grad = 1.0
    for v in reversed(topo):
        v._backward()

Two things surprised me here:

Topological sort is the entire algorithm. Backpropagation isn't some exotic technique — it's "visit every node in reverse creation order and run its gradient function." That's it. The chain rule from calculus, which sounds intimidating, is just: each node multiplies the incoming gradient by its local derivative and passes it to its parents. The topological sort makes sure you process nodes in the right order so gradients are ready when you need them.

Gradients accumulate, they don't replace. This tripped me up for a while. When a value is used in multiple operations — say x appears in both x * w1 and x * w2 — gradients from both paths add together at x. In micrograd you see this directly: self.grad += local_gradient. In PyTorch, this is why you need optimizer.zero_grad() before each step. Without it, gradients from the previous batch pile up on top of the current ones. I never understood why until I built the system that makes it necessary.

The Spreadsheet Analogy That Actually Works

The best mental model I found: think of each Value as a spreadsheet cell. The cell stores a number (forward pass) and a formula showing where that number came from. When you change an input cell, the spreadsheet knows which downstream cells to recalculate — that's the forward pass. Backpropagation is the reverse: starting from the output cell, ask "how much would the output change if I tweaked each input cell slightly?" and propagate that sensitivity backward through the formulas.

PyTorch is a spreadsheet with a billion cells where you can't see the formulas. Micrograd is a spreadsheet with twelve cells where every formula is visible. The mechanics are identical.

Three Things I Didn't Understand Before Building It

1. The graph is built during the forward pass, not before it. I assumed the computation graph was something you defined upfront, like a blueprint. It's not. The graph gets built as you run your code. Every +, *, and activation function adds nodes and edges in real time. This is what "dynamic computation graph" means in PyTorch, and why you can use normal Python control flow (if/else, loops) inside your model — the graph just records whatever actually executes.

2. Autograd has nothing to do with neural networks specifically. Micrograd doesn't know what a neuron is. It doesn't know what a layer is. It just tracks operations on numbers and computes derivatives. Neural networks are one application of automatic differentiation, not the reason it exists. Once I understood this, I stopped thinking of backpropagation as "a neural network thing" and started seeing it as "calculus done by a computer."

3. The framework is doing less magic than I thought. Before micrograd, loss.backward() felt like it was solving a hard problem. After micrograd, I realized the hard part is just bookkeeping — recording what happened during the forward pass so you can replay it in reverse. The math at each node is usually trivial (the derivative of multiplication is just the other operand). The engineering is in building the graph correctly and traversing it efficiently. PyTorch's complexity comes from doing this fast on GPUs with batched tensors, not from the core algorithm being complex.

Why This Matters Beyond the Exercise

When I moved on to building GPT-2, I hit bugs that would have been completely opaque without this foundation. Shapes going wrong in multi-head attention, gradient explosions during training, loss curves that plateaued for no obvious reason — in every case, being able to think about what the computation graph looked like helped me reason about where the problem was.

If you're using PyTorch and backward() is still a black box, build micrograd. Not because the exercise is impressive — it's only 100 lines. Because those 100 lines will change what you see when you read framework code. The gradient isn't magic. It's bookkeeping. And once you've done the bookkeeping yourself, you never forget how it works.


The micrograd implementation I built follows Andrej Karpathy's approach. The code above is simplified for clarity.