Building ChatGPT and Adding My Own Twist

When I started this project, I wanted to challenge myself by rebuilding GPT-2 piece by piece. The plan escalated quickly from writing micrograd, a tiny autograd engine, to coding a GPT-2 style transformer, training it on slices of WikiText and C4, and finally deploying everything into a Streamlit app that could answer questions privately. I even added RAG (retrieval augmented generation) so the chatbot could use my uploaded notes. None of it was easy, but each step taught me something crucial about how modern AI systems are actually built.

Micrograd Was Supposed to Be Small

Micrograd looks simple—just a few dozen lines—but it forced me to think differently about math and computation graphs.

class Value:
    def __init__(self, data, children=()):
        self.data = data
        self.grad = 0
        self._backward = lambda: None
        self._prev = set(children)

Graph thinking: every computation is a node in a graph that remembers how it was created.
Backpropagation: calling .backward() walks the graph in reverse topological order and accumulates gradients.
System over calculator: a few careful abstractions beat a pile of ad-hoc math.

You can think of each Value as a spreadsheet cell that stores both its number and the formula that produced it. Each operation wires in a tiny _backward function so gradients can be pushed into the parents when you run backprop.

GPT-2 Was Another Level

Micrograd set the stage, but GPT-2 dropped me into deep water. The architecture seems tidy on paper, yet the implementation details are brutal.

Attention Was the Hardest Part

class Head(nn.Module):
    def __init__(self, n_embd, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)
        wei = q @ k.transpose(-2, -1) * (C ** -0.5)
        wei = torch.softmax(wei, dim=-1)
        v = self.value(x)
        return wei @ v

Queries, keys, values: queries are what a token wants, keys are what it offers, values are the payload.
Shapes matter: get batch, time, or channels wrong once and the model crashes.
Multi-head attention: multiple heads learn different relationships in parallel.

The head splits the input three ways, compares queries with keys to score relevance, and then mixes the values using those weights. It rewrites each token with context from the others in the sequence.

Positional Embeddings

Transformers are permutation-invariant unless you give them a sense of order. Positional embeddings are how tokens learn the difference between “the dog bit the man” and “the man bit the dog.” They’re just learned vectors that get added to token embeddings, but without them the model can’t reason about sequence.

Training Was a Grind

for step in range(max_iters):
    xb, yb = get_batch("train")
    logits, loss = model(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if step % eval_interval == 0:
        val_loss = estimate_loss("val")
        print(step, loss.item(), val_loss)

Watching loss fall feels magical, even when it’s just math.
Hyperparameters (LR, batch size, dropout) make or break training.
Sometimes a single GPU and scrappy tooling beat wrestling with TPU clusters.

Each iteration grabs a batch, runs the model, backprops, and applies an optimizer step. Every eval_interval, I check validation loss to make sure the model learns something real instead of overfitting.

From Model to App: LittleGPT

The end result is a Streamlit app that loads a small HF model locally (I default to Qwen 0.6B) with device and precision controls. It supports chatting, grounding answers in your uploaded notes via embeddings + FAISS RAG, quick LoRA fine-tuning, and basic evaluation. Streamlit’s @st.cache_resource keeps model loading snappy.

The Heartbeat: Generation

def generate(
    model,
    tokenizer,
    prompt: str,
    max_new_tokens: int = 128,
    temperature: float = 0.0,
    top_p: float = 0.9,
) -> str:
    """Generate text from a prompt using lightweight decoding defaults."""
    device = next(model.parameters()).device
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        temp = max(float(temperature), 0.0)
        do_sample = temp > 0
        sampling_temp = max(temp, 1e-5) if do_sample else 1.0
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=sampling_temp,
            top_p=min(max(float(top_p), 0.1), 1.0) if do_sample else 1.0,
            do_sample=do_sample,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    generated = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    if generated.startswith(prompt):
        return generated[len(prompt) :].strip()
    return generated.strip()

It tokenizes the prompt, moves it to the right device, runs model.generate, and decodes the new tokens. Temperature toggles between deterministic and sampled decoding, while top_p keeps sampling inside the most likely nucleus. When the decoded string contains the original prompt, I strip it out before returning the continuation.

Streamlit CLI

Running python -m littlegpt.cli --help shows the whole interface:

usage: littlegpt.cli [-h] --prompt PROMPT [--model M] [--device cpu|mps|cuda]
                     [--max_new_tokens N] [--temperature T] [--top_p P]
                     [--rag_index PATH] [--top_k K] [--show_sources]
                     [--seed 42]

Generate text locally with LittleGPT.

optional arguments:
  -h, --help            show this help message and exit
  --prompt PROMPT       Text to continue
  --model M             HF model id or local path (default: Qwen-0.5B)
  --device DEV          cpu|mps|cuda (auto-detects)
  --max_new_tokens N    Max tokens to generate (default: 128)
  --temperature T       0 = deterministic
  --top_p P             nucleus sampling (default: 0.9)
  --rag_index PATH      Optional FAISS index dir
  --top_k K             Retrieval chunks to include (default: 3)
  --show_sources        Print sources under the answer
  --seed S              RNG seed for reproducibility

Example commands:

python -m littlegpt.cli --prompt "Explain positional embeddings in two sentences." --temperature 0

python -m littlegpt.cli --prompt "Summarize my notes on ecosystems." \
  --rag_index ~/.littlegpt/index --top_k 4 --show_sources

I spent real time on UX details: clear error messages (“CUDA not found, falling back to CPU”), stripping small talk from responses, showing latency and token counts, and pre-warming caches so the app feels instant.

How to Run It

python -m venv .venv && source .venv/bin/activate
pip install -r littlegpt/requirements.txt
streamlit run littlegpt/app.py

Run pytest -q for a quick smoke check.

To expose the app remotely while testing:

streamlit run app.py --server.address 0.0.0.0 --server.port 7860

Deploying to Hugging Face Spaces is straightforward:

huggingface-cli repo create <org>/<space-name> --type=space --sdk=streamlit

Create the Space, copy the repository (with requirements.txt, app.py, and pages/), and Spaces will run the app online.

What I Learned

Hardware choices matter. Switching between CPU, MPS, CUDA, and different quantization levels completely changes latency and UX.
Small models plus retrieval punch above their weight. A 600M-parameter checkpoint becomes shockingly capable when you give it relevant context.
Caching makes or breaks polish. Streamlit’s @st.cache_resource plus a pre-warm hook kept the app responsive.
Sweat the UX details. Prompt templates, concise responses, latency counters, and bug-friendly error messages made the app usable.

The biggest lesson was to slow down and master each layer before stacking the next. Micrograd taught me the math, GPT-2 hammered in architecture details, and shipping LittleGPT taught me to care about user experience as much as model accuracy.

Originally published on ConnorK.