Large Language Models

The Big Idea (in 3 sentences)

An LLM is a neural network trained on a huge pile of text. Its job is dead simple: given some words, predict the next word. Repeat that prediction over and over and you get sentences, paragraphs, code, poems, and answers to questions.

Tokens — The Atoms of Language

LLMs don't actually see words. They see tokens — chunks that might be a whole word, a piece of a word, or even punctuation. Try it:

Try the Tokenizer

Token count: 0

(This is a simplified demo — real tokenizers like tiktoken use learned vocabularies of ~100k tokens.)

Why tokens?

Splitting on whitespace would make the vocabulary infinite (every misspelling = a new word).
Splitting on characters would lose meaning (the model would have to relearn what "the" means every time).
Subword tokens are the sweet spot. Common words become one token. Rare words break into smaller pieces.
Pricing on most LLM APIs is per token. ~1 token ≈ 4 English characters ≈ ¾ of a word.

Embeddings — Words as Coordinates

Once you have tokens, you turn each one into a vector — a list of hundreds or thousands of numbers. This vector is the token's embedding: its position in a high-dimensional "meaning space".

⊞

Similar meaning = close in space

"king" and "queen" land near each other. "king" and "banana" don't. The model learned this purely from context — words that appear near similar words get similar embeddings.

+/-

Vector arithmetic works

Famously: king − man + woman ≈ queen. Embeddings turn meaning into math, and that's what makes the rest of the model possible.

The Transformer (without the math)

Every modern LLM is built on the Transformer architecture, introduced by Google in 2017. The magic ingredient: self-attention.

A toy view of self-attention

What is self-attention, really?

For every token in the input, the model asks: "which other tokens should I pay attention to right now?" When predicting the next word in "The cat sat on the ___", the model needs to look at "cat" and "sat" much more than "the". Attention learns these weights automatically.

Transformers stack dozens of attention layers on top of each other. Each layer refines the representation. By the top, the model has built up a rich understanding of the input — enough to produce the next token with frightening accuracy.

The Context Window

What it is

The maximum number of tokens an LLM can consider at once — your prompt + the conversation history + the response. Past that, things drop off the end like a sliding window.

Why it matters

Bigger context = the model can read longer documents, remember longer conversations, and use more tools. Modern models range from 4k tokens (small) to 1M+ tokens (Claude, Gemini).

From Pretraining to ChatGPT

Raw LLMs are weird. They'll happily continue any text — including offensive, useless, or wrong stuff. Three steps turn a raw model into a helpful chatbot.

1

Pretraining

Show the model trillions of tokens of internet text. It learns grammar, facts, style, code — by predicting the next token, over and over.

2

Fine-tuning

Show it high-quality examples of helpful conversations. It learns the format and tone you actually want.

3

RLHF

Humans rate the model's responses. The model gets a reward signal and learns to prefer responses humans like. ChatGPT was the breakout demo of this technique.

The Major Models (2026 edition)

Model	Maker	Strengths	Open?
Claude (Opus / Sonnet / Haiku)	Anthropic	Long context, careful reasoning, agents, code	No
GPT-4 / GPT-4o	OpenAI	General reasoning, vision, broad ecosystem	No
Gemini (Pro / Ultra)	Google	Massive context, multimodal, search integration	No
Llama 3	Meta	Strong open weights, easy to fine-tune	Yes
Mistral / Mixtral	Mistral AI	Efficient, sparse mixture-of-experts	Yes
DeepSeek	DeepSeek	Strong reasoning, very efficient training	Yes

Things LLMs are Bad At

Loving LLMs means knowing where they break.

⚠

Hallucination

Confidently making things up. Citations, statistics, code APIs that don't exist. Always verify.

⚠

Math

Especially multi-step arithmetic. Give them a calculator tool instead.

⚠

Recency

Their knowledge is frozen at the training cutoff. Need fresh info? Use retrieval or search tools.

Up Next

→

AI Agents

An LLM by itself just predicts text. Hook it up to tools and a loop, and suddenly it can browse the web, write code, send emails, and finish your TODOs while you sleep.