A neural network the size of a small country, trained to do one thing: predict the next word. From that humble objective, you get GPT, Claude, Gemini — and most of the AI you use every day.
An LLM is a neural network trained on a huge pile of text. Its job is dead simple: given some words, predict the next word. Repeat that prediction over and over and you get sentences, paragraphs, code, poems, and answers to questions.
LLMs don't actually see words. They see tokens — chunks that might be a whole word, a piece of a word, or even punctuation. Try it:
Token count: 0
(This is a simplified demo — real tokenizers like tiktoken use learned vocabularies of ~100k tokens.)
Once you have tokens, you turn each one into a vector — a list of hundreds or thousands of numbers. This vector is the token's embedding: its position in a high-dimensional "meaning space".
"king" and "queen" land near each other. "king" and "banana" don't. The model learned this purely from context — words that appear near similar words get similar embeddings.
Famously: king − man + woman ≈ queen. Embeddings turn meaning into math, and that's what makes the rest of the model possible.
Every modern LLM is built on the Transformer architecture, introduced by Google in 2017. The magic ingredient: self-attention.
For every token in the input, the model asks: "which other tokens should I pay attention to right now?" When predicting the next word in "The cat sat on the ___", the model needs to look at "cat" and "sat" much more than "the". Attention learns these weights automatically.
Transformers stack dozens of attention layers on top of each other. Each layer refines the representation. By the top, the model has built up a rich understanding of the input — enough to produce the next token with frightening accuracy.
The maximum number of tokens an LLM can consider at once — your prompt + the conversation history + the response. Past that, things drop off the end like a sliding window.
Bigger context = the model can read longer documents, remember longer conversations, and use more tools. Modern models range from 4k tokens (small) to 1M+ tokens (Claude, Gemini).
Raw LLMs are weird. They'll happily continue any text — including offensive, useless, or wrong stuff. Three steps turn a raw model into a helpful chatbot.
Show the model trillions of tokens of internet text. It learns grammar, facts, style, code — by predicting the next token, over and over.
Show it high-quality examples of helpful conversations. It learns the format and tone you actually want.
Humans rate the model's responses. The model gets a reward signal and learns to prefer responses humans like. ChatGPT was the breakout demo of this technique.
| Model | Maker | Strengths | Open? |
|---|---|---|---|
| Claude (Opus / Sonnet / Haiku) | Anthropic | Long context, careful reasoning, agents, code | No |
| GPT-4 / GPT-4o | OpenAI | General reasoning, vision, broad ecosystem | No |
| Gemini (Pro / Ultra) | Massive context, multimodal, search integration | No | |
| Llama 3 | Meta | Strong open weights, easy to fine-tune | Yes |
| Mistral / Mixtral | Mistral AI | Efficient, sparse mixture-of-experts | Yes |
| DeepSeek | DeepSeek | Strong reasoning, very efficient training | Yes |
Loving LLMs means knowing where they break.
Confidently making things up. Citations, statistics, code APIs that don't exist. Always verify.
Especially multi-step arithmetic. Give them a calculator tool instead.
Their knowledge is frozen at the training cutoff. Need fresh info? Use retrieval or search tools.
An LLM by itself just predicts text. Hook it up to tools and a loop, and suddenly it can browse the web, write code, send emails, and finish your TODOs while you sleep.