RES FUTURAS

A very high level introduction to LLMs

I’ve tried to write a complete article about LLMs from beginning to end, but after attempting it like 4 times, I kept finding myself constantly digressing and getting lost between the high level flow and the technical details, it just kept turning into a massive blob of information that’s really hard to follow.

So I’ve decided to make this like a small (or big?) series of posts. Let’s go.

Big Picture

Watch this video; it explains the key concepts better than I’ll ever be able to explain:

While the video is particularly about transformers, that’s the heart of the matter anyway, if you understand transformers and (tokenization + training logic), you are golden.

Inference in LLMs

LLMs work by predicting the next token, for the sake of simplicity I’ll use “word” instead of token1, because it’s simpler to follow.

They are called glorified auto-complete machines by AI-nay-sayers, which is technically correct, just like calling music just a bunch of vibrations is technically correct. Having said that, that’s exactly how they work, predicting the next word.

Demystifying Embeddings and High-Dimensional Vectors

Embedding is a fancy word for a simple concept.

Think of 2D vectors that can represent x,y (x,y) like in a point on a grid, you’d say the Knight is located on b8, which is (2,8) b=2nd column as a 2D vector, and you know what that means, it means the Knight piece is on the column 2 of row 8. A 2D vector gives you the information about where something is.

If you were to calculate how close the Knight (2,8) is to the Pawn (3,8), c= column 3, that’s easy to do; they are just next to each other, same row, different column.

Think of 3D vectors that can represent an object in 3D space by adding another dimension z (x,y,z). By adding dimension z now we can represent the depth of an object.

Now think of 16,384 of these dimensions, and what you can represent with them!!!

This is it, in LLMs words are represented with these “high-dimensional” vectors, so we can relate them to things like meaning, category, sentiment etc. An extremely simple example of this is the Word2Vec approach. Here is a demo of Word2Vec go and play with it to see how we can associate words with each other by using embeddings.

To summarize: Embeddings are just learned high-dimensional vectors that represent an object, at the code level they are just arrays.

I’m using the word “learned”. We’ll explain how they are learned in the training section.

What’s truly confusing about high-dimensional vectors is that it’s impossible to truly visually represent them, which makes it very hard to grasp.

How LLMs predict the next word

Again simplifying things for the sake of clarity, but this is the essence

  1. They convert the user’s input into embeddings (high-dimensional vectors)
  2. Pass embeddings through many transformer layers (attention + neural networks)
  3. Model outputs possibility scores for each prediction
  4. Pick one, convert it back to a word
  5. Append it to the sentence and repeat.

Example: I'm going home to might predict the next word as watch, then we take that as input and repeat the process I'm going home to watch, prediction TV. Repeat, and, repeat eat, repeat dinner.

This is kind of a draw the rest of the fucking owl explanation, I know. Let’s talk about how these transformers know what to do, how do we learn these “learned weights”?

Training in LLMs

Again conceptually just assume word == token for the sake of simplicity

This training loop is theoretically super simple:

  1. Feed the transformer with some data, it predicts the next token Transformers do their magic, lots of relevancy operations-dot product/matrix multiplication- using their current weights
  2. Check how well it can predict the next token
  3. Calculate the quality of that prediction (loss) We give the model a sentence where we already know the next token. The model only sees the training tokens; it predicts, we check it against the next token that we already know, and score how close the model’s probabilities are to the actual answer.
  4. Then adjust 2all the weights up and down (optimizer) until the quality of the prediction gets better.
  5. Rinse and repeat until your loss score is better.

Training == inference + accuracy detection(loss) and optimization

This is one of those things, it shouldn’t have worked this well, but it works like fucking black magic. This works because a simple training loop is repeated at an insane scale. Let’s take LLaMA 3 70B model as an example:

This model can update 70 billion parameters each repetition and uses all of these parameters for each token. 80 layers of transformers (this is generally Attention and MLP blocks repeated 80 times), uses 8192 dimensions for its token embeddings.

Smallest production models (for general LLM purposes) trained on ~45TB (something like FineWeb) of raw text would be on the order of ~10-15 trillion tokens.

What’s been discovered in LLM space is that it’s all about scaling, bigger everything works better.

You might ask where the initial weights come from when you first start training, surprisingly nowhere. They are just random, you let the model find them. As the model goes through training process it gets better and better.

Base Models vs. Instructs

Now when you have a base model, it indeed is a very good predictor of the next token. However they don’t show assistant-like behaviour. You can ask a base model something and it will keep autocompleting without a clear objective. Because it’s not answering your question, it’s just predicting the next token, until it stops.

Download a base model in Ollama, LM Studio or whatever you are using and talk to it. You’ll see.

Now we know that models learn by detecting patterns, so how can a base model learn to help a user and follow the user’s instructions? Just teach it those instructions.

That means now we further train the base model (this is called fine tuning) with actual data that looks like a User/Assistant conversation, for example:

User: What's the capital of France?
Assistant: Paris

but in the most LLM fashion, we give it hundreds of thousands to millions examples so it learns how to follow the user’s instructions. Model already knows a lot of information in its original training, this is just adding a new behaviour to it. An example of this kind of dataset is Topical-Chat. While fine tuning still requires data and time, it’s insignificant compared to the base model’s training and data.

Initially these chat examples were written by humans (SFT), and prediction results scored by humans for optimization (RLHF). Now we have a mix of data from humans and synthetically created data by using already trained LLMs.

Now you know how a model can be specialized in coding or some other task. Base Model + behaviour we want to see.

Tool Calling is another great example: train the model to output specific syntax (i.e. <T_C>cat code.py</T_C>). You need to train with data that shows the behaviour.

Another yet more complicated problem that uses a similar approach is the ability to reply with I don't know to questions where the model doesn’t have enough confidence to answer, instead of hallucinating. Part of solving that problem is teaching the model to say it by training in that kind of data.

Do you want your model to reject things like Help me to make a bomb? Teach the model to refuse such requests.

There is a whole science of fine tuning, but this is the gist of it. Optimize the model toward the behaviour you want to see with a shit load of training. Again all of these are simplifications and there are devils and 3jinn that dwell in the details.

You can watch the post-training section of this video to understand more about what common things are done after training the base model.

How the fuck does it reason?

This is arguably the most confusing part of LLMs. You can see how they can predict the next token precisely, we have been using simpler versions of these. Think “auto complete” suggestions on your phone, some of them use small NNs (neural networks) for a long time now. OCRs use NNs as well, for many years.

However reasoning is very confusing, how can an “auto-complete machine” reason? There are many reasons why reasoning emerges from these models.

One reason is what we’ve discussed in the previous section: being trained on data that shows logic and reasoning. Data such as:

  • Step by step solutions.
  • Proofs
  • Arguments
  • Explanations

The reasoning is like structured prediction, they predict the reasoning. This is why they can be absolute idiots in deterministic things like “When’s next Wednesday?”, it’s a very confusing thing for an LLM to reason. You can give the current date and day to a small model, then ask this question and it’ll fail to answer consistently.

Transformers

The biggest breakthrough in recent years was the Attention is all you need paper. This paper introduced scalable contextual attention, which is pretty much used by all models.

This allows models to make sense of positional context they are in. i.e. “dog bites man” vs. “man bites dog”. Without positional context the model cannot understand what this sentence means.

This is called the attention block in a transformer.

Layers upon layers and billions of neurons

Because there are so many layers that learn, we believe different layers and neurons create logic like behaviour.

Earlier layers are about word meaning, middle layers are about context and relevancy between tokens, late layers are about conceptual patterns and logic. Some of them act like logic gates (true - false) which helps to build true reasoning-like behaviour.

Because models look at the context and predict the next, longer context (but still coherent) leads to better results in reasoning, or things like Chain of Thought (i.e. ask it to “think step by step”) work because we keep introducing intermediary steps to help the model to predict better, which again looks like reasoning.

Bottom line, pattern recognition at extreme scale looks like reasoning. Though it’s one of those things; If it looks like a duck, walks like a duck and quacks like a duck, then it’s a fucking AI generated video. I rest my case on LLM reasoning, your honour.

Obvious, not so Obvious Stuff

Models are stateless

Every time a model is called it doesn’t know anything other than the given context. Therefore they don’t know what they are unless you keep adding what they are into their context. Load a model and ask what it is, unless it’s specifically trained on that knowledge it won’t be able to answer and possibly hallucinate.

Model doesn’t know the date, doesn’t know its name, doesn’t have any facts after its knowledge cut-off date (training data).

Models are stochastic

Models are pretty bad with deterministic calculation. This is why a lot of LLMs use tool calling for these operations. Ask an LLM to calculate a math operation, it’ll call a tool (like Python interpreter) and literally code it, execute it, and respond with that. Because if it doesn’t do that, the answer might include a lot of hallucinations. Remember it’s just predicting the next token and deterministic calculations require deterministic reasoning.

Date calculations, math problems, even things like infamous how many "r"s in strawberry questions.

Where to go from here

I’ve personally gone through the theory, implementing a toy LLM, implementing a more production grade LLM (that can be trained on GPU and can actually predict stuff, and answer stuff), training LLMs close to a more real-life training, playing around with tokenization, and spending hours on every single line on toy LLMs. Learned a lot, expert at none, still trying to understand the details and find the little devils in the details. After all of that I think this is the best structure to share it:

  1. High level introduction to LLMs (this post)
  2. Resources to Learn
  3. Inner Workings of LLM
  4. Analyzing and implementing a toy LLM in Python
  5. A more production grade LLM written with PyTorch
  6. Basic math behind LLMs

Or… Maybe you should just give this article to ChatGPT and ask it to come up with the rest of the series, only downside of that would be reading an article written by an ANAL prick.


  1. Please do understand tokens can be anything, pixels, 3D objects, sound pieces, it’s just something that’s chunked into pieces and converted into numerical representations, which is stored as a high-dimensional vector. ↩︎

  2. All the weights carry a lot here, optimization might update billions of parameters each run (Embeddings, Attention weights, FFN - MLP weights, final map lm_head matrix and do this for all layers) ↩︎

  3. WTF: Why is “jinn” plural in English??!!! And why am I learning this now? ↩︎