How does AI answer my question?

Interacting with an AI feels like straight out of a science-fiction movie: a computer responds to my questions like a human, generates code, draws pictures, and so much more. It becomes even more fascinating when you realize that all of it is "just" math. There is no human-like understanding of language, no thinking, but it sure looks like it. So how does it work? ### 1. Tokenization: Breaking Down Language First, the Large Language Model (LLM) breaks my question down into smaller units called tokens. These small chunks of text that might be words, parts of words, or even individual characters. For example: `"How does AI understand language?" → ["How", "does", "AI", "understand", "language", "?"]` ### 2. Embedding: Turning Words into Numbers After that, each token is translated into a vector, which is basically a list of numbers. Those vectors can become very complex, containing hundreds of dimensions. It's the internal "language" of AIs. A vector is often called an embedding - it embeds a symbolic entity (like a word) into a mathematical space. The AI can perform mathematical operations with the vector, the foundation for the machine learning algorithms used in LLMs. A simplified visualization could look like this: `"neural" → [0.6, -0.2, 0.4, -0.1, ...]` **Understanding the Meaning of Language: A Simple Example** The following is an extreme simplified example of tokens being mapped to a vector. Modern AIs don't use static vectors, but the example is still relevant for a basic understanding how LLMs work. - “King” → `[0.8, 0.1, 0.7]` - “Queen” → `[0.78, 0.12, 0.72]` - “Apple” → `[0.2, 0.9, 0.3]` What you can see is that the word "King" and "Queen" have a similar vector, whereas "Apple" is quite different. This is how they derive meaning from our language - it's all based on (large) vectors. Modern LLMs don't use static vectors though - the same word might have multiple different vectors based on its meaning in the specific context. ### 3. Self-Attention: Understanding Context With the simple example in mind, it's time to look at the heart of modern AI: The "Multi-Head Self-Attention" mechanism. It determines how words relate to each other by generating three different vectors for each word: - Query vector – represents what a word is looking for in its context ("what information is relevant to this word"). - Key vector – represents what a word has to offer in terms of meaning ("how this word might be relevant to others"). - Value vector – contains the actual information that is passed forward in the network. Rather than assigning a fixed Query to a specific Key, the model computes attention scores dynamically for all words in the sequence. Each word is compared against every other word to determine how relevant they are to each other (see visualization). By comparing Query and Key vectors mathematically, the model calculates how closely words relate to each other. This results in an attention score matrix, which determines how much influence words have on one another. For example, consider the sentence: "The cat chased the mouse." - When processing "chased," the model calculates its Query vector and checks how strongly it relates to the Keys of other words. - The word "cat" has a Key vector that suggests it could be a subject. - Their similarity score is high, so the model assigns greater attention weight to "cat" when interpreting "chased." - The Value vector from "cat" (containing subject-related information) is then incorporated into the representation of "chased." The "multi-head" part of the mechanism refers to the fact that several independent attention mechanism are running in parallel, each potentially focusing on different aspects of language (e.g., syntactic relationships between subject and verb, or what a pronoun like "he" or "she" refers to). As humans, we process language in a very similar way. We just don't consciously think about the process of understanding different parts of a sentence anymore. ### 4. Processing Layers: Deepening Understanding When I'm talking to an AI, my entire question needs to be understood in mathematical terms, not just each single word. You can think of it as layers upon layers of meaning, represented through vectors. It's actually not that different from how humans learn languages: First, we need to learn individual letters, then we recognize common letter combinations, entire words, phrases and sentences. Each layer takes the patterns identified by previous layers and looks for more complex combinations and relationships. For example, one layer might identify that "bark" is being used, while a higher layer might use surrounding context to determine if it refers to a tree's bark or a dog's bark. The LLM might use more than a hundred layers to develop a full understanding of my language in its mathematical representation of it. Each of those layers makes use of the self-attention mechanism described above until there is a full hierarchical understanding from the bottom (words/characters) up to the top (long-range relationships and meaning). ### 5. Response Generation: One Token at a Time The LLM finally starts to generate the response by creating one token after the other. It will pick the most probable one and repeats the process until the answer is fully generated. This is where the massive dataset that was used during model training comes in. All the patterns that were identified during training are now being used to create the output. Some models can also involve external data sources that are connected through a RAG system (see [AI Glossary](AI%20Glossary.md)) or a more productized functionality like the web search in ChatGPT. ### Summary 1. Tokenization: The question is broken down into smaller units (tokens). 2. Embedding: Each token is translated into a vector, a mathematical representation of each word. 3. Self-Attention: The relationship between single words is analyzed. 4. Processing Layers: The self-attention mechanism is repeated many times on previously identified relationships, creating a more complete understanding of the entire query. 5. Response Generation: The answer is generated one token at a time, selecting the most probable option at each step until completion.