## Inference
An AI model making predictions on new data that wasn‘t included in its training.
During inference, the model:
1. Takes in new data (like an image, text, or numbers)
2. Processes it through its learned parameters and architecture
3. Produces an output (like a classification, prediction, or generated content)
For example, when you ask an AI a question, it’s performing inference - using its trained parameters to process your input and generate an appropriate response.
## Tokens and Vectors
Large Language Models (LLMs) don’t read text like humans—they process it as tokens and vectors. **Tokens** are the building blocks of text, which can be words, subwords, or characters. Each token is converted into a **vector**—a list of numbers that represents its meaning in a high-dimensional space. Similar words have similar vectors, helping AI understand relationships between concepts.
LLMs are trained on large amounts of text (books, articles, websites). The model starts with **random vectors** for each token and learns their meaning by predicting missing words in sentences.
During training, the model **adjusts the vectors** so that words with similar meanings are mathematically closer. For example:
- “King” → `[0.8, 0.1, 0.7]`
- “Queen” → `[0.78, 0.12, 0.72]`
- “Apple” → `[0.2, 0.9, 0.3]`
Since “King” and “Queen” have similar vectors, the AI understands they are related, while “Apple” is in a different area of the space.
When you ask an AI a question, it converts your words into vectors, searches for similar meanings, and generates a response based on patterns it has learned—without storing exact text. This is how LLMs can answer questions, complete sentences, or summarize content intelligently. See [How does AI answer my question?](How%20does%20AI%20answer%20my%20question?.md) for more details.
## RAG
RAG stands for Retrieval-Augmented Generation. It is a technique that enhances the performance of generative AI models by incorporating external knowledge retrieval during the response generation process.
RAG process:
1. Retrieval: When the model receives a query, it searches an external knowledge base (e.g., a document database, vector store, or the web) to find relevant information.
2. Augmentation: The retrieved information is then fed into the generative AI model as additional context.
3. Generation: The AI model generates a response based on both its internal knowledge and the retrieved data.