My LLM Cheatsheet: Concepts, Training, and Best Practices 🧠✨

Large Language Models (LLMs) are everywhere now — chatbots, copilots, search, coding, writing, and reasoning.
This is my personal LLM cheatsheet: concise notes I use to refresh core concepts, training ideas, and practical techniques without diving back into papers or long courses.

1. Core Concepts
#

What is an LLM?
#

An LLM (Large Language Model) is a neural network trained on massive text corpora to predict the next token.

With enough scale, this simple objective unlocks:

Text generation
Reasoning
Translation
Summarization
Q&A

By learning to predict the next word in sentences over billions of examples, LLMs implicitly learn grammar, facts, reasoning patterns, and even some world knowledge.

Tokenization
#

LLMs don’t read text — they read tokens.

Tokenization is how text becomes numeric input. Smaller tokens give flexibility but increase sequence length, while larger tokens are faster but less flexible.

Text → tokens → numbers
Tokens can be words, subwords, or characters
Example:
cryptocurrency → crypto + currency

Why it matters:

Handles rare / new words
Controls vocabulary size
Impacts cost (more tokens = more compute)

Context Window
#

The context window is the model’s working memory. LLMs can only “see” a limited number of tokens at a time. Anything beyond the window is ignored, so long documents may need special handling.

Measured in tokens (4k, 8k, 32k, 128k…)
Bigger window = more context, better coherence
Trade-off: cost & latency

2. Transformer Fundamentals
#

Attention (The Secret Sauce)
#

Attention lets the model decide what matters most in a sequence. It allows the model to focus on relevant words regardless of their position, enabling it to capture complex relationships in language.

Instead of reading text left-to-right only, the model:

Looks at all tokens
Assigns importance weights
Builds context dynamically

This is why transformers scale so well.

Self-Attention Formula (High Level)
#

Self-attention computes a weighted sum of all words in a sequence, where weights measure relevance to each word.

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

Q: What I’m looking for
K: What I have
V: The actual information

Multi-Head Attention
#

Multi-head attention allows the model to capture multiple perspectives simultaneously, improving understanding of complex sentences.

Each head focuses on different patterns:
- Syntax
- Semantics
- Long-range dependencies

Positional Encoding
#

Attention alone has no sense of order. Positional encodings give the model information about word order.

Positional encodings inject:

Token position
Sequence structure

Without them, word order wouldn’t matter.

3. Training Paradigms
#

Autoregressive vs Masked Models
#

Autoregressive (GPT-style)

Predict next token
Best for generation

Masked (BERT-style)

Predict hidden tokens
Best for understanding & classification

Autoregressive models excel at producing coherent text, while masked models excel at understanding context for tasks like classification or QA.

Pretraining Objectives
#

Next-token prediction
Masked Language Modeling (MLM)
Next Sentence Prediction (NSP) (less common now)

These objectives define what the model learns during pretraining, shaping whether it is better at generating or understanding text.

Loss Function
#

Cross-entropy loss measures how well the predicted probability distribution matches the true tokens. Lower loss = better predictions.

Most LLMs optimize cross-entropy loss by:

Penalizing wrong token predictions
Encouraging probability mass on correct tokens

4. Generation Controls (Very Practical)
#

Temperature
#

Controls randomness:

0.2 → deterministic
0.7–0.9 → balanced
>1.0 → creative / risky

Temperature scales the probability distribution of the next token. Low temperature = safe predictions, high = more diverse output.

Top-K Sampling
#

Sample only from the top K tokens
Prevents weird low-probability outputs
Limits the choice to the most likely tokens, reducing nonsensical text while keeping some randomness.

Top-P (Nucleus) Sampling
#

Sample from tokens whose cumulative probability ≥ p
More adaptive than Top-K
Dynamically chooses a subset of probable tokens so that rare but important words can still be selected.
Best choice for creative tasks

Beam Search
#

Beam search explores several paths at once, picking the most probable sequence, which improves fluency but reduces diversity.

Keeps multiple candidate sequences
Improves coherence
Less creative, more “safe”

5. Fine-Tuning & Efficiency
#

Catastrophic Forgetting
#

When fine-tuning overwrites prior knowledge.

Mitigations:

Mix old + new data
Freeze most weights
Use adapters

Without precautions, fine-tuning can erase the general knowledge learned during pretraining.

PEFT (Parameter-Efficient Fine-Tuning)
#

PEFT methods let you adapt huge models to new tasks without retraining everything, saving resources.

Popular techniques:

LoRA
QLoRA (LoRA + quantization)

Benefits:

Lower memory usage
Cheaper training
Works on large models with limited hardware

Model Distillation
#

Train a small model to mimic a large one.

Distillation transfers knowledge from a big model to a smaller one, keeping most capabilities but reducing compute needs.

Faster
Cheaper
Great for edge devices

6. Retrieval & Reasoning
#

RAG (Retrieval-Augmented Generation)
#

LLMs don’t “know” facts — they predict text.

RAG pipeline:

Retrieve documents
Rank relevance
Generate using retrieved context

This:

Reduces hallucinations
Improves factual accuracy

By combining LLMs with external knowledge, RAG ensures outputs are grounded in real data rather than just model memorization.

Chain-of-Thought (CoT)
#

Encourages step-by-step reasoning.

Useful for:

Math
Logic
Multi-step questions

CoT prompts make the model “think out loud,” improving multi-step reasoning performance.

Zero-Shot & Few-Shot Learning
#

Zero-shot: Just instructions
Few-shot: Add 2–5 examples

Prompt quality often matters more than model size.

LLMs can perform tasks without explicit training, but giving examples usually boosts accuracy.

7. Scaling Tricks
#

Mixture of Experts (MoE)
#

Many expert sub-models
Only a few activate per input

Result:

Massive models
Lower inference cost

MoE allows enormous models to exist without making every computation expensive by activating only relevant “experts” for each input.

Adaptive Softmax
#

It speeds up prediction for frequent words while using fewer resources for rare ones.

Optimizes large vocabularies
Faster training
Lower memory usage

8. Challenges & Limitations
#

LLMs are powerful tools but come with technical, ethical, and operational challenges that must be managed.

High compute cost
Bias from training data
Hallucinations
Limited interpretability
Privacy & data leakage risks

Final Thoughts
#

This cheatsheet is not exhaustive, but it covers:

How LLMs work
How they’re trained
How to control them
How to deploy them wisely

Mastering these concepts is often enough to reason effectively about LLM behavior, limitations, and trade-offs in real systems.

1. Core Concepts#

What is an LLM?#

Tokenization#

Context Window#

2. Transformer Fundamentals#

Attention (The Secret Sauce)#

Self-Attention Formula (High Level)#

Multi-Head Attention#

Positional Encoding#

3. Training Paradigms#

Autoregressive vs Masked Models#

Pretraining Objectives#

Loss Function#

4. Generation Controls (Very Practical)#

Temperature#

Top-K Sampling#

Top-P (Nucleus) Sampling#

Beam Search#

5. Fine-Tuning & Efficiency#

Catastrophic Forgetting#

PEFT (Parameter-Efficient Fine-Tuning)#

Model Distillation#

6. Retrieval & Reasoning#

RAG (Retrieval-Augmented Generation)#

Chain-of-Thought (CoT)#

Zero-Shot & Few-Shot Learning#

7. Scaling Tricks#

Mixture of Experts (MoE)#

Adaptive Softmax#

8. Challenges & Limitations#

Final Thoughts#