AI Backbone — Complete Study Guide
10 articles covering the fundamental concepts of modern AI
Reading Order
Rendering diagram...
Article Index
# Topic Core Formula Key Concept 01 Forward Pass z = W x + b \mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b} z = Wx + b Data flows left → right through layers 02 Loss Functions L = ( y − y ^ ) 2 L = (y - \hat{y})^2 L = ( y − y ^ ) 2 Measuring how wrong predictions are 03 Backpropagation ∂ L ∂ w = ∂ L ∂ a ⋅ ∂ a ∂ z ⋅ ∂ z ∂ w \frac{\partial L}{\partial w} = \frac{\partial L}{\partial a}\cdot\frac{\partial a}{\partial z}\cdot\frac{\partial z}{\partial w} ∂ w ∂ L = ∂ a ∂ L ⋅ ∂ z ∂ a ⋅ ∂ w ∂ z Chain rule distributes blame 04 Gradient Descent w ← w − α ∇ L w \leftarrow w - \alpha\nabla L w ← w − α ∇ L Walk downhill on the loss landscape 05 Activation Functions ReLU ( z ) = max ( 0 , z ) \text{ReLU}(z) = \max(0,z) ReLU ( z ) = max ( 0 , z ) Non-linearity enables complex learning 06 Embeddings e i = E [ i ] \mathbf{e}_i = \mathbf{E}[i] e i = E [ i ] Discrete tokens → dense meaning vectors 07 Attention & Transformers softmax ( Q K T / d k ) V \text{softmax}(\mathbf{QK}^T/\sqrt{d_k})\mathbf{V} softmax ( QK T / d k ) V Every token attends to every token 08 RLHF r − β D K L ( π θ ∥ π S F T ) r - \beta D_{KL}(\pi_\theta\|\pi_{SFT}) r − β D K L ( π θ ∥ π S F T ) Align model with human values 09 Regularization L + λ ∑ w 2 L + \lambda\sum w^2 L + λ ∑ w 2 Generalize, don't memorize 10 Tokenization 1 token ≈ 0.75 words 1\text{ token} \approx 0.75\text{ words} 1 token ≈ 0.75 words Text → numbers the model can process
Quick-Glance Cheat Sheet
The Training Loop
x → y ^ ⏟ forward pass → L ( y ^ , y ) ⏟ loss → ∇ W L ⏟ backprop → W ← W − α ∇ L ⏟ gradient descent \underbrace{x \to \hat{y}}_{\text{forward pass}} \to \underbrace{L(\hat{y}, y)}_{\text{loss}} \to \underbrace{\nabla_W L}_{\text{backprop}} \to \underbrace{W \leftarrow W - \alpha\nabla L}_{\text{gradient descent}} forward pass x → y ^ → loss L ( y ^ , y ) → backprop ∇ W L → gradient descent W ← W − α ∇ L
Key Activations
Function Formula Use ReLU max ( 0 , z ) \max(0,z) max ( 0 , z ) Hidden layers Sigmoid 1 1 + e − z \frac{1}{1+e^{-z}} 1 + e − z 1 Binary output Softmax e z i ∑ e z j \frac{e^{z_i}}{\sum e^{z_j}} ∑ e z j e z i Multi-class output GELU z ⋅ Φ ( z ) z\cdot\Phi(z) z ⋅ Φ ( z ) Transformers
Attention Formula
Attention ( Q , K , V ) = softmax ( Q K T d k ) V \text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V Attention ( Q , K , V ) = softmax ( d k Q K T ) V
Regularization at a Glance
Method Prevents How L2 Large weights + λ ∑ w 2 +\lambda\sum w^2 + λ ∑ w 2 Dropout Co-dependency Random zeroing Early Stopping Over-training Monitor val loss Batch Norm Covariate shift Normalize activations
Generated from a full conversational deep-dive into AI foundations.
Each article contains: intuition, real-life analogy, math derivations, examples, and diagrams.