B = batch size | k = class index
ReLU/sigmoid → He init · tanh → Xavier init
- η = learning rate (hyperparameter)
- Propagate gradients backward layer by layer
- Gradient at layer ℓ depends on layer ℓ+1 (chain rule)
μB, σB² = batch mean & variance | ε = numerical stability
Applied to intermediate layer outputs; stabilizes training.
- L2 → drives weights toward 0 (ridge / weight decay)
- L1 → encourages sparsity (feature selection)
- Bias terms are not regularized
- Dropout: randomly zero out neurons during training (implicit regularization)
- Early stopping: regularization in time
- Zero init → all neurons learn the same thing (symmetry problem)
- Large random init → exploding/vanishing gradients
- Bias → always initialize to 0
W = input size | K = kernel size | P = padding | S = stride
Parameter count per Conv layer = K × K × Cin × Cout + Cout (bias)
- MaxPool: takes max in each window (no params)
- 1×1 Conv: channel-wise linear projection
- DepthwiseConv: one filter per input channel (MobileNet)
- Condition number: κ = λmax/λmin of Hessian
Word analogy: vec(king) − vec(man) + vec(woman) ≈ vec(queen)
- Noise word sampling: pα(w), α = ¾ works well
- Learn separate W (target) and C (context) matrices
- BPTT: backpropagation through time
- Long sequences → vanishing/exploding gradients
- Gradient clipping: cap ‖∇‖ to prevent explosion
f = forget | i = input | o = output | C = cell state | ⊙ = element-wise multiply | GRU simplifies to 2 gates (reset + update)
Keys = encoder hidden states | Query = decoder state | Values = encoder hidden states
Prior: z ~ N(0, I) | Reparameterization: z = μ + σ⊙ε, ε ~ N(0,I)
Non-max suppression: discard boxes with IoU > 0.5 against best box
CNN Output Size
- Same padding: P = (K−1)/2
- Valid padding: P = 0
- Params = K² × C_in × C_out
Init Summary
- Xavier/Glorot: tanh, symmetric
- He/Kaiming: ReLU, sigmoid
- Bias: always → 0
- Zero weights: symmetry problem
- Too large: explode/vanish
Regularization
- L2: weight decay (ridge)
- L1: sparsity (lasso)
- Dropout: random neuron zeroing
- Early stopping: time regularization
- Data augmentation: implicit
ResNet Skip Connection
- Solves vanishing gradients in deep nets
- Output = residual + identity
- Related to ODE solvers (Euler method)
Condition Number
- Large κ → poor conditioning
- H = U·Diag(λ)·Uᵀ (eigendecomp)
- Leads to zigzag gradient descent
GAN Objective
- D = discriminator, G = generator
- z ~ N(0,1) for generator input
- Use label smoothing (real=0.9, fake=0.1)
Attention (image captioning)
MobileNet Bottleneck
- Inverted residual block
- Fewer params than standard conv
- ReLU6 = min(max(0,x), 6)