The Big Idea
When we train a neural network, we usually just minimise the loss L(y, p). But a Bayesian perspective says: we have a PRIOR BELIEF about what good weights look like. The training data gives us a likelihood. We want the most likely weights given BOTH.
The Key Insight
Different priors on weights give you DIFFERENT regularisation methods automatically. This isn't a coincidence — regularisation IS a prior, encoded as math. The prof's slide: "many regularization techniques correspond to imposing certain prior distributions on model parameters."
L2 / Ridge / Tikhonov
Prior: P(w) = N(0, 1/λ)
Penalty: λ/2 · ||w||²
Effect: weights shrink toward 0 (never exactly 0)
Also called: weight decay
L1 / Lasso
Prior: P(w) = Laplace(0, 1/λ)
Penalty: λ · ||w||₁
Effect: many weights go exactly to 0 (sparse!)
Acts as: automatic feature selection
Bias is NOT regularised
Why? Bias shifts the output — regularising it would constrain the model to pass near the origin. Weights control shape/curvature. Only weights get the prior.
Different layers can have different λ
Cheat Sheet — Bayesian Regularization
Jensen's Inequality
For a concave function f:
f(E[X]) ≥ E[f(X)]
log is concave (curves downward), so:
log(E[X]) ≥ E[log(X)]
KL Divergence Definition
KL(Q||P) = E_Q[log Q(z|x) / P(z)]
= E_Q[log Q(z|x) − log P(z)]
KL ≥ 0 always. KL = 0 iff Q = P exactly.
↑ RECONSTRUCTION LOSS ↑ KL DIVERGENCE LOSS
Cheat Sheet — ELBO Proof in 6 Lines
1. log P(x) = log ∫ P(x|z)P(z) dz 2. Multiply-divide by Q(z|x): = log E_Q[P(x|z)P(z)/Q(z|x)] 3. Jensen (log concave): ≥ E_Q[log(P(x|z)P(z)/Q(z|x))] 4. Expand log: = E_Q[log P(x|z)] + E_Q[log P(z)] - E_Q[log Q(z|x)] 5. Rearrange: = E_Q[log P(x|z)] - E_Q[log Q(z|x)/P(z)] 6. Recognise KL: = E_Q[log P(x|z)] - KL(Q||P) ← ELBO ✓Why Attention = Soft Dictionary
Hard dictionary: you either find the key or you don't. Binary. Not differentiable.
Soft dictionary (Attention): every key matches a little. The similarity score = how much that value matters. Weighted sum = soft retrieval. Fully differentiable → trainable end-to-end.
Why Gates = Differentiable if-then-else
Normal if-then-else: gradient is zero everywhere except at the switching point → can't backprop.
Sigmoid gate: smooth everywhere → gradient flows through. The network LEARNS what conditions should open/close each gate.
Mixture of Experts (MoE) — the gating extension
Instead of one big network, have N "expert" sub-networks and a gating network that decides which experts handle which input. Gate output = probabilities over experts. The input is routed to the top-k experts (differentiable gating via softmax).
output = Σᵢ gate(x)ᵢ · expertᵢ(x)
Cheat Sheet — The Three Analogies
| Dimension | Single Channel | Multi-Channel | Notes |
|---|---|---|---|
| 1D | Audio waveform | Skeleton animation data | Conv1D, time series |
| 2D | Fourier transform of audio | Color image data (RGB) | Conv2D, most common |
| 3D | Volumetric data (CT scans) | Color video data | Conv3D, medical imaging |
| 4D | Heart scans (3D + time) | Multi-modal MRI | Rare, very expensive |
What "channel" means in each
1D audio: 1 channel = mono, 2 = stereo. 2D image: 1 = grayscale, 3 = RGB. 3D CT scan: 1 = Hounsfield units. Color video: 3 channels × time × height × width.
What changes in the Conv layer
For 1D: kernel slides along time axis (1 direction). For 2D: kernel slides along height × width (2 directions). For 3D: kernel slides along depth × height × width (3 directions). The math is the same — just more axes.
Cheat Sheet — Data Types
1D: audio waveform (1ch), skeleton (multi-ch) → Conv1D 2D: Fourier of audio (1ch), color image (3ch) → Conv2D 3D: CT scan (1ch), color video (3ch) → Conv3D 4D: heart scan, multi-modal MRI → 4D ConvResNet skip connection (what you know)
y = F(x) + x
Output = residual function + original input. The network learns the CHANGE (residual) rather than the full mapping. Deep networks → no vanishing gradients.
ODE view (the connection)
dy/dt = F(y, t)
An ordinary differential equation says: the rate of change of y at time t is some function F. The solution y(t) is found by stepping forward in time.
Why this matters conceptually
The ODE view explains WHY ResNet works better than plain deep networks — it's solving a well-posed mathematical problem (integrating a differential equation) rather than stacking arbitrary transformations. It also gives theoretical tools: stability analysis from ODE theory applies to ResNet depth and training.
Cheat Sheet — ResNet ↔ ODE
ODE: dy/dt = F(y,t) → Euler step: y(t+1) = y(t) + F(y(t)) ResNet: x_{l+1} = x_l + F(x_l) ← SAME FORMULA Each block = 1 Euler step. Depth = number of time steps. mHC: W doubly stochastic (rows+cols sum to 1). Via Sinkhorn-Knopp.Classical view (what you learned in Lec 4)
Bias-variance tradeoff: as model capacity increases, training error decreases monotonically, but test error follows a U-shape. Too simple = high bias (underfitting). Too complex = high variance (overfitting). Sweet spot in the middle.
Modern finding: Double Descent
For large neural networks: after the classical "overfitting peak," if you keep increasing model size PAST the data-fitting threshold, test error starts DECREASING again.
Small model → underfit → test error high (bias)
Medium model → overfit → test error peaks (variance)
Large model → over-parameterised → test error DROPS again!
Why does this happen?
When the model has WAY more parameters than data points, there are infinitely many solutions that perfectly fit the training data. Gradient descent finds the one with the MINIMUM NORM (smallest weights). This solution turns out to be smooth and generalises well — it avoids the extreme oscillations of just-right-sized overfitting.
What the prof's slide showed
At the interpolation threshold (model capacity = data size), test error peaks. Beyond this point with even larger models, it seems like the network uses a "smoothing function" again — behaving more like regularisation than overfitting.
Implication: bigger is often better in deep learning — don't stop at the classical sweet spot!
Cheat Sheet — Double Descent
Classical: test error = U-shape (bias → sweet spot → variance) Modern: test error = double-U (goes DOWN again past interpolation threshold) Why: gradient descent finds min-norm solution → smooth → generalises Implication: very large models are often BETTER than medium-sized onesInput image (H×W×C)
↓
Conv Block 1 → 64 filters ─────────────────────────────────┐
↓ MaxPool (÷2)
Conv Block 2 → 128 filters ─────────────────────────────┐ │
↓ MaxPool (÷2)
Conv Block 3 → 256 filters ─────────────────────────┐ │ │
↓ MaxPool (÷2)
Bottleneck → 512 filters (smallest feature map)
↓ UpSample (Conv2DTranspose ×2)
UpConv Block 3 → 256 ◄─ CONCAT encoder Block 3 ──────┘
↓ UpSample (×2)
UpConv Block 2 → 128 ◄─ CONCAT encoder Block 2 ─────────┘
↓ UpSample (×2)
UpConv Block 1 → 64 ◄─ CONCAT encoder Block 1 ────────────┘
↓
Output → 1×1 Conv → num_classes (pixel-wise softmax)
What makes it a U-Net
1. The "U" shape: downsample (left side) then upsample (right side)
2. Skip connections from EACH encoder level to the MATCHING decoder level (same spatial resolution). These skip connections carry fine spatial detail that gets lost in the bottleneck.
How it's different from a plain Autoencoder
Plain Autoencoder: bottleneck → decoder reconstructs from only the bottleneck code. No direct path from encoder to decoder at fine scales.
U-Net: skip connections concatenate encoder features at every scale. The decoder has BOTH the upsampled signal from below AND the fine details from the matching encoder layer.
DeconvNet vs U-Net
DeconvNet (Zeiler 2014): also encoder-decoder, also uses unpooling + deconvolution to upsample. Difference from U-Net: DeconvNet does NOT have skip connections — it relies entirely on the bottleneck code. U-Net's skip connections give it much better fine-grained localisation, which is why it became the standard for medical imaging segmentation.
Cheat Sheet — U-Net
Encoder: Conv+MaxPool at each level (H↓ W↓ C↑) Decoder: Conv2DTranspose at each level (H↑ W↑ C↓) Skip: Concatenate encoder_i with decoder_i at SAME resolution Output: 1×1 Conv → num_classes (pixel-wise, same H×W as input) vs AutoEncoder: AE has no skip connections. U-Net does. → better localisationProblem with Word2Vec (what ELMo fixes)
Word2Vec gives ONE fixed vector per word regardless of context.
"I went to the fair last Saturday."
"I do not believe your attitude is fair."
"fair" gets the same vector in both sentences — but the meaning is completely different. Word2Vec can't handle this.
ELMo's solution
Train a deep bidirectional LSTM as a language model. For each word position, take the hidden states from MULTIPLE layers (not just the final output) and combine them. The result: a different vector for "fair" in each context, because the LSTM's hidden state encodes the surrounding words.
ELMo vs Word2Vec
Word2Vec: 1 vector per word. Context-free.
ELMo: different vector per usage. Context-sensitive via BiLSTM hidden states.
ELMo vs BERT
ELMo: BiLSTM-based. Sequential. Two separate LMs concatenated.
BERT: Transformer-based. Truly bidirectional (not two one-directional LMs). Attention over full context at once.
The progression
Word2Vec (static)
↓
ELMo (contextual, LSTM)
↓
BERT (contextual, Transformer)
↓
GPT (generative, Transformer)