Home

ELEN 521 · Deep Learning — Master Formula Sheet

Lectures 3–9  ·  Midterm Winter 2026  ·  Final Winter 2025  ·  ELEN 521
★ FINAL   ★ MIDTERM

Loss Functions

MSE
\[\mathcal{L}=\tfrac{1}{B}\textstyle\sum_i(y^{(i)}-\hat p^{(i)})^2\]
Binary Cross-Entropy
\[\mathcal{L}=-\tfrac{1}{B}\sum_i\bigl[y\log\hat p+(1-y)\log(1-\hat p)\bigr]\]
Categorical CE
\[\mathcal{L}=-\tfrac{1}{B}\sum_i\sum_k y_k^{(i)}\log\hat p_k^{(i)}\]
Neural Style Loss ★ FINAL
\[\mathcal{L}=\alpha\,\mathcal{L}_{content}+\beta\,\mathcal{L}_{style}\]
VAE ELBO Loss
\[\mathcal{L}_{VAE}=\underbrace{\mathbb{E}[\log p(x|z)]}_{\text{recon}}-\underbrace{D_{KL}(q\|p)}_{\text{reg}}\]

B = batch size  |  k = class index

Activation Functions

Sigmoid
\[\sigma(z)=\frac{1}{1+e^{-z}},\quad \sigma'=\sigma(1-\sigma)\]
Softmax
\[\text{sm}(z_k)=\frac{e^{z_k}}{\sum_j e^{z_j}}\]
ReLU
relu(z) = max(0, z)
Sign (RNN — Final Q7)
sign(z) = +1 if z≥0, -1 if z<0
ReLU6 (MobileNet)
relu6(z) = min(max(0,z), 6)

Gradient Descent & Backprop

Weight Update (SGD)
\[w \leftarrow w - \eta\,\frac{\partial\mathcal{L}}{\partial w}\]
Chain Rule
\[\frac{\partial g}{\partial x}=\frac{\partial g}{\partial f}\cdot\frac{\partial f}{\partial x}\]
Newton Step (Midterm Q4)
\[w \leftarrow w - H^{-1}\nabla\mathcal{L}\]
  • η = learning rate  |  smaller batch → more updates per epoch
  • Gradient clipping: cap ‖∇‖ to prevent explosion

Batch Normalization

Normalize
\[\hat x = \frac{x-\mu_B}{\sqrt{\sigma_B^2+\varepsilon}}\]
Scale & Shift (learnable)
\[y = \gamma\hat x + \beta\]
  • If γ=1, β=0 → E[X2]=0, V[X2]=1
  • Applied after linear, before activation
  • Bias is not regularized; μ,σ tracked per batch

Regularization

L2 / Weight Decay (Ridge)
\[\tilde{\mathcal{L}}=\mathcal{L}+\tfrac{\lambda}{2}\|w\|^2\]
L2 Update (shrink)
\[w\leftarrow w(1-\eta\lambda)-\eta\frac{\partial\mathcal{L}}{\partial w}\]
L1 (Sparsity/Lasso)
\[\tilde{\mathcal{L}}=\mathcal{L}+\lambda\|w\|_1\]
  • Dropout: random zero during train; keep during inference (frozen=wrong)
  • Early stopping: regularization in time

Weight Initialization

Xavier/Glorot (tanh)
\[W\sim\mathcal{U}\!\left(\!\pm\sqrt{\tfrac{6}{n_{in}+n_{out}}}\right)\]
He/Kaiming (ReLU/sigmoid)
\[W\sim\mathcal{N}\!\left(0,\sqrt{\tfrac{2}{n_{in}}}\right)\]
  • Bias → always init to 0
  • Zero init → symmetry problem
  • Too large → vanishing/exploding

Convolutional Networks

Output Spatial Size
\[O=\left\lfloor\frac{W-K+2P}{S}\right\rfloor+1\]
Conv Operation
\[\text{out}[f_o,r,c]=\sum_{f_i}\sum_{i,j}w[f_o,f_i,i,j]\cdot x[f_i,Sr+i,Sc+j]\]
Condition Number (Hessian)
\[\kappa=\frac{\lambda_{max}}{\lambda_{min}},\quad H=U\,\text{Diag}(\lambda)\,U^T\]

W=input size, K=kernel, P=padding, S=stride

Params per Conv = K²×C_in×C_out + C_out (bias)

  • ResNet skip: y = F(x,W) + x
  • 1×1 Conv: channel projection, no spatial
  • DepthwiseConv: 1 filter/channel (MobileNet)
  • Same padding: P=(K−1)/2

Parameter & Activation Counting ★ FINAL★ MIDTERM

Dense Layer Params
params = n_in × n_out + n_out (bias)
e.g. Dense(300) on input(400): 400×300+300 = 120,300
Activation Size (Final Q5)
size = batch × layer_output_dim
Total = sum over all layers
Conv Params
K² × C_in × C_out + C_out
No params for MaxPool/AvgPool
Final Q5 example: xi=Input(400), Dense(300,relu), Dense(100,relu), Dense(10,softmax)
Params: (400×300+300)+(300×100+100)+(100×10+10) = 120,300+30,100+1,010 = 151,410
Activations: 400 + 300 + 100 + 10 = 810 (per sample)

Pooling Gradients ★ FINAL Q1

P-Pooling   y = (Σ xᵢᵖ)^(1/p)
\[\frac{\partial y}{\partial x_i} = \delta_y\cdot\frac{x_i^{p-1}}{(\textstyle\sum_j x_j^p)^{(p-1)/p}}\] → p=1: avg pool  |  p=∞: max pool
Log-Average   y = log(1/n Σ exp(xᵢ))
\[\frac{\partial y}{\partial x_i} = \delta_y\cdot\frac{\exp(x_i)}{\sum_j\exp(x_j)} = \delta_y\cdot\text{softmax}(x_i)\] gradient is softmax of inputs!
AvgPool1D Backprop (Midterm Q3) — stride=window=k
\[\frac{\partial x_4}{\partial x_3}=\sigma'(x_3)\cdot I,\quad \frac{\partial x_3}{\partial x_2}=\frac{1}{k}\cdot\mathbf{1},\quad \frac{\partial x_2}{\partial w}=x_1^T\]

TF-IDF & Cosine Similarity ★ FINAL Q6

Term Frequency
\[\text{TF}(w,d)=\frac{\text{count}(w,d)}{|d|}\]
Inverse Doc Frequency
\[\text{IDF}(w)=\log\frac{N}{df(w)}\]
TF-IDF
\[\text{TF\text{-}IDF}(w,d)=\text{TF}\times\text{IDF}\]
Cosine Similarity
\[\cos(\mathbf{v},\mathbf{w})=\frac{\mathbf{v}\cdot\mathbf{w}}{\|\mathbf{v}\|\|\mathbf{w}\|}\]
Final Q6 — IDF: word appears in k of N=4 docs → IDF = log(4/k)
Melanoma: appears in all 4 docs → IDF = log(4/4) = 0  |  Dermatitis: appears in 3 docs → IDF = log(4/3) ≈ 0.288
TF-IDF = 0 for any word appearing in ALL documents → useless for discrimination

Word2Vec Skip-Gram

P(+ | target t, context c)
\[P(+|t,c)=\sigma(\mathbf{t}\cdot\mathbf{c})\]
Objective (+ k negatives)
\[\log\sigma(\mathbf{t}\cdot\mathbf{c})+\sum_{i=1}^k\log\sigma(-\mathbf{t}\cdot\mathbf{n}_i)\]
  • Noise sampling: p3/4(w) — boosts rare words
  • Analogy: v(king)−v(man)+v(woman)≈v(queen)

Classification Metrics ★ FINAL Q2e

Accuracy
(TP+TN) / (TP+TN+FP+FN)
Recall (Sensitivity) ← use for melanoma!
TP / (TP + FN) — minimizes missed cases
Precision
TP / (TP + FP)

Melanoma → use Recall: FN (missed cancer) is far more dangerous than FP.

Simple RNN ★ FINAL Q7

Hidden State
\[s_t = \tanh(W\,s_{t-1}+U\,x_t)\]
Final Q7 Variant (sign activation)
s_t = sign(W·s_{t-1} + U·x_t) o_t = ReLU(V·s_t)
  • Shapes: x(B,T), W(h,h), U(h,x), V(out,h)
  • BPTT: backprop through time (chain rule over t)
  • Long seqs → vanishing gradient → use LSTM/GRU

LSTM Gates

Forget Gate
\[f_t=\sigma(W_f[h_{t-1},x_t]+b_f)\]
Input Gate
\[i_t=\sigma(W_i[h_{t-1},x_t]+b_i)\]
Cell State
\[C_t=f_t\odot C_{t-1}+i_t\odot\tanh(W_c[h_{t-1},x_t]+b_c)\]
Output & Hidden
\[o_t=\sigma(W_o[h_{t-1},x_t]+b_o),\quad h_t=o_t\odot\tanh(C_t)\]

= element-wise multiply  |  GRU = 2 gates (reset + update, no separate C)

Attention Mechanism

Score
\[e_{ti}=f_{att}(a_i,h_{t-1})\]
Weight (softmax)
\[\alpha_{ti}=\frac{e^{e_{ti}}}{\sum_k e^{e_{tk}}}\]
Context Vector
\[\hat z_t=\sum_i\alpha_{ti}\,a_i\]
Scaled Dot-Product (Transformer)
\[\text{Attn}(Q,K,V)=\text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

K=keys (encoder)  |  Q=query (decoder)  |  V=values (encoder)

Neural Style Transfer ★ FINAL Q3

Gram Matrix (Style)
\[G^{[l]}=M^{[l]}\,(M^{[l]})^T,\quad M^{[l]}\in\mathbb{R}^{C_o\times H_oW_o}\]
Total Loss
\[\mathcal{L}=\alpha\,\mathcal{L}_{content}+\beta\,\mathcal{L}_{style}\]
  • 3 images: Content image (C), Style image (S), Generated image (G)
  • Content loss: MSE between activations of C and G at chosen layer
  • Style loss: MSE between Gram matrices of S and G across multiple layers
  • Optimize over G (not weights) using gradient ascent on the image pixels
  • Early layers → textures/style  |  Later layers → semantic content

VAE

KL Divergence (Gaussian)
\[D_{KL}=-\tfrac{1}{2}\sum_j(1+\log\sigma_j^2-\mu_j^2-\sigma_j^2)\]
Reparameterization Trick
z = μ + σ ⊙ ε, ε ~ N(0, I) Prior: z ~ N(0, I)
  • Loss = reconstruction CE + KL divergence

GAN

Min-Max Objective
\[\min_G\max_D\;\mathbb{E}[\log D(x)]+\mathbb{E}[\log(1-D(G(z)))]\]
  • z ~ N(0,I) → generator input
  • Use label smoothing: real=0.9, fake=0.1
  • Mode collapse / vanishing gradient = key failure modes
  • Use LeakyReLU + AvgPool (not ReLU + MaxPool) in D

Transfer Learning ★ FINAL Q2d

VGG19 Strategy (skin condition classifier)
Branch at: block5_pool (last spatial features) Freeze: block1–block3 (general low-level features) Train: block4, block5 + new Dense head New head: Flatten → Dense(256,relu) → Dense(5,softmax)
  • More data → unfreeze more layers
  • Domain shift (book→phone): use augmentation, fine-tune
  • Different aspect ratios → use YOLO or crop-resize per class

Hessian & 2nd-Order Optimization ★ MIDTERM Q4

2D Quadratic Hessian
\[H=\begin{pmatrix}2a&b\\b&2c\end{pmatrix},\quad H^{-1}=\frac{1}{4ac-b^2}\begin{pmatrix}2c&-b\\-b&2a\end{pmatrix}\]
Block-Diagonal Inverse
\[H'=\text{diag}(A,B,\ldots)\Rightarrow H'^{-1}=\text{diag}(A^{-1},B^{-1},\ldots)\]
Complexity
\[H\in\mathbb{R}^{n\times n}:\;\mathcal{O}(n^2)\text{ entries},\;\mathcal{O}(n/2)\text{ block pairs}\]
  • Block-diag H': only O(n) non-zero blocks → cheap inversion
  • H'⁻¹∇L costs O(n) ops (each 2×2 block × 2-vec = 4 ops)
  • Force block structure during training: constrain cross-pair weights to 0

DNNs as Data Structures ★ FINAL Q4

Index a list (Q4c)
Embedding layer: index i → lookup E[i] (learned row of weight matrix)
Hash table / Dict (Q4d)
Attention mechanism: query Q → softmax(Q·Kᵀ)·V (soft lookup by similarity)
If-then-else (Q4e)
LSTM gates (sigmoid = soft switch) Mixture-of-Experts: gating network routes input

Batch size ↓ → more epochs needed (Q4a): fewer samples per update → more noisy steps → need more passes to converge

Dropout at inference (Q4b): recent work shows keeping dropout active at test time acts as Bayesian ensemble → better uncertainty estimates (don't freeze)

YOLO & Object Detection

IoU
\[\text{IoU}=\frac{\text{Intersection}}{\text{Union}}\]
Non-Max Suppression Algorithm
1. Discard boxes: p_c < 0.6 2. Sort by p_c descending 3. Keep top box, discard IoU > 0.5 with it 4. Repeat

Quick Reference Index

CNN Output Size

O = ⌊(W−K+2P)/S⌋+1 Same: P=(K−1)/2 Valid: P=0

ResNet

y = F(x,W) + x (residual + identity) Solves vanishing gradient

Dropout

Train: zero p% of neurons Infer: keep all (scale by 1-p) Recent: keep dropout active

GRU (simplified LSTM)

r_t = σ(W_r[h_{t-1},x_t]) z_t = σ(W_z[h_{t-1},x_t]) h_t = z_t⊙h_{t-1} + (1-z_t)⊙h̃_t

VAE Reparameterization

z = μ + σ ⊙ ε ε ~ N(0, I) Enables backprop through z

Embedding Analogy

v(king)−v(man)+v(woman) ≈ v(queen) cos sim range: [−1, +1]