ELEN 521 — Deep Learning Master Cheat Sheet

Loss Functions

MSE

\[\mathcal{L}=\tfrac{1}{B}\textstyle\sum_i(y^{(i)}-\hat p^{(i)})^2\]

Binary Cross-Entropy

\[\mathcal{L}=-\tfrac{1}{B}\sum_i\bigl[y\log\hat p+(1-y)\log(1-\hat p)\bigr]\]

Categorical CE

\[\mathcal{L}=-\tfrac{1}{B}\sum_i\sum_k y_k^{(i)}\log\hat p_k^{(i)}\]

Neural Style Loss ★ FINAL

\[\mathcal{L}=\alpha\,\mathcal{L}_{content}+\beta\,\mathcal{L}_{style}\]

VAE ELBO Loss

\[\mathcal{L}_{VAE}=\underbrace{\mathbb{E}[\log p(x|z)]}_{\text{recon}}-\underbrace{D_{KL}(q\|p)}_{\text{reg}}\]

B = batch size | k = class index

Activation Functions

Sigmoid

\[\sigma(z)=\frac{1}{1+e^{-z}},\quad \sigma'=\sigma(1-\sigma)\]

Softmax

\[\text{sm}(z_k)=\frac{e^{z_k}}{\sum_j e^{z_j}}\]

ReLU

relu(z) = max(0, z)

Sign (RNN — Final Q7)

sign(z) = +1 if z≥0, -1 if z<0

ReLU6 (MobileNet)

relu6(z) = min(max(0,z), 6)

Gradient Descent & Backprop

Weight Update (SGD)

\[w \leftarrow w - \eta\,\frac{\partial\mathcal{L}}{\partial w}\]

Chain Rule

\[\frac{\partial g}{\partial x}=\frac{\partial g}{\partial f}\cdot\frac{\partial f}{\partial x}\]

Newton Step (Midterm Q4) ★

\[w \leftarrow w - H^{-1}\nabla\mathcal{L}\]

η = learning rate | smaller batch → more updates per epoch
Gradient clipping: cap ‖∇‖ to prevent explosion

Batch Normalization ★

Normalize

\[\hat x = \frac{x-\mu_B}{\sqrt{\sigma_B^2+\varepsilon}}\]

Scale & Shift (learnable)

\[y = \gamma\hat x + \beta\]

If γ=1, β=0 → E[X2]=0, V[X2]=1
Applied after linear, before activation
Bias is not regularized; μ,σ tracked per batch

Regularization

L2 / Weight Decay (Ridge)

\[\tilde{\mathcal{L}}=\mathcal{L}+\tfrac{\lambda}{2}\|w\|^2\]

L2 Update (shrink)

\[w\leftarrow w(1-\eta\lambda)-\eta\frac{\partial\mathcal{L}}{\partial w}\]

L1 (Sparsity/Lasso)

\[\tilde{\mathcal{L}}=\mathcal{L}+\lambda\|w\|_1\]

Dropout: random zero during train; keep during inference (frozen=wrong)
Early stopping: regularization in time

Weight Initialization

Xavier/Glorot (tanh)

\[W\sim\mathcal{U}\!\left(\!\pm\sqrt{\tfrac{6}{n_{in}+n_{out}}}\right)\]

He/Kaiming (ReLU/sigmoid)

\[W\sim\mathcal{N}\!\left(0,\sqrt{\tfrac{2}{n_{in}}}\right)\]

Bias → always init to 0
Zero init → symmetry problem
Too large → vanishing/exploding

Convolutional Networks

Output Spatial Size

\[O=\left\lfloor\frac{W-K+2P}{S}\right\rfloor+1\]

Conv Operation

\[\text{out}[f_o,r,c]=\sum_{f_i}\sum_{i,j}w[f_o,f_i,i,j]\cdot x[f_i,Sr+i,Sc+j]\]

Condition Number (Hessian) ★

\[\kappa=\frac{\lambda_{max}}{\lambda_{min}},\quad H=U\,\text{Diag}(\lambda)\,U^T\]

W=input size, K=kernel, P=padding, S=stride

Params per Conv = K²×C_in×C_out + C_out (bias)

ResNet skip: y = F(x,W) + x
1×1 Conv: channel projection, no spatial
DepthwiseConv: 1 filter/channel (MobileNet)
Same padding: P=(K−1)/2

Parameter & Activation Counting ★ FINAL★ MIDTERM

Dense Layer Params

params = n_in × n_out + n_out (bias)
e.g. Dense(300) on input(400): 400×300+300 = 120,300

Activation Size (Final Q5)

size = batch × layer_output_dim
Total = sum over all layers

Conv Params

K² × C_in × C_out + C_out
No params for MaxPool/AvgPool

Final Q5 example: xi=Input(400), Dense(300,relu), Dense(100,relu), Dense(10,softmax)

Params: (400×300+300)+(300×100+100)+(100×10+10) = 120,300+30,100+1,010 = 151,410
Activations: 400 + 300 + 100 + 10 = 810 (per sample)

Pooling Gradients ★ FINAL Q1

P-Pooling y = (Σ xᵢᵖ)^(1/p)

\[\frac{\partial y}{\partial x_i} = \delta_y\cdot\frac{x_i^{p-1}}{(\textstyle\sum_j x_j^p)^{(p-1)/p}}\] → p=1: avg pool | p=∞: max pool

Log-Average y = log(1/n Σ exp(xᵢ))

\[\frac{\partial y}{\partial x_i} = \delta_y\cdot\frac{\exp(x_i)}{\sum_j\exp(x_j)} = \delta_y\cdot\text{softmax}(x_i)\] gradient is softmax of inputs!

AvgPool1D Backprop (Midterm Q3) ★ — stride=window=k

\[\frac{\partial x_4}{\partial x_3}=\sigma'(x_3)\cdot I,\quad \frac{\partial x_3}{\partial x_2}=\frac{1}{k}\cdot\mathbf{1},\quad \frac{\partial x_2}{\partial w}=x_1^T\]

TF-IDF & Cosine Similarity ★ FINAL Q6

Term Frequency

\[\text{TF}(w,d)=\frac{\text{count}(w,d)}{|d|}\]

Inverse Doc Frequency

\[\text{IDF}(w)=\log\frac{N}{df(w)}\]

TF-IDF

\[\text{TF\text{-}IDF}(w,d)=\text{TF}\times\text{IDF}\]

Cosine Similarity

\[\cos(\mathbf{v},\mathbf{w})=\frac{\mathbf{v}\cdot\mathbf{w}}{\|\mathbf{v}\|\|\mathbf{w}\|}\]

Final Q6 — IDF: word appears in k of N=4 docs → IDF = log(4/k)

Melanoma: appears in all 4 docs → IDF = log(4/4) = 0 | Dermatitis: appears in 3 docs → IDF = log(4/3) ≈ 0.288
TF-IDF = 0 for any word appearing in ALL documents → useless for discrimination

Word2Vec Skip-Gram

P(+ | target t, context c)

\[P(+|t,c)=\sigma(\mathbf{t}\cdot\mathbf{c})\]

Objective (+ k negatives)

\[\log\sigma(\mathbf{t}\cdot\mathbf{c})+\sum_{i=1}^k\log\sigma(-\mathbf{t}\cdot\mathbf{n}_i)\]

Noise sampling: p^3/4(w) — boosts rare words
Analogy: v(king)−v(man)+v(woman)≈v(queen)

Classification Metrics ★ FINAL Q2e

Accuracy

(TP+TN) / (TP+TN+FP+FN)

Recall (Sensitivity) ← use for melanoma!

TP / (TP + FN) — minimizes missed cases

Precision

TP / (TP + FP)

Melanoma → use Recall: FN (missed cancer) is far more dangerous than FP.

Simple RNN ★ FINAL Q7

Hidden State

\[s_t = \tanh(W\,s_{t-1}+U\,x_t)\]

Final Q7 Variant (sign activation)

s_t = sign(W·s_{t-1} + U·x_t) o_t = ReLU(V·s_t)

Shapes: x(B,T), W(h,h), U(h,x), V(out,h)
BPTT: backprop through time (chain rule over t)
Long seqs → vanishing gradient → use LSTM/GRU

LSTM Gates

Forget Gate

\[f_t=\sigma(W_f[h_{t-1},x_t]+b_f)\]

Input Gate

\[i_t=\sigma(W_i[h_{t-1},x_t]+b_i)\]

Cell State

\[C_t=f_t\odot C_{t-1}+i_t\odot\tanh(W_c[h_{t-1},x_t]+b_c)\]

Output & Hidden

\[o_t=\sigma(W_o[h_{t-1},x_t]+b_o),\quad h_t=o_t\odot\tanh(C_t)\]

⊙ = element-wise multiply | GRU = 2 gates (reset + update, no separate C)

Attention Mechanism

Score

\[e_{ti}=f_{att}(a_i,h_{t-1})\]

Weight (softmax)

\[\alpha_{ti}=\frac{e^{e_{ti}}}{\sum_k e^{e_{tk}}}\]

Context Vector

\[\hat z_t=\sum_i\alpha_{ti}\,a_i\]

Scaled Dot-Product (Transformer)

\[\text{Attn}(Q,K,V)=\text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

K=keys (encoder) | Q=query (decoder) | V=values (encoder)

Neural Style Transfer ★ FINAL Q3

Gram Matrix (Style)

\[G^{[l]}=M^{[l]}\,(M^{[l]})^T,\quad M^{[l]}\in\mathbb{R}^{C_o\times H_oW_o}\]

Total Loss

\[\mathcal{L}=\alpha\,\mathcal{L}_{content}+\beta\,\mathcal{L}_{style}\]

3 images: Content image (C), Style image (S), Generated image (G)
Content loss: MSE between activations of C and G at chosen layer
Style loss: MSE between Gram matrices of S and G across multiple layers
Optimize over G (not weights) using gradient ascent on the image pixels
Early layers → textures/style | Later layers → semantic content

VAE

KL Divergence (Gaussian)

\[D_{KL}=-\tfrac{1}{2}\sum_j(1+\log\sigma_j^2-\mu_j^2-\sigma_j^2)\]

Reparameterization Trick

z = μ + σ ⊙ ε, ε ~ N(0, I) Prior: z ~ N(0, I)

Loss = reconstruction CE + KL divergence

GAN

Min-Max Objective

\[\min_G\max_D\;\mathbb{E}[\log D(x)]+\mathbb{E}[\log(1-D(G(z)))]\]

z ~ N(0,I) → generator input
Use label smoothing: real=0.9, fake=0.1
Mode collapse / vanishing gradient = key failure modes
Use LeakyReLU + AvgPool (not ReLU + MaxPool) in D

Transfer Learning ★ FINAL Q2d

VGG19 Strategy (skin condition classifier)

Branch at: block5_pool (last spatial features) Freeze: block1–block3 (general low-level features) Train: block4, block5 + new Dense head New head: Flatten → Dense(256,relu) → Dense(5,softmax)

More data → unfreeze more layers
Domain shift (book→phone): use augmentation, fine-tune
Different aspect ratios → use YOLO or crop-resize per class

Hessian & 2nd-Order Optimization ★ MIDTERM Q4

2D Quadratic Hessian

\[H=\begin{pmatrix}2a&b\\b&2c\end{pmatrix},\quad H^{-1}=\frac{1}{4ac-b^2}\begin{pmatrix}2c&-b\\-b&2a\end{pmatrix}\]

Block-Diagonal Inverse

\[H'=\text{diag}(A,B,\ldots)\Rightarrow H'^{-1}=\text{diag}(A^{-1},B^{-1},\ldots)\]

Complexity

\[H\in\mathbb{R}^{n\times n}:\;\mathcal{O}(n^2)\text{ entries},\;\mathcal{O}(n/2)\text{ block pairs}\]

Block-diag H': only O(n) non-zero blocks → cheap inversion
H'⁻¹∇L costs O(n) ops (each 2×2 block × 2-vec = 4 ops)
Force block structure during training: constrain cross-pair weights to 0

DNNs as Data Structures ★ FINAL Q4

Index a list (Q4c)

Embedding layer: index i → lookup E[i] (learned row of weight matrix)

Hash table / Dict (Q4d)

Attention mechanism: query Q → softmax(Q·Kᵀ)·V (soft lookup by similarity)

If-then-else (Q4e)

LSTM gates (sigmoid = soft switch) Mixture-of-Experts: gating network routes input

Batch size ↓ → more epochs needed (Q4a): fewer samples per update → more noisy steps → need more passes to converge

Dropout at inference (Q4b): recent work shows keeping dropout active at test time acts as Bayesian ensemble → better uncertainty estimates (don't freeze)

YOLO & Object Detection

IoU

\[\text{IoU}=\frac{\text{Intersection}}{\text{Union}}\]

Non-Max Suppression Algorithm

1. Discard boxes: p_c < 0.6 2. Sort by p_c descending 3. Keep top box, discard IoU > 0.5 with it 4. Repeat

Quick Reference Index

CNN Output Size

O = ⌊(W−K+2P)/S⌋+1 Same: P=(K−1)/2 Valid: P=0

ResNet

y = F(x,W) + x (residual + identity) Solves vanishing gradient

Dropout

Train: zero p% of neurons Infer: keep all (scale by 1-p) Recent: keep dropout active

GRU (simplified LSTM)

r_t = σ(W_r[h_{t-1},x_t]) z_t = σ(W_z[h_{t-1},x_t]) h_t = z_t⊙h_{t-1} + (1-z_t)⊙h̃_t

VAE Reparameterization

z = μ + σ ⊙ ε ε ~ N(0, I) Enables backprop through z

Embedding Analogy

v(king)−v(man)+v(woman) ≈ v(queen) cos sim range: [−1, +1]