Loss Functions
MSE
\[\mathcal{L}=\tfrac{1}{B}\textstyle\sum_i(y^{(i)}-\hat p^{(i)})^2\]Binary Cross-Entropy
\[\mathcal{L}=-\tfrac{1}{B}\sum_i\bigl[y\log\hat p+(1-y)\log(1-\hat p)\bigr]\]Categorical CE
\[\mathcal{L}=-\tfrac{1}{B}\sum_i\sum_k y_k^{(i)}\log\hat p_k^{(i)}\]Neural Style Loss ★ FINAL
\[\mathcal{L}=\alpha\,\mathcal{L}_{content}+\beta\,\mathcal{L}_{style}\]VAE ELBO Loss
\[\mathcal{L}_{VAE}=\underbrace{\mathbb{E}[\log p(x|z)]}_{\text{recon}}-\underbrace{D_{KL}(q\|p)}_{\text{reg}}\]B = batch size | k = class index
Activation Functions
Sigmoid
\[\sigma(z)=\frac{1}{1+e^{-z}},\quad \sigma'=\sigma(1-\sigma)\]Softmax
\[\text{sm}(z_k)=\frac{e^{z_k}}{\sum_j e^{z_j}}\]ReLU
relu(z) = max(0, z)Sign (RNN — Final Q7)
sign(z) = +1 if z≥0, -1 if z<0ReLU6 (MobileNet)
relu6(z) = min(max(0,z), 6)Gradient Descent & Backprop
Weight Update (SGD)
\[w \leftarrow w - \eta\,\frac{\partial\mathcal{L}}{\partial w}\]Chain Rule
\[\frac{\partial g}{\partial x}=\frac{\partial g}{\partial f}\cdot\frac{\partial f}{\partial x}\]Newton Step (Midterm Q4) ★
\[w \leftarrow w - H^{-1}\nabla\mathcal{L}\]- η = learning rate | smaller batch → more updates per epoch
- Gradient clipping: cap ‖∇‖ to prevent explosion
Batch Normalization ★
Normalize
\[\hat x = \frac{x-\mu_B}{\sqrt{\sigma_B^2+\varepsilon}}\]Scale & Shift (learnable)
\[y = \gamma\hat x + \beta\]- If γ=1, β=0 → E[X2]=0, V[X2]=1
- Applied after linear, before activation
- Bias is not regularized; μ,σ tracked per batch
Regularization
L2 / Weight Decay (Ridge)
\[\tilde{\mathcal{L}}=\mathcal{L}+\tfrac{\lambda}{2}\|w\|^2\]L2 Update (shrink)
\[w\leftarrow w(1-\eta\lambda)-\eta\frac{\partial\mathcal{L}}{\partial w}\]L1 (Sparsity/Lasso)
\[\tilde{\mathcal{L}}=\mathcal{L}+\lambda\|w\|_1\]- Dropout: random zero during train; keep during inference (frozen=wrong)
- Early stopping: regularization in time
Weight Initialization
Xavier/Glorot (tanh)
\[W\sim\mathcal{U}\!\left(\!\pm\sqrt{\tfrac{6}{n_{in}+n_{out}}}\right)\]He/Kaiming (ReLU/sigmoid)
\[W\sim\mathcal{N}\!\left(0,\sqrt{\tfrac{2}{n_{in}}}\right)\]- Bias → always init to 0
- Zero init → symmetry problem
- Too large → vanishing/exploding
Convolutional Networks
Output Spatial Size
\[O=\left\lfloor\frac{W-K+2P}{S}\right\rfloor+1\]Conv Operation
\[\text{out}[f_o,r,c]=\sum_{f_i}\sum_{i,j}w[f_o,f_i,i,j]\cdot x[f_i,Sr+i,Sc+j]\]Condition Number (Hessian) ★
\[\kappa=\frac{\lambda_{max}}{\lambda_{min}},\quad H=U\,\text{Diag}(\lambda)\,U^T\]W=input size, K=kernel, P=padding, S=stride
Params per Conv = K²×C_in×C_out + C_out (bias)
- ResNet skip: y = F(x,W) + x
- 1×1 Conv: channel projection, no spatial
- DepthwiseConv: 1 filter/channel (MobileNet)
- Same padding: P=(K−1)/2
Parameter & Activation Counting ★ FINAL★ MIDTERM
Dense Layer Params
params = n_in × n_out + n_out (bias)
e.g. Dense(300) on input(400): 400×300+300 = 120,300
Activation Size (Final Q5)
size = batch × layer_output_dim
Total = sum over all layers
Conv Params
K² × C_in × C_out + C_out
No params for MaxPool/AvgPool
Final Q5 example: xi=Input(400), Dense(300,relu), Dense(100,relu), Dense(10,softmax)
Params: (400×300+300)+(300×100+100)+(100×10+10) = 120,300+30,100+1,010 = 151,410
Activations: 400 + 300 + 100 + 10 = 810 (per sample)
Pooling Gradients ★ FINAL Q1
P-Pooling y = (Σ xᵢᵖ)^(1/p)
\[\frac{\partial y}{\partial x_i} = \delta_y\cdot\frac{x_i^{p-1}}{(\textstyle\sum_j x_j^p)^{(p-1)/p}}\]
→ p=1: avg pool | p=∞: max pool
Log-Average y = log(1/n Σ exp(xᵢ))
\[\frac{\partial y}{\partial x_i} = \delta_y\cdot\frac{\exp(x_i)}{\sum_j\exp(x_j)} = \delta_y\cdot\text{softmax}(x_i)\]
gradient is softmax of inputs!
AvgPool1D Backprop (Midterm Q3) ★ — stride=window=k
\[\frac{\partial x_4}{\partial x_3}=\sigma'(x_3)\cdot I,\quad \frac{\partial x_3}{\partial x_2}=\frac{1}{k}\cdot\mathbf{1},\quad \frac{\partial x_2}{\partial w}=x_1^T\]
TF-IDF & Cosine Similarity ★ FINAL Q6
Term Frequency
\[\text{TF}(w,d)=\frac{\text{count}(w,d)}{|d|}\]Inverse Doc Frequency
\[\text{IDF}(w)=\log\frac{N}{df(w)}\]TF-IDF
\[\text{TF\text{-}IDF}(w,d)=\text{TF}\times\text{IDF}\]Cosine Similarity
\[\cos(\mathbf{v},\mathbf{w})=\frac{\mathbf{v}\cdot\mathbf{w}}{\|\mathbf{v}\|\|\mathbf{w}\|}\]Final Q6 — IDF: word appears in k of N=4 docs → IDF = log(4/k)
Melanoma: appears in all 4 docs → IDF = log(4/4) = 0 | Dermatitis: appears in 3 docs → IDF = log(4/3) ≈ 0.288
TF-IDF = 0 for any word appearing in ALL documents → useless for discrimination
Word2Vec Skip-Gram
P(+ | target t, context c)
\[P(+|t,c)=\sigma(\mathbf{t}\cdot\mathbf{c})\]Objective (+ k negatives)
\[\log\sigma(\mathbf{t}\cdot\mathbf{c})+\sum_{i=1}^k\log\sigma(-\mathbf{t}\cdot\mathbf{n}_i)\]- Noise sampling: p3/4(w) — boosts rare words
- Analogy: v(king)−v(man)+v(woman)≈v(queen)
Classification Metrics ★ FINAL Q2e
Accuracy
(TP+TN) / (TP+TN+FP+FN)Recall (Sensitivity) ← use for melanoma!
TP / (TP + FN) — minimizes missed casesPrecision
TP / (TP + FP)Melanoma → use Recall: FN (missed cancer) is far more dangerous than FP.
Simple RNN ★ FINAL Q7
Hidden State
\[s_t = \tanh(W\,s_{t-1}+U\,x_t)\]Final Q7 Variant (sign activation)
s_t = sign(W·s_{t-1} + U·x_t)
o_t = ReLU(V·s_t)- Shapes: x(B,T), W(h,h), U(h,x), V(out,h)
- BPTT: backprop through time (chain rule over t)
- Long seqs → vanishing gradient → use LSTM/GRU
LSTM Gates
Forget Gate
\[f_t=\sigma(W_f[h_{t-1},x_t]+b_f)\]Input Gate
\[i_t=\sigma(W_i[h_{t-1},x_t]+b_i)\]Cell State
\[C_t=f_t\odot C_{t-1}+i_t\odot\tanh(W_c[h_{t-1},x_t]+b_c)\]Output & Hidden
\[o_t=\sigma(W_o[h_{t-1},x_t]+b_o),\quad h_t=o_t\odot\tanh(C_t)\]⊙ = element-wise multiply | GRU = 2 gates (reset + update, no separate C)
Attention Mechanism
Score
\[e_{ti}=f_{att}(a_i,h_{t-1})\]Weight (softmax)
\[\alpha_{ti}=\frac{e^{e_{ti}}}{\sum_k e^{e_{tk}}}\]Context Vector
\[\hat z_t=\sum_i\alpha_{ti}\,a_i\]Scaled Dot-Product (Transformer)
\[\text{Attn}(Q,K,V)=\text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]K=keys (encoder) | Q=query (decoder) | V=values (encoder)
Neural Style Transfer ★ FINAL Q3
Gram Matrix (Style)
\[G^{[l]}=M^{[l]}\,(M^{[l]})^T,\quad M^{[l]}\in\mathbb{R}^{C_o\times H_oW_o}\]Total Loss
\[\mathcal{L}=\alpha\,\mathcal{L}_{content}+\beta\,\mathcal{L}_{style}\]- 3 images: Content image (C), Style image (S), Generated image (G)
- Content loss: MSE between activations of C and G at chosen layer
- Style loss: MSE between Gram matrices of S and G across multiple layers
- Optimize over G (not weights) using gradient ascent on the image pixels
- Early layers → textures/style | Later layers → semantic content
VAE
KL Divergence (Gaussian)
\[D_{KL}=-\tfrac{1}{2}\sum_j(1+\log\sigma_j^2-\mu_j^2-\sigma_j^2)\]Reparameterization Trick
z = μ + σ ⊙ ε, ε ~ N(0, I)
Prior: z ~ N(0, I)- Loss = reconstruction CE + KL divergence
GAN
Min-Max Objective
\[\min_G\max_D\;\mathbb{E}[\log D(x)]+\mathbb{E}[\log(1-D(G(z)))]\]- z ~ N(0,I) → generator input
- Use label smoothing: real=0.9, fake=0.1
- Mode collapse / vanishing gradient = key failure modes
- Use LeakyReLU + AvgPool (not ReLU + MaxPool) in D
Transfer Learning ★ FINAL Q2d
VGG19 Strategy (skin condition classifier)
Branch at: block5_pool (last spatial features)
Freeze: block1–block3 (general low-level features)
Train: block4, block5 + new Dense head
New head: Flatten → Dense(256,relu) → Dense(5,softmax)- More data → unfreeze more layers
- Domain shift (book→phone): use augmentation, fine-tune
- Different aspect ratios → use YOLO or crop-resize per class
Hessian & 2nd-Order Optimization ★ MIDTERM Q4
2D Quadratic Hessian
\[H=\begin{pmatrix}2a&b\\b&2c\end{pmatrix},\quad H^{-1}=\frac{1}{4ac-b^2}\begin{pmatrix}2c&-b\\-b&2a\end{pmatrix}\]Block-Diagonal Inverse
\[H'=\text{diag}(A,B,\ldots)\Rightarrow H'^{-1}=\text{diag}(A^{-1},B^{-1},\ldots)\]Complexity
\[H\in\mathbb{R}^{n\times n}:\;\mathcal{O}(n^2)\text{ entries},\;\mathcal{O}(n/2)\text{ block pairs}\]- Block-diag H': only O(n) non-zero blocks → cheap inversion
- H'⁻¹∇L costs O(n) ops (each 2×2 block × 2-vec = 4 ops)
- Force block structure during training: constrain cross-pair weights to 0
DNNs as Data Structures ★ FINAL Q4
Index a list (Q4c)
Embedding layer: index i
→ lookup E[i] (learned row of weight matrix)Hash table / Dict (Q4d)
Attention mechanism: query Q
→ softmax(Q·Kᵀ)·V (soft lookup by similarity)If-then-else (Q4e)
LSTM gates (sigmoid = soft switch)
Mixture-of-Experts: gating network routes inputBatch size ↓ → more epochs needed (Q4a): fewer samples per update → more noisy steps → need more passes to converge
Dropout at inference (Q4b): recent work shows keeping dropout active at test time acts as Bayesian ensemble → better uncertainty estimates (don't freeze)
YOLO & Object Detection
IoU
\[\text{IoU}=\frac{\text{Intersection}}{\text{Union}}\]Non-Max Suppression Algorithm
1. Discard boxes: p_c < 0.6
2. Sort by p_c descending
3. Keep top box, discard IoU > 0.5 with it
4. RepeatQuick Reference Index
CNN Output Size
O = ⌊(W−K+2P)/S⌋+1
Same: P=(K−1)/2
Valid: P=0
ResNet
y = F(x,W) + x
(residual + identity)
Solves vanishing gradient
Dropout
Train: zero p% of neurons
Infer: keep all (scale by 1-p)
Recent: keep dropout active
GRU (simplified LSTM)
r_t = σ(W_r[h_{t-1},x_t])
z_t = σ(W_z[h_{t-1},x_t])
h_t = z_t⊙h_{t-1} + (1-z_t)⊙h̃_t
VAE Reparameterization
z = μ + σ ⊙ ε
ε ~ N(0, I)
Enables backprop through z
Embedding Analogy
v(king)−v(man)+v(woman)
≈ v(queen)
cos sim range: [−1, +1]