ELEN 521 Deep Learning — Study Guide Lec 3

CNNs & Deep Learning Fundamentals

ML vs DLDNN Structure Activation FunctionsLoss Functions BackpropagationConv Layer PoolingBatchNorm DropoutSoftmax

⚠ EXAM FOCUS (from your midterm): BatchNorm statistics (E[X], V[X]), ReLU output counting, backprop derivatives with chain rule, operations counting in networks.

💡

Core Concepts — Simple Explanations

Click to expand

▼

ML vs Deep Learning

Traditional ML: YOU manually engineer features (histograms, gradients, shape context), then feed to a model.

Deep Learning: The network learns features automatically from raw data. No manual feature engineering needed.

Analogy: ML = teaching a child what to look for. DL = letting the child figure it out themselves from examples.

Universal Approximation Theorem

A neural network with even ONE hidden layer can approximate ANY continuous function — if you give it enough neurons. In practice, we use multiple layers (deeper = more efficient).

relu(y0·w1+b1) ≥ 0 defines a half-space (one side of a plane). Stack enough planes → cut out any shape.

Activation Functions

Non-linearities that let networks learn complex patterns. Without them, stacking linear layers = still one linear layer.

ReLU: max(0,x) — most common. Sigmoid: 1/(1+e⁻ˣ) — squashes to [0,1]. Tanh: squashes to [-1,1]. Softmax: converts scores to probabilities summing to 1.

Loss Functions

Measure how wrong the network is. We minimize these during training.

MSE = regression problems. Binary Cross-Entropy = 2-class classification. Cross-Entropy = multi-class classification.

Backpropagation

Algorithm to compute gradients of the loss w.r.t. ALL weights. Uses the chain rule to propagate error backwards through layers.

Like tracing blame backwards: "the final error happened because this layer was off... because that earlier layer was off..."

Convolutional Layer

Instead of fully-connected (every input → every neuron), we slide a small filter over the image. Same filter weights used everywhere → translation invariance.

Like a spotlight (filter) scanning a stage (image). It fires when it sees the pattern it was trained to detect — wherever that pattern appears.

Pooling Layer

Downsamples feature maps. MaxPool takes the biggest value in a region. AvgPool takes the average. Reduces spatial size and provides some noise immunity.

Taking the "best" or "average" from a neighborhood — like summarizing a paragraph into its key point.

Batch Normalization

Normalizes the outputs of intermediate layers so they have mean≈0 and variance≈1. Learned scale (γ) and shift (β) let the network un-normalize if needed. Speeds up training dramatically.

Like standardizing test scores to a curve — keeps all students (neurons) on the same scale so nobody dominates.

Dropout

During training, randomly "turn off" neurons with probability p. This forces the network to not rely on any single neuron — acts as an ensemble of smaller networks. Turned OFF at test time.

Like studying for an exam without relying on one specific note — forces your brain to learn redundant pathways.

Softmax (Output Layer)

For multi-class classification, converts raw scores (logits) into probabilities. Sum of all outputs = 1. Pick class with highest probability.

Voting: each class gets votes proportional to its score. Softmax just normalizes so votes add up to 100%.

∂

All Math & Derivations

Every formula from Lec 3

▼

Loss Functions MSE: L(y,p) = (1/B) Σᵢ (y⁽ⁱ⁾ - p⁽ⁱ⁾)² BCE: L(y,p) = -(1/B) Σᵢ [y⁽ⁱ⁾ log p⁽ⁱ⁾ + (1-y⁽ⁱ⁾) log(1-p⁽ⁱ⁾)] CE: L(y,p) = -(1/B) Σᵢ Σₖ yₖ⁽ⁱ⁾ log pₖ⁽ⁱ⁾ B = batch size, y = true label, p = predicted probability, k = class index

Gradient Descent (Weight Update) w = w - η \cdot \partialL(y,p)/\partialw η (eta) = learning rate. This is how we adjust weights to reduce loss. Stochastic = uses mini-batches, not full dataset.

Backpropagation — Chain Rule \partialg/\partialx = (\partialg/\partialf) \cdot (\partialf/\partialx) \partialg/\partialy = (\partialg/\partialf) \cdot (\partialf/\partialy) If g depends on f, and f depends on x: the gradient "flows backward" through each function. Multiply the upstream gradient by the local gradient at each step. Forward pass: compute and save all intermediate values Backward pass: start from loss, apply chain rule layer by layer backwards Each layer: grad_input = grad_output \times local_derivative

Forward Pass Through a DNN (Layer by Layer) y0 = x (input) y1 = relu(y0\cdotw1 + b1) (hidden layer 1) y2 = relu(y1\cdotw2 + b2) (hidden layer 2) y3 = softmax(y2\cdotw3 + b3) (output layer) At the last layer: use linear output for regression, sigmoid for binary class., softmax for multi-class.

Convolutional Layer — Output Size Formula R_out = floor((R_in - K + P*(K-1)) / S) + 1 R_in=input size, K=kernel size, P=padding (0 for 'valid', 1 for 'same'), S=stride Example: 32\times32 input, 3\times3 kernel, stride=1, padding='same' \to R_out = floor((32-3+2)/1)+1 = 32

Conv Layer — Number of Operations (MACs) Ops = R_out \times C_out \times N_filters \times M_filters \times K \times K For each output pixel (R\timesC), for each output filter (N), for each input filter (M), apply K\timesK kernel. This is the inner loop in the code.

Softmax Formula P(y=j|x) = e^(x\cdotwⱼ) / Σₖ e^(x\cdotwₖ) Numerically stable version: subtract max before exponentiating. All outputs sum to 1. Used for multi-class classification output layer.

Dropout — Expected Value Proof x1 ~ Bernoulli(p), x2 ~ Gaussian(μ, σ) Output = (x1 \cdot x2) / p E[Output] = E[x1]\cdotE[x2] / p = p\cdotμ / p = μ We divide by p to keep the expected value the same at test time (when dropout is off). This is why Keras/TF handle dropout automatically at inference.

Convolutional Layer — 6-Nested Loop (Code Pseudocode) for row in R: for col in C: for to in N: for ti in M: for i in K: for j in K: output[to][row][col] += weights[to][ti][i][j] * input[ti][S*row+i][S*col+j] S = stride. This shows why convolutions are so parallelizable — all output pixels are independent.

All Abbreviations — Lec 3

▼

Abbr	Full Name	What It Does
DNN	Deep Neural Network	Network with many layers of neurons
CNN	Convolutional Neural Network	Uses conv filters for spatial data (images)
RNN	Recurrent Neural Network	Has memory, processes sequences
RBM	Restricted Boltzmann Machine	Generative model (energy-based)
ReLU	Rectified Linear Unit	Activation: max(0,x)
ELU	Exponential Linear Unit	Smoother version of ReLU
PReLU	Parametric ReLU	Leaky ReLU with learnable slope
SGD	Stochastic Gradient Descent	Optimizer using mini-batches
MSE	Mean Squared Error	Loss for regression
BCE	Binary Cross Entropy	Loss for 2-class classification
CE	Cross Entropy	Loss for multi-class classification
BN	Batch Normalization	Normalizes layer activations
FC	Fully Connected	Every neuron connected to every input
MaxPool	Max Pooling	Takes max in spatial region
AvgPool	Average Pooling	Takes average in spatial region
η (eta)	Learning Rate	Step size in gradient descent
B	Batch Size	Number of samples per gradient update
K	Kernel Size	Size of conv filter (e.g., 3×3)
S	Stride	Step size when sliding filter
M,N	Input/Output Feature Maps	Number of channels in/out of conv layer

</>

Code Covered — Lec 3

▼

Conv Layer 6-Loop (Pseudocode from Slides)

for row in range(R):
  for col in range(C):
    for to in range(N):   # output channels
      for ti in range(M):   # input channels
        for i in range(K):
          for j in range(K):
            output_fm[to][row][col] += \
              weights[to][ti][i][j] * input_fm[ti][S*row+i][S*col+j]

Building a CNN in Keras (AlexNet-style)

from tensorflow.keras import layers, Model, Input

x_in = Input(shape=(227, 227, 3))
x = Conv2D(96, (11,11), strides=4, activation='relu')(x_in)
x = MaxPooling2D((3,3), strides=2)(x)
x = BatchNormalization()(x)
x = Conv2D(256, (5,5), padding='same', activation='relu')(x)
x = MaxPooling2D((3,3), strides=2)(x)
x = Flatten()(x)
x = Dense(4096, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(1000, activation='softmax')(x)
model = Model(x_in, x)

Output Size Calculation

import math
def conv_output_size(R_in, K, S, padding):
    P = 0 if padding == 'valid' else 1
    return math.floor((R_in - K + P*(K-1)) / S) + 1

# Example: 32x32 input, 3x3 filter, stride=1, 'same' padding
print(conv_output_size(32, 3, 1, 'same'))  # → 32
print(conv_output_size(32, 3, 1, 'valid')) # → 30

📋

CHEAT SHEET — Lec 3 (Print This!)

▼

ELEN 521 — Lec 3 Cheat Sheet: CNNs & DL Basics

Loss Functions

MSE = (1/B)Σ(y-p)²

Regression

BCE = -(1/B)Σ[y·log(p) + (1-y)·log(1-p)]

Binary classification

CE = -(1/B)ΣΣ yₖ·log(pₖ)

Multi-class classification

Weight Update (SGD)

w = w - η · ∂L/∂w

η = learning rate

Chain Rule (Backprop)

∂g/∂x = (∂g/∂f)·(∂f/∂x)

Multiply upstream grad by local grad

Softmax

P(y=j|x) = e^(x·wⱼ) / Σₖ e^(x·wₖ)

Multi-class output layer

Conv Output Size

R_out = floor((R_in - K + P(K-1))/S) + 1

P=0 valid, P=1 same. S=stride, K=kernel

Dropout (Expected Value)

Output = x1·x2/p, E[Output] = μ

x1~Bernoulli(p), x2~Gaussian(μ,σ). Divide by p to preserve expectation!

Activation Functions

ReLU: max(0,x) | Sigmoid: 1/(1+e⁻ˣ)

Tanh: (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)

Key Layer Types

Dense/FC Conv2D MaxPool2D BatchNorm Dropout Flatten Softmax

CNN = Conv Layers + FC Layers

Front: Conv layers learn visual features (edges→textures→objects)
Back: FC layers do classification

CNN Architectures, Initialization & Regularization

AlexNet/VGGGoogleNet/Inception ResNetMobileNet Xavier/He InitL1/L2 Reg Bias-VarianceEarly Stopping Train/Val/Test

⚠ EXAM FOCUS: Network operations complexity O(N), number of layers L(N), minimizing data requirements, initialization choices (Xavier vs He), regularization effects on weights.

💡

Core Concepts

▼

AlexNet / CaffeNet

Input: 3×227×227. Stack: Conv → ReLU → MaxPool → Norm → ... → FC → Softmax. First deep net to win ImageNet (2012). Used GPU training + Dropout.

VGG

Very deep network using only 3×3 conv filters. The insight: two 3×3 convs = one 5×5 conv's receptive field, but fewer parameters. Goes up to 19 layers.

Key insight: depth beats width. Small filters, stacked deep.

GoogleNet / Inception Module

Instead of choosing 1×1, 3×3, or 5×5 filters — use ALL in parallel and concatenate. Uses 1×1 convolutions as "bottlenecks" to reduce computation before expensive 3×3 and 5×5 convs.

Multiple scales of feature detection happening simultaneously, then combined.

ResNet — Skip Connections

Problem: as networks get deeper, gradients vanish. Solution: add the input directly to the output of a block (skip connection). Learns residual = F(x) + x instead of F(x).

ResNet ≈ solving ODEs with Euler method: each block is a tiny step forward. Connection to differential equations: dy/dt = F(y).

MobileNet — Depthwise Separable Conv

Split standard conv into: (1) Depthwise conv (filter each channel independently) + (2) Pointwise 1×1 conv (combine channels). Dramatically reduces computation.

Instead of one job doing everything, split into two specialized mini-jobs.

MobileNet v2 — Inverted Residuals

Bottleneck blocks that EXPAND first (using 1×1), then depthwise, then COMPRESS (using 1×1). Add skip connection on the compressed (narrow) ends. Opposite of regular bottleneck.

Weight Initialization

Zeros → all neurons learn the same thing (symmetry problem). Random small → works better. Xavier/Glorot → good for symmetric activations (tanh). He → good for ReLU/asymmetric.

Wrong init = students starting with the same wrong answer. Can't unlearn it together.

Vanishing / Exploding Gradients

In deep nets, gradients can shrink to ≈0 (vanishing) or blow up to ∞ (exploding) as they travel backwards. Cause: repeated multiplication of small/large numbers. Fix: proper init, BatchNorm, skip connections.

L2 Regularization (Weight Decay)

Add λ·||w||² to the loss. Effect: weights are pulled towards zero (shrunk) at every update. Also called Tikhonov regularization or ridge regression.

L1 Regularization (Lasso)

Add λ·||w||₁ to the loss. Effect: many weights become exactly 0 (sparsity). Acts as feature selection.

Bias-Variance Trade-off

High bias = underfitting (model too simple). High variance = overfitting (model memorizes training data). Goal: find the sweet spot.

High bias: a rule that says "all dogs are beige". High variance: memorized every specific dog you've ever seen.

Train / Validation / Test Sets

Train: learn weights. Validation: tune hyperparameters (architecture, lr, reg). Test: final evaluation ONCE. Rule: Train > 1M → val/test can be 1% (≈10k samples).

Early Stopping

Stop training when validation loss stops improving. Prevents overfitting. Simple and very effective regularization.

Bagging / Ensemble Methods

Train multiple models on different bootstrap samples of data. Average their predictions. Different models make different mistakes → averaging reduces errors.

∂

All Math & Derivations — Lec 4

▼

Xavier / Glorot Initialization W ~ Uniform[-\sqrt(6/(nᵢₙ+nₒᵤₜ)), \sqrt(6/(nᵢₙ+nₒᵤₜ))] OR: W ~ Normal(0, \sqrt(2/(nᵢₙ+nₒᵤₜ))) nᵢₙ = number of inputs to layer, nₒᵤₜ = number of outputs. Keeps variance constant across layers for symmetric activations (tanh).

He Initialization W ~ Normal(0, \sqrt(2/nᵢₙ)) Use for ReLU/asymmetric activations. The \sqrt2 factor accounts for ReLU zeroing out half the values. This maintains variance after ReLU.

L2 Regularization — Weight Update Ω(w) = (1/2)\cdotλ\cdotwᵀw \partialΩ/\partialw = λ\cdotw w \leftarrow w - η\cdot(\partialL_train/\partialw + λ\cdotw) w \leftarrow (1 - η\cdotλ)\cdotw - η\cdot\partialL_train/\partialw \leftarrow SHRINK then update The (1-η\cdotλ) factor shrinks weights at every step \to "weight decay". This is why it's called that!

L1 Regularization Ω(w) = λ\cdot||w||₁ = λ\cdotΣᵢ|wᵢ| Gradient: sign(w). This pushes weights to exactly 0 \to sparsity. Feature selection behavior.

Condition Number (Poor Conditioning) κ = λ_max / λ_min For Hessian H = U\cdotDiag(λ₁,...,λₙ)\cdotUᵀ: condition number = ratio of largest to smallest eigenvalue. Large κ \to highly elongated loss surface \to slow convergence and oscillation in SGD.

Inception Module — Parameter Count Reduction Standard 5\times5 conv: 5\times5\timesM\timesN params With 1\times1 bottleneck (to B channels): 1\times1\timesM\timesB + 5\times5\timesB\timesN params If B << M: massive savings! E.g., M=128, B=16, N=256: standard=819K, bottleneck=105K \to 8\times reduction!

MobileNet — Depthwise Separable Conv Cost Standard conv: K²\cdotM\cdotN (K=kernel, M=in-channels, N=out-channels) Depthwise: K²\cdotM + Pointwise: M\cdotN Ratio = 1/N + 1/K² \approx 1/9 for K=3 (nearly 9\times cheaper!)

Abbreviations — Lec 4

▼

Abbr	Full Name	Key Point
VGG	Visual Geometry Group	Deep net, only 3×3 filters (Oxford)
ResNet	Residual Network	Skip connections to fight vanishing gradients
mHC	mHC (doubly stochastic)	Sinkhorn-Knopp normalization for ResNet in 2026
ReLU6	ReLU capped at 6	min(max(0,x), 6) — used in MobileNet
GELU	Gaussian Error Linear Unit	Smoother activation, used in Transformers
L1	L1 norm penalty (Lasso)	Encourages sparsity (exact zeros)
L2	L2 norm penalty (Ridge/Tikhonov)	Weight decay, drives weights toward 0
i.i.d.	Independent and Identically Distributed	Assumption that train/test come from same distribution
κ	Condition Number	λ_max/λ_min — how badly conditioned the loss surface is
DW	Depthwise Conv	Filter each input channel separately
PW	Pointwise Conv (1×1 Conv)	Combine channels linearly

</>

Code — Lec 4

▼

GoogleNet Inception Module

# Parallel branches, then concatenate
conv_1x1 = Conv2D(filters_1x1, (1,1), padding='same', activation='relu')(x)
conv_3x3 = Conv2D(filters_3x3_reduce, (1,1), padding='same', activation='relu')(x)
conv_3x3 = Conv2D(filters_3x3, (3,3), padding='same', activation='relu')(conv_3x3)
conv_5x5 = Conv2D(filters_5x5_reduce, (1,1), padding='same', activation='relu')(x)
conv_5x5 = Conv2D(filters_5x5, (5,5), padding='same', activation='relu')(conv_5x5)
pool_proj = MaxPool2D((3,3), strides=(1,1), padding='same')(x)
pool_proj = Conv2D(filters_pool_proj, (1,1), padding='same', activation='relu')(pool_proj)
output = Concatenate(axis=3)([conv_1x1, conv_3x3, conv_5x5, pool_proj])

MobileNet v2 Inverted Residual Block

def inverted_residual_block(x, expand=64, squeeze=16):
    m = Conv2D(expand, (1,1), activation='relu')(x)      # EXPAND
    m = DepthwiseConv2D((3,3), activation='relu')(m)  # DEPTHWISE
    m = Conv2D(squeeze, (1,1), activation='relu')(m)   # SQUEEZE
    return Add()([m, x])  # Skip connection on NARROW end

Weight Initialization (NumPy)

# BAD - zeros initialization (symmetry problem)
w_bad = np.zeros((n_in, n_out))

# Xavier/Glorot - for tanh activation
limit = np.sqrt(6.0 / (n_in + n_out))
w_xavier = np.random.uniform(-limit, limit, (n_in, n_out))

# He initialization - for ReLU activation
w_he = np.random.normal(0, np.sqrt(2.0 / n_in), (n_in, n_out))

# In Keras:
Dense(64, activation='relu', kernel_initializer='he_normal')
Dense(64, activation='tanh', kernel_initializer='glorot_uniform')

📋

CHEAT SHEET — Lec 4

▼

ELEN 521 — Lec 4: Architectures, Init, Regularization

Initialization Rules

Xavier/Glorot: W ~ N(0, 2/(nᵢₙ+nₒᵤₜ))

Use with: tanh, sigmoid (symmetric)

He: W ~ N(0, 2/nᵢₙ)

Use with: ReLU, Leaky-ReLU (asymmetric)

Bias: always initialize to 0

L2 Regularization

w ← (1-η·λ)·w - η·∂L/∂w

Shrinks weights each step = "weight decay"

L1 Regularization

Ω(w) = λ·Σ|wᵢ| → sparsity (zeros!)

Condition Number

κ = λ_max / λ_min of Hessian

Large κ = bad conditioning, slow convergence

Architecture Shortcuts

AlexNet: Conv→ReLU→Pool→Norm (deep but wide)

VGG: Only 3×3 filters, very deep, simple

Inception: parallel 1×1, 3×3, 5×5 + MaxPool → concat

ResNet: y = F(x) + x (skip connection)

Solves vanishing gradient. Deeper=better.

MobileNet v2: Expand→DW→Squeeze + skip

Inverted residual on NARROW ends

Cost Comparison

DW-Sep = 1/N + 1/K² of standard (≈1/9 for K=3)

Train/Val/Test

Train: learn weights | Val: tune hyper-params | Test: final eval (use once!)

Problems, Transfer Learning & Advanced Applications

Design PatternsTransfer Learning VisualizationStyle Transfer YOLOIoUNMS U-NetDeconvNet

💡

Core Concepts

▼

Transfer Learning

Take a pre-trained network (e.g., ResNet trained on ImageNet) and fine-tune it on your small dataset. Freeze early layers (they learn generic features), only train later layers on your data.

Like hiring someone who already knows how to read and write — you just teach them your specific domain vocabulary.

Freeze-Drop-Path

Freeze layers at the bottom (generic features), use Dropout on the path, and only fine-tune the top layers for your new task.

Image Augmentation

Artificially expand your dataset: mirror images, add distortions, blur, change colors. Forces the network to be invariant to these transformations. Very effective!

U-Net (Segmentation)

Encoder-decoder architecture with skip connections between encoder and decoder at each scale. State-of-the-art for image segmentation. "U" shape = downsampling then upsampling.

DeconvNet

Reverse of convolution — upsamples feature maps back to original image size. Used for segmentation and visualizing what a network has learned.

Neural Style Transfer

Combine CONTENT of one image with STYLE of another. Use content loss (match activations) + style loss (match Gram matrices of activations) and optimize the output image.

YOLO (You Only Look Once)

Object detection: divide image into grid cells, each predicts bounding boxes + class probabilities. Single forward pass = very fast. Outputs: (bx, by, bw, bh, pc, class probs).

Instead of sliding a classifier over every possible region (slow), look at the whole image once with a CNN.

IoU — Intersection over Union

Measures overlap between predicted box and ground truth box. IoU = Area(intersection) / Area(union). Used to decide if a detection is correct (threshold usually 0.5).

Non-Max Suppression (NMS)

After YOLO predicts many overlapping boxes: (1) discard boxes with confidence <0.6, (2) order by confidence, (3) keep highest, discard any that overlap with it by IoU>0.5, (4) repeat.

Double Descent

Modern finding: overfitting is NOT always bad. If you use a very large model (way beyond data capacity), test error can DECREASE again — the model finds smooth interpolating solutions.

Debugging Networks

Print statistics of intermediate layers: min/max of outputs. If a layer has all same values → dead filter. If ranges are wrong → initialization issue or missing normalization.

CNNs Detect Texture, Not Shape

Important finding: CNNs trained on ImageNet primarily learn texture statistics, not object shape. A cat image with dog texture → classified as dog!

∂

Math — Lec 6

▼

Neural Style Transfer — Gram Matrix GM[l]ᵢⱼ = M[l] \cdot M[l]ᵀ M[l]: feature map at layer l, shape = (Nₗ \times Mₗ) where Nₗ=channels, Mₗ=H\timesW. Gram matrix captures correlations between feature maps \to encodes "style" (texture patterns).

Neural Style Transfer — Loss Function Total Loss = α\cdotL_content(C, G) + β\cdotL_style(S, G) C = content image, S = style image, G = generated image. Optimize G by gradient descent! α, β control balance between content and style.

YOLO — Bounding Box Encoding Output per cell: [pc, bx, by, bw, bh, c1, c2, ..., ck] pc = object confidence (0-1). bx, by \in [0,1] = box center relative to cell. bw, bh can be > 1 (relative to cell size). c1...ck = class probabilities.

IoU Formula IoU = Area(B₁ \cap B₂) / Area(B₁ \cup B₂) IoU = 1 \to perfect overlap. IoU = 0 \to no overlap. Use threshold 0.5 for NMS. Use higher (0.75) for stricter detection quality.

Visualizing Network — Gradient Ascent on Image loss = MSE(layer[i].output) x = x + η \cdot \partialloss/\partialx_input (gradient ASCENT, not descent) Start with random/noise image, update it to maximize a neuron's activation. Shows what pattern that neuron looks for.

Abbreviations — Lec 6

▼

Abbr	Full Name	Key Point
YOLO	You Only Look Once	Fast object detection (1 forward pass)
IoU	Intersection over Union	Bounding box overlap metric
NMS	Non-Max Suppression	Remove duplicate bounding box predictions
SOTA	State Of The Art	Best known performance on benchmark
U-Net	U-shaped Network	Encoder-decoder with skip connections for segmentation
GM	Gram Matrix	Captures style (feature map correlations)
SSL	Self-Supervised Learning	Use raw data as supervision (no labels)
CLR	Cyclical Learning Rates	Oscillate lr between bounds during training

</>

Code — Lec 6

▼

Transfer Learning with Frozen Layers

from tensorflow.keras.applications import ResNet50

base = ResNet50(weights='imagenet', include_top=False)
base.trainable = False  # FREEZE base layers

x = base.output
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu')(x)
output = Dense(num_classes, activation='softmax')(x)

model = Model(base.input, output)
# Only the new dense layers will be trained!

Neural Style Transfer Loop (Pseudocode)

C_layers = ["relu_3_3"]
S_layers = ["relu_1_2", "relu_2_2", "relu_3_3", "relu_4_3"]
alpha, beta = 0.5, 0.5

for i in range(iterations):
    loss = alpha * C_loss(C, G, C_layers) + beta * S_loss(S, G, S_layers)
    minimize(loss).change(G)  # Optimize IMAGE G, not weights!

Debugging Layer Statistics

layer = model.get_layer(layer_name)
debug = Model(inputs=model.inputs, outputs=layer.output)
p = debug.predict(x_train)
for channel in range(p.shape[-1]):
    print(f"[{channel}] min={np.min(p[...,channel]):.4f} max={np.max(p[...,channel]):.4f}")
# Dead filter: all values at exactly 0 or same value!

📋

CHEAT SHEET — Lec 6

▼

ELEN 521 — Lec 6: Transfer Learning & Applications

Transfer Learning

1. Load pre-trained model (ImageNet)
2. Freeze base layers (base.trainable=False)
3. Add new head layers
4. Train only new layers
5. Optional: unfreeze top few layers and fine-tune

Style Transfer Math

Gram = M[l] · M[l]ᵀ (style matrix)

M[l] shape: (channels, H×W)

Loss = α·L_content + β·L_style

Optimize IMAGE, not weights!

IoU

IoU = Area(intersection) / Area(union)

NMS threshold = 0.5

YOLO Output per Cell

[pc, bx, by, bw, bh, c₁..cₖ]

pc=confidence, bx,by∈[0,1], bw,bh can >1

NMS Algorithm

1. Discard boxes with pc < 0.6
2. Sort by pc (high→low)
3. Keep highest; discard IoU>0.5 with it
4. Repeat

Design Patterns (Key 5)

1. Pyramid shape: H↓ W↓ C↑
2. Use skip connections (ResNet style)
3. Normalize layer inputs (BatchNorm)
4. Use pre-trained + finetuning
5. Augment data (mirror, distort, blur)

Embeddings, RNNs, LSTMs & GRUs

Word RepresentationsTF-IDF Word2VecSkip-gram Cosine SimilarityRNN LSTMGRU Language Models

💡

Core Concepts

▼

One-Hot Encoding / BOW

Each word = a vector of zeros with a 1 at its position in vocabulary. Bag of Words (BOW) = count occurrences. Simple but: very sparse, doesn't capture meaning or similarity.

TF-IDF

Term Frequency × Inverse Document Frequency. Words common in ALL documents (like "the") get weight≈0. Rare but important words get high weight.

TF-IDF kills stop words automatically — "the" appears everywhere so IDF=0, making TF-IDF=0.

Dense Embeddings

Represent words as dense low-dimensional vectors (e.g., 300 dims). Capture semantic meaning: similar words → similar vectors. Better for ML than sparse representations.

Word2Vec

Train a shallow network to predict whether two words appear near each other. Take the learned weights as word vectors. Two approaches: Skip-gram and CBOW.

You are known by the company you keep. Words that appear in similar contexts get similar vectors.

Skip-gram

Given a center word, predict the surrounding context words. Uses negative sampling: treat nearby words as positive, random words as negative, train a binary classifier.

Cosine Similarity

Measures angle between two vectors. Range: -1 (opposite) to +1 (same direction) to 0 (perpendicular). For word frequencies (non-negative): range [0,1]. Better than dot product for different-length vectors.

Embedding Analogies

Learned embeddings encode relational meaning: king - man + woman ≈ queen. Paris - France + Italy ≈ Rome. Vector arithmetic can answer analogy questions!

RNN — Recurrent Neural Network

Has hidden state h that carries memory across time steps. hₜ = tanh(W·hₜ₋₁ + U·xₜ). Output: oₜ = softmax(V·hₜ). Processes sequences of ANY length.

Like reading a sentence word by word while taking notes (hidden state) that update with each word.

Vanishing Gradient in RNNs

Gradients must flow backwards through time (BPTT). With many time steps, repeated multiplication of small gradients → vanishes to 0. RNN "forgets" long-ago inputs. LSTMs fix this.

LSTM — Long Short-Term Memory

Has SEPARATE cell state (long-term memory) and hidden state (short-term). Three gates: Forget (what to erase), Input (what new info to store), Output (what to output). Can preserve gradients over many steps.

GRU — Gated Recurrent Unit

Simpler than LSTM: only 2 gates (Reset and Update). Combines cell state and hidden state. Fewer parameters, often similar performance to LSTM.

Language Model

Predict the next word given all previous words: P(xₜ₊₁ = wᵢ | xₜ, xₜ₋₁, ..., x₁). RNNs are natural for this — feed history through hidden state.

∂

Math — Lec 7

▼

TF-IDF Formula TF(word, doc) = freq(word) / total_words_in_doc IDF(word) = log(N_docs / N_docs_containing_word) TF-IDF = TF \times IDF If word appears in ALL docs: IDF = log(1) = 0, so TF-IDF = 0. Common words like "the" are automatically suppressed.

Cosine Similarity cos(v, w) = (v \cdot w) / (||v|| \cdot ||w||) = dot product divided by product of magnitudes. Range: [-1, 1]. For non-negative vectors (word counts): [0, 1].

Word2Vec Skip-gram — Probability P(+|t,c) = σ(t \cdot c) = 1 / (1 + e^(-(t\cdotc))) P(-|t,c) = 1 - P(+|t,c) t = target word vector, c = context word vector. Use dot product + sigmoid to get probability. Train to maximize P(+) for real pairs, P(-) for noise words.

Skip-gram Training Objective Maximize: Σ log P(+|t,cᵢ) + Σ log P(-|t,nⱼ) Sum over positive context words cᵢ (real neighbors) and negative noise words nⱼ (random). Noise words sampled proportional to pα(w) with α=3/4 (boosts rare words).

Simple RNN Equations hₜ = tanh(W\cdothₜ₋₁ + U\cdotxₜ) [hidden state update] oₜ = softmax(V\cdothₜ) [output] W = hidden-to-hidden weights. U = input-to-hidden weights. V = hidden-to-output weights. SAME weights at every time step!

Language Model Probability P(xₜ₊₁ = wᵢ | xₜ, xₜ₋₁, ..., x₁) Full probability of entire sequence: P(w₁, w₂, ..., wₙ) = Π P(wₜ|w₁,...,wₜ₋₁). RNN compresses history into hₜ.

Abbreviations — Lec 7

▼

Abbr	Full Name	Key Point
NLP	Natural Language Processing	AI for text/language
BOW	Bag of Words	Count word occurrences, ignore order
TF-IDF	Term Frequency-Inverse Document Frequency	Weight words by rarity in corpus
PPMI	Positive Pointwise Mutual Information	Measure word co-occurrence, sparse
CBOW	Continuous Bag of Words	Predict center word from context
RNN	Recurrent Neural Network	Has hidden state, processes sequences
LSTM	Long Short-Term Memory	Gated RNN, avoids vanishing gradients
GRU	Gated Recurrent Unit	Simpler LSTM (2 gates vs 3)
BPTT	Backpropagation Through Time	Backprop unrolled through time steps
V	Vocabulary	Set of all unique words in corpus
W, U, V	Weight matrices in RNN	W=h-to-h, U=x-to-h, V=h-to-output
GloVe	Global Vectors (Stanford)	Pre-trained word embeddings
xLSTM	Extended LSTM	Modern variant of LSTM

📋

CHEAT SHEET — Lec 7

▼

ELEN 521 — Lec 7: Embeddings, RNN, LSTM, GRU

TF-IDF

TF(w,d) = freq(w) / total_words(d)

IDF(w) = log(N / df(w))

TF-IDF = TF × IDF → 0 for common words!

Cosine Similarity

cos(v,w) = v·w / (||v||·||w||)

Range [-1,1]. 1=same, 0=orthogonal, -1=opposite

Word2Vec Skip-gram

P(+|t,c) = σ(t·c) = 1/(1+e^(-t·c))

Noise sampling: p^(3/4) boosts rare words

Embedding Analogy

king - man + woman ≈ queen

Vector arithmetic captures relations!

Simple RNN

hₜ = tanh(W·hₜ₋₁ + U·xₜ)

oₜ = softmax(V·hₜ)

SAME W,U,V at every time step

LSTM Gates

Forget gate: what to remove from cell state (σ)
Input gate: what new info to add (σ + tanh)
Output gate: what to output (σ × tanh(C))
Cell state C: long-term memory
Hidden state h: short-term memory

GRU (Simpler)

Reset gate: how much past to forget
Update gate: how much new vs old to mix
Fewer params, often ≈ LSTM performance

RNN Types (by I/O shape)

1→1: standard (image classification)
1→N: image captioning
N→1: sentiment analysis
N→N: translation, seq2seq
N↔N: video frame labeling

Applications of RNNs & Attention Mechanism

Sentiment AnalysisText Generation seq2seqBidirectional RNN Image CaptioningAttention ELMOKey/Query/Value

💡

Core Concepts

▼

Sentiment Analysis

Input: text sequence. Output: positive/negative. Architecture: Embed words → feed through LSTM → use final hidden state → Dense(sigmoid). Many-to-one architecture.

Character Generation

Train LSTM on text. At inference: generate one character at a time, feed it back as input for next step. Uses LSTM with `stateful=True` and `reset_states()` between sequences.

seq2seq (Encoder-Decoder)

Encoder LSTM reads input sequence → compresses into context vector (last hidden state). Decoder LSTM takes context vector → generates output sequence one token at a time. Used for translation, chatbots, Q&A.

Encoder = reading a sentence and forming a mental representation. Decoder = translating that mental representation into another language.

seq2seq Bottleneck Problem

The entire input must be compressed into ONE vector (last hidden state of encoder). For long sequences, too much information is lost. Solution: Attention.

Bidirectional RNN

Run one LSTM forward (left→right) and one backward (right→left). Concatenate both hidden states. The model can see both past AND future context at each position.

Understanding a word from both what came before AND after it in the sentence.

Image Captioning

CNN extracts image features → feed as initial context to LSTM → LSTM generates caption word by word starting with <START> token until <END>. Many-to-many.

Attention Mechanism

Instead of compressing input to ONE vector, decoder can "attend" to all encoder hidden states. At each decoder step, compute a weighted sum of ALL encoder states — attention weights tell us which input tokens to focus on.

When translating "dog", the decoder looks back at the encoder states and pays more attention to the part of input that said "Hund" (German for dog).

Keys, Queries, Values

Query: what the decoder is asking. Key: what each encoder state is. Value: what the encoder state contains. Similarity(query, key) → attention weight → weighted sum of values.

ELMO (Contextual Embeddings)

Unlike Word2Vec (one vector per word), ELMo generates DIFFERENT vectors for the same word depending on context. "fair" in "fair game" vs "county fair" → different vectors. Uses bidirectional LSTMs.

Beyond RNNs

ELMO → Attention → Transformers → BERT → GPT. Each step improved long-range dependency handling. Transformers replaced RNNs for most NLP tasks by 2018-2019.

∂

Math — Lec 8 (Attention)

▼

Attention Scores (encoder hidden states = keys, decoder state = query) eₜᵢ = f_att(aᵢ, hₜ₋₁) [attention score: how relevant encoder state i is at decoder step t] αₜᵢ = exp(eₜᵢ) / Σₖ exp(eₜₖ) [softmax \to attention probabilities] aᵢ = encoder hidden states (keys/values), hₜ₋₁ = decoder state (query). f_att can be dot product, bilinear, or MLP.

Context Vector (weighted sum) ẑₜ = Σᵢ αₜᵢ \cdot aᵢ Context vector ẑₜ is a soft selection over all encoder states. Heavy weight on relevant encoder positions, near-zero elsewhere.

Key-Query-Value Formulation Attention(Q, K, V) = softmax(K\cdotQ / \sqrtd) \cdot V where: Q=query, K=keys, V=values, d=dimension This is the scaled dot-product attention used in Transformers. \sqrtd scales to prevent large dot products from causing softmax to saturate.

Image Captioning with Attention a = {a₁,...,aₗ}, aᵢ ∈ ℝᴰ [L spatial regions from CNN] eₜᵢ = f_att(aᵢ, hₜ₋₁) [attention energy] αₜᵢ = softmax(e) [attention weights] ẑₜ = Σᵢ αₜᵢ · aᵢ [context vector = attended image region] At each word generation step, the decoder "looks at" different parts of the image. When generating "dog", αₜᵢ peaks over the dog region.

Abbreviations — Lec 8

▼

Abbr	Full Name	Key Point
seq2seq	Sequence to Sequence	Encoder-decoder for variable length I/O
ELMO	Embeddings from Language Models	Contextual word embeddings via biLSTM
BERT	Bidirectional Encoder Representations from Transformers	Pre-trained transformer encoder
GPT	Generative Pre-trained Transformer	Autoregressive transformer decoder
Q, K, V	Query, Key, Value	Components of attention mechanism
BiLSTM	Bidirectional LSTM	Forward + backward LSTM concatenated
B, T, F	Batch, Time, Features	Standard tensor shape for sequences
MoE	Mixture of Experts	Multiple specialized sub-networks

</>

Code — Lec 8

▼

LSTM API in Keras — All Variants

# Default: returns last output
x = LSTM(32)(x)  # x: (batch, 32)

# Return all time step outputs
x = LSTM(32, return_sequences=True)(x)  # x: (batch, T, 32)

# Return output + states (h and c)
x, h, c = LSTM(32, return_state=True)(x)  # [y, h, c] each (batch, 32)

# Use initial state (from encoder in seq2seq)
x = LSTM(32)(x, initial_state=[h, c])

# Bidirectional
x = Bidirectional(LSTM(64))(x)  # output doubled: (batch, 128)

Attention in Code (Matrix Form)

# h: encoder states (B, TE, F), s: decoder states (B, TD, F)
h_p = Permute((2, 1))(h)       # (B, F, TE)
e = tf.matmul(s, h_p)          # (B, TD, TE) - attention scores
alpha = softmax(e)             # (B, TD, TE) - attention weights
a = tf.matmul(alpha, h)         # (B, TD, F) - context vectors

Character Generation

x = LSTM(32, stateful=True)(x)    # Remembers state between batches!
# ... build model ...

model.reset_states()               # Reset between sequences
x = np.argmax(output)              # Greedy decoding
# OR sample from distribution:
x = np.random.choice(len(output), size=1, p=output)

📋

CHEAT SHEET — Lec 8

▼

ELEN 521 — Lec 8: RNN Applications & Attention

Attention Math

eₜᵢ = f_att(aᵢ, hₜ₋₁) (score)

αₜᵢ = softmax(eₜᵢ) (weights)

ẑₜ = Σᵢ αₜᵢ·aᵢ (context)

Scaled dot-product: softmax(K·Qᵀ/√d)·V

seq2seq

Encoder: reads input → context vector
Decoder: context vector → output sequence
Bottleneck: entire input → 1 vector (problem!)
Fix: Attention over all encoder states

Keras LSTM Cheat

LSTM(32): → (B, 32) last output

LSTM(32, return_sequences=True): → (B,T,32)

LSTM(32, return_state=True): → [y,h,c]

Bidirectional(LSTM(64)): → (B, 128)

Applications

Sentiment: N→1 Translation: N→N Captioning: 1→N Generation: N→N stateful

ELMO vs Word2Vec

Word2Vec: 1 vector per word (context-free)
ELMo: different vector per context (better!)

Unsupervised Learning: Autoencoders, VAEs & GANs

AutoencodersDenoising AE VAEKL Divergence ReparameterizationGAN DiscriminatorGenerator VQ-VAEDiffusion

💡

Core Concepts

▼

Why Unsupervised / Generative Models?

Labels are expensive. Unsupervised: learn structure from data without labels. Generative: learn to GENERATE new samples (not just classify). Applications: image synthesis, denoising, style transfer, data augmentation.

Autoencoder

Encoder: compress input x → bottleneck z (latent code). Decoder: reconstruct x from z. Trained to minimize reconstruction error. Forces the network to learn a compact representation.

Like learning to describe an image in 50 words, then reconstruct it from those 50 words.

Denoising Autoencoder

Add noise to input, train to reconstruct clean original. Forces the autoencoder to learn robust features. Input: noisy image. Target: clean image. Architecture: Conv → bottleneck → Conv2DTranspose.

Problem with Regular Autoencoders

Latent space z is not continuous or structured. You can't sample a random z and get a meaningful image. VAEs fix this by imposing a distribution on z.

VAE — Variational Autoencoder

Encoder outputs μ (mean) and σ (std) instead of a single z. Sample z ~ N(μ, σ²). Decoder takes z → output. Training forces z to be close to N(0,1) via KL divergence loss.

Instead of a single point in latent space, each input maps to a REGION (Gaussian). This makes the latent space smooth and traversable.

Reparameterization Trick

Problem: sampling is not differentiable. Solution: z = μ + σ·ε where ε ~ N(0,1). Now the randomness (ε) is separate from the learnable parts (μ, σ) → gradients can flow!

GAN — Generative Adversarial Network

Two networks in competition: Generator G (makes fake images from noise z) vs Discriminator D (tells real from fake). G tries to fool D. D tries to correctly classify. They improve each other.

Forger (G) vs Art Authenticator (D). Forger gets better to fool the authenticator, who gets better to catch the forger.

GAN Problems

1. Nash equilibrium hard to achieve. 2. Vanishing gradient. 3. Mode collapse: G produces only one type of output. 4. No good evaluation metric.

VQ-VAE

Vector Quantized VAE: latent space is discrete (codebook of vectors). Encoder output is "snapped" to nearest codebook entry. Enables high-quality generation. Used in DALL-E.

Diffusion Models

Gradually add noise to images (forward process) then train a network to reverse it (denoise step by step). State-of-the-art for image generation (DALL-E 2, Stable Diffusion).

VAE-GAN

Combine VAE (structured latent space) with GAN (sharp, realistic outputs). Encoder-decoder from VAE + discriminator from GAN. Best of both worlds.

Feature Addition to Images (VAE)

Find images with and without a feature. Compute latent vectors, subtract to get feature direction. Add this direction to μ of any image to add that feature.

In latent space, directions correspond to semantic attributes (smile, glasses, age). Arithmetic works!

∂

Math — Lec 9

▼

VAE — ELBO (Evidence Lower BOund) Loss L = E[log P(x|z)] - KL(Q(z|x) || P(z)) L = Reconstruction Loss + KL Divergence Loss Reconstruction: how well we rebuild x from z (cross-entropy for images). KL: how close Q(z|x)=N(μ,σ²) is to prior P(z)=N(0,1). Minimize: negative reconstruction + KL.

KL Divergence — Closed Form for Gaussians KL(N(μ,σ²) || N(0,1)) = -½ \cdot Σ(1 + log(σ²) - μ² - σ²) This is the exact formula your prof uses! No need to estimate. This term regularizes the latent space to be close to a standard normal. Minimizing it pulls μ\to0, σ\to1.

Reparameterization Trick z = μ + σ \cdot ε, where ε ~ N(0, 1) Gradient of loss w.r.t. μ and σ can now be computed. Without this trick: z ~ N(μ,σ²) is not differentiable and backprop would fail.

GAN — Objective Function (Minimax) min_G max_D V(D,G) = E[log D(x)] + E[log(1 - D(G(z)))] D tries to maximize V (correctly classify real/fake). G tries to minimize V (fool D). At Nash equilibrium: D(x)=0.5 everywhere (can't tell real from fake).

GAN Training Trick — Flip Labels Instead of minimizing log(1-D(G(z))), maximize log(D(G(z))) Same game-theoretically, but much better gradient signal early in training when G is bad and D(G(z))\approx0 \to gradients don't vanish.

Jensen Inequality (used in VAE derivation) f(E[X]) \geq E[f(X)] for concave f (like log) Used to derive the ELBO: log P(x) \geq E[log P(x|z)] - KL(Q||P). The ELBO is a lower bound on the true log-likelihood.

Abbreviations — Lec 9

▼

Abbr	Full Name	Key Point
AE	Autoencoder	Compress → reconstruct, learn latent code
VAE	Variational Autoencoder	AE with structured latent space (Gaussian)
GAN	Generative Adversarial Network	Generator vs Discriminator adversarial game
DCGAN	Deep Convolutional GAN	GAN using Conv/Deconv layers
KL	Kullback-Leibler Divergence	Measures difference between distributions
ELBO	Evidence Lower BOund	VAE loss = reconstruction + KL
VQ-VAE	Vector Quantized VAE	Discrete latent space (codebook)
RQ-VAE	Residual Quantized VAE	Boosted Embeddings + VQ-VAE
z	Latent Code / Latent Vector	Compressed representation of input
G	Generator	Network that makes fake samples from noise
D	Discriminator	Network that tells real from fake
μ, σ	Mean, Standard Deviation	VAE encoder outputs these for each z-dim
ε	Epsilon (noise)	ε~N(0,1) in reparameterization trick

</>

Code — Lec 9

▼

Denoising Autoencoder (Conv2DTranspose for upsampling)

# Encoder
x = Input(shape=(28,28,1))
z = Conv2D(32, (3,3), activation='relu', padding='same')(x)
z = MaxPooling2D((2,2))(z)

# Decoder (Conv2DTranspose = learned upsampling)
out = Conv2DTranspose(32, (3,3), strides=(2,2), padding='same', activation='relu')(z)
out = Conv2D(1, (3,3), activation='sigmoid', padding='same')(out)
# Input: noisy image | Target: clean image

VAE — Reparameterization Layer

class Sampling(Layer):
    def call(self, inputs):
        mu, log_var = inputs
        epsilon = tf.random.normal(shape=tf.shape(mu))
        return mu + tf.exp(0.5 * log_var) * epsilon  # z = μ + σ·ε

# VAE loss:
reconstruction_loss = cross_entropy(y_true, y_pred)
kl_loss = -0.5 * tf.reduce_sum(1 + log_var - tf.square(mu) - tf.exp(log_var))
vae_loss = reconstruction_loss + kl_loss

GAN Training Loop

for epoch in range(epochs):
    # Train Discriminator
    real_labels = np.ones((batch_size, 1)) * 0.9  # Label smoothing!
    fake_labels = np.zeros((batch_size, 1)) + 0.1
    z = np.random.normal(0, 1, (batch_size, latent_dim))  # Sample from Gaussian!
    fake_imgs = generator.predict(z)
    discriminator.train_on_batch(real_imgs, real_labels)
    discriminator.train_on_batch(fake_imgs, fake_labels)
    
    # Train Generator (freeze discriminator)
    z = np.random.normal(0, 1, (batch_size, latent_dim))
    # Fool discriminator: label fakes as REAL
    gan.train_on_batch(z, np.ones((batch_size, 1)))

📋

CHEAT SHEET — Lec 9

▼

ELEN 521 — Lec 9: Autoencoders, VAEs, GANs

Autoencoder

Encode x → z (bottleneck) → Decode z → x̂
Loss = ||x - x̂||² (reconstruction)
Denoising: add noise to input, reconstruct clean

VAE Loss (ELBO)

L = -Reconstruction + KL(N(μ,σ²)||N(0,1))

KL = -½·Σ(1 + log σ² - μ² - σ²)

Reparameterization Trick

z = μ + σ·ε, ε~N(0,1)

Makes sampling differentiable → backprop works!

Jensen Inequality

log(E[X]) ≥ E[log(X)]

Used to derive ELBO from marginal likelihood

GAN Objective

min_G max_D [E[log D(x)] + E[log(1-D(G(z)))]]

D: maximize (real=1, fake=0)
G: fool D (make D output 1 for fakes)

GAN Tricks (Key ones)

• z~N(0,1) not uniform!
• Label smooth: real=0.9, fake=0.1
• Use tanh as last G activation (output in [-1,1])
• Leaky ReLU + AvgPool (no MaxPool)
• Separate mini-batches for real/fake

GAN Problems

Mode collapse Vanishing gradient No good eval metric Nash equilibrium

Hierarchy

AE → VAE → VQ-VAE → Diffusion
GAN → DCGAN → VAE-GAN