Home

ELEN 521 — Deep Learning Study Guide

Lectures 3 through 9 · Final Exam Prep · Prof. Claudionor N. Coelho Jr
Lec 3: CNNs
Lec 4: Architectures
Lec 6: Transfer & Apps
Lec 7: Embeddings & RNN
Lec 8: RNN Apps
Lec 9: Autoencoders & GANs
03
CNNs & Deep Learning Fundamentals
ML vs DLDNN Structure Activation FunctionsLoss Functions BackpropagationConv Layer PoolingBatchNorm DropoutSoftmax
⚠ EXAM FOCUS (from your midterm): BatchNorm statistics (E[X], V[X]), ReLU output counting, backprop derivatives with chain rule, operations counting in networks.
💡
Core Concepts — Simple Explanations
Click to expand

ML vs Deep Learning

Traditional ML: YOU manually engineer features (histograms, gradients, shape context), then feed to a model.

Deep Learning: The network learns features automatically from raw data. No manual feature engineering needed.

Analogy: ML = teaching a child what to look for. DL = letting the child figure it out themselves from examples.

Universal Approximation Theorem

A neural network with even ONE hidden layer can approximate ANY continuous function — if you give it enough neurons. In practice, we use multiple layers (deeper = more efficient).

relu(y0·w1+b1) ≥ 0 defines a half-space (one side of a plane). Stack enough planes → cut out any shape.

Activation Functions

Non-linearities that let networks learn complex patterns. Without them, stacking linear layers = still one linear layer.

ReLU: max(0,x) — most common. Sigmoid: 1/(1+e⁻ˣ) — squashes to [0,1]. Tanh: squashes to [-1,1]. Softmax: converts scores to probabilities summing to 1.

Loss Functions

Measure how wrong the network is. We minimize these during training.

MSE = regression problems. Binary Cross-Entropy = 2-class classification. Cross-Entropy = multi-class classification.

Backpropagation

Algorithm to compute gradients of the loss w.r.t. ALL weights. Uses the chain rule to propagate error backwards through layers.

Like tracing blame backwards: "the final error happened because this layer was off... because that earlier layer was off..."

Convolutional Layer

Instead of fully-connected (every input → every neuron), we slide a small filter over the image. Same filter weights used everywhere → translation invariance.

Like a spotlight (filter) scanning a stage (image). It fires when it sees the pattern it was trained to detect — wherever that pattern appears.

Pooling Layer

Downsamples feature maps. MaxPool takes the biggest value in a region. AvgPool takes the average. Reduces spatial size and provides some noise immunity.

Taking the "best" or "average" from a neighborhood — like summarizing a paragraph into its key point.

Batch Normalization

Normalizes the outputs of intermediate layers so they have mean≈0 and variance≈1. Learned scale (γ) and shift (β) let the network un-normalize if needed. Speeds up training dramatically.

Like standardizing test scores to a curve — keeps all students (neurons) on the same scale so nobody dominates.

Dropout

During training, randomly "turn off" neurons with probability p. This forces the network to not rely on any single neuron — acts as an ensemble of smaller networks. Turned OFF at test time.

Like studying for an exam without relying on one specific note — forces your brain to learn redundant pathways.

Softmax (Output Layer)

For multi-class classification, converts raw scores (logits) into probabilities. Sum of all outputs = 1. Pick class with highest probability.

Voting: each class gets votes proportional to its score. Softmax just normalizes so votes add up to 100%.
All Math & Derivations
Every formula from Lec 3
Loss Functions
MSE: L(y,p) = (1/B) Σᵢ (y⁽ⁱ⁾ - p⁽ⁱ⁾)²
BCE: L(y,p) = -(1/B) Σᵢ [y⁽ⁱ⁾ log p⁽ⁱ⁾ + (1-y⁽ⁱ⁾) log(1-p⁽ⁱ⁾)]
CE: L(y,p) = -(1/B) Σᵢ Σₖ yₖ⁽ⁱ⁾ log pₖ⁽ⁱ⁾
B = batch size, y = true label, p = predicted probability, k = class index
Gradient Descent (Weight Update)
w = w - η · ∂L(y,p)/∂w
η (eta) = learning rate. This is how we adjust weights to reduce loss. Stochastic = uses mini-batches, not full dataset.
Backpropagation — Chain Rule
∂g/∂x = (∂g/∂f) · (∂f/∂x)
∂g/∂y = (∂g/∂f) · (∂f/∂y)
If g depends on f, and f depends on x: the gradient "flows backward" through each function. Multiply the upstream gradient by the local gradient at each step.
  • Forward pass: compute and save all intermediate values
  • Backward pass: start from loss, apply chain rule layer by layer backwards
  • Each layer: grad_input = grad_output × local_derivative
Forward Pass Through a DNN (Layer by Layer)
y0 = x (input)
y1 = relu(y0·w1 + b1) (hidden layer 1)
y2 = relu(y1·w2 + b2) (hidden layer 2)
y3 = softmax(y2·w3 + b3) (output layer)
At the last layer: use linear output for regression, sigmoid for binary class., softmax for multi-class.
Convolutional Layer — Output Size Formula
R_out = floor((R_in - K + P*(K-1)) / S) + 1
R_in=input size, K=kernel size, P=padding (0 for 'valid', 1 for 'same'), S=stride
Example: 32×32 input, 3×3 kernel, stride=1, padding='same' → R_out = floor((32-3+2)/1)+1 = 32
Conv Layer — Number of Operations (MACs)
Ops = R_out × C_out × N_filters × M_filters × K × K
For each output pixel (R×C), for each output filter (N), for each input filter (M), apply K×K kernel. This is the inner loop in the code.
Softmax Formula
P(y=j|x) = e^(x·wⱼ) / Σₖ e^(x·wₖ)
Numerically stable version: subtract max before exponentiating. All outputs sum to 1. Used for multi-class classification output layer.
Dropout — Expected Value Proof
x1 ~ Bernoulli(p), x2 ~ Gaussian(μ, σ)
Output = (x1 · x2) / p
E[Output] = E[x1]·E[x2] / p = p·μ / p = μ
We divide by p to keep the expected value the same at test time (when dropout is off). This is why Keras/TF handle dropout automatically at inference.
Convolutional Layer — 6-Nested Loop (Code Pseudocode)
for row in R: for col in C: for to in N: for ti in M:
for i in K: for j in K:
output[to][row][col] += weights[to][ti][i][j] * input[ti][S*row+i][S*col+j]
S = stride. This shows why convolutions are so parallelizable — all output pixels are independent.
A
All Abbreviations — Lec 3
AbbrFull NameWhat It Does
DNNDeep Neural NetworkNetwork with many layers of neurons
CNNConvolutional Neural NetworkUses conv filters for spatial data (images)
RNNRecurrent Neural NetworkHas memory, processes sequences
RBMRestricted Boltzmann MachineGenerative model (energy-based)
ReLURectified Linear UnitActivation: max(0,x)
ELUExponential Linear UnitSmoother version of ReLU
PReLUParametric ReLULeaky ReLU with learnable slope
SGDStochastic Gradient DescentOptimizer using mini-batches
MSEMean Squared ErrorLoss for regression
BCEBinary Cross EntropyLoss for 2-class classification
CECross EntropyLoss for multi-class classification
BNBatch NormalizationNormalizes layer activations
FCFully ConnectedEvery neuron connected to every input
MaxPoolMax PoolingTakes max in spatial region
AvgPoolAverage PoolingTakes average in spatial region
η (eta)Learning RateStep size in gradient descent
BBatch SizeNumber of samples per gradient update
KKernel SizeSize of conv filter (e.g., 3×3)
SStrideStep size when sliding filter
M,NInput/Output Feature MapsNumber of channels in/out of conv layer
</>
Code Covered — Lec 3
Conv Layer 6-Loop (Pseudocode from Slides)
for row in range(R):
  for col in range(C):
    for to in range(N):   # output channels
      for ti in range(M):   # input channels
        for i in range(K):
          for j in range(K):
            output_fm[to][row][col] += \
              weights[to][ti][i][j] * input_fm[ti][S*row+i][S*col+j]
Building a CNN in Keras (AlexNet-style)
from tensorflow.keras import layers, Model, Input

x_in = Input(shape=(227, 227, 3))
x = Conv2D(96, (11,11), strides=4, activation='relu')(x_in)
x = MaxPooling2D((3,3), strides=2)(x)
x = BatchNormalization()(x)
x = Conv2D(256, (5,5), padding='same', activation='relu')(x)
x = MaxPooling2D((3,3), strides=2)(x)
x = Flatten()(x)
x = Dense(4096, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(1000, activation='softmax')(x)
model = Model(x_in, x)
Output Size Calculation
import math
def conv_output_size(R_in, K, S, padding):
    P = 0 if padding == 'valid' else 1
    return math.floor((R_in - K + P*(K-1)) / S) + 1

# Example: 32x32 input, 3x3 filter, stride=1, 'same' padding
print(conv_output_size(32, 3, 1, 'same'))  # → 32
print(conv_output_size(32, 3, 1, 'valid')) # → 30
📋
CHEAT SHEET — Lec 3 (Print This!)

ELEN 521 — Lec 3 Cheat Sheet: CNNs & DL Basics

Loss Functions

MSE = (1/B)Σ(y-p)²
Regression
BCE = -(1/B)Σ[y·log(p) + (1-y)·log(1-p)]
Binary classification
CE = -(1/B)ΣΣ yₖ·log(pₖ)
Multi-class classification

Weight Update (SGD)

w = w - η · ∂L/∂w
η = learning rate

Chain Rule (Backprop)

∂g/∂x = (∂g/∂f)·(∂f/∂x)
Multiply upstream grad by local grad

Softmax

P(y=j|x) = e^(x·wⱼ) / Σₖ e^(x·wₖ)
Multi-class output layer

Conv Output Size

R_out = floor((R_in - K + P(K-1))/S) + 1
P=0 valid, P=1 same. S=stride, K=kernel

Dropout (Expected Value)

Output = x1·x2/p, E[Output] = μ
x1~Bernoulli(p), x2~Gaussian(μ,σ). Divide by p to preserve expectation!

Activation Functions

ReLU: max(0,x) | Sigmoid: 1/(1+e⁻ˣ)
Tanh: (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)

Key Layer Types

Dense/FC Conv2D MaxPool2D BatchNorm Dropout Flatten Softmax

CNN = Conv Layers + FC Layers

Front: Conv layers learn visual features (edges→textures→objects)
Back: FC layers do classification
04
CNN Architectures, Initialization & Regularization
AlexNet/VGGGoogleNet/Inception ResNetMobileNet Xavier/He InitL1/L2 Reg Bias-VarianceEarly Stopping Train/Val/Test
⚠ EXAM FOCUS: Network operations complexity O(N), number of layers L(N), minimizing data requirements, initialization choices (Xavier vs He), regularization effects on weights.
💡
Core Concepts

AlexNet / CaffeNet

Input: 3×227×227. Stack: Conv → ReLU → MaxPool → Norm → ... → FC → Softmax. First deep net to win ImageNet (2012). Used GPU training + Dropout.

VGG

Very deep network using only 3×3 conv filters. The insight: two 3×3 convs = one 5×5 conv's receptive field, but fewer parameters. Goes up to 19 layers.

Key insight: depth beats width. Small filters, stacked deep.

GoogleNet / Inception Module

Instead of choosing 1×1, 3×3, or 5×5 filters — use ALL in parallel and concatenate. Uses 1×1 convolutions as "bottlenecks" to reduce computation before expensive 3×3 and 5×5 convs.

Multiple scales of feature detection happening simultaneously, then combined.

ResNet — Skip Connections

Problem: as networks get deeper, gradients vanish. Solution: add the input directly to the output of a block (skip connection). Learns residual = F(x) + x instead of F(x).

ResNet ≈ solving ODEs with Euler method: each block is a tiny step forward. Connection to differential equations: dy/dt = F(y).

MobileNet — Depthwise Separable Conv

Split standard conv into: (1) Depthwise conv (filter each channel independently) + (2) Pointwise 1×1 conv (combine channels). Dramatically reduces computation.

Instead of one job doing everything, split into two specialized mini-jobs.

MobileNet v2 — Inverted Residuals

Bottleneck blocks that EXPAND first (using 1×1), then depthwise, then COMPRESS (using 1×1). Add skip connection on the compressed (narrow) ends. Opposite of regular bottleneck.

Weight Initialization

Zeros → all neurons learn the same thing (symmetry problem). Random small → works better. Xavier/Glorot → good for symmetric activations (tanh). He → good for ReLU/asymmetric.

Wrong init = students starting with the same wrong answer. Can't unlearn it together.

Vanishing / Exploding Gradients

In deep nets, gradients can shrink to ≈0 (vanishing) or blow up to ∞ (exploding) as they travel backwards. Cause: repeated multiplication of small/large numbers. Fix: proper init, BatchNorm, skip connections.

L2 Regularization (Weight Decay)

Add λ·||w||² to the loss. Effect: weights are pulled towards zero (shrunk) at every update. Also called Tikhonov regularization or ridge regression.

L1 Regularization (Lasso)

Add λ·||w||₁ to the loss. Effect: many weights become exactly 0 (sparsity). Acts as feature selection.

Bias-Variance Trade-off

High bias = underfitting (model too simple). High variance = overfitting (model memorizes training data). Goal: find the sweet spot.

High bias: a rule that says "all dogs are beige". High variance: memorized every specific dog you've ever seen.

Train / Validation / Test Sets

Train: learn weights. Validation: tune hyperparameters (architecture, lr, reg). Test: final evaluation ONCE. Rule: Train > 1M → val/test can be 1% (≈10k samples).

Early Stopping

Stop training when validation loss stops improving. Prevents overfitting. Simple and very effective regularization.

Bagging / Ensemble Methods

Train multiple models on different bootstrap samples of data. Average their predictions. Different models make different mistakes → averaging reduces errors.

All Math & Derivations — Lec 4
Xavier / Glorot Initialization
W ~ Uniform[-√(6/(nᵢₙ+nₒᵤₜ)), √(6/(nᵢₙ+nₒᵤₜ))]
OR: W ~ Normal(0, √(2/(nᵢₙ+nₒᵤₜ)))
nᵢₙ = number of inputs to layer, nₒᵤₜ = number of outputs. Keeps variance constant across layers for symmetric activations (tanh).
He Initialization
W ~ Normal(0, √(2/nᵢₙ))
Use for ReLU/asymmetric activations. The √2 factor accounts for ReLU zeroing out half the values. This maintains variance after ReLU.
L2 Regularization — Weight Update
Ω(w) = (1/2)·λ·wᵀw
∂Ω/∂w = λ·w
w ← w - η·(∂L_train/∂w + λ·w)
w ← (1 - η·λ)·w - η·∂L_train/∂w ← SHRINK then update
The (1-η·λ) factor shrinks weights at every step → "weight decay". This is why it's called that!
L1 Regularization
Ω(w) = λ·||w||₁ = λ·Σᵢ|wᵢ|
Gradient: sign(w). This pushes weights to exactly 0 → sparsity. Feature selection behavior.
Condition Number (Poor Conditioning)
κ = λ_max / λ_min
For Hessian H = U·Diag(λ₁,...,λₙ)·Uᵀ: condition number = ratio of largest to smallest eigenvalue. Large κ → highly elongated loss surface → slow convergence and oscillation in SGD.
Inception Module — Parameter Count Reduction
Standard 5×5 conv: 5×5×M×N params
With 1×1 bottleneck (to B channels): 1×1×M×B + 5×5×B×N params
If B << M: massive savings! E.g., M=128, B=16, N=256: standard=819K, bottleneck=105K → 8× reduction!
MobileNet — Depthwise Separable Conv Cost
Standard conv: K²·M·N (K=kernel, M=in-channels, N=out-channels)
Depthwise: K²·M + Pointwise: M·N
Ratio = 1/N + 1/K² ≈ 1/9 for K=3 (nearly 9× cheaper!)
A
Abbreviations — Lec 4
AbbrFull NameKey Point
VGGVisual Geometry GroupDeep net, only 3×3 filters (Oxford)
ResNetResidual NetworkSkip connections to fight vanishing gradients
mHCmHC (doubly stochastic)Sinkhorn-Knopp normalization for ResNet in 2026
ReLU6ReLU capped at 6min(max(0,x), 6) — used in MobileNet
GELUGaussian Error Linear UnitSmoother activation, used in Transformers
L1L1 norm penalty (Lasso)Encourages sparsity (exact zeros)
L2L2 norm penalty (Ridge/Tikhonov)Weight decay, drives weights toward 0
i.i.d.Independent and Identically DistributedAssumption that train/test come from same distribution
κCondition Numberλ_max/λ_min — how badly conditioned the loss surface is
DWDepthwise ConvFilter each input channel separately
PWPointwise Conv (1×1 Conv)Combine channels linearly
</>
Code — Lec 4
GoogleNet Inception Module
# Parallel branches, then concatenate
conv_1x1 = Conv2D(filters_1x1, (1,1), padding='same', activation='relu')(x)
conv_3x3 = Conv2D(filters_3x3_reduce, (1,1), padding='same', activation='relu')(x)
conv_3x3 = Conv2D(filters_3x3, (3,3), padding='same', activation='relu')(conv_3x3)
conv_5x5 = Conv2D(filters_5x5_reduce, (1,1), padding='same', activation='relu')(x)
conv_5x5 = Conv2D(filters_5x5, (5,5), padding='same', activation='relu')(conv_5x5)
pool_proj = MaxPool2D((3,3), strides=(1,1), padding='same')(x)
pool_proj = Conv2D(filters_pool_proj, (1,1), padding='same', activation='relu')(pool_proj)
output = Concatenate(axis=3)([conv_1x1, conv_3x3, conv_5x5, pool_proj])
MobileNet v2 Inverted Residual Block
def inverted_residual_block(x, expand=64, squeeze=16):
    m = Conv2D(expand, (1,1), activation='relu')(x)      # EXPAND
    m = DepthwiseConv2D((3,3), activation='relu')(m)  # DEPTHWISE
    m = Conv2D(squeeze, (1,1), activation='relu')(m)   # SQUEEZE
    return Add()([m, x])  # Skip connection on NARROW end
Weight Initialization (NumPy)
# BAD - zeros initialization (symmetry problem)
w_bad = np.zeros((n_in, n_out))

# Xavier/Glorot - for tanh activation
limit = np.sqrt(6.0 / (n_in + n_out))
w_xavier = np.random.uniform(-limit, limit, (n_in, n_out))

# He initialization - for ReLU activation
w_he = np.random.normal(0, np.sqrt(2.0 / n_in), (n_in, n_out))

# In Keras:
Dense(64, activation='relu', kernel_initializer='he_normal')
Dense(64, activation='tanh', kernel_initializer='glorot_uniform')
📋
CHEAT SHEET — Lec 4

ELEN 521 — Lec 4: Architectures, Init, Regularization

Initialization Rules

Xavier/Glorot: W ~ N(0, 2/(nᵢₙ+nₒᵤₜ))
Use with: tanh, sigmoid (symmetric)
He: W ~ N(0, 2/nᵢₙ)
Use with: ReLU, Leaky-ReLU (asymmetric)
Bias: always initialize to 0

L2 Regularization

w ← (1-η·λ)·w - η·∂L/∂w
Shrinks weights each step = "weight decay"

L1 Regularization

Ω(w) = λ·Σ|wᵢ| → sparsity (zeros!)

Condition Number

κ = λ_max / λ_min of Hessian
Large κ = bad conditioning, slow convergence

Architecture Shortcuts

AlexNet: Conv→ReLU→Pool→Norm (deep but wide)
VGG: Only 3×3 filters, very deep, simple
Inception: parallel 1×1, 3×3, 5×5 + MaxPool → concat
ResNet: y = F(x) + x (skip connection)
Solves vanishing gradient. Deeper=better.
MobileNet v2: Expand→DW→Squeeze + skip
Inverted residual on NARROW ends

Cost Comparison

DW-Sep = 1/N + 1/K² of standard (≈1/9 for K=3)

Train/Val/Test

Train: learn weights | Val: tune hyper-params | Test: final eval (use once!)
06
Problems, Transfer Learning & Advanced Applications
Design PatternsTransfer Learning VisualizationStyle Transfer YOLOIoUNMS U-NetDeconvNet
💡
Core Concepts

Transfer Learning

Take a pre-trained network (e.g., ResNet trained on ImageNet) and fine-tune it on your small dataset. Freeze early layers (they learn generic features), only train later layers on your data.

Like hiring someone who already knows how to read and write — you just teach them your specific domain vocabulary.

Freeze-Drop-Path

Freeze layers at the bottom (generic features), use Dropout on the path, and only fine-tune the top layers for your new task.

Image Augmentation

Artificially expand your dataset: mirror images, add distortions, blur, change colors. Forces the network to be invariant to these transformations. Very effective!

U-Net (Segmentation)

Encoder-decoder architecture with skip connections between encoder and decoder at each scale. State-of-the-art for image segmentation. "U" shape = downsampling then upsampling.

DeconvNet

Reverse of convolution — upsamples feature maps back to original image size. Used for segmentation and visualizing what a network has learned.

Neural Style Transfer

Combine CONTENT of one image with STYLE of another. Use content loss (match activations) + style loss (match Gram matrices of activations) and optimize the output image.

YOLO (You Only Look Once)

Object detection: divide image into grid cells, each predicts bounding boxes + class probabilities. Single forward pass = very fast. Outputs: (bx, by, bw, bh, pc, class probs).

Instead of sliding a classifier over every possible region (slow), look at the whole image once with a CNN.

IoU — Intersection over Union

Measures overlap between predicted box and ground truth box. IoU = Area(intersection) / Area(union). Used to decide if a detection is correct (threshold usually 0.5).

Non-Max Suppression (NMS)

After YOLO predicts many overlapping boxes: (1) discard boxes with confidence <0.6, (2) order by confidence, (3) keep highest, discard any that overlap with it by IoU>0.5, (4) repeat.

Double Descent

Modern finding: overfitting is NOT always bad. If you use a very large model (way beyond data capacity), test error can DECREASE again — the model finds smooth interpolating solutions.

Debugging Networks

Print statistics of intermediate layers: min/max of outputs. If a layer has all same values → dead filter. If ranges are wrong → initialization issue or missing normalization.

CNNs Detect Texture, Not Shape

Important finding: CNNs trained on ImageNet primarily learn texture statistics, not object shape. A cat image with dog texture → classified as dog!

Math — Lec 6
Neural Style Transfer — Gram Matrix
GM[l]ᵢⱼ = M[l] · M[l]ᵀ
M[l]: feature map at layer l, shape = (Nₗ × Mₗ) where Nₗ=channels, Mₗ=H×W. Gram matrix captures correlations between feature maps → encodes "style" (texture patterns).
Neural Style Transfer — Loss Function
Total Loss = α·L_content(C, G) + β·L_style(S, G)
C = content image, S = style image, G = generated image. Optimize G by gradient descent! α, β control balance between content and style.
YOLO — Bounding Box Encoding
Output per cell: [pc, bx, by, bw, bh, c1, c2, ..., ck]
pc = object confidence (0-1). bx, by ∈ [0,1] = box center relative to cell. bw, bh can be > 1 (relative to cell size). c1...ck = class probabilities.
IoU Formula
IoU = Area(B₁ ∩ B₂) / Area(B₁ ∪ B₂)
IoU = 1 → perfect overlap. IoU = 0 → no overlap. Use threshold 0.5 for NMS. Use higher (0.75) for stricter detection quality.
Visualizing Network — Gradient Ascent on Image
loss = MSE(layer[i].output)
x = x + η · ∂loss/∂x_input (gradient ASCENT, not descent)
Start with random/noise image, update it to maximize a neuron's activation. Shows what pattern that neuron looks for.
A
Abbreviations — Lec 6
AbbrFull NameKey Point
YOLOYou Only Look OnceFast object detection (1 forward pass)
IoUIntersection over UnionBounding box overlap metric
NMSNon-Max SuppressionRemove duplicate bounding box predictions
SOTAState Of The ArtBest known performance on benchmark
U-NetU-shaped NetworkEncoder-decoder with skip connections for segmentation
GMGram MatrixCaptures style (feature map correlations)
SSLSelf-Supervised LearningUse raw data as supervision (no labels)
CLRCyclical Learning RatesOscillate lr between bounds during training
</>
Code — Lec 6
Transfer Learning with Frozen Layers
from tensorflow.keras.applications import ResNet50

base = ResNet50(weights='imagenet', include_top=False)
base.trainable = False  # FREEZE base layers

x = base.output
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu')(x)
output = Dense(num_classes, activation='softmax')(x)

model = Model(base.input, output)
# Only the new dense layers will be trained!
Neural Style Transfer Loop (Pseudocode)
C_layers = ["relu_3_3"]
S_layers = ["relu_1_2", "relu_2_2", "relu_3_3", "relu_4_3"]
alpha, beta = 0.5, 0.5

for i in range(iterations):
    loss = alpha * C_loss(C, G, C_layers) + beta * S_loss(S, G, S_layers)
    minimize(loss).change(G)  # Optimize IMAGE G, not weights!
Debugging Layer Statistics
layer = model.get_layer(layer_name)
debug = Model(inputs=model.inputs, outputs=layer.output)
p = debug.predict(x_train)
for channel in range(p.shape[-1]):
    print(f"[{channel}] min={np.min(p[...,channel]):.4f} max={np.max(p[...,channel]):.4f}")
# Dead filter: all values at exactly 0 or same value!
📋
CHEAT SHEET — Lec 6

ELEN 521 — Lec 6: Transfer Learning & Applications

Transfer Learning

1. Load pre-trained model (ImageNet)
2. Freeze base layers (base.trainable=False)
3. Add new head layers
4. Train only new layers
5. Optional: unfreeze top few layers and fine-tune

Style Transfer Math

Gram = M[l] · M[l]ᵀ (style matrix)
M[l] shape: (channels, H×W)
Loss = α·L_content + β·L_style
Optimize IMAGE, not weights!

IoU

IoU = Area(intersection) / Area(union)
NMS threshold = 0.5

YOLO Output per Cell

[pc, bx, by, bw, bh, c₁..cₖ]
pc=confidence, bx,by∈[0,1], bw,bh can >1

NMS Algorithm

1. Discard boxes with pc < 0.6
2. Sort by pc (high→low)
3. Keep highest; discard IoU>0.5 with it
4. Repeat

Design Patterns (Key 5)

1. Pyramid shape: H↓ W↓ C↑
2. Use skip connections (ResNet style)
3. Normalize layer inputs (BatchNorm)
4. Use pre-trained + finetuning
5. Augment data (mirror, distort, blur)
07
Embeddings, RNNs, LSTMs & GRUs
Word RepresentationsTF-IDF Word2VecSkip-gram Cosine SimilarityRNN LSTMGRU Language Models
💡
Core Concepts

One-Hot Encoding / BOW

Each word = a vector of zeros with a 1 at its position in vocabulary. Bag of Words (BOW) = count occurrences. Simple but: very sparse, doesn't capture meaning or similarity.

TF-IDF

Term Frequency × Inverse Document Frequency. Words common in ALL documents (like "the") get weight≈0. Rare but important words get high weight.

TF-IDF kills stop words automatically — "the" appears everywhere so IDF=0, making TF-IDF=0.

Dense Embeddings

Represent words as dense low-dimensional vectors (e.g., 300 dims). Capture semantic meaning: similar words → similar vectors. Better for ML than sparse representations.

Word2Vec

Train a shallow network to predict whether two words appear near each other. Take the learned weights as word vectors. Two approaches: Skip-gram and CBOW.

You are known by the company you keep. Words that appear in similar contexts get similar vectors.

Skip-gram

Given a center word, predict the surrounding context words. Uses negative sampling: treat nearby words as positive, random words as negative, train a binary classifier.

Cosine Similarity

Measures angle between two vectors. Range: -1 (opposite) to +1 (same direction) to 0 (perpendicular). For word frequencies (non-negative): range [0,1]. Better than dot product for different-length vectors.

Embedding Analogies

Learned embeddings encode relational meaning: king - man + woman ≈ queen. Paris - France + Italy ≈ Rome. Vector arithmetic can answer analogy questions!

RNN — Recurrent Neural Network

Has hidden state h that carries memory across time steps. hₜ = tanh(W·hₜ₋₁ + U·xₜ). Output: oₜ = softmax(V·hₜ). Processes sequences of ANY length.

Like reading a sentence word by word while taking notes (hidden state) that update with each word.

Vanishing Gradient in RNNs

Gradients must flow backwards through time (BPTT). With many time steps, repeated multiplication of small gradients → vanishes to 0. RNN "forgets" long-ago inputs. LSTMs fix this.

LSTM — Long Short-Term Memory

Has SEPARATE cell state (long-term memory) and hidden state (short-term). Three gates: Forget (what to erase), Input (what new info to store), Output (what to output). Can preserve gradients over many steps.

GRU — Gated Recurrent Unit

Simpler than LSTM: only 2 gates (Reset and Update). Combines cell state and hidden state. Fewer parameters, often similar performance to LSTM.

Language Model

Predict the next word given all previous words: P(xₜ₊₁ = wᵢ | xₜ, xₜ₋₁, ..., x₁). RNNs are natural for this — feed history through hidden state.

Math — Lec 7
TF-IDF Formula
TF(word, doc) = freq(word) / total_words_in_doc
IDF(word) = log(N_docs / N_docs_containing_word)
TF-IDF = TF × IDF
If word appears in ALL docs: IDF = log(1) = 0, so TF-IDF = 0. Common words like "the" are automatically suppressed.
Cosine Similarity
cos(v, w) = (v · w) / (||v|| · ||w||)
= dot product divided by product of magnitudes. Range: [-1, 1]. For non-negative vectors (word counts): [0, 1].
Word2Vec Skip-gram — Probability
P(+|t,c) = σ(t · c) = 1 / (1 + e^(-(t·c)))
P(-|t,c) = 1 - P(+|t,c)
t = target word vector, c = context word vector. Use dot product + sigmoid to get probability. Train to maximize P(+) for real pairs, P(-) for noise words.
Skip-gram Training Objective
Maximize: Σ log P(+|t,cᵢ) + Σ log P(-|t,nⱼ)
Sum over positive context words cᵢ (real neighbors) and negative noise words nⱼ (random). Noise words sampled proportional to pα(w) with α=3/4 (boosts rare words).
Simple RNN Equations
hₜ = tanh(W·hₜ₋₁ + U·xₜ) [hidden state update]
oₜ = softmax(V·hₜ) [output]
W = hidden-to-hidden weights. U = input-to-hidden weights. V = hidden-to-output weights. SAME weights at every time step!
Language Model Probability
P(xₜ₊₁ = wᵢ | xₜ, xₜ₋₁, ..., x₁)
Full probability of entire sequence: P(w₁, w₂, ..., wₙ) = Π P(wₜ|w₁,...,wₜ₋₁). RNN compresses history into hₜ.
A
Abbreviations — Lec 7
AbbrFull NameKey Point
NLPNatural Language ProcessingAI for text/language
BOWBag of WordsCount word occurrences, ignore order
TF-IDFTerm Frequency-Inverse Document FrequencyWeight words by rarity in corpus
PPMIPositive Pointwise Mutual InformationMeasure word co-occurrence, sparse
CBOWContinuous Bag of WordsPredict center word from context
RNNRecurrent Neural NetworkHas hidden state, processes sequences
LSTMLong Short-Term MemoryGated RNN, avoids vanishing gradients
GRUGated Recurrent UnitSimpler LSTM (2 gates vs 3)
BPTTBackpropagation Through TimeBackprop unrolled through time steps
VVocabularySet of all unique words in corpus
W, U, VWeight matrices in RNNW=h-to-h, U=x-to-h, V=h-to-output
GloVeGlobal Vectors (Stanford)Pre-trained word embeddings
xLSTMExtended LSTMModern variant of LSTM
📋
CHEAT SHEET — Lec 7

ELEN 521 — Lec 7: Embeddings, RNN, LSTM, GRU

TF-IDF

TF(w,d) = freq(w) / total_words(d)
IDF(w) = log(N / df(w))
TF-IDF = TF × IDF → 0 for common words!

Cosine Similarity

cos(v,w) = v·w / (||v||·||w||)
Range [-1,1]. 1=same, 0=orthogonal, -1=opposite

Word2Vec Skip-gram

P(+|t,c) = σ(t·c) = 1/(1+e^(-t·c))
Noise sampling: p^(3/4) boosts rare words

Embedding Analogy

king - man + woman ≈ queen
Vector arithmetic captures relations!

Simple RNN

hₜ = tanh(W·hₜ₋₁ + U·xₜ)
oₜ = softmax(V·hₜ)
SAME W,U,V at every time step

LSTM Gates

Forget gate: what to remove from cell state (σ)
Input gate: what new info to add (σ + tanh)
Output gate: what to output (σ × tanh(C))
Cell state C: long-term memory
Hidden state h: short-term memory

GRU (Simpler)

Reset gate: how much past to forget
Update gate: how much new vs old to mix
Fewer params, often ≈ LSTM performance

RNN Types (by I/O shape)

1→1: standard (image classification)
1→N: image captioning
N→1: sentiment analysis
N→N: translation, seq2seq
N↔N: video frame labeling
08
Applications of RNNs & Attention Mechanism
Sentiment AnalysisText Generation seq2seqBidirectional RNN Image CaptioningAttention ELMOKey/Query/Value
💡
Core Concepts

Sentiment Analysis

Input: text sequence. Output: positive/negative. Architecture: Embed words → feed through LSTM → use final hidden state → Dense(sigmoid). Many-to-one architecture.

Character Generation

Train LSTM on text. At inference: generate one character at a time, feed it back as input for next step. Uses LSTM with `stateful=True` and `reset_states()` between sequences.

seq2seq (Encoder-Decoder)

Encoder LSTM reads input sequence → compresses into context vector (last hidden state). Decoder LSTM takes context vector → generates output sequence one token at a time. Used for translation, chatbots, Q&A.

Encoder = reading a sentence and forming a mental representation. Decoder = translating that mental representation into another language.

seq2seq Bottleneck Problem

The entire input must be compressed into ONE vector (last hidden state of encoder). For long sequences, too much information is lost. Solution: Attention.

Bidirectional RNN

Run one LSTM forward (left→right) and one backward (right→left). Concatenate both hidden states. The model can see both past AND future context at each position.

Understanding a word from both what came before AND after it in the sentence.

Image Captioning

CNN extracts image features → feed as initial context to LSTM → LSTM generates caption word by word starting with <START> token until <END>. Many-to-many.

Attention Mechanism

Instead of compressing input to ONE vector, decoder can "attend" to all encoder hidden states. At each decoder step, compute a weighted sum of ALL encoder states — attention weights tell us which input tokens to focus on.

When translating "dog", the decoder looks back at the encoder states and pays more attention to the part of input that said "Hund" (German for dog).

Keys, Queries, Values

Query: what the decoder is asking. Key: what each encoder state is. Value: what the encoder state contains. Similarity(query, key) → attention weight → weighted sum of values.

ELMO (Contextual Embeddings)

Unlike Word2Vec (one vector per word), ELMo generates DIFFERENT vectors for the same word depending on context. "fair" in "fair game" vs "county fair" → different vectors. Uses bidirectional LSTMs.

Beyond RNNs

ELMO → Attention → Transformers → BERT → GPT. Each step improved long-range dependency handling. Transformers replaced RNNs for most NLP tasks by 2018-2019.

Math — Lec 8 (Attention)
Attention Scores (encoder hidden states = keys, decoder state = query)
eₜᵢ = f_att(aᵢ, hₜ₋₁) [attention score: how relevant encoder state i is at decoder step t]
αₜᵢ = exp(eₜᵢ) / Σₖ exp(eₜₖ) [softmax → attention probabilities]
aᵢ = encoder hidden states (keys/values), hₜ₋₁ = decoder state (query). f_att can be dot product, bilinear, or MLP.
Context Vector (weighted sum)
ẑₜ = Σᵢ αₜᵢ · aᵢ
Context vector ẑₜ is a soft selection over all encoder states. Heavy weight on relevant encoder positions, near-zero elsewhere.
Key-Query-Value Formulation
Attention(Q, K, V) = softmax(K·Q / √d) · V
where: Q=query, K=keys, V=values, d=dimension
This is the scaled dot-product attention used in Transformers. √d scales to prevent large dot products from causing softmax to saturate.
Image Captioning with Attention
a = {a₁,...,aₗ}, aᵢ ∈ ℝᴰ [L spatial regions from CNN]
eₜᵢ = f_att(aᵢ, hₜ₋₁) [attention energy]
αₜᵢ = softmax(e) [attention weights]
ẑₜ = Σᵢ αₜᵢ · aᵢ [context vector = attended image region]
At each word generation step, the decoder "looks at" different parts of the image. When generating "dog", αₜᵢ peaks over the dog region.
A
Abbreviations — Lec 8
AbbrFull NameKey Point
seq2seqSequence to SequenceEncoder-decoder for variable length I/O
ELMOEmbeddings from Language ModelsContextual word embeddings via biLSTM
BERTBidirectional Encoder Representations from TransformersPre-trained transformer encoder
GPTGenerative Pre-trained TransformerAutoregressive transformer decoder
Q, K, VQuery, Key, ValueComponents of attention mechanism
BiLSTMBidirectional LSTMForward + backward LSTM concatenated
B, T, FBatch, Time, FeaturesStandard tensor shape for sequences
MoEMixture of ExpertsMultiple specialized sub-networks
</>
Code — Lec 8
LSTM API in Keras — All Variants
# Default: returns last output
x = LSTM(32)(x)  # x: (batch, 32)

# Return all time step outputs
x = LSTM(32, return_sequences=True)(x)  # x: (batch, T, 32)

# Return output + states (h and c)
x, h, c = LSTM(32, return_state=True)(x)  # [y, h, c] each (batch, 32)

# Use initial state (from encoder in seq2seq)
x = LSTM(32)(x, initial_state=[h, c])

# Bidirectional
x = Bidirectional(LSTM(64))(x)  # output doubled: (batch, 128)
Attention in Code (Matrix Form)
# h: encoder states (B, TE, F), s: decoder states (B, TD, F)
h_p = Permute((2, 1))(h)       # (B, F, TE)
e = tf.matmul(s, h_p)          # (B, TD, TE) - attention scores
alpha = softmax(e)             # (B, TD, TE) - attention weights
a = tf.matmul(alpha, h)         # (B, TD, F) - context vectors
Character Generation
x = LSTM(32, stateful=True)(x)    # Remembers state between batches!
# ... build model ...

model.reset_states()               # Reset between sequences
x = np.argmax(output)              # Greedy decoding
# OR sample from distribution:
x = np.random.choice(len(output), size=1, p=output)
📋
CHEAT SHEET — Lec 8

ELEN 521 — Lec 8: RNN Applications & Attention

Attention Math

eₜᵢ = f_att(aᵢ, hₜ₋₁) (score)
αₜᵢ = softmax(eₜᵢ) (weights)
ẑₜ = Σᵢ αₜᵢ·aᵢ (context)
Scaled dot-product: softmax(K·Qᵀ/√d)·V

seq2seq

Encoder: reads input → context vector
Decoder: context vector → output sequence
Bottleneck: entire input → 1 vector (problem!)
Fix: Attention over all encoder states

Keras LSTM Cheat

LSTM(32): → (B, 32) last output
LSTM(32, return_sequences=True): → (B,T,32)
LSTM(32, return_state=True): → [y,h,c]
Bidirectional(LSTM(64)): → (B, 128)

Applications

Sentiment: N→1 Translation: N→N Captioning: 1→N Generation: N→N stateful

ELMO vs Word2Vec

Word2Vec: 1 vector per word (context-free)
ELMo: different vector per context (better!)
09
Unsupervised Learning: Autoencoders, VAEs & GANs
AutoencodersDenoising AE VAEKL Divergence ReparameterizationGAN DiscriminatorGenerator VQ-VAEDiffusion
💡
Core Concepts

Why Unsupervised / Generative Models?

Labels are expensive. Unsupervised: learn structure from data without labels. Generative: learn to GENERATE new samples (not just classify). Applications: image synthesis, denoising, style transfer, data augmentation.

Autoencoder

Encoder: compress input x → bottleneck z (latent code). Decoder: reconstruct x from z. Trained to minimize reconstruction error. Forces the network to learn a compact representation.

Like learning to describe an image in 50 words, then reconstruct it from those 50 words.

Denoising Autoencoder

Add noise to input, train to reconstruct clean original. Forces the autoencoder to learn robust features. Input: noisy image. Target: clean image. Architecture: Conv → bottleneck → Conv2DTranspose.

Problem with Regular Autoencoders

Latent space z is not continuous or structured. You can't sample a random z and get a meaningful image. VAEs fix this by imposing a distribution on z.

VAE — Variational Autoencoder

Encoder outputs μ (mean) and σ (std) instead of a single z. Sample z ~ N(μ, σ²). Decoder takes z → output. Training forces z to be close to N(0,1) via KL divergence loss.

Instead of a single point in latent space, each input maps to a REGION (Gaussian). This makes the latent space smooth and traversable.

Reparameterization Trick

Problem: sampling is not differentiable. Solution: z = μ + σ·ε where ε ~ N(0,1). Now the randomness (ε) is separate from the learnable parts (μ, σ) → gradients can flow!

GAN — Generative Adversarial Network

Two networks in competition: Generator G (makes fake images from noise z) vs Discriminator D (tells real from fake). G tries to fool D. D tries to correctly classify. They improve each other.

Forger (G) vs Art Authenticator (D). Forger gets better to fool the authenticator, who gets better to catch the forger.

GAN Problems

1. Nash equilibrium hard to achieve. 2. Vanishing gradient. 3. Mode collapse: G produces only one type of output. 4. No good evaluation metric.

VQ-VAE

Vector Quantized VAE: latent space is discrete (codebook of vectors). Encoder output is "snapped" to nearest codebook entry. Enables high-quality generation. Used in DALL-E.

Diffusion Models

Gradually add noise to images (forward process) then train a network to reverse it (denoise step by step). State-of-the-art for image generation (DALL-E 2, Stable Diffusion).

VAE-GAN

Combine VAE (structured latent space) with GAN (sharp, realistic outputs). Encoder-decoder from VAE + discriminator from GAN. Best of both worlds.

Feature Addition to Images (VAE)

Find images with and without a feature. Compute latent vectors, subtract to get feature direction. Add this direction to μ of any image to add that feature.

In latent space, directions correspond to semantic attributes (smile, glasses, age). Arithmetic works!
Math — Lec 9
VAE — ELBO (Evidence Lower BOund) Loss
L = E[log P(x|z)] - KL(Q(z|x) || P(z))
L = Reconstruction Loss + KL Divergence Loss
Reconstruction: how well we rebuild x from z (cross-entropy for images). KL: how close Q(z|x)=N(μ,σ²) is to prior P(z)=N(0,1). Minimize: negative reconstruction + KL.
KL Divergence — Closed Form for Gaussians
KL(N(μ,σ²) || N(0,1)) = -½ · Σ(1 + log(σ²) - μ² - σ²)
This is the exact formula your prof uses! No need to estimate. This term regularizes the latent space to be close to a standard normal. Minimizing it pulls μ→0, σ→1.
Reparameterization Trick
z = μ + σ · ε, where ε ~ N(0, 1)
Gradient of loss w.r.t. μ and σ can now be computed. Without this trick: z ~ N(μ,σ²) is not differentiable and backprop would fail.
GAN — Objective Function (Minimax)
min_G max_D V(D,G) = E[log D(x)] + E[log(1 - D(G(z)))]
D tries to maximize V (correctly classify real/fake). G tries to minimize V (fool D). At Nash equilibrium: D(x)=0.5 everywhere (can't tell real from fake).
GAN Training Trick — Flip Labels
Instead of minimizing log(1-D(G(z))), maximize log(D(G(z)))
Same game-theoretically, but much better gradient signal early in training when G is bad and D(G(z))≈0 → gradients don't vanish.
Jensen Inequality (used in VAE derivation)
f(E[X]) ≥ E[f(X)] for concave f (like log)
Used to derive the ELBO: log P(x) ≥ E[log P(x|z)] - KL(Q||P). The ELBO is a lower bound on the true log-likelihood.
A
Abbreviations — Lec 9
AbbrFull NameKey Point
AEAutoencoderCompress → reconstruct, learn latent code
VAEVariational AutoencoderAE with structured latent space (Gaussian)
GANGenerative Adversarial NetworkGenerator vs Discriminator adversarial game
DCGANDeep Convolutional GANGAN using Conv/Deconv layers
KLKullback-Leibler DivergenceMeasures difference between distributions
ELBOEvidence Lower BOundVAE loss = reconstruction + KL
VQ-VAEVector Quantized VAEDiscrete latent space (codebook)
RQ-VAEResidual Quantized VAEBoosted Embeddings + VQ-VAE
zLatent Code / Latent VectorCompressed representation of input
GGeneratorNetwork that makes fake samples from noise
DDiscriminatorNetwork that tells real from fake
μ, σMean, Standard DeviationVAE encoder outputs these for each z-dim
εEpsilon (noise)ε~N(0,1) in reparameterization trick
</>
Code — Lec 9
Denoising Autoencoder (Conv2DTranspose for upsampling)
# Encoder
x = Input(shape=(28,28,1))
z = Conv2D(32, (3,3), activation='relu', padding='same')(x)
z = MaxPooling2D((2,2))(z)

# Decoder (Conv2DTranspose = learned upsampling)
out = Conv2DTranspose(32, (3,3), strides=(2,2), padding='same', activation='relu')(z)
out = Conv2D(1, (3,3), activation='sigmoid', padding='same')(out)
# Input: noisy image | Target: clean image
VAE — Reparameterization Layer
class Sampling(Layer):
    def call(self, inputs):
        mu, log_var = inputs
        epsilon = tf.random.normal(shape=tf.shape(mu))
        return mu + tf.exp(0.5 * log_var) * epsilon  # z = μ + σ·ε

# VAE loss:
reconstruction_loss = cross_entropy(y_true, y_pred)
kl_loss = -0.5 * tf.reduce_sum(1 + log_var - tf.square(mu) - tf.exp(log_var))
vae_loss = reconstruction_loss + kl_loss
GAN Training Loop
for epoch in range(epochs):
    # Train Discriminator
    real_labels = np.ones((batch_size, 1)) * 0.9  # Label smoothing!
    fake_labels = np.zeros((batch_size, 1)) + 0.1
    z = np.random.normal(0, 1, (batch_size, latent_dim))  # Sample from Gaussian!
    fake_imgs = generator.predict(z)
    discriminator.train_on_batch(real_imgs, real_labels)
    discriminator.train_on_batch(fake_imgs, fake_labels)
    
    # Train Generator (freeze discriminator)
    z = np.random.normal(0, 1, (batch_size, latent_dim))
    # Fool discriminator: label fakes as REAL
    gan.train_on_batch(z, np.ones((batch_size, 1)))
📋
CHEAT SHEET — Lec 9

ELEN 521 — Lec 9: Autoencoders, VAEs, GANs

Autoencoder

Encode x → z (bottleneck) → Decode z → x̂
Loss = ||x - x̂||² (reconstruction)
Denoising: add noise to input, reconstruct clean

VAE Loss (ELBO)

L = -Reconstruction + KL(N(μ,σ²)||N(0,1))
KL = -½·Σ(1 + log σ² - μ² - σ²)

Reparameterization Trick

z = μ + σ·ε, ε~N(0,1)
Makes sampling differentiable → backprop works!

Jensen Inequality

log(E[X]) ≥ E[log(X)]
Used to derive ELBO from marginal likelihood

GAN Objective

min_G max_D [E[log D(x)] + E[log(1-D(G(z)))]]
D: maximize (real=1, fake=0)
G: fool D (make D output 1 for fakes)

GAN Tricks (Key ones)

• z~N(0,1) not uniform!
• Label smooth: real=0.9, fake=0.1
• Use tanh as last G activation (output in [-1,1])
• Leaky ReLU + AvgPool (no MaxPool)
• Separate mini-batches for real/fake

GAN Problems

Mode collapse Vanishing gradient No good eval metric Nash equilibrium

Hierarchy

AE → VAE → VQ-VAE → Diffusion
GAN → DCGAN → VAE-GAN