ELEN 521 — Lec 8 & 9 Complete Deep Dive

LEC 8

Applications of RNNs & Attention

How to Translate · seq2seq · Attention · ELMO → Transformers

1 · Embeddings Review (What each dimension represents)

▼

Lec 8 opens with a direct question from the slides: "What if each dimension can represent real-world relations?" Embeddings are not just lookup tables — they encode meaning mathematically.

E[day] · E[sunny] ≈ 1

Stockholm after 9pm during summer = midnight sun. Days in Stockholm summer are always sunny. Because these words co-occur constantly in training text, their embedding vectors point in similar directions → high dot product ≈ 1. The embedding space learns REAL WORLD RELATIONS.

Polysemy Problem (Word2Vec fails here)

From slides: "Will word2vec work well for a word that has different meanings depending on context?"

"I went to the fair last Saturday." → fair = festival
"I do not believe your attitude is fair." → fair = just

Word2Vec gives ONE vector for "fair" — an average of all usages. Captures neither meaning well. Solution: ELMO, BERT → contextual embeddings.

Domain Adaptation

From slides: "Vectors need to be adapted for different domains (medical, legal, etc.)" General embeddings trained on Wikipedia fail for medical text — "cold" means something different in "I have a cold" vs a medical paper. Fine-tune embeddings on domain-specific text.

2 · LSTM & SimpleRNN Keras API — All Variants & Shapes

▼

The slides show the full LSTM API with all four modes. These shapes appear in complexity questions. Know every one of them.

SimpleRNN — unrolled over T time steps (slide 10)

# W = hidden-to-hidden, U = input-to-hidden, V = hidden-to-output
# Same W, U, V used at EVERY time step t
# Input shape: (batch, T, |x|)
# The T bracket shows all steps share weights

LSTM — all four API modes from slide 11 (exact from slides)

# Input: (batch, T, |x|)

x = LSTM(32)(x)
# X: (batch, 32)  ← last timestep output only

x = LSTM(32, return_sequences=True)(x)
# X: (batch, T, 32)  ← ALL timestep outputs

x = LSTM(32, return_state=True)(x)
# x: [(batch, 32)]*3 = [y, h, c]
#   y = last output (= h), h = last hidden state, c = last cell state

x = LSTM(32)(x, initial_state=x[1:])
# Pass [h, c] from encoder as initial state of decoder (seq2seq handoff)

Stacked LSTMs — from slide 12

x = Dense(32)(x)
x = Dense(64)(x)

x: (B, T, E(x))
x = LSTM(32, return_sequences=True)(x)   # → (B, T, 32)
x = LSTM(64)(x)                              # → (B, 64)

# Making 2*T sequence size:
x: (B, T, E(x))
x = LSTM(32, return_state=True)(x)         # [(B,32)]*3 = y,h,c
x = LSTM(32, initial_state=x[1:])           # [h,c] → (B,32)

3 · Sentiment Analysis — N→1 Architecture

▼

From slides: "Is a movie review from Netflix or a restaurant review from Yelp! positive or negative? Input: several sentences or paragraphs. Output: pos/neg."

Architecture (N→1)

Input x: (B, T, F)
LSTM[0] → LSTM[t] → LSTM[T-1]
Only y[T-1] is used (last output)
→ Dense → pos/neg

N→1: read entire sequence, output one label at the end.

Real Example from Slides

"Joker is brutal, dark and a compelling origin story of a man's painful journey to self destruction."

"Led by Phoenix's ferocious, feral performance, this especially dark, gritty comic book movie is a character drama..."

→ Positive (critics liked it despite the dark tone)

4 · Character & Text Generation — Code, LaTeX, HTML, Patents

▼

Character generation — exact code from slide 16

x: (B, 1, F)                           # one token at a time
x = LSTM(32, stateful=True)(x)          # state preserved between batches
model …
model.reset_states()                     # reset between sequences

# Two sampling strategies:
x = np.argmax(output)                    # greedy: always pick max prob
x = np.random.choice(len(output),
    size=1, p=output)                    # stochastic: sample proportionally

Character generation

Generate text character by character. Vocabulary = alphabet + punctuation.

Code generation

Karpathy's RNN generating Python/C code. Vocabulary = code tokens.

LaTeX / HTML / Patents

Same pattern — stateful LSTM, one token at a time. Only vocabulary differs.

Key concept: stateful=True

Normally Keras resets LSTM state between batches. stateful=True keeps the state — so the LSTM remembers across batches. Required for long sequence generation where the sequence is longer than one batch. Must call reset_states() manually between independent sequences.

5 · seq2seq — Encoder→Decoder Architecture

▼

seq2seq = two LSTMs. The encoder reads the source sequence and compresses it into a state vector. The decoder uses that state to generate the target sequence. From slides: translation, chatbots, automated Q&A.

Input sequence

→

ENCODER LSTM

→

STATE (h,c)

→

DECODER LSTM

→

Output sequence

Training mode (Teacher forcing)

Encoder reads source. Decoder receives the ground-truth target words (shifted by one) as input and predicts the next word. This is "teacher forcing" — the decoder always gets the correct previous word, not its own prediction.

Prediction / Inference mode

Encoder reads source. Decoder starts with <START> token. At each step, decoder's own output is fed back as the next input. Stops when <END> token is generated.

seq2seq Keras code

# ENCODER
enc_in = Input((T_E, V))
enc_out, h, c = LSTM(H, return_state=True)(enc_in)
# enc_out = last output = h. h = last hidden. c = last cell.

# DECODER
dec_in = Input((T_D, V))
dec_out = LSTM(H, return_sequences=True)(dec_in,
              initial_state=[h, c])         # encoder state passed here!
output = Dense(V, activation='softmax')(dec_out)

6 · Bidirectional RNNs

▼

Run TWO LSTMs over the sequence — one forward (left to right), one backward (right to left). Concatenate their outputs. The model sees both past AND future context at every step.

Bidirectional LSTM — exact code from slide 27

x = Bidirectional(LSTM(64))(x)
# Output shape: (B, 128)  ← doubled! 64 from forward + 64 from backward

# Architecture from slides:
# Forward:  LSTM[0] → LSTM[t] → LSTM[N-1]  →  y[N-1]
# Backward: LSTM[N-1] ← LSTM[N-1] ← LSTM[N-1]  (reversed)
# x[0], x[t], ..., x[N-1] fed to both

When to use Bidirectional

When the FULL sequence is available at inference time (not generation). Good for: sentiment analysis (see whole review), named entity recognition, translation encoder. NOT useful for generation (you don't have future tokens yet).

7 · Image Captioning + Image Captioning with Attention

▼

From slides: "Given an image, generate a sentence describing the image. Input: image. Output: paragraph describing image." e.g. "The book is on the table."

Image captioning architecture — from slide 29

Input to decoder: <START> x[t] x[N-1] Output of decoder: y[0] y[t] y[N-1] Example: "The" "dog" <END> <START> token = signals begin. <END> token = signals stop. Decoder is a 1→N architecture (image → sentence).

Image Captioning WITH Attention — from slides 31/32

CNN extracts L feature vectors: a = {a₁,...,aL}, aᵢ ∈ ℝᴰ (image H×W×3 → CNN → L×D feature grid) Context vector at step t: ẑₜ = φ({aᵢ},{αᵢ}) = Σᵢ αᵢ·aᵢ ← weighted combination of features Attention score: eₜᵢ = f_att(aᵢ, hₜ₋₁) Attention weight: αₜᵢ = exp(eₜᵢ) / Σₖ exp(eₜₖ) ← softmax The white blobs in the attention maps show WHERE the model looks when generating each word. "bird" → attends to the bird region. "water" → attends to the lake region.

8 · ELMO — Contextual Embeddings

▼

ELMO (Embeddings from Language Models) fixes Word2Vec's polysemy problem. Instead of one fixed vector per word, ELMO generates a different vector for each word BASED ON ITS CONTEXT.

How ELMO works

Train a deep bidirectional LSTM on a language model task. Use the internal representations (all layers) as word vectors. The same word "fair" gets different vectors in "county fair" vs "fair treatment" — because the surrounding words change the LSTM's hidden state.

The progression beyond ELMO

From slides: ELMO → Attention → Transformers → BERT → GPT

Each step improved how context is used. Transformers removed the RNN entirely and used attention at every layer. BERT trained bidirectional Transformer. GPT trained autoregressive Transformer. These are the foundations of ChatGPT.

9 · Bottleneck Problem in seq2seq

▼

The slides dedicate multiple slides (38,39) to this problem. It is the REASON attention was invented.

The Problem

The encoder reads the ENTIRE source sequence and must compress ALL information into a SINGLE fixed-size state vector (h,c) — dimension H. For long sentences, this vector can't hold all the detail. Information is lost. Translation quality drops for long sequences.

Analogy: summarise a 200-word paragraph into a 256-number vector, then ask someone to reconstruct the paragraph from just those numbers. Impossible for complex sentences.

Why attention solves it

Instead of passing only the final (h,c), pass ALL T_E encoder hidden states. At each decoder step, the decoder can attend to any of the encoder states — not just the last one. Long-range dependencies can be directly accessed.

⚠ The bottleneck is NOT a software bug — it's a fundamental architectural limitation of vanilla seq2seq. Attention is the architectural fix.

10 · Attention — All Equations (Slides 53 & 54)

▼

The prof wrote out attention equations over two dedicated slides. These are the exact formulas from your slides.

Slide 53 — Setting up the problem

Encoder hidden states (KEYS): h₁, ..., hₙ ∈ ℝʰ Decoder state at step t (QUERY): sₜ ∈ ℝʰ Attention scores at time t: eᵗ = [sₜᵀ·h₁, ..., sₜᵀ·hₙ] ∈ ℝᴺ Attention probability: αᵗ = softmax(eᵗ) ∈ ℝᴺ

Slide 54 — Computing context vector and concatenation

Context vector (VALUE = weighted sum): aₜ = Σᵢ₌₁ᴺ αᵢᵗ · hᵢ ∈ ℝʰ Concatenate with decoder state: [aₜ ; sₜ] ∈ ℝ²ʰ ← this feeds the output layer Summary: softmax(keyᵢᵀ · query) · valueᵢ

Keys = encoder states (what we search through). Query = current decoder state (what we're looking for). Values = encoder states (what we retrieve).

Image captioning attention (slides 32/33) — slightly different notation

eₜᵢ = f_att(aᵢ, hₜ₋₁) ← score function (can be dot product or MLP) αₜᵢ = exp(eₜᵢ) / Σₖ exp(eₜₖ) ← softmax over all L locations ẑₜ = Σᵢ αᵢ · aᵢ ← weighted context vector

Alignment Matrix (Slide 52)

When attention is visualised as a matrix (source words on X-axis, target words on Y-axis), it forms a near-diagonal for simple sentences (word-for-word alignment) and a more complex pattern for sentences with different word orders (e.g. English vs French adjective placement). Bright = high attention weight.

11 · Attention — Keras Code (Slide 55)

▼

Attention implementation — exact code from slide 55

# h: (B, TE, F)  — all encoder hidden states
# s: (B, TD, F)  — all decoder hidden states
# s[t]: (B, F)   — decoder state at step t

h_p = Permute((2, 1))(h)         # (B, F, TE)  ← transpose encoder

# (B, TD, TE) = (B, TD, F) * (B, F, TE)
e = tf.matmul(s, h_p)              # (B, TD, TE) ← attention scores

alpha = softmax(e)                  # (B, TD, TE) ← attention weights

# (B, TD, F) = (B, TD, TE) * (B, TE, F)
a = tf.matmul(alpha, h)             # (B, TD, F)  ← context vectors

Permute

Transposes the encoder output from (B,TE,F) to (B,F,TE). Needed for matrix multiply to compute dot products between each decoder step and each encoder step.

matmul(s, h_p)

(B,TD,F)·(B,F,TE) = (B,TD,TE). Each of the TD decoder steps scores against all TE encoder steps. One big batch matrix multiply.

matmul(alpha, h)

(B,TD,TE)·(B,TE,F) = (B,TD,F). Weights the encoder states by attention and sums. Output: one context vector per decoder step.

12 · "What We Have Seen So Far" — Tables · Dictionaries · Gates

▼

⚡ Direct quote from slide 56: "Tables: Embedding · Dictionaries: Attention · if-then-else: gates in LSTM, Mixture-of-Experts"

Classical Computing	Deep Learning Equivalent	Key difference
Table lookup	Embedding layer	Integer index → dense row. Exact, hard lookup. O(1).
Dictionary / Hash map	Attention mechanism	Soft lookup. Query vs all keys → weighted sum of values. Differentiable.
if-then-else	LSTM gates / Mixture of Experts	Sigmoid = differentiable switch 0→1. Backprop flows through it.

Attention as soft dictionary (from slide)

Hard dict: lookup(key) → exact match or error ← not differentiable Attention: scores = softmax(keyᵢᵀ · query) ← differentiable output = Σᵢ scores[i] · values[i] ← weighted sum

Every key contributes a little. Weights sum to 1. Fully differentiable → end-to-end training.

seq2seq with attention: how many encoder hidden states does the decoder access at each step?

✓ Correct! Attention allows the decoder to access ALL T_E encoder hidden states at every decoding step, via a weighted sum αₜ·hᵢ. This is what solves the bottleneck.

✗ That's vanilla seq2seq (the bottleneck!). With attention: the decoder computes scores against all N encoder states, softmaxes them into weights αᵗ, then takes the weighted sum aₜ = Σαᵢᵗhᵢ. It accesses ALL of them.

LEC 9

Autoencoders · VAEs · GANs · Generative Models

How to CREATE · From compression to generation

1 · Autoencoders — Architecture & Math

▼

Slide 18 asks: "Why Autoencoders and VAEs?" — because we want to learn compact representations of data without labels. The autoencoder learns to compress and reconstruct.

Autoencoder equations — from slide 19

Encoder: z = σ(Wx + b) ← compress input to code z Decoder: x' = σ'(W'z + b') ← reconstruct from code Encoder function: φ: X → F ← maps input space to feature space Decoder function: ψ: F → X ← maps feature space back Objective (minimise reconstruction error): φ,ψ = argmin_{φ,ψ} ||X − (ψ ∘ φ)X||²

ψ∘φ means apply φ first (encode), then ψ (decode). The bottleneck z forces the network to learn the most important structure — it can't memorise.

Standard Autoencoder vs VAE (slide 26)

Standard AE: encoder maps x to a SINGLE POINT z in latent space. Latent space is fragmented — different classes map to scattered, disconnected points. No structure.

VAE: encoder maps x to a GAUSSIAN REGION (μ, σ). KL loss forces all regions to overlap near N(0,1). Latent space is smooth and continuous.

Standard AE: can reconstruct, can't generate. VAE: can reconstruct AND generate from any z~N(0,1).

Application: Autoencoders

Dimensionality reduction
Anomaly detection (high reconstruction error = anomaly)
Denoising (train on noisy input, clean target)
Feature learning without labels

2 · Denoising Autoencoders — Full Code

▼

Train the autoencoder on corrupted/noisy inputs but make it reconstruct the CLEAN original. Forces it to learn robust features. The upsampling layer is tf.keras.layers.Conv2DTranspose.

Denoising autoencoder — exact code from slide 22

# ENCODER
input_img = Input(shape=(28, 28, 1))  # adapt for channels_first format
x = Conv2D(32, (3,3), activation='relu', padding='same')(input_img)
x = MaxPooling2D((2,2), padding='same')(x)
x = Conv2D(32, (3,3), activation='relu', padding='same')(x)
encoded = MaxPooling2D((2,2), padding='same')(x)
# representation is (7, 7, 32)

# DECODER ← uses UpSampling2D (or Conv2DTranspose)
x = Conv2D(32, (3,3), activation='relu', padding='same')(encoded)
x = UpSampling2D((2,2))(x)            # ← tf.keras.layers.Conv2DTranspose
x = Conv2D(32, (3,3), activation='relu', padding='same')(x)
x = UpSampling2D((2,2))(x)
decoded = Conv2D(1, (3,3), activation='sigmoid', padding='same')(x)

autoencoder = Model(input_img, decoded)
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')

Conv2DTranspose vs UpSampling2D

UpSampling2D: simple nearest-neighbour upscaling (no learned parameters). Conv2DTranspose: learned transposed convolution — can produce sharper upsampling. Slide labels the UpSampling block as "tf.keras.layers.Conv2DTranspose" — meaning use Conv2DTranspose in practice for better quality.

3 · Variational Autoencoders — Architecture & Intuition

▼

From slide 23: "z ~ N(0,1)" — once trained, you can sample from the standard Gaussian and decode to get new images. From slide 24: "Sample from P(z) Standard Gaussian."

VAE architecture — from slides 26, 27, 28

Encoder: q_θ(z|x) → outputs Mean μ and SD σ Sampling: z ~ N(μ, σ) ← sample from the Gaussian Decoder: p_φ(x|z) → reconstructs x̃ from z Standard AE: x → encoder → z (single point) → decoder → x̃ VAE: x → encoder → (μ, σ) → sample z ~ N(μ,σ) → decoder → x̃

The VAE forces regions in latent space. The digit "3" is PUSHED AWAY from "2" cluster (slide 27). But because both are forced toward N(0,1), the space between them is still structured — you can interpolate.

Generative use (slide 23)

z ~ N(0, 1) ← sample from standard normal
x = Decoder(z) ← generate new image

Because KL loss forces the latent space toward N(0,1), any randomly sampled z maps to a valid image. This is generation.

Latent space structure (slide 31)

Left plot: scattered clusters (standard AE style). Right plot: overlapping, continuous cloud (VAE style). The VAE latent space is smooth — travel in any direction and get a valid image. This is what enables interpolation and feature arithmetic.

4 · Reparameterisation Trick — Code from Slides

▼

The problem and the trick

Problem: z ~ N(μ, σ) is stochastic → not differentiable → backprop fails Trick: z = μ + σ·ε, where ε ~ N(0, 1) ε is the randomness (not learnable) μ and σ are the learnable parameters ∂z/∂μ = 1 (gradient flows through μ) ∂z/∂σ = ε (gradient flows through σ)

We moved the randomness OUT of the computational graph into ε. Now μ and σ are deterministic paths → backprop works normally.

Reparameterisation trick — exact code from slide 29

def sampling(args):
    z_mean, z_log_sigma = args
    epsilon = K.random_normal(shape=(batch_size, latent_dim),
                               mean=0., std=epsilon_std)
    return z_mean + K.exp(z_log_sigma) * epsilon
    # z = μ + exp(log_σ) * ε = μ + σ * ε

# note: "output_shape" isn't necessary with TensorFlow backend
# so you could write Lambda(sampling)([z_mean, z_log_sigma])
z = Lambda(sampling, output_shape=(latent_dim,))([z_mean, z_log_sigma])

Why log_sigma instead of sigma?

Network outputs log(σ²) instead of σ directly. Reason: σ must be positive (it's a standard deviation). If the network outputs log(σ²), then σ = exp(log_σ) is always positive regardless of what the network outputs. Removes the need for a positivity constraint.

5 · VAE Loss Function — Full Code & All Formulas (Slide 30)

▼

VAE full loss formula — from slide 30

lᵢ(θ,φ) = −E_{z~q_θ(z|xᵢ)}[log p_φ(xᵢ|z)] + KL(q_θ(z|xᵢ) || p(z)) ←―――― crossentropy (reconstruction) ――――→ ←― KL divergence ―→

Reconstruction term: forces decoder to rebuild x accurately. KL term: forces latent distribution to stay close to N(0,1) → smooth generative space.

KL divergence formula — from slide 30

D_KL(p||q) = Σᵢ₌₁ᴺ p(xᵢ) · (log p(xᵢ) − log q(xᵢ)) For Gaussian (simplified form also from slide 30): KL = Σᵢ (σᵢ² + μᵢ² − log(σᵢ) − 1) = −½ · Σⱼ (1 + log σⱼ² − μⱼ² − σⱼ²) ← standard form

VAE loss — exact code from slide 30

def vae_loss(x, x_decoded_mean):
    xent_loss = objectives.binary_crossentropy(x, x_decoded_mean)
    kl_loss = -0.5 * K.mean(1 + z_log_sigma
                                - K.square(z_mean)
                                - K.exp(z_log_sigma), axis=-1)
    return xent_loss + kl_loss

Reconstruction loss (cross-entropy)

binary_crossentropy between original x and reconstructed x_decoded_mean. Measures how well the decoder rebuilt the input. Low = good reconstruction.

KL divergence

-0.5 * (1 + log_σ² − μ² − σ²). KL=0 only when μ=0 and σ²=1. Minimising pushes encoder toward N(0,1). Controls how structured the latent space is.

The β-VAE varies this weight to control disentanglement.

6 · ELBO Derivation — Jensen Inequality (Slides 32 & 33)

▼

⚡ Slide 33 says "Exercise: Prove that [ELBO] is proportional to [crossentropy + KL]" — This is coming in the final.

Jensen Inequality — exact from slide 32

log E[x] ≥ E[log(x)] ← log is concave log ∫ p(x)g(x)dx ≥ ∫ p(x) log g(x)dx ← integral form

Full ELBO derivation — from slide 32 (all steps)

Goal: max_θ Σᵢ log p_θ(xᵢ)

Step 1: p_θ(x) = ∫ p_θ(x|z)·p_θ(z) dz

Step 2: log p_θ(x) = log ∫ p_θ(x|z)·p_θ(z) dz

Step 3: = log ∫ p_θ(x|z)·(p_θ(z)/q(z))·q(z) dz ← multiply by q/q=1

Step 4: Apply Jensen: log E_q[·] ≥ E_q[log(·)] log p_θ(x) ≥ ∫ q(z) · log(p_θ(x|z)·p_θ(z)/q(z)) dz

Step 5: = ∫ q(z) log p_θ(x|z) dz − ∫ q(z) log(q(z)/p_θ(z)) dz

Step 6: = E_{q(z)}[log p_θ(x|z)] − KL[q(z) || p_θ(z)]

log p_θ(x) ≥ E_q[log p_θ(x|z)] − KL[q(z)||p_θ(z)] ← ELBO ✓
= Reconstruction likelihood − KL divergence

Exercise (slide 33) — Prove D_KL(p||q) simplifies

Prove: D_KL(p||q) = Σᵢ p(xᵢ)·(log p(xᵢ) − log q(xᵢ)) is proportional to: Σᵢ (σᵢ² + μᵢ² − log(σᵢ) − 1) Substitute Gaussian distributions: p = N(μ,σ²), q = N(0,1). Expand log of Gaussian density. Simplify. The −½ factor gives the standard closed form.

7 · How to Add Features to a Picture (Slide 34)

▼

From slides: "Find images, one with and one without feature, compute feature vector. Get the picture you want to add feature, and add to μ feature vector."

Feature addition algorithm

Step 1: Find images WITH feature (e.g. smile) → encode → get μ_smile Step 2: Find images WITHOUT feature (no smile) → encode → get μ_no_smile Step 3: feature_direction = mean(μ_smile) − mean(μ_no_smile) Step 4: For target image: z_new = μ_target + α · feature_direction Step 5: Decode z_new → image with feature added

The VAE latent space is smooth and linear — directions correspond to semantic attributes. This is latent space arithmetic.

Why this works

Because the KL term forced the latent space to be continuous and structured. Every point in the space corresponds to a valid image. Directions are consistent — "smile direction" always adds a smile regardless of who the person is.

8 · Beyond VAE — VQ-VAE & RQ-VAE (Slides 36 & 37)

▼

VQ-VAE — Vector Quantized VAE (slide 36)

From slides: "embedding learns | latent encoding space learns"

Uses a DISCRETE latent space — a codebook of K learned vectors instead of continuous z. Encoder output is "snapped" to the nearest codebook entry. Result: higher quality generation, discrete representation (like tokens). Foundation of DALL-E.

RQ-VAE — Residual Quantized VAE (slide 37)

RQ-VAE = Boosted Embeddings + VQ-VAE

Stacks multiple VQ-VAE quantisation stages. Each stage encodes the residual error of the previous one. Achieves much higher quality than single VQ-VAE. Like boosting applied to discrete latent codes.

9 · GAN — Architecture, Objective & Training Algorithm (Slides 38–40)

▼

Two networks play a game. Generator (G) tries to produce realistic fakes. Discriminator (D) tries to tell real from fake. Adversarial pressure drives both toward perfection.

Noise z

→

G (Generator)

→

Fake image

→

D (Discriminator)

→

Real or Fake?

GAN Minimax Objective — exact from slide 40

min_G max_D L(D,G) = E_{x~p_r(x)}[log D(x)] + E_{z~p_z(z)}[log(1−D(G(z)))] = E_{x~p_r(x)}[log D(x)] + E_{x~p_g(x)}[log(1−D(x))] D maximises V: make D(real)→1 and D(fake)→0 (log D(x) high, log(1-D(G(z))) high). G minimises V: make D(G(z))→1 (fool D).

GAN Training Algorithm — from slide 40 (Algorithm 1)

for number of training iterations: for k steps: Sample m noise samples {z^(1),...,z^(m)} from noise prior p_g(z) Sample m real examples {x^(1),...,x^(m)} from p_data(x) Update DISCRIMINATOR by ascending stochastic gradient: ∇_{θ_d} (1/m) Σᵢ [log D(x^(i)) + log(1−D(G(z^(i))))] Sample m noise samples {z^(1),...,z^(m)} from p_g(z) Update GENERATOR by descending stochastic gradient: ∇_{θ_g} (1/m) Σᵢ log(1−D(G(z^(i))))

Discriminator ascends (maximise), Generator descends (minimise). k=1 is the least expensive option (used in practice).

Nash Equilibrium — optimal D*

At Nash equilibrium (optimal G and D): D*(x) = p_data(x) / (p_data(x) + p_G(x)) = 0.5 D*(x) = 0.5 everywhere means the discriminator can't distinguish real from fake → P_G = P_data → Generator has perfectly learned the data distribution.

10 · Problems of GANs (Slide 41)

▼

1. Hard to achieve Nash equilibrium

G and D are trained alternately. The landscape is non-stationary — each update changes what the other network is optimising. Convergence is not guaranteed.

2. Vanishing gradient

Early in training D is very good (G is terrible). D(G(z))≈0 → log(1-0)=0 → gradient≈0 → G gets no training signal. G can't improve.

3. Mode collapse

Generator collapses to producing only one type of output — the one that reliably fools D. Ignores all other modes of the real data distribution. E.g. only generates one type of face.

4. Lack of proper evaluation metric

Hard to know when training is succeeding. Loss values don't directly indicate image quality. Metrics like FID (Fréchet Inception Distance) help but are expensive to compute.

11 · How to Improve GAN — Full List from Slide 42

▼

⚡ The slides give 14 specific improvement tricks. All of them are directly from slide 42.

Core training tricks

Normalize inputs between -1 and +1 (use tanh as last activation of G)
Use max log(D(G(z))) instead of min log(1-D(G(z))) — "flip labels" trick
Sample z from Gaussian, NOT uniform
Construct different mini-batches for real and fake (all real, then all fake)

Architecture tricks

Avoid sparse gradients: NO ReLU, NO MaxPool in D → use Leaky ReLU + AvgPool instead
Use label smoothing: real = 0.9, fake = 0.1 (not hard 0 and 1)
Use DCGAN when you can, or VAE-GAN

Optimiser & stability tricks

Use stability tricks from RL: checkpoint and replay
Use ADAM Optimizer for G, OR use SGD for discriminator
Track failures early
Don't balance loss via statistics (# Gs followed by # Ds)

Data & regularisation tricks

If you have labels, use them
Add noise to inputs, decay over time
Use Dropouts in G in both train AND test phase

Why flip G's loss? (max log D(G(z)) instead of min log(1-D(G(z))))

Problem with min log(1-D(G(z))): Early training: D(G(z)) ≈ 0 → log(1-0) = 0 → gradient ≈ 0 (saturated) With max log(D(G(z))): Early training: D(G(z)) ≈ 0 → log(0) → -∞ → STRONG gradient

Same Nash equilibrium. But gradient is strong early when G needs it most. This is why in practice we flip to maximising log D(G(z)).

Why use Leaky ReLU instead of ReLU in the discriminator?

✓ Exactly right! Sparse gradients (many zeros from ReLU) reduce the training signal flowing back to G. Leaky ReLU has a small slope for negative values → no dead neurons → gradients always flow → better training signal.

✗ From the slides: "Avoid sparse gradients: no ReLU and no MaxPool — use Leaky ReLU + AvgPool instead." ReLU zeros out negative activations → many dead neurons → sparse gradients → poor gradient flow to G.

12 · VAE-GAN & Diffusion Models (Slides 43 & 45)

▼

VAE-GAN (slide 43)

Combine VAE and GAN. VAE provides structured, continuous latent space. GAN discriminator forces sharp realistic outputs. VAE alone = blurry. GAN alone = unstable latent space. Together = sharp + structured. Foundation of many modern image generators.

Diffusion Models (slide 45)

Forward: gradually add Gaussian noise over T steps until image = pure noise. Reverse: train network to denoise step by step. Generate: start from pure noise, run reverse T times. Used in DALL-E 2, Stable Diffusion, Midjourney. More stable training than GANs.

The slide lists the diffusion model reference: lilianweng.github.io/posts/2021-07-11-diffusion-models/

CHEAT SHEET — Lec 8 & 9 · Copy to Handwritten Notes

LEC 8 — LSTM API Shapes

LSTM(32)(x) → (B, 32) LSTM(32, return_sequences=True) → (B,T,32) LSTM(32, return_state=True) → [y,h,c] each (B,32) LSTM(32)(x, initial_state=[h,c])→ (B,32) Bidirectional(LSTM(64))(x) → (B,128) ← DOUBLED

LEC 8 — Text Generation

x = LSTM(32, stateful=True)(x) model.reset_states() ← between sequences x = np.argmax(output) ← greedy x = np.random.choice(len(output), size=1, p=output) Generates: code, LaTeX, HTML, patents — same pattern

LEC 8 — Image Captioning

Input: <START> x[t] ... x[N-1] Output: y[0] y[t] ... <END> Decoder generates until <END> token produced

LEC 8 — Attention Equations (slides 53,54)

Keys: h₁,...,hₙ ∈ ℝʰ (encoder states) Query: sₜ ∈ ℝʰ (decoder state) Scores: eᵗ = [sₜᵀh₁,...,sₜᵀhₙ] ∈ ℝᴺ Weights: αᵗ = softmax(eᵗ) ∈ ℝᴺ Context: aₜ = Σᵢ αᵢᵗhᵢ ∈ ℝʰ (value) Output: [aₜ; sₜ] ∈ ℝ²ʰ (concatenated) Summary: softmax(keyᵢᵀ·query)·valueᵢ

LEC 8 — Attention Code (slide 55)

h_p = Permute((2,1))(h) # (B,F,TE) e = tf.matmul(s, h_p) # (B,TD,TE) alpha = softmax(e) # (B,TD,TE) a = tf.matmul(alpha, h) # (B,TD,F)

LEC 8 — Tables/Dicts/Gates

Tables → Embedding (exact lookup) Dicts → Attention (soft lookup) if-else → LSTM gates / MoE

LEC 8 — Key Facts

E[day]·E[sunny] ≈ 1 (dot product = relation) Polysemy: "fair"=festival vs "fair"=just → same vector → fails Bottleneck: entire source → 1 vector (h,c) → info lost ELMO → Attention → Transformers → BERT → GPT

LEC 9 — Autoencoder Math

z = σ(Wx+b) encoder x' = σ'(W'z+b') decoder φ,ψ = argmin ||X − (ψ∘φ)X||²

LEC 9 — VAE

Encoder: q_θ(z|x) → (μ, σ) Sample: z ~ N(μ,σ) → decoder p_φ(x|z) Generate: z~N(0,1) → Decoder → new image

LEC 9 — Reparameterisation Trick

z = μ + σ·ε, ε~N(0,1) Moves randomness out of graph → backprop works return z_mean + K.exp(z_log_sigma) * epsilon

LEC 9 — VAE Loss (slide 30)

lᵢ(θ,φ) = −E_Q[log p_φ(xᵢ|z)] + KL(q_θ||p) kl = -0.5*K.mean(1+z_log_σ-K.sq(z_mean)-K.exp(z_log_σ), axis=-1) KL closed form: Σᵢ (σᵢ²+μᵢ²−log(σᵢ)−1) D_KL(p||q) = Σᵢ p(xᵢ)·(log p(xᵢ)−log q(xᵢ))

LEC 9 — ELBO Jensen Proof (slide 32)

log p(x) = log∫p(x|z)p(z)dz = log∫p(x|z)(p(z)/q(z))q(z)dz Jensen: ≥ ∫q(z)log(p(x|z)p(z)/q(z))dz = E_Q[log p(x|z)] − KL[q(z)||p(z)] Jensen: log E[x] ≥ E[log(x)] (log concave)

LEC 9 — GAN Objective (slide 40)

min_G max_D L = E[logD(x)] + E[log(1−D(G(z)))] D* = 0.5 at Nash equilibrium G trick: use max log D(G(z)) not min log(1-D(G(z))) Original G loss saturates early → use flipped version

LEC 9 — GAN 4 Problems

1. Nash equilibrium hard to achieve 2. Vanishing gradient (D too good early) 3. Mode collapse (G produces one output) 4. No evaluation metric

LEC 9 — GAN Key Tricks

tanh last activation of G (inputs -1 to +1) z~N(0,1) not Uniform Label smoothing: real=0.9, fake=0.1 Leaky ReLU + AvgPool (not ReLU + MaxPool) ADAM for G, SGD for D Dropout in G at BOTH train and test Separate mini-batches: all real, then all fake

LEC 9 — VQ-VAE / RQ-VAE

VQ-VAE: discrete codebook, embedding learns RQ-VAE: Boosted Embeddings + VQ-VAE Diffusion: noise→denoise step by step (DALL-E2)

LEC 9 — Feature Addition

feature_vec = mean(μ_with) − mean(μ_without) z_new = μ_img + α·feature_vec → decode → new img

ELEN 521 — Complete Deep Dive: Applications of RNNs · Autoencoders · VAEs · GANs