Applications of RNNs & Attention
How to Translate · seq2seq · Attention · ELMO → Transformers1 · Embeddings Review (What each dimension represents)
▼E[day] · E[sunny] ≈ 1
Stockholm after 9pm during summer = midnight sun. Days in Stockholm summer are always sunny. Because these words co-occur constantly in training text, their embedding vectors point in similar directions → high dot product ≈ 1. The embedding space learns REAL WORLD RELATIONS.
Polysemy Problem (Word2Vec fails here)
From slides: "Will word2vec work well for a word that has different meanings depending on context?"
"I went to the fair last Saturday." → fair = festival
"I do not believe your attitude is fair." → fair = just
Word2Vec gives ONE vector for "fair" — an average of all usages. Captures neither meaning well. Solution: ELMO, BERT → contextual embeddings.
Domain Adaptation
From slides: "Vectors need to be adapted for different domains (medical, legal, etc.)" General embeddings trained on Wikipedia fail for medical text — "cold" means something different in "I have a cold" vs a medical paper. Fine-tune embeddings on domain-specific text.
2 · LSTM & SimpleRNN Keras API — All Variants & Shapes
▼# W = hidden-to-hidden, U = input-to-hidden, V = hidden-to-output # Same W, U, V used at EVERY time step t # Input shape: (batch, T, |x|) # The T bracket shows all steps share weights
# Input: (batch, T, |x|) x = LSTM(32)(x) # X: (batch, 32) ← last timestep output only x = LSTM(32, return_sequences=True)(x) # X: (batch, T, 32) ← ALL timestep outputs x = LSTM(32, return_state=True)(x) # x: [(batch, 32)]*3 = [y, h, c] # y = last output (= h), h = last hidden state, c = last cell state x = LSTM(32)(x, initial_state=x[1:]) # Pass [h, c] from encoder as initial state of decoder (seq2seq handoff)
x = Dense(32)(x) x = Dense(64)(x) x: (B, T, E(x)) x = LSTM(32, return_sequences=True)(x) # → (B, T, 32) x = LSTM(64)(x) # → (B, 64) # Making 2*T sequence size: x: (B, T, E(x)) x = LSTM(32, return_state=True)(x) # [(B,32)]*3 = y,h,c x = LSTM(32, initial_state=x[1:]) # [h,c] → (B,32)
3 · Sentiment Analysis — N→1 Architecture
▼Architecture (N→1)
Input x: (B, T, F)
LSTM[0] → LSTM[t] → LSTM[T-1]
Only y[T-1] is used (last output)
→ Dense → pos/neg
N→1: read entire sequence, output one label at the end.
Real Example from Slides
"Joker is brutal, dark and a compelling origin story of a man's painful journey to self destruction."
"Led by Phoenix's ferocious, feral performance, this especially dark, gritty comic book movie is a character drama..."
→ Positive (critics liked it despite the dark tone)
4 · Character & Text Generation — Code, LaTeX, HTML, Patents
▼x: (B, 1, F) # one token at a time x = LSTM(32, stateful=True)(x) # state preserved between batches model … model.reset_states() # reset between sequences # Two sampling strategies: x = np.argmax(output) # greedy: always pick max prob x = np.random.choice(len(output), size=1, p=output) # stochastic: sample proportionally
Character generation
Generate text character by character. Vocabulary = alphabet + punctuation.
Code generation
Karpathy's RNN generating Python/C code. Vocabulary = code tokens.
LaTeX / HTML / Patents
Same pattern — stateful LSTM, one token at a time. Only vocabulary differs.
Key concept: stateful=True
Normally Keras resets LSTM state between batches. stateful=True keeps the state — so the LSTM remembers across batches. Required for long sequence generation where the sequence is longer than one batch. Must call reset_states() manually between independent sequences.
5 · seq2seq — Encoder→Decoder Architecture
▼Training mode (Teacher forcing)
Encoder reads source. Decoder receives the ground-truth target words (shifted by one) as input and predicts the next word. This is "teacher forcing" — the decoder always gets the correct previous word, not its own prediction.
Prediction / Inference mode
Encoder reads source. Decoder starts with <START> token. At each step, decoder's own output is fed back as the next input. Stops when <END> token is generated.
# ENCODER enc_in = Input((T_E, V)) enc_out, h, c = LSTM(H, return_state=True)(enc_in) # enc_out = last output = h. h = last hidden. c = last cell. # DECODER dec_in = Input((T_D, V)) dec_out = LSTM(H, return_sequences=True)(dec_in, initial_state=[h, c]) # encoder state passed here! output = Dense(V, activation='softmax')(dec_out)
6 · Bidirectional RNNs
▼x = Bidirectional(LSTM(64))(x) # Output shape: (B, 128) ← doubled! 64 from forward + 64 from backward # Architecture from slides: # Forward: LSTM[0] → LSTM[t] → LSTM[N-1] → y[N-1] # Backward: LSTM[N-1] ← LSTM[N-1] ← LSTM[N-1] (reversed) # x[0], x[t], ..., x[N-1] fed to both
When to use Bidirectional
When the FULL sequence is available at inference time (not generation). Good for: sentiment analysis (see whole review), named entity recognition, translation encoder. NOT useful for generation (you don't have future tokens yet).
7 · Image Captioning + Image Captioning with Attention
▼8 · ELMO — Contextual Embeddings
▼How ELMO works
Train a deep bidirectional LSTM on a language model task. Use the internal representations (all layers) as word vectors. The same word "fair" gets different vectors in "county fair" vs "fair treatment" — because the surrounding words change the LSTM's hidden state.
The progression beyond ELMO
From slides: ELMO → Attention → Transformers → BERT → GPT
Each step improved how context is used. Transformers removed the RNN entirely and used attention at every layer. BERT trained bidirectional Transformer. GPT trained autoregressive Transformer. These are the foundations of ChatGPT.
9 · Bottleneck Problem in seq2seq
▼The Problem
The encoder reads the ENTIRE source sequence and must compress ALL information into a SINGLE fixed-size state vector (h,c) — dimension H. For long sentences, this vector can't hold all the detail. Information is lost. Translation quality drops for long sequences.
Why attention solves it
Instead of passing only the final (h,c), pass ALL T_E encoder hidden states. At each decoder step, the decoder can attend to any of the encoder states — not just the last one. Long-range dependencies can be directly accessed.
10 · Attention — All Equations (Slides 53 & 54)
▼Alignment Matrix (Slide 52)
When attention is visualised as a matrix (source words on X-axis, target words on Y-axis), it forms a near-diagonal for simple sentences (word-for-word alignment) and a more complex pattern for sentences with different word orders (e.g. English vs French adjective placement). Bright = high attention weight.
11 · Attention — Keras Code (Slide 55)
▼# h: (B, TE, F) — all encoder hidden states # s: (B, TD, F) — all decoder hidden states # s[t]: (B, F) — decoder state at step t h_p = Permute((2, 1))(h) # (B, F, TE) ← transpose encoder # (B, TD, TE) = (B, TD, F) * (B, F, TE) e = tf.matmul(s, h_p) # (B, TD, TE) ← attention scores alpha = softmax(e) # (B, TD, TE) ← attention weights # (B, TD, F) = (B, TD, TE) * (B, TE, F) a = tf.matmul(alpha, h) # (B, TD, F) ← context vectors
Permute
Transposes the encoder output from (B,TE,F) to (B,F,TE). Needed for matrix multiply to compute dot products between each decoder step and each encoder step.
matmul(s, h_p)
(B,TD,F)·(B,F,TE) = (B,TD,TE). Each of the TD decoder steps scores against all TE encoder steps. One big batch matrix multiply.
matmul(alpha, h)
(B,TD,TE)·(B,TE,F) = (B,TD,F). Weights the encoder states by attention and sums. Output: one context vector per decoder step.
12 · "What We Have Seen So Far" — Tables · Dictionaries · Gates
▼| Classical Computing | Deep Learning Equivalent | Key difference |
|---|---|---|
| Table lookup | Embedding layer | Integer index → dense row. Exact, hard lookup. O(1). |
| Dictionary / Hash map | Attention mechanism | Soft lookup. Query vs all keys → weighted sum of values. Differentiable. |
| if-then-else | LSTM gates / Mixture of Experts | Sigmoid = differentiable switch 0→1. Backprop flows through it. |
Autoencoders · VAEs · GANs · Generative Models
How to CREATE · From compression to generation1 · Autoencoders — Architecture & Math
▼Standard Autoencoder vs VAE (slide 26)
Standard AE: encoder maps x to a SINGLE POINT z in latent space. Latent space is fragmented — different classes map to scattered, disconnected points. No structure.
VAE: encoder maps x to a GAUSSIAN REGION (μ, σ). KL loss forces all regions to overlap near N(0,1). Latent space is smooth and continuous.
Application: Autoencoders
- Dimensionality reduction
- Anomaly detection (high reconstruction error = anomaly)
- Denoising (train on noisy input, clean target)
- Feature learning without labels
2 · Denoising Autoencoders — Full Code
▼# ENCODER input_img = Input(shape=(28, 28, 1)) # adapt for channels_first format x = Conv2D(32, (3,3), activation='relu', padding='same')(input_img) x = MaxPooling2D((2,2), padding='same')(x) x = Conv2D(32, (3,3), activation='relu', padding='same')(x) encoded = MaxPooling2D((2,2), padding='same')(x) # representation is (7, 7, 32) # DECODER ← uses UpSampling2D (or Conv2DTranspose) x = Conv2D(32, (3,3), activation='relu', padding='same')(encoded) x = UpSampling2D((2,2))(x) # ← tf.keras.layers.Conv2DTranspose x = Conv2D(32, (3,3), activation='relu', padding='same')(x) x = UpSampling2D((2,2))(x) decoded = Conv2D(1, (3,3), activation='sigmoid', padding='same')(x) autoencoder = Model(input_img, decoded) autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
Conv2DTranspose vs UpSampling2D
UpSampling2D: simple nearest-neighbour upscaling (no learned parameters). Conv2DTranspose: learned transposed convolution — can produce sharper upsampling. Slide labels the UpSampling block as "tf.keras.layers.Conv2DTranspose" — meaning use Conv2DTranspose in practice for better quality.
3 · Variational Autoencoders — Architecture & Intuition
▼Generative use (slide 23)
z ~ N(0, 1) ← sample from standard normal
x = Decoder(z) ← generate new image
Because KL loss forces the latent space toward N(0,1), any randomly sampled z maps to a valid image. This is generation.
Latent space structure (slide 31)
Left plot: scattered clusters (standard AE style). Right plot: overlapping, continuous cloud (VAE style). The VAE latent space is smooth — travel in any direction and get a valid image. This is what enables interpolation and feature arithmetic.
4 · Reparameterisation Trick — Code from Slides
▼def sampling(args): z_mean, z_log_sigma = args epsilon = K.random_normal(shape=(batch_size, latent_dim), mean=0., std=epsilon_std) return z_mean + K.exp(z_log_sigma) * epsilon # z = μ + exp(log_σ) * ε = μ + σ * ε # note: "output_shape" isn't necessary with TensorFlow backend # so you could write Lambda(sampling)([z_mean, z_log_sigma]) z = Lambda(sampling, output_shape=(latent_dim,))([z_mean, z_log_sigma])
Why log_sigma instead of sigma?
Network outputs log(σ²) instead of σ directly. Reason: σ must be positive (it's a standard deviation). If the network outputs log(σ²), then σ = exp(log_σ) is always positive regardless of what the network outputs. Removes the need for a positivity constraint.
5 · VAE Loss Function — Full Code & All Formulas (Slide 30)
▼def vae_loss(x, x_decoded_mean): xent_loss = objectives.binary_crossentropy(x, x_decoded_mean) kl_loss = -0.5 * K.mean(1 + z_log_sigma - K.square(z_mean) - K.exp(z_log_sigma), axis=-1) return xent_loss + kl_loss
Reconstruction loss (cross-entropy)
binary_crossentropy between original x and reconstructed x_decoded_mean. Measures how well the decoder rebuilt the input. Low = good reconstruction.
KL divergence
-0.5 * (1 + log_σ² − μ² − σ²). KL=0 only when μ=0 and σ²=1. Minimising pushes encoder toward N(0,1). Controls how structured the latent space is.
6 · ELBO Derivation — Jensen Inequality (Slides 32 & 33)
▼= Reconstruction likelihood − KL divergence
7 · How to Add Features to a Picture (Slide 34)
▼Why this works
Because the KL term forced the latent space to be continuous and structured. Every point in the space corresponds to a valid image. Directions are consistent — "smile direction" always adds a smile regardless of who the person is.
8 · Beyond VAE — VQ-VAE & RQ-VAE (Slides 36 & 37)
▼VQ-VAE — Vector Quantized VAE (slide 36)
From slides: "embedding learns | latent encoding space learns"
Uses a DISCRETE latent space — a codebook of K learned vectors instead of continuous z. Encoder output is "snapped" to the nearest codebook entry. Result: higher quality generation, discrete representation (like tokens). Foundation of DALL-E.
RQ-VAE — Residual Quantized VAE (slide 37)
RQ-VAE = Boosted Embeddings + VQ-VAE
Stacks multiple VQ-VAE quantisation stages. Each stage encodes the residual error of the previous one. Achieves much higher quality than single VQ-VAE. Like boosting applied to discrete latent codes.
9 · GAN — Architecture, Objective & Training Algorithm (Slides 38–40)
▼10 · Problems of GANs (Slide 41)
▼1. Hard to achieve Nash equilibrium
G and D are trained alternately. The landscape is non-stationary — each update changes what the other network is optimising. Convergence is not guaranteed.
2. Vanishing gradient
Early in training D is very good (G is terrible). D(G(z))≈0 → log(1-0)=0 → gradient≈0 → G gets no training signal. G can't improve.
3. Mode collapse
Generator collapses to producing only one type of output — the one that reliably fools D. Ignores all other modes of the real data distribution. E.g. only generates one type of face.
4. Lack of proper evaluation metric
Hard to know when training is succeeding. Loss values don't directly indicate image quality. Metrics like FID (Fréchet Inception Distance) help but are expensive to compute.
11 · How to Improve GAN — Full List from Slide 42
▼Core training tricks
- Normalize inputs between -1 and +1 (use tanh as last activation of G)
- Use max log(D(G(z))) instead of min log(1-D(G(z))) — "flip labels" trick
- Sample z from Gaussian, NOT uniform
- Construct different mini-batches for real and fake (all real, then all fake)
Architecture tricks
- Avoid sparse gradients: NO ReLU, NO MaxPool in D → use Leaky ReLU + AvgPool instead
- Use label smoothing: real = 0.9, fake = 0.1 (not hard 0 and 1)
- Use DCGAN when you can, or VAE-GAN
Optimiser & stability tricks
- Use stability tricks from RL: checkpoint and replay
- Use ADAM Optimizer for G, OR use SGD for discriminator
- Track failures early
- Don't balance loss via statistics (# Gs followed by # Ds)
Data & regularisation tricks
- If you have labels, use them
- Add noise to inputs, decay over time
- Use Dropouts in G in both train AND test phase
12 · VAE-GAN & Diffusion Models (Slides 43 & 45)
▼VAE-GAN (slide 43)
Combine VAE and GAN. VAE provides structured, continuous latent space. GAN discriminator forces sharp realistic outputs. VAE alone = blurry. GAN alone = unstable latent space. Together = sharp + structured. Foundation of many modern image generators.
Diffusion Models (slide 45)
Forward: gradually add Gaussian noise over T steps until image = pure noise. Reverse: train network to denoise step by step. Generate: start from pure noise, run reverse T times. Used in DALL-E 2, Stable Diffusion, Midjourney. More stable training than GANs.