ELEN 521 — Lec 6 & 7 Deep Dive

1 · Design Principles & Patterns

▼

Your prof gave 13 architecture design patterns. These are the rules top researchers use when building CNNs. Think of them as "if you follow these, your network will likely work well."

4 Core Design Principles

Reduce filter sizes — except possibly at the lowest layer. Factorize aggressively (3×3 over 5×5)
Use 1×1 convolutions to reduce and expand feature maps (bottlenecks)
Use skip connections and/or create multiple paths (ResNet, GoogleNet)
Use auxiliary functions whenever possible (extra loss heads during training)

Why These Principles?

Smaller filters = fewer parameters = less overfitting. 1×1 convs are "free" channel mixers. Skip connections let gradients flow freely. Multiple paths give the network flexibility — it can route information through whichever path helps most.

Like building a city with many roads: if one road jams, traffic finds another route. Skip connections are highway bypasses.

The 13 Design Patterns from the slides

1. Architecture Follows Application — choose arch for your task 2. Proliferate Paths — GoogleNet + ResNet50 style 3. Strive for Simplicity — simpler is faster and better 4. Increase Symmetry — Conv + augmentation 5. Pyramid Shape — H↓, W↓, C↑ as you go deeper 6. Overtrain — train on HARDER problems than test 7. Cover the Problem Space — augmentation + noise 8. Incremental Feature Construction — delta/residual changes 9. Normalize Layer Inputs — BatchNormalization 10. Input Transition — first Dense/Conv handles raw input 11. Available Resources Guide Layer Widths — accuracy + GPU 12. Summation Joining — ResNet50-like skip/add 13. Down-sampling Transition — MaxPool + strides Plus 6 training tricks: pre-trained + finetuning · freeze-drop-path · cyclical learning rates · bootstrapping noisy labels · use ELUs/GELUs not ReLUs · experiment with init

What does "Pyramid Shape" mean as a design pattern?

✓ Correct! H↓ W↓ C↑ — the spatial dimensions shrink via pooling/strides while the number of feature channels grows. This is the standard CNN pattern.

✗ Not quite. Pyramid = spatial dims SHRINK (H↓, W↓) while channels GROW (C↑). Think of it like zooming out but seeing more "types" of features.

2 · Self-Supervised Learning & Image Augmentation

▼

Sometimes you don't have enough labelled data. Two solutions: (1) Self-supervised learning — use the data itself as supervision. (2) Augmentation — artificially create more training examples.

Self-Supervised Learning

When you don't have labels, you can still learn useful representations. Create a "pretext task" from the data itself — e.g., predict which rotation an image was rotated by, or predict a missing patch. The network learns rich features without any human labels.

Teaching a child by asking them to complete a puzzle — they learn to understand objects without anyone telling them what each piece is.

Image Augmentation (from the slides)

Mirrored / flipped images
Distorted images (geometric warp)
Blurred images
Color changes (brightness, contrast, hue)

You have 1000 dog photos. Augmentation gives you 8000+ by flipping, cropping, brightening each one. The network sees more variety → less overfitting.

Getting Better Results — from the slides

Better result formula: y log p + (1 − y) log(1 − p) This is the binary cross-entropy loss. "Getting better results" means optimising this over your augmented dataset. More augmented data → the loss surface is smoother → generalises better.

3 · Bias-Variance Revisited + Double Descent

▼

The classical "overfitting is bad" story is incomplete. Your prof showed a modern finding: very large networks can actually beat the variance hill and generalise again.

Classical View (U-curve)

As model capacity increases:
— Too simple → High Bias (underfitting) → test error HIGH
— Just right → Sweet spot → test error LOW
— Too complex → High Variance (overfitting) → test error HIGH again

This is the standard story. True for classical ML (SVMs, polynomials).

Modern Finding: Double Descent

For very large neural networks, there's a SECOND descent after the overfitting peak:

Small model → underfit (high bias)
Medium model → overfits (high variance)
LARGE model → generalises again! ← new finding

The slide says: "If network is even larger, it seems to be using a smoothing function again." Massive models find smooth solutions that happen to pass cleanly through all training points without wild oscillations.

Why double descent happens (conceptual derivation)

At the "interpolation threshold" (params ≈ data points): → only ONE perfect fit exists → it wiggles wildly → high test error Beyond the threshold (params >> data points): → INFINITELY many perfect fits exist → gradient descent finds the MINIMUM NORM solution → min-norm = smoothest → generalises well

Implication: don't stop at the classical sweet spot. In deep learning, bigger is often better!

4 · Debugging Networks (Printing Layer Statistics)

▼

When your network isn't learning, the first thing to do is print what's happening INSIDE each layer. Dead filters, vanishing activations, and bad initialisations all show up here.

Exact code from the slides

# Get intermediate layer output
layer = model.get_layer(layer_name)
debug = Model(inputs=model.inputs, outputs=layer.output)
p = debug.predict(x_train)

# Print stats for each channel/filter
for channel in range(p.shape[-1]):
    print("layer {}[{}] min={:.4f} max={:.4f}".format(
        layer_name, channel,
        np.min(p[..., channel]),
        np.max(p[..., channel])))

The prof showed actual output from a real network. Notice filter [0] vs the others:

Real output from slides — spot the dead filter!

layer conv2d_0_m[0]  min=-0.0030  max=0.0030   ← 🚨 DEAD FILTER — near-zero range!
layer conv2d_0_m[1]  min=-2.4972  max=1.5494   ← healthy
layer conv2d_0_m[2]  min=-2.4191  max=1.7075   ← healthy
layer conv2d_0_m[7]  min=-1.8190  max=-1.4244  ← 🚨 ALL NEGATIVE — dead after ReLU!

Dead Filter

min ≈ max ≈ 0. The filter learned nothing — outputs are always near zero. After ReLU, this will always output 0. Fix: better init, lower learning rate, check for vanishing gradients.

All-Negative Filter

max is negative. After ReLU (max(0,x)), this filter always outputs 0. The filter is "blocked." Fix: check for dying ReLU problem, try Leaky ReLU instead.

Healthy Filter

Has both negative and positive values with reasonable range (say ±2). After ReLU, the positive part survives. This filter is learning real features.

Also: CNNs detect patterns, NOT shapes (from slides)

Key finding [Geirhos et al. 2019]: CNNs are biased toward TEXTURE, not shape. A cat image with dog-like texture → classified as dog. Human babies use shape. CNNs use texture. This is a fundamental limitation. Augmentation with style transfer can help fix this bias.

5 · Breaking CNNs (Adversarial Examples)

▼

CNNs are surprisingly easy to fool. Small, invisible perturbations to an image can make a CNN classify it with high confidence as something completely different.

Fooling a Linear Classifier

The prof showed: to fool a linear classifier, just add a small multiple of the weight vector to the input:

x → x + α·w

α is a tiny number. w is the weight vector. The change is INVISIBLE to the human eye but completely changes the prediction. Why? Because the classifier's decision boundary is linear — the weight vector points DIRECTLY toward misclassification.

Deep Networks Easily Fooled

[Nguyen et al. CVPR 2015]: showed random noise patterns that deep networks classify with 99%+ confidence as meaningful objects.

[Szegedy ICLR 2014]: added imperceptible noise to real images → confident wrong prediction.

A panda photo + tiny noise → 99% confident "gibbon." You can't see any difference. The CNN sees something completely different.

Adversarial example derivation

Original prediction: argmax f(x) = "cat" Goal: find δ (perturbation) such that argmax f(x + δ) = "dog" Constraint: ||δ||_∞ < ε (perturbation is imperceptibly small) Simple method (FGSM — Fast Gradient Sign Method): δ = ε · sign(∂L/∂x) ← go in direction that INCREASES loss

x_adversarial = x + ε · sign(∇_x L(x, y_true))
This maximises the loss w.r.t. the TRUE label → pushes toward wrong prediction.

6 · Transfer Learning

▼

ResNet needs LOTS of training data. But you can reuse weights trained on a big dataset (like ImageNet) for your small dataset. The early layers learn universal features — edges, textures, shapes — that work for ANY image task.

The Problem Without Transfer Learning

Training ResNet from scratch needs millions of images. If you only have 500 photos of your specific product, the network will overfit badly and not generalise.

The slide says: "Can we use the pre-trained weights from cats and horses to train the network to recognize dogs?" Yes — the early features are universal.

The Solution: Freeze + Fine-tune

Take a pre-trained ResNet (trained on ImageNet with 1M+ images). FREEZE all the base layers. Add a new classification head. Train ONLY the new head on your small dataset. Need much less data!

Slide quote: "need less data" when using pre-trained weights.

Pre-trained ResNet (ImageNet)

→

Freeze base layers

→

Add new head

→

Train only head

→

Works on small dataset!

Transfer learning — Keras code

from tensorflow.keras.applications import ResNet50

# Load pre-trained base (no top/classification head)
base = ResNet50(weights='imagenet', include_top=False)
base.trainable = False  # FREEZE — don't update base weights

# Add YOUR classification head
x = base.output
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu')(x)
output = Dense(num_classes, activation='softmax')(x)

model = Model(base.input, output)
# Now train — only the 2 Dense layers update!

# Optional fine-tuning: unfreeze TOP few layers later
for layer in base.layers[-20:]:
    layer.trainable = True
# Re-compile with very small learning rate, then train again

What "freeze-drop-path" means (from training patterns)

Freeze: base.trainable = False (base layers stay fixed) Drop: Dropout(0.5) on the new head (prevent overfitting on small data) Path: use a specific path through the network for gradient flow Cyclical Learning Rates: oscillate lr between a low and high bound during training. Helps escape local minima. lr = lr_min + (lr_max - lr_min) × triangle_wave(step)

7 · Visualizing Networks (4 Methods)

▼

How do you know what your CNN has learned? You can't look at millions of weights. So you use clever tricks to visualise what each layer is doing. The prof covered 4 methods.

Method 1: Occlusion Experiments

[Zeiler & Fergus 2014]: Slide a grey square over different parts of the image and record the network's confidence for the true class at each position. Where confidence DROPS most → that region matters most.

Cover parts of a painting and ask "does covering THIS spot change what it is?" Where the answer is yes → that's the important part.

Method 2: First Filter Visualisation

Just plot the weights of first-layer conv filters as images. They should look like edge/colour detectors. Easy for layer 1 (filters are image-sized). Gets harder for deeper layers (filters in abstract space).

After ReLU: some filters will always be 0 (< 0 inputs) → dead filter detection!

Method 3: Which Image Maximally Activates a Neuron?

Go through your entire dataset. For each image, record the activation of a specific neuron. Find the top-9 images that cause the HIGHEST activation. These images show you what that neuron "likes."

Key: receptive field becomes larger as you look at layers closer to the output!

Method 4: Gradient Ascent on Image

Start with a random/noise image. Compute gradient of a neuron's activation w.r.t. the INPUT IMAGE. Step in the POSITIVE gradient direction. Repeat. The image evolves into what maximally activates that neuron.

[Yosinski et al. 2015] — "Deep Visualisation." Instead of asking "which real image fires this?" — ask "what IMAGINARY image would fire it hardest?"

Method 4 — Gradient Ascent Maths

loss = MSE( layer[i].output ) ← activation we want to maximise gradient = ∂loss / ∂x_input ← how does changing the IMAGE affect this? x = x + η · gradient ← move IMAGE in direction of MORE activation Repeat for many iterations, starting from random image

If the result looks like noise → the network hasn't learned anything meaningful for that filter. A bad sign.

Also: "What image maximises a class score?" — same method but loss = log probability of the desired class. Produces dream-like images.

8 · Reconstructing Images from CNN Codes

▼

The question from the slides: "Given a CNN code (the internal representation of an image), is it possible to reconstruct the original image?" — Yes. And how well you can reconstruct tells you how much information that layer preserves.

The Reconstruction Problem (Mahendran & Vedaldi, 2014)

Goal: Find an image x* such that: 1) φ(x*) ≈ φ(x₀) ← CNN code of x* matches target code 2) x* "looks natural" ← add image prior regularisation R(x) Objective: x* = argmin_x ||φ(x) − φ(x₀)||² + λ·R(x) φ(x) = the CNN's representation of image x at a particular layer. We optimise over the IMAGE x (not the weights). Start from noise or the original, use gradient descent.

We're doing gradient descent on the IMAGE to minimise the difference between its CNN code and the target code.

Experiment 1: Reconstruct from softmax output (1000 class probs)

Start from the 1000 log-probabilities of ImageNet classes and try to reconstruct the original image. Result: very rough, blurry blobs. Most spatial information is GONE by the time you reach softmax — the network only knows "it's a cat" not WHERE the cat is.

Experiment 2: Reconstruct from last pooling layer

Reconstruct from the representation just before the first fully-connected layer (after last pooling). Result: much sharper, original image largely recoverable! Spatial information still exists at this layer.

Early layers = photographs. Last layers = abstract descriptions. You can reconstruct from a photograph but not from "it's a cat with pointy ears."

Key Takeaway for the Exam

Reconstruction quality tells you WHAT each layer preserves. Early CNN layers → full spatial detail → perfect reconstruction. Deep layers → abstract class info → terrible reconstruction. This is why transfer learning works: early layers are universal, deep layers are task-specific.

9 · Neural Style Transfer

▼

Take the CONTENT of one image (a photo) and apply the STYLE of another (a Van Gogh painting). The trick: content lives in activations, style lives in correlations between activations (Gram matrices).

Three images involved

C = Content image (the photo)
S = Style image (the painting)
G = Generated image (what we're creating)

You OPTIMISE G to have the content of C and the style of S. The weights of the CNN stay FROZEN. Only G changes.

What "content" and "style" mean

Content = what objects are where. Captured by neuron activations at a deep layer (relu_3_3). If G's activations match C's at that layer → G has the same content.

Style = textures, colours, brushstroke patterns. Captured by the Gram matrix of activations at multiple layers.

Gram Matrix — how to capture "style"

M[l] = feature map at layer l, shape: (Nₗ, Mₗ) Nₗ = number of channels (Co) ← how many filters Mₗ = H_out × W_out ← spatial positions flattened GM[l]ᵢⱼ = M[l] · M[l]ᵀ ← shape: (Nₗ, Nₗ) GM[l]ᵢⱼ = how much filter i and filter j activate together at any spatial location. If filter i (vertical edges) often activates where filter j (blue colour) activates → that co-occurrence IS the style.

Gram matrix = pairwise correlations between all filters. This captures texture without capturing WHERE things are → style, not content.

Total Loss Function

L_total = α · L_content(C, G) + β · L_style(S, G) L_content = Σ_l ||activations_C[l] − activations_G[l]||² L_style = Σ_l ||GM_S[l] − GM_G[l]||² α controls content weight, β controls style weight. α=β=0.5 gives equal balance. Increase β → more painterly. Increase α → more photo-like.

Style transfer loop — exact code from slides

C_layers = ["relu_3_3"]                         # 1 layer for content
S_layers = ["relu_1_2", "relu_2_2",            # multiple layers for style
            "relu_3_3", "relu_4_3"]
alpha = 0.5
beta = 0.5

for i in range(iterations):
    loss = alpha * C_loss(C, G, C_layers) + \
           beta  * S_loss(S, G, S_layers)
    minimize(loss).change(G)    # ← update IMAGE G, NOT weights!

In neural style transfer, what do you optimise during the loop?

✓ Exactly right! The CNN weights are frozen. You run gradient descent on the PIXELS of G to minimise the combined content + style loss.

✗ The CNN weights stay frozen! Only the pixels of the generated image G are updated. This is gradient descent on an image, not on weights.

10 · YOLO — You Only Look Once

▼

Object detection (finding WHERE objects are and WHAT they are) in a single forward pass. The key insight: divide the image into a grid, make each cell predict boxes and classes simultaneously.

How YOLO Works

Use an all-convolutional network (like GoogleNet)
Split the image into a grid (e.g., 7×7 cells)
Each cell predicts B bounding boxes + class probabilities
Apply NMS to remove duplicate detections

Instead of sliding a detector window everywhere (slow), YOLO looks at the whole image ONCE and predicts all boxes simultaneously.

Bounding Box Encoding

Each cell predicts, for each box:

pc = confidence (is there an object?)
bx = x center (relative to cell, 0–1)
by = y center (relative to cell, 0–1)
bw = width (relative to cell, CAN be > 1)
bh = height (relative to cell, CAN be > 1)
c₁..cₖ = class probabilities

Grid origin: top-left = (0,0), bottom-right = (1,1)

IoU — Intersection over Union

IoU = Area(B₁ ∩ B₂) / Area(B₁ ∪ B₂) B₁ = predicted box, B₂ = ground truth box (or another predicted box).
IoU = 1 → perfect overlap. IoU = 0 → no overlap.
Used as: (1) detection quality metric, and (2) in NMS to remove duplicates.

If IoU > 0.5 with the best box → this box is a duplicate → discard it (NMS).

NMS — Non-Max Suppression (exact algorithm from slides)

Step 1: Discard ALL boxes where pc < 0.6 Step 2: Order remaining boxes from highest to lowest pc Step 3: Pick the box with largest pc (keep it) Step 4: Discard any remaining box with IoU > 0.5 with the kept box Step 5: Repeat Step 3–4 until no boxes remain

Result: at most one box per object. The one with highest confidence survives.

In YOLO, bw and bh (box width and height) are relative to the grid cell. Can they be greater than 1?

✓ Correct! bx, by are always 0–1 (center position within the cell). But bw, bh CAN be > 1 because a large object (e.g., a bus) spans multiple cells.

✗ The slides explicitly say: "bx, by: between 0 and 1. bw, bh: can be greater than 1." A box that spans multiple cells needs bw or bh > 1.

1 · How to Represent Words (4 Methods)

▼

Before a neural network can process text, words need to become numbers. The prof showed 4 ways to do this, from simplest to best. Each has real tradeoffs.

Key insight from the slides: vocabulary can be ANY set of objects, and "words" can be any element of that set — letters, pixels, tokens, genes.

Method	How it works	Size	Problem
One-hot / BOW	[0,0,1,0,0,...] — 1 at word's position	\|V\| = 50,000	Sparse, no meaning, "cat" and "kitten" unrelated
Counting	Count how often each word appears in each doc	\|V\| = 50,000	Sparse, common words dominate
TF-IDF	Count × rarity score	\|V\| = 50,000	Still sparse, no synonymy
Embeddings (dense)	Learned ~300-dim vectors	300	Best! Captures meaning and synonymy

Why Dense Vectors Win (from slides)

Short vectors → fewer weights to tune in downstream ML
May generalise better than storing explicit counts
Captures synonymy: "car" and "automobile" get similar vectors
Words used in similar contexts get similar vectors automatically
"In practice, they work better"

BOW Example (from slides)

Sentence 1: [0,0,1,1,0,1,1,1, ...]
Sentence 2: [1,1,0,0,1,0,1,1, ...]
↑ each position = one vocab word

Doc 1: [0,0,3,1,0,5,1,2, ...]
Doc 2: [3,1,0,0,1,0,6,1, ...]
↑ counts per document

BOW = a bag where you throw all words in. You know what's in the bag, not the order. "I love dogs, not cats" and "I love cats, not dogs" look the same to BOW.

2 · TF-IDF — with the Prof's Exact Example

▼

TF-IDF weights words by how often they appear in a document AND how rare they are across all documents. Common words like "the" get zero weight automatically.

TF-IDF Formula

TF(word, document) = frequency of word / number of words in document IDF(word) = log( number of documents / number of docs containing word ) TF-IDF(word, document) = TF × IDF

Prof's exact worked example (from slides)

Document 1 has words: "This" + other words = 1+1+2+4 = 8 total words Document 2 has words: "This" + other words = 1+2+1+1 = 5 total words There are 2 documents total, and "This" appears in both. TF("This", Doc1) = 1/8 TF("This", Doc2) = 1/5 IDF("This") = log(2/2) = log(1) = 0 ← "This" is in ALL docs! TF-IDF("This", Doc1) = 1/8 × 0 = 0 ← useless word, weight = 0

"This" appears in EVERY document → IDF = 0 → TF-IDF = 0. Common stop words are automatically suppressed. Only rare but important words get high weight.

TF-IDF and PPMI are SPARSE representations

From the slides: "tf-idf and PPMI vectors are long (length |V| = 20,000 to 50,000) and sparse (most elements are zero)." This is why we need dense embeddings — 50,000-dimensional vectors are impractical.

3 · Cosine Similarity — with Visualisation Example

▼

Cosine Similarity Formula

cos(v, w) = (v · w) / (||v|| × ||w||) = dot product / (product of vector lengths) ||v|| = √(Σᵢ vᵢ²) ← L2 norm (length of vector)

cos = +1

Vectors point in SAME direction. Words always appear in same contexts. Maximum similarity.

cos = 0

Vectors are ORTHOGONAL. Words have nothing in common. Zero similarity.

cos = -1

Vectors point in OPPOSITE directions. For word frequencies (non-negative values), this can't happen — range is 0 to 1.

Prof's 2D visualisation example (from slides)

2D space with axes: Dim 1 = 'large', Dim 2 = 'data' Word "digital": vector ≈ [1, 1] (small large, small data) Word "information": vector ≈ [2, 3] (medium large, medium data) Word "apricot": vector ≈ [6, 1] (very large, very little data) "digital" and "information" have similar angles → higher cosine "apricot" points a different direction → lower cosine with both

Window size C matters! Small C (±2 words) → syntactic similarity. Large C (±5 words) → semantic/topical similarity. The slides show: C=±2 "Hogwarts" → Sunnydale, Evernight. C=±5 → Dumbledore, Malfoy, halfblood.

For word frequency vectors (all values ≥ 0), what is the range of cosine similarity?

✓ Correct! The slides say "Frequency is non-negative, so cosine range 0–1." If all values are ≥ 0, the dot product is always ≥ 0, so cosine can't be negative.

✗ For general vectors, cosine is [-1, +1]. But for frequency/count vectors (non-negative), it's [0, +1]. The slides explicitly state this.

4 · Word2Vec — Predict, Don't Count

▼

The brilliant insight of Word2Vec: instead of counting how often words appear together, train a classifier to PREDICT whether two words appear near each other. Then throw away the classifier and keep the weights — those weights ARE the embeddings.

The Idea

"Instead of counting how often each word w occurs near 'apricot' — train a classifier on a binary prediction task: Is w likely to show up near 'apricot'?"

We don't care about this task. We'll take the learned classifier weights as the word embeddings.

Self-Supervised Genius

"Brilliant insight: Use running text as implicitly supervised training data!" A word near 'apricot' acts as the gold correct answer — "Is word w likely near apricot?" NO human labels needed. Text itself provides supervision.

Reading billions of sentences, the network figures out that "apricot" and "jam" go together without anyone telling it what apricots or jam are.

Skip-gram: Given center, predict context

Input: target word t = "apricot"
Predict: context words nearby
window = ±2 words
→ positive examples: real neighbors
→ negative examples: random words

CBOW: Given context, predict center

Input: context words around a gap
Predict: the missing center word
"___ apricot ___" → predict "apricot"
Average the context vectors

CBOW = fill in the blank. Skip-gram = given a word, guess what surrounded it.

The 4-step Skip-gram Algorithm (from slides)

1. Treat target word + neighboring context word as POSITIVE example

2. Randomly sample other words in lexicon → NEGATIVE samples

3. Use logistic regression to train a classifier distinguishing the two cases

4. Use the learned weights as the embeddings

Prof's exact training data example

Sentence: "... lemon, a tablespoon of apricot jam a pinch ..." c1 c2 TARGET c3 c4 Window size ±2: c1="a", c2="tablespoon", c3="jam", c4="a" Positive pairs: (apricot, a), (apricot, tablespoon), (apricot, jam), (apricot, a) For k=2 negative examples per positive: add 2 random words (not "apricot") Negative pairs: (apricot, aardvark), (apricot, telephone), ...

5 · Skip-gram — All the Maths & Training

▼

Step 1 — Computing P(+|t,c) — turning dot product into probability

Intuition: words likely near each other → similar vectors → high dot product Similarity(t, c) ∝ t · c (dot product) Problem: dot product ∈ (-∞, +∞) — not a probability! Solution: pass through sigmoid! P(+|t, c) = σ(t · c) = 1 / (1 + e^(−t·c)) P(−|t, c) = 1 − P(+|t, c)

σ squashes any value to (0,1). High dot product → near 1 (likely context). Low dot product → near 0 (unlikely context).

Step 2 — For ALL context words (independence assumption)

Assume all context words c₁, c₂, ..., cL are INDEPENDENT given target t P(+|t, c₁, c₂, ..., cL) = Π_i P(+|t, cᵢ) = Π_i σ(t · cᵢ) Taking log (log of product = sum of logs): log P(+|t, context) = Σᵢ log σ(t · cᵢ)

Step 3 — Full Objective (Maximize)

For one target word t, with L positive context words and k negative noise words: J(t) = Σᵢ log P(+|t,cᵢ) + Σⱼ log P(−|t,nⱼ) = Σᵢ log σ(t·cᵢ) + Σⱼ log(1 − σ(t·nⱼ)) = Σᵢ log σ(t·cᵢ) + Σⱼ log σ(−t·nⱼ)

Maximise this over ALL words in the training corpus using gradient descent. This pulls target and context vectors closer, pushes target and noise vectors apart.

Negative Sampling — Noise Words

Instead of P(w) (uniform by freq), use p_α(w) with α = 3/4: p_α(w) = P(w)^(3/4) / Σⱥ P(u)^(3/4) Why α=3/4? Gives RARE words slightly higher probability. Example: P(a) = 0.99, P(b) = 0.01 p(a)^(3/4) = 0.992, p(b)^(3/4) = 0.018 Ratio: 0.99/0.01 = 99 vs 0.992/0.018 ≈ 55

The 3/4 power smooths the distribution — less dominated by very common words, more fair to rare words.

The Weight Matrices — Setup

Represent each word as a vector of length D=300, randomly initialised. Total initial parameters: 300 × |V| (one vector per word) We actually learn TWO matrices: W = target/word embeddings (|V| × D) C = context embeddings (|V| × D) After training: use W and discard C, or average W and C.

Train with gradient descent. Positive pairs → t·c↑ (pull together). Negative pairs → t·n↓ (push apart).

6 · Embedding Properties — Window Size & Analogies

▼

Window size C affects what kind of similarity is captured

C = ±2 (small): nearest words to "Hogwarts" → Sunnydale, Evernight → syntactic/nearby context similarity C = ±5 (large): nearest words to "Hogwarts" → Dumbledore, Malfoy, halfblood → semantic/topical similarity Small window → words that appear right next to each other grammatically. Large window → words that appear in similar paragraphs/topics.

Analogy Arithmetic (from slides)

vector('king') − vector('man') + vector('woman') ≈ vector('queen') vector('Paris') − vector('France') + vector('Italy') ≈ vector('Rome')

Embeddings capture RELATIONAL MEANING. The direction "man→woman" in vector space is the same as "king→queen". The direction "country→capital" is consistent across countries.

Pre-trained embeddings you can download (from slides)

Word2Vec (Mikolov et al.) — code.google.com/archive/p/word2vec/
FastText — fasttext.cc
GloVe (Pennington, Socher, Manning) — nlp.stanford.edu/projects/glove/

Don't train from scratch unless you have massive data. Download pre-trained vectors and fine-tune — same idea as transfer learning for images.

7 · Why We Need RNNs — Limitations of CNNs + Simple RNN

▼

CNNs only look at the CURRENT input. They can't remember what came before. For video, audio, and text — history matters. RNNs fix this by feeding their output back as input at the next step.

Limitations of CNN-based networks (from slides)

We only look at the present to make a decision
Examples have fixed length

Examples where history matters: video processing, audio, text analysis.

A CNN watching a movie classifies each frame in isolation. It doesn't know the character who just walked in was a villain in frame 5. An RNN does.

The Echo Game (from slides)

Prof's intuition builder: "Select a shift amount from 0 to N=3. Can ML discover the sequence?" You hear a sound and must repeat it 3 steps later. A CNN can't — it only sees one step. An RNN can — it carries the history in its hidden state.

Sliding window analysis: you could use a fixed window, but that's limited. RNN = infinite memory that decays naturally.

Language Modelling — the RNN task (from slides)

Vocabulary V = {w₁, w₂, ..., w|V|} (all possible words) At each step t, predict the next word: P(x_{t+1} = wᵢ | xₜ, x_{t-1}, ..., x₁) Given all previous words, what's the probability that the next word is wᵢ? The RNN compresses the entire history into hₜ, then predicts from hₜ.

Simple RNN Equations (from slides — exact notation)

hₜ = tanh( W·hₜ₋₁ + U·xₜ ) ← hidden state update oₜ = softmax( V·hₜ ) ← output (next word probs) W = H×H matrix (hidden-to-hidden) ← memory connections U = H×F matrix (input-to-hidden) ← input connections V = |V|×H matrix (hidden-to-output) SAME W, U, V at EVERY time step t!

The shared weights mean: the same "rule" for updating memory is used at every step. This is what lets RNNs handle sequences of any length.

Vanishing/Exploding Gradient Problem (from slides)

During backprop through time (BPTT): ∂hₜ/∂h₁ = Π_{k=1}^{t-1} W · diag(1 − hₖ²) If ||W|| < 1: product → 0.0 (vanishing memory) If ||W|| > 1: product → ∞ (exploding gradient)

The gradient must travel backwards through t time steps. Each step multiplies by W. With long sequences, this either vanishes to 0 (network forgets early inputs) or explodes to infinity (training crashes).

RNN TypeInputOutputExample 1→1FixedFixedImage classification (no RNN needed) 1→NFixedSequenceImage captioning N→1SequenceFixedSentiment analysis N→NSequenceSequenceMachine translation (seq2seq) N↔NSequenceSynced seqVideo frame labelling

8 · LSTM — Long Short-Term Memory (All 4 Gates)

▼

LSTM solves the vanishing gradient problem by introducing a SEPARATE cell state C that carries long-term memory, and GATES that decide what to remember, write, and output. The gradient flows through C almost unchanged.

Two Types of Memory

Cell state C = long-term memory. Like a conveyor belt — information flows along it with minimal interference. Gradients can flow back through it easily.

Hidden state h = short-term memory / output. What gets passed to the next time step AND used as output.

C = your long-term memory (what you learned in school). h = your working memory (what you're actively thinking about right now).

What the gates do

Forget gate f: what to ERASE from cell state

Input gate i: what NEW information to WRITE

Candidate C̃: what new content to potentially add

Output gate o: what to OUTPUT from cell state

Reading a book: forget = stop tracking the character who left the story. Input = remember the new character's name. Output = use current page's info to answer a question.

All 6 LSTM Equations (from slides)

fₜ = σ( Wf·[hₜ₋₁, xₜ] + bf ) ← FORGET gate (0=forget, 1=keep) iₜ = σ( Wi·[hₜ₋₁, xₜ] + bi ) ← INPUT gate (0=ignore, 1=write) C̃ₜ = tanh( Wc·[hₜ₋₁, xₜ] + bc) ← CANDIDATE (new content to add) Cₜ = fₜ ⊙ Cₜ₋₁ + iₜ ⊙ C̃ₜ ← CELL STATE UPDATE oₜ = σ( Wo·[hₜ₋₁, xₜ] + bo ) ← OUTPUT gate hₜ = oₜ ⊙ tanh(Cₜ) ← HIDDEN STATE ⊙ = element-wise (Hadamard) product. σ = sigmoid. [hₜ₋₁, xₜ] = concatenation. The key: Cₜ equation has ONLY addition and element-wise multiply → gradient flows back through time with near-constant magnitude. No repeated matrix multiplications!

Why LSTM fixes vanishing gradients

Simple RNN: ∂hₜ/∂h₁ = Π W·diag(1−h²) ← repeated matrix mult → vanishes LSTM: ∂Cₜ/∂C₁ = Π fₜ ← forget gates (≈1 for important info)

If the forget gate fₜ ≈ 1, gradients flow back almost unmodified. The cell state C is like a "gradient highway" that bypasses the vanishing problem.

The LSTM has two memory vectors at each time step. What are they and what's the difference?

✓ Correct! Cₜ = long-term memory (the gradient highway, flows with minimal interference). hₜ = short-term memory and current output (passed to next step AND used as prediction).

✗ The two memory vectors are Cₜ (cell state = long-term memory) and hₜ (hidden state = short-term memory/output). fₜ and iₜ are gates that control what goes into C.

9 · GRU — Gated Recurrent Unit (Simpler LSTM)

▼

GRU is a simplified LSTM. Instead of 3 gates and 2 memory vectors, it has 2 gates and 1 memory vector. Fewer parameters, often similar performance.

GRU Equations

zₜ = σ( Wz·[hₜ₋₁, xₜ] ) ← UPDATE gate (how much old vs new) rₜ = σ( Wr·[hₜ₋₁, xₜ] ) ← RESET gate (how much past to forget) h̃ₜ = tanh( W·[rₜ ⊙ hₜ₋₁, xₜ] ) ← candidate new hidden state hₜ = (1 − zₜ) ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜ ← new hidden state

No separate cell state C — GRU combines long and short-term into one h. Update gate z interpolates between old h and new candidate.

LSTM vs GRU

	LSTM	GRU
Gates	3 (f, i, o)	2 (z, r)
Memory vectors	2 (C, h)	1 (h)
Parameters	More	Fewer
Performance	Slightly better on long sequences	Often similar

xLSTM (from slides)

The prof mentioned xLSTM — an extended/modern version of LSTM covered in the xLSTM paper. Also mentioned mLSTM. These are more recent variants that scale better for large language models. Conceptually same gating idea, different implementation details.

GRU : LSTM :: MobileNet : ResNet — simpler, faster, slightly less powerful but often good enough.

ELEN 521 — HANDWRITTEN CHEAT SHEET · Lec 6 & 7 · Copy these to your notes

LEC 6 — Style Transfer

GM[l] = M[l] · M[l]ᵀ (Gram matrix = style) M[l] shape: (Nₗ channels, H_o × W_o positions)

L_total = α·L_content + β·L_style Optimise IMAGE G, not weights. CNN is frozen.

C_layers = ["relu_3_3"] (1 layer for content) S_layers = ["relu_1_2","relu_2_2","relu_3_3","relu_4_3"] loss = α·C_loss(C,G) + β·S_loss(S,G) minimize(loss).change(G) ← update pixels of G

LEC 6 — YOLO & IoU

Output per cell: [pc, bx, by, bw, bh, c₁..cₖ] bx, by ∈ [0,1]. bw, bh CAN be > 1. pc = confidence.

IoU = Area(B₁ ∩ B₂) / Area(B₁ ∪ B₂)

NMS: 1) discard pc < 0.6 2) sort by pc high→low 3) keep top; discard IoU > 0.5 with it 4) repeat

LEC 6 — Visualising & Reconstructing

Gradient ascent: x = x + η·∂(MSE(layer_output))/∂x Move image to maximise a neuron's activation.

Reconstruction: x* = argmin_x ||φ(x)−φ(x₀)||² + λR(x) φ(x)=CNN code at layer. R(x)=image prior. Optimise image x. Last pooling: good reconstruction. Softmax: terrible (info lost).

Adversarial: x_adv = x + ε·sign(∂L/∂x) Tiny perturbation → wrong prediction (FGSM).

LEC 6 — Transfer Learning Code

base = ResNet50(weights='imagenet', include_top=False) base.trainable = False # FREEZE x = GlobalAveragePooling2D()(base.output) x = Dense(256, activation='relu')(x) out = Dense(N, activation='softmax')(x)

LEC 6 — Debug Code

debug = Model(inputs=model.inputs, outputs=layer.output) p = debug.predict(x_train) np.min(p[...,ch]) / np.max(p[...,ch]) per channel Dead: min≈max≈0. Blocked: max<0. Healthy: has +ve and -ve.

LEC 7 — TF-IDF

TF(w,d) = freq(w) / total_words(d) IDF(w) = log( N_docs / df(w) ) TF-IDF = TF × IDF df(w) = # docs with word w. IDF=0 if word in all docs. Ex: IDF("This") = log(2/2) = 0 → TF-IDF = 0

LEC 7 — Cosine Similarity

cos(v,w) = (v·w) / (||v||·||w||) Range: [-1,+1]. For freq vectors (non-neg): [0,+1]. Small window C → syntactic. Large C → semantic.

LEC 7 — Word2Vec Skip-gram

P(+|t,c) = σ(t·c) = 1/(1+e^(-t·c)) P(-|t,c) = 1 - P(+|t,c)

Objective (maximize): J = Σᵢ log σ(t·cᵢ) + Σⱼ log σ(-t·nⱼ) cᵢ = positive (real) context. nⱼ = noise words.

Noise sampling: p_α(w) ∝ P(w)^(3/4) α=3/4 boosts rare words. Prevents common words dominating.

Analogy: king - man + woman ≈ queen Paris - France + Italy ≈ Rome

LEC 7 — Simple RNN

hₜ = tanh( W·hₜ₋₁ + U·xₜ ) oₜ = softmax( V·hₜ ) W=H×H, U=H×F, V=|V|×H. SAME weights at all t.

Vanishing: ||W||<1 → gradient→0 (forgets long ago) Exploding: ||W||>1 → gradient→∞ (training crashes)

LEC 7 — LSTM (all 6 equations)

fₜ = σ(Wf·[hₜ₋₁,xₜ]+bf) FORGET gate iₜ = σ(Wi·[hₜ₋₁,xₜ]+bi) INPUT gate C̃ₜ = tanh(Wc·[hₜ₋₁,xₜ]+bc) CANDIDATE Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ CELL STATE oₜ = σ(Wo·[hₜ₋₁,xₜ]+bo) OUTPUT gate hₜ = oₜ⊙tanh(Cₜ) HIDDEN STATE ⊙=element-wise. C=long-term. h=short-term. No vanishing!

LEC 7 — GRU

zₜ = σ(Wz·[hₜ₋₁,xₜ]) UPDATE gate rₜ = σ(Wr·[hₜ₋₁,xₜ]) RESET gate h̃ₜ = tanh(W·[rₜ⊙hₜ₋₁, xₜ]) hₜ = (1-zₜ)⊙hₜ₋₁ + zₜ⊙h̃ₜ Simpler than LSTM. 2 gates, 1 memory vector. Often same perf.

RNN Input/Output Types

1→1: image classification 1→N: image captioning N→1: sentiment analysis N→N: translation N↔N: video frame labelling (synced)

Language Model

P(x_{t+1}=wᵢ | xₜ, x_{t-1}, ..., x₁) RNN compresses all history into hₜ → predicts next word.

ELEN 521 — Deep Dive: Problems & Solutions · Embeddings & RNNs

Problems, Solutions & Advanced Applications