Problems, Solutions & Advanced Applications
Transfer Learning · Visualising · Style Transfer · YOLO1 · Design Principles & Patterns
▼4 Core Design Principles
- Reduce filter sizes — except possibly at the lowest layer. Factorize aggressively (3×3 over 5×5)
- Use 1×1 convolutions to reduce and expand feature maps (bottlenecks)
- Use skip connections and/or create multiple paths (ResNet, GoogleNet)
- Use auxiliary functions whenever possible (extra loss heads during training)
Why These Principles?
Smaller filters = fewer parameters = less overfitting. 1×1 convs are "free" channel mixers. Skip connections let gradients flow freely. Multiple paths give the network flexibility — it can route information through whichever path helps most.
2 · Self-Supervised Learning & Image Augmentation
▼Self-Supervised Learning
When you don't have labels, you can still learn useful representations. Create a "pretext task" from the data itself — e.g., predict which rotation an image was rotated by, or predict a missing patch. The network learns rich features without any human labels.
Image Augmentation (from the slides)
- Mirrored / flipped images
- Distorted images (geometric warp)
- Blurred images
- Color changes (brightness, contrast, hue)
3 · Bias-Variance Revisited + Double Descent
▼Classical View (U-curve)
As model capacity increases:
— Too simple → High Bias (underfitting) → test error HIGH
— Just right → Sweet spot → test error LOW
— Too complex → High Variance (overfitting) → test error HIGH again
This is the standard story. True for classical ML (SVMs, polynomials).
Modern Finding: Double Descent
For very large neural networks, there's a SECOND descent after the overfitting peak:
Small model → underfit (high bias)
Medium model → overfits (high variance)
LARGE model → generalises again! ← new finding
4 · Debugging Networks (Printing Layer Statistics)
▼# Get intermediate layer output layer = model.get_layer(layer_name) debug = Model(inputs=model.inputs, outputs=layer.output) p = debug.predict(x_train) # Print stats for each channel/filter for channel in range(p.shape[-1]): print("layer {}[{}] min={:.4f} max={:.4f}".format( layer_name, channel, np.min(p[..., channel]), np.max(p[..., channel])))
layer conv2d_0_m[0] min=-0.0030 max=0.0030 ← 🚨 DEAD FILTER — near-zero range! layer conv2d_0_m[1] min=-2.4972 max=1.5494 ← healthy layer conv2d_0_m[2] min=-2.4191 max=1.7075 ← healthy layer conv2d_0_m[7] min=-1.8190 max=-1.4244 ← 🚨 ALL NEGATIVE — dead after ReLU!
Dead Filter
min ≈ max ≈ 0. The filter learned nothing — outputs are always near zero. After ReLU, this will always output 0. Fix: better init, lower learning rate, check for vanishing gradients.
All-Negative Filter
max is negative. After ReLU (max(0,x)), this filter always outputs 0. The filter is "blocked." Fix: check for dying ReLU problem, try Leaky ReLU instead.
Healthy Filter
Has both negative and positive values with reasonable range (say ±2). After ReLU, the positive part survives. This filter is learning real features.
5 · Breaking CNNs (Adversarial Examples)
▼Fooling a Linear Classifier
The prof showed: to fool a linear classifier, just add a small multiple of the weight vector to the input:
x → x + α·w
α is a tiny number. w is the weight vector. The change is INVISIBLE to the human eye but completely changes the prediction. Why? Because the classifier's decision boundary is linear — the weight vector points DIRECTLY toward misclassification.
Deep Networks Easily Fooled
[Nguyen et al. CVPR 2015]: showed random noise patterns that deep networks classify with 99%+ confidence as meaningful objects.
[Szegedy ICLR 2014]: added imperceptible noise to real images → confident wrong prediction.
This maximises the loss w.r.t. the TRUE label → pushes toward wrong prediction.
6 · Transfer Learning
▼The Problem Without Transfer Learning
Training ResNet from scratch needs millions of images. If you only have 500 photos of your specific product, the network will overfit badly and not generalise.
The Solution: Freeze + Fine-tune
Take a pre-trained ResNet (trained on ImageNet with 1M+ images). FREEZE all the base layers. Add a new classification head. Train ONLY the new head on your small dataset. Need much less data!
Slide quote: "need less data" when using pre-trained weights.
from tensorflow.keras.applications import ResNet50 # Load pre-trained base (no top/classification head) base = ResNet50(weights='imagenet', include_top=False) base.trainable = False # FREEZE — don't update base weights # Add YOUR classification head x = base.output x = GlobalAveragePooling2D()(x) x = Dense(256, activation='relu')(x) output = Dense(num_classes, activation='softmax')(x) model = Model(base.input, output) # Now train — only the 2 Dense layers update! # Optional fine-tuning: unfreeze TOP few layers later for layer in base.layers[-20:]: layer.trainable = True # Re-compile with very small learning rate, then train again
7 · Visualizing Networks (4 Methods)
▼Method 1: Occlusion Experiments
[Zeiler & Fergus 2014]: Slide a grey square over different parts of the image and record the network's confidence for the true class at each position. Where confidence DROPS most → that region matters most.
Method 2: First Filter Visualisation
Just plot the weights of first-layer conv filters as images. They should look like edge/colour detectors. Easy for layer 1 (filters are image-sized). Gets harder for deeper layers (filters in abstract space).
After ReLU: some filters will always be 0 (< 0 inputs) → dead filter detection!
Method 3: Which Image Maximally Activates a Neuron?
Go through your entire dataset. For each image, record the activation of a specific neuron. Find the top-9 images that cause the HIGHEST activation. These images show you what that neuron "likes."
Key: receptive field becomes larger as you look at layers closer to the output!
Method 4: Gradient Ascent on Image
Start with a random/noise image. Compute gradient of a neuron's activation w.r.t. the INPUT IMAGE. Step in the POSITIVE gradient direction. Repeat. The image evolves into what maximally activates that neuron.
8 · Reconstructing Images from CNN Codes
▼Experiment 1: Reconstruct from softmax output (1000 class probs)
Start from the 1000 log-probabilities of ImageNet classes and try to reconstruct the original image. Result: very rough, blurry blobs. Most spatial information is GONE by the time you reach softmax — the network only knows "it's a cat" not WHERE the cat is.
Experiment 2: Reconstruct from last pooling layer
Reconstruct from the representation just before the first fully-connected layer (after last pooling). Result: much sharper, original image largely recoverable! Spatial information still exists at this layer.
Key Takeaway for the Exam
Reconstruction quality tells you WHAT each layer preserves. Early CNN layers → full spatial detail → perfect reconstruction. Deep layers → abstract class info → terrible reconstruction. This is why transfer learning works: early layers are universal, deep layers are task-specific.
9 · Neural Style Transfer
▼Three images involved
- C = Content image (the photo)
- S = Style image (the painting)
- G = Generated image (what we're creating)
You OPTIMISE G to have the content of C and the style of S. The weights of the CNN stay FROZEN. Only G changes.
What "content" and "style" mean
Content = what objects are where. Captured by neuron activations at a deep layer (relu_3_3). If G's activations match C's at that layer → G has the same content.
Style = textures, colours, brushstroke patterns. Captured by the Gram matrix of activations at multiple layers.
C_layers = ["relu_3_3"] # 1 layer for content S_layers = ["relu_1_2", "relu_2_2", # multiple layers for style "relu_3_3", "relu_4_3"] alpha = 0.5 beta = 0.5 for i in range(iterations): loss = alpha * C_loss(C, G, C_layers) + \ beta * S_loss(S, G, S_layers) minimize(loss).change(G) # ← update IMAGE G, NOT weights!
10 · YOLO — You Only Look Once
▼How YOLO Works
- Use an all-convolutional network (like GoogleNet)
- Split the image into a grid (e.g., 7×7 cells)
- Each cell predicts B bounding boxes + class probabilities
- Apply NMS to remove duplicate detections
Bounding Box Encoding
Each cell predicts, for each box:
pc = confidence (is there an object?)
bx = x center (relative to cell, 0–1)
by = y center (relative to cell, 0–1)
bw = width (relative to cell, CAN be > 1)
bh = height (relative to cell, CAN be > 1)
c₁..cₖ = class probabilities
Grid origin: top-left = (0,0), bottom-right = (1,1)
IoU = 1 → perfect overlap. IoU = 0 → no overlap.
Used as: (1) detection quality metric, and (2) in NMS to remove duplicates.
Embeddings, RNNs, LSTMs & GRUs
Word Representations · Word2Vec · Language Models · Sequence Networks1 · How to Represent Words (4 Methods)
▼| Method | How it works | Size | Problem |
|---|---|---|---|
| One-hot / BOW | [0,0,1,0,0,...] — 1 at word's position | |V| = 50,000 | Sparse, no meaning, "cat" and "kitten" unrelated |
| Counting | Count how often each word appears in each doc | |V| = 50,000 | Sparse, common words dominate |
| TF-IDF | Count × rarity score | |V| = 50,000 | Still sparse, no synonymy |
| Embeddings (dense) | Learned ~300-dim vectors | 300 | Best! Captures meaning and synonymy |
Why Dense Vectors Win (from slides)
- Short vectors → fewer weights to tune in downstream ML
- May generalise better than storing explicit counts
- Captures synonymy: "car" and "automobile" get similar vectors
- Words used in similar contexts get similar vectors automatically
- "In practice, they work better"
BOW Example (from slides)
Sentence 1: [0,0,1,1,0,1,1,1, ...]
Sentence 2: [1,1,0,0,1,0,1,1, ...]
↑ each position = one vocab word
Doc 1: [0,0,3,1,0,5,1,2, ...]
Doc 2: [3,1,0,0,1,0,6,1, ...]
↑ counts per document
2 · TF-IDF — with the Prof's Exact Example
▼TF-IDF and PPMI are SPARSE representations
From the slides: "tf-idf and PPMI vectors are long (length |V| = 20,000 to 50,000) and sparse (most elements are zero)." This is why we need dense embeddings — 50,000-dimensional vectors are impractical.
3 · Cosine Similarity — with Visualisation Example
▼cos = +1
Vectors point in SAME direction. Words always appear in same contexts. Maximum similarity.
cos = 0
Vectors are ORTHOGONAL. Words have nothing in common. Zero similarity.
cos = -1
Vectors point in OPPOSITE directions. For word frequencies (non-negative values), this can't happen — range is 0 to 1.
4 · Word2Vec — Predict, Don't Count
▼The Idea
"Instead of counting how often each word w occurs near 'apricot' — train a classifier on a binary prediction task: Is w likely to show up near 'apricot'?"
We don't care about this task. We'll take the learned classifier weights as the word embeddings.
Self-Supervised Genius
"Brilliant insight: Use running text as implicitly supervised training data!" A word near 'apricot' acts as the gold correct answer — "Is word w likely near apricot?" NO human labels needed. Text itself provides supervision.
Skip-gram: Given center, predict context
Input: target word t = "apricot"
Predict: context words nearby
window = ±2 words
→ positive examples: real neighbors
→ negative examples: random words
CBOW: Given context, predict center
Input: context words around a gap
Predict: the missing center word
"___ apricot ___" → predict "apricot"
Average the context vectors
5 · Skip-gram — All the Maths & Training
▼6 · Embedding Properties — Window Size & Analogies
▼Pre-trained embeddings you can download (from slides)
- Word2Vec (Mikolov et al.) — code.google.com/archive/p/word2vec/
- FastText — fasttext.cc
- GloVe (Pennington, Socher, Manning) — nlp.stanford.edu/projects/glove/
7 · Why We Need RNNs — Limitations of CNNs + Simple RNN
▼Limitations of CNN-based networks (from slides)
- We only look at the present to make a decision
- Examples have fixed length
Examples where history matters: video processing, audio, text analysis.
The Echo Game (from slides)
Prof's intuition builder: "Select a shift amount from 0 to N=3. Can ML discover the sequence?" You hear a sound and must repeat it 3 steps later. A CNN can't — it only sees one step. An RNN can — it carries the history in its hidden state.
8 · LSTM — Long Short-Term Memory (All 4 Gates)
▼Two Types of Memory
Cell state C = long-term memory. Like a conveyor belt — information flows along it with minimal interference. Gradients can flow back through it easily.
Hidden state h = short-term memory / output. What gets passed to the next time step AND used as output.
What the gates do
Forget gate f: what to ERASE from cell state
Input gate i: what NEW information to WRITE
Candidate C̃: what new content to potentially add
Output gate o: what to OUTPUT from cell state
9 · GRU — Gated Recurrent Unit (Simpler LSTM)
▼LSTM vs GRU
| LSTM | GRU | |
|---|---|---|
| Gates | 3 (f, i, o) | 2 (z, r) |
| Memory vectors | 2 (C, h) | 1 (h) |
| Parameters | More | Fewer |
| Performance | Slightly better on long sequences | Often similar |
xLSTM (from slides)
The prof mentioned xLSTM — an extended/modern version of LSTM covered in the xLSTM paper. Also mentioned mLSTM. These are more recent variants that scale better for large language models. Conceptually same gating idea, different implementation details.