diff --git a/docs/machine-learning/deep-learning/attention-mechanisms/multi-head-attention.mdx b/docs/machine-learning/deep-learning/attention-mechanisms/multi-head-attention.mdx index e69de29..54aab1d 100644 --- a/docs/machine-learning/deep-learning/attention-mechanisms/multi-head-attention.mdx +++ b/docs/machine-learning/deep-learning/attention-mechanisms/multi-head-attention.mdx @@ -0,0 +1,110 @@ +--- +title: "Multi-Head Attention: Parallelizing Insight" +sidebar_label: Multi-Head Attention +description: "Understanding how multiple attention 'heads' allow Transformers to capture diverse linguistic and spatial relationships simultaneously." +tags: [deep-learning, attention, multi-head-attention, transformers, nlp] +--- + +While [Self-Attention](./self-attention) is powerful, a single attention head often averages out the relationships between words. **Multi-Head Attention** solves this by running multiple self-attention operations in parallel, allowing the model to focus on different aspects of the input simultaneously. + +## 1. The Concept: Why Multiple Heads? + +If we use only one attention head, the model might focus entirely on the strongest relationship (e.g., the subject of a sentence). However, a word often has multiple relationships: +* **Head 1:** Might focus on the **Grammar** (Subject-Verb agreement). +* **Head 2:** Might focus on the **Context** (What does "it" refer to?). +* **Head 3:** Might focus on the **Visual/Spatial** relations (Is the object "on" or "under" the table?). + +By using multiple heads, we allow the model to "attend" to these different representation subspaces at once. + +## 2. How it Works: Split, Attend, Concatenate + +The process of Multi-Head Attention follows four distinct steps: + +1. **Linear Projection (Split):** The input Query ($Q$), Key ($K$), and Value ($V$) are projected into $h$ different, lower-dimensional versions using learned weight matrices. +2. **Parallel Attention:** We apply the [Scaled Dot-Product Attention](./self-attention#3-the-calculation-process) to each of the $h$ heads independently. +3. **Concatenation:** The outputs from all heads are concatenated back into a single vector. +4. **Final Linear Projection:** A final weight matrix ($W^O$) is applied to the concatenated vector to bring it back to the expected output dimension. + +## 3. Mathematical Representation + +For each head $i$, the attention is calculated as: + +$$ +\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) +$$ + +The final output is the concatenation of these heads multiplied by an output weight matrix: + +$$ +\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O +$$ + +## 4. Advanced Logic Flow (Mermaid) + +The following diagram visualizes how the model splits a single high-dimensional embedding into multiple "heads" to process information in parallel. + +```mermaid +graph TD + Input[Input Q, K, V] --> Split{Linear Split into 'h' Heads} + + subgraph Parallel_Heads [Parallel Processing] + Head1[Head 1: Scaled Dot-Product] + Head2[Head 2: Scaled Dot-Product] + HeadN[Head 'h': Scaled Dot-Product] + end + + Split --> Head1 + Split --> Head2 + Split --> HeadN + + Head1 --> Concat[Concatenate Results] + Head2 --> Concat + HeadN --> Concat + + Concat --> FinalLinear[Final Linear Projection WO] + FinalLinear --> Output[Multi-Head Output] + +``` + +## 5. Key Advantages + +* **Ensemble Effect:** It acts like an ensemble of models, where each head learns something unique. +* **Stable Training:** By dividing the by the number of heads, the internal dimensionality stays manageable, preventing the dot-products from growing too large. +* **Resolution:** It improves the "resolution" of the attention map, making it less likely that one dominant word will "wash out" the influence of others. + +## 6. Implementation with PyTorch + +Using the `nn.MultiheadAttention` module is the standard way to implement this in production. + +```python +import torch +import torch.nn as nn + +# Parameters +embed_dim = 128 # Dimension of the model +num_heads = 8 # Number of parallel attention heads +# Note: embed_dim must be divisible by num_heads (128/8 = 16 per head) + +mha_layer = nn.MultiheadAttention(embed_dim, num_heads) + +# Input shape: (sequence_length, batch_size, embed_dim) +query = torch.randn(20, 1, 128) +key = torch.randn(20, 1, 128) +value = torch.randn(20, 1, 128) + +# attn_output: the projected result; attn_weights: the attention map +attn_output, attn_weights = mha_layer(query, key, value) + +print(f"Output size: {attn_output.shape}") # [20, 1, 128] +print(f"Attention weights: {attn_weights.shape}") # [1, 20, 20] + +``` + +## References + +* **Original Paper:** [Attention Is All You Need (Vaswani et al.)](https://arxiv.org/abs/1706.03762) +* **Visualizing Attention:** [A Survey of Attention Mechanisms](https://arxiv.org/abs/2101.02257) + +--- + +**Multi-Head Attention is the engine. But how do we organize these engines into a structure that can actually translate languages or generate text?** \ No newline at end of file diff --git a/docs/machine-learning/deep-learning/attention-mechanisms/self-attention.mdx b/docs/machine-learning/deep-learning/attention-mechanisms/self-attention.mdx index e69de29..5c75183 100644 --- a/docs/machine-learning/deep-learning/attention-mechanisms/self-attention.mdx +++ b/docs/machine-learning/deep-learning/attention-mechanisms/self-attention.mdx @@ -0,0 +1,110 @@ +--- +title: The Core of Transformers +sidebar_label: Self-Attention +description: "Understanding how models weigh the importance of different parts of an input sequence using Queries, Keys, and Values." +tags: [deep-learning, attention, transformers, nlp, self-attention] +--- + +**Self-Attention** (also known as Intra-Attention) is the mechanism that allows a model to look at other words in an input sequence to get a better encoding for the word it is currently processing. + +Unlike [RNNs](../rnn/rnn-basics), which process words one by one, Self-Attention allows every word to "talk" to every other word simultaneously, regardless of their distance. + +## 1. Why do we need Self-Attention? + +Consider the sentence: *"The animal didn't cross the street because **it** was too tired."* + +When a model processes the word **"it"**, it needs to know what "it" refers to. Is it the animal or the street? +* In a standard RNN, if the sentence is long, the model might "forget" about the animal by the time it reaches "it". +* In **Self-Attention**, the model calculates a score that links "it" strongly to "animal" and weakly to "street". + +## 2. The Three Vectors: Query, Key, and Value + +To calculate self-attention, we create three vectors from every input word (embedding) by multiplying it by three weight matrices ($W^Q, W^K, W^V$) that are learned during training. + +| Vector | Analogy (The Library) | Purpose | +| :--- | :--- | :--- | +| **Query ($Q$)** | The topic you are searching for. | Represents the current word looking at other words. | +| **Key ($K$)** | The label on the spine of the book. | Represents the "relevance" tag of all other words. | +| **Value ($V$)** | The information inside the book. | Represents the actual content of the word. | + +## 3. The Calculation Process + +The attention score is calculated through a series of matrix operations: + +1. **Dot Product:** We multiply the Query of the current word by the Keys of all other words. +2. **Scaling:** We divide by the square root of the dimension of the key ($\sqrt{d_k}$) to keep gradients stable. +3. **Softmax:** We apply a Softmax function to turn scores into probabilities (weights) that sum to 1. +4. **Weighted Sum:** We multiply the weights by the Value vectors to get the final output for that word. + +$$ +\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V +$$ + +## 4. Advanced Flow Logic (Mermaid) + +The following diagram represents how an input embedding is transformed into an Attention output. + +```mermaid +graph TD + Input[Input Embedding $$\ X$$] --> WQ[Weight Matrix $$\ W^Q$$] + Input --> WK[Weight Matrix $$\ W^K$$] + Input --> WV[Weight Matrix $$\ W^V$$] + + WQ --> Q[Query $$\ Q$$] + WK --> K[Key $$\ K$$] + WV --> V[Value $$\ V$$] + + Q --> Dot[Dot Product $$\ Q·K$$] + K --> Dot + + Dot --> Scale["Scale by $$\ 1/\sqrt {d_k}$$"] + Scale --> Softmax[Softmax Layer] + + Softmax --> WeightSum[Weighted Sum with $$\ V$$] + V --> WeightSum + + WeightSum --> Final[Attention Output] + +``` + +## 5. Multi-Head Attention + +In practice, we don't just use one self-attention mechanism. We use **Multi-Head Attention**. This involves running several self-attention calculations (heads) in parallel. + +* One head might focus on the **subject-verb** relationship. +* Another head might focus on **adjectives**. +* Another head might focus on **contextual references**. + +By combining these, the model gets a much richer understanding of the text. + +## 6. Implementation with PyTorch + +Modern deep learning frameworks provide highly optimized modules for this. + +```python +import torch +import torch.nn as nn + +# Embedding dim = 512, Number of heads = 8 +multihead_attn = nn.MultiheadAttention(embed_dim=512, num_heads=8) + +# Input shape: (sequence_length, batch_size, embed_dim) +query = torch.randn(10, 1, 512) +key = torch.randn(10, 1, 512) +value = torch.randn(10, 1, 512) + +attn_output, attn_weights = multihead_attn(query, key, value) + +print(f"Output shape: {attn_output.shape}") # [10, 1, 512] + +``` + +## References + +* **Original Paper:** [Attention Is All You Need (2017)](https://arxiv.org/abs/1706.03762) +* **The Illustrated Transformer:** [Jay Alammar's Blog](https://jalammar.github.io/illustrated-transformer/) +* **Harvard NLP:** [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html) + +--- + +**Self-Attention allows the model to understand the context of a sequence. But how do we stack these layers to build the most powerful models in AI today?** \ No newline at end of file diff --git a/docs/machine-learning/deep-learning/attention-mechanisms/transformers.mdx b/docs/machine-learning/deep-learning/attention-mechanisms/transformers.mdx index e69de29..e1a743c 100644 --- a/docs/machine-learning/deep-learning/attention-mechanisms/transformers.mdx +++ b/docs/machine-learning/deep-learning/attention-mechanisms/transformers.mdx @@ -0,0 +1,119 @@ +--- +title: "Transformer Architecture: The Foundation of Modern AI" +sidebar_label: Transformers +description: "A comprehensive deep dive into the Transformer architecture, including Encoder-Decoder stacks and Positional Encoding." +tags: [deep-learning, transformers, nlp, attention, gpt, bert] +--- + +Introduced in the 2017 paper *"Attention Is All You Need"*, the **Transformer** shifted the paradigm of sequence modeling. By removing recurrence (RNNs) and convolutions (CNNs) entirely and relying solely on [Self-Attention](./self-attention), Transformers allowed for massive parallelization and state-of-the-art performance in NLP and beyond. + +## 1. High-Level Architecture + +The Transformer follows an **Encoder-Decoder** structure: +* **The Encoder (Left):** Maps an input sequence to a sequence of continuous representations. +* **The Decoder (Right):** Uses the encoder's representation and previous outputs to generate an output sequence, one element at a time. + +## 2. The Encoder Stack + +An encoder consists of a stack of identical layers (typically 6). Each layer has two sub-layers: +1. **Multi-Head Self-Attention:** Allows the encoder to look at other words in the input sentence as it encodes a specific word. +2. **Position-wise Feed-Forward Network (FFN):** A simple fully connected network applied to each position independently and identically. + +:::info Key Feature +Each sub-layer uses **Residual Connections** (Add) followed by **Layer Normalization** (Norm). This is often abbreviated as `Add & Norm`. +::: + +## 3. The Decoder Stack + +The decoder also has a stack of identical layers, but it includes a third sub-layer: +1. **Masked Multi-Head Attention:** Ensures that the prediction for a specific position can only depend on the known outputs at positions before it (preventing the model from "cheating" by looking ahead). +2. **Encoder-Decoder Attention:** Performs attention over the encoder's output. This helps the decoder focus on relevant parts of the input sequence. +3. **Feed-Forward Network (FFN):** Similar to the encoder's FFN. + +## 4. Positional Encoding + +Since Transformers do not use RNNs, they have no inherent sense of the **order** of words. To fix this, we add **Positional Encodings** to the input embeddings. These are vectors that follow a specific mathematical pattern (often sine and cosine functions) to give the model information about the relative or absolute position of words. + +$$ +PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}}) +$$ +$$ +PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}}) +$$ + +## 5. Transformer Data Flow (Mermaid) + +This diagram visualizes how a single token moves through the Transformer stack. + +```mermaid +graph TD + Input[Input Tokens] --> Embed[Input Embedding] + Pos[Positional Encoding] --> Embed + Embed --> EncStack[Encoder Stack] + + subgraph EncoderLayer [Encoder Layer] + SelfAttn[Multi-Head Self-Attention] --> AddNorm1[Add & Norm] + AddNorm1 --> FFN[Feed Forward] + FFN --> AddNorm2[Add & Norm] + end + + EncStack --> DecStack[Decoder Stack] + + subgraph DecoderLayer [Decoder Layer] + MaskAttn[Masked Self-Attention] --> AddNorm3[Add & Norm] + AddNorm3 --> CrossAttn[Encoder-Decoder Attention] + CrossAttn --> AddNorm4[Add & Norm] + AddNorm4 --> DecFFN[Feed Forward] + DecFFN --> AddNorm5[Add & Norm] + end + + DecStack --> Linear[Linear Layer] + Linear --> Softmax[Softmax] + Softmax --> Output[Predicted Token] + +``` + +## 6. Why Transformers Won + +| Feature | RNNs / LSTMs | Transformers | +| --- | --- | --- | +| **Processing** | Sequential (Slow) | Parallel (Fast on GPUs) | +| **Long-range Ties** | Difficult (Vanishing Gradient) | Easy (Direct Attention) | +| **Scaling** | Hard to scale to massive data | Designed for massive data & parameters | +| **Example Models** | ELMo | BERT, GPT-4, Llama 3 | + +## 7. Simple Implementation (PyTorch) + +PyTorch provides a high-level `nn.Transformer` module, but you can also access the individual components: + +```python +import torch +import torch.nn as nn + +# Parameters +d_model = 512 +nhead = 8 +num_encoder_layers = 6 + +# Define Encoder Layer +encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead) +# Define Transformer Encoder +transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_encoder_layers) + +# Input shape: (S, N, E) where S is seq_length, N is batch, E is d_model +src = torch.randn(10, 32, 512) +out = transformer_encoder(src) + +print(f"Output shape: {out.shape}") # [10, 32, 512] + +``` + +## References + +* **Original Paper:** [Attention Is All You Need (Vaswani et al.)](https://arxiv.org/abs/1706.03762) +* **Visual Guide:** [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) +* **DeepLearning.AI:** [Transformer Network (C5W4L06)](https://www.youtube.com/watch?v=AFkGPmU16QA) + +--- + +**The Transformer architecture is the engine. But how do we train it? Does it read the whole internet at once?** \ No newline at end of file diff --git a/docs/machine-learning/deep-learning/autoencoders.mdx b/docs/machine-learning/deep-learning/autoencoders.mdx index e69de29..213364f 100644 --- a/docs/machine-learning/deep-learning/autoencoders.mdx +++ b/docs/machine-learning/deep-learning/autoencoders.mdx @@ -0,0 +1,117 @@ +--- +title: "Autoencoders: Self-Supervised Learning" +sidebar_label: Autoencoders +description: "Understanding the Encoder-Decoder architecture used for dimensionality reduction and feature learning." +tags: [deep-learning, unsupervised-learning, autoencoders, neural-networks, compression] +--- + +An **Autoencoder** is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a lower-dimensional representation (encoding) for a higher-dimensional dataset, typically for **dimensionality reduction**, **denoising**, or **feature extraction**. + +Unlike traditional networks, the "labels" for an autoencoder are the input data itself. It tries to reconstruct its own input. + +## 1. The Architecture: The Hourglass Design + +An autoencoder is composed of two main parts connected by a "bottleneck": + +1. **The Encoder:** This part of the network compresses the input into a latent-space representation. It reduces the spatial dimensions. +2. **The Bottleneck (Latent Space):** This is the layer that contains the compressed representation of the input data. It represents the "knowledge" the network has captured. +3. **The Decoder:** This part of the network aims to reconstruct the input from the latent space representation as closely as possible. + +## 2. The Objective Function + +The network is trained to minimize the **Reconstruction Loss**, which measures the difference between the original input ($x$) and the reconstructed output ($\hat{x}$). + +If the input is continuous, we typically use **Mean Squared Error (MSE)**: + +$$ +L(x, \hat{x}) = \|x - \hat{x}\|^2 +$$ + +## 3. Advanced Structural Logic (Mermaid) + +The following diagram illustrates how the information is squeezed through the bottleneck to force the network to prioritize the most important features. + +```mermaid +graph LR + subgraph Input_Layer [Input] + X1(( )) + X2(( )) + X3(( )) + X4(( )) + end + + subgraph Encoder [Encoder] + E1(( )) + E2(( )) + end + + subgraph Bottleneck [Latent Space Z] + Z1(( )) + end + + subgraph Decoder [Decoder] + D1(( )) + D2(( )) + end + + subgraph Output_Layer [Reconstruction] + R1(( )) + R2(( )) + R3(( )) + R4(( )) + end + + Input_Layer --> Encoder + Encoder --> Bottleneck + Bottleneck --> Decoder + Decoder --> Output_Layer + +``` + +## 4. Common Types of Autoencoders + +| Type | Purpose | Mechanism | +| --- | --- | --- | +| **Undercomplete** | Dimensionality Reduction | The latent space is smaller than the input space, forcing compression. | +| **Denoising (DAE)** | Feature Robustness | Takes a partially corrupted input and learns to recover the original undistorted version. | +| **Sparse** | Feature Selection | Adds a penalty to the loss function that encourages the network to activate only a small number of neurons. | +| **Variational (VAE)** | Generative Modeling | Instead of a single point, the encoder predicts a probability distribution (mean and variance) in the latent space. | + +## 5. Use Cases + +* **Dimensionality Reduction:** A non-linear alternative to PCA (Principal Component Analysis). +* **Image Denoising:** Removing "grain" or noise from photographs or medical scans. +* **Anomaly Detection:** If a model is trained to reconstruct "normal" data, it will have a high reconstruction error when it sees "anomalous" data. +* **Recommendation Systems:** Learning latent user preferences (similar to [Collaborative Deep Learning](./cnn-applications/recommendation-systems)). + +## 6. Implementation with Keras + +Building a simple undercomplete autoencoder for the MNIST dataset: + +```python +import tensorflow as tf +from tensorflow.keras import layers, models + +# Input size: 784 (28x28 flattened) +input_img = layers.Input(shape=(784,)) + +# Encoder: Compress to 32 dimensions +encoded = layers.Dense(32, activation='relu')(input_img) + +# Decoder: Reconstruct back to 784 +decoded = layers.Dense(784, activation='sigmoid')(encoded) + +# The Autoencoder Model +autoencoder = models.Model(input_img, decoded) + +autoencoder.compile(optimizer='adam', loss='binary_crossentropy') + +``` + +## References + +* **Keras Blog:** [Building Autoencoders in Keras](https://blog.keras.io/building-autoencoders-in-keras.html) + +--- + +**Standard autoencoders are great for compression, but what if you want to generate *new* data that looks like the training set?** \ No newline at end of file diff --git a/docs/machine-learning/deep-learning/gans.mdx b/docs/machine-learning/deep-learning/gans.mdx index e69de29..44330c3 100644 --- a/docs/machine-learning/deep-learning/gans.mdx +++ b/docs/machine-learning/deep-learning/gans.mdx @@ -0,0 +1,106 @@ +--- +title: "GANs: Generative Adversarial Networks" +sidebar_label: GANs +description: "Understanding the competitive framework between Generators and Discriminators to create realistic synthetic data." +tags: [deep-learning, gans, generative-ai, computer-vision, neural-networks] +--- + +Introduced by Ian Goodfellow in 2014, **Generative Adversarial Networks (GANs)** are a class of machine learning frameworks where two neural networks contest with each other in a game. This framework allows the model to learn how to generate new, synthetic data that is indistinguishable from real data. + +## 1. The Adversarial Concept: The Forger and the Detective + +A GAN consists of two distinct models that are trained simultaneously through competition: + +1. **The Generator ($G$):** Think of this as a **forger**. Its goal is to create realistic images (or data) from random noise to trick the discriminator. +2. **The Discriminator ($D$):** Think of this as a **detective**. Its goal is to distinguish between "real" data (from the training set) and "fake" data (produced by the generator). + +## 2. The Training Process: A Zero-Sum Game + +The GAN training process is a "minimax" game where the Generator tries to minimize the probability that the Discriminator is correct, while the Discriminator tries to maximize it. + +1. **The Generator** takes random noise as input and produces a synthetic sample. +2. **The Discriminator** receives both real samples and synthetic samples. +3. **Feedback Loop:** * If the Detective (D) catches the Forger (G), G learns how to improve its forgery. + * If the Forger (G) tricks the Detective (D), D learns how to be a better investigator. + +Eventually, the Generator becomes so good that the Discriminator can only guess with 50% accuracy (equivalent to a coin flip). + +## 3. Mathematical Objective + +The entire system can be described by the following value function $V(D, G)$: + +$$ +\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))] +$$ + +* $D(x)$: Discriminator's estimate of the probability that real data $x$ is real. +* $G(z)$: The Generator's output for a given noise $z$. +* $D(G(z))$: Discriminator's estimate of the probability that a fake sample is real. + +## 4. Architectural Flow (Mermaid) + +The following diagram illustrates the interaction between the two networks and the data sources. + +```mermaid +graph LR + Noise[Random Noise Z] --> Gen[Generator G] + Gen --> Fake[Fake Samples] + Real[Real Dataset X] --> Disc[Discriminator D] + Fake --> Disc + + Disc --> Prediction{Real or Fake?} + Prediction -- Error Feedback --> Disc + Prediction -- Loss Signal --> Gen + +``` + +## 5. Challenges in Training GANs + +Training GANs is notoriously difficult because of the delicate balance required between the two models: + +* **Mode Collapse:** The Generator discovers a single "type" of output that tricks the Discriminator and keeps producing only that (e.g., a model supposed to generate all digits only generates the number "7"). +* **Vanishing Gradients:** If the Discriminator is too good, the Generator doesn't get enough feedback to learn. +* **Convergence:** Unlike standard models, GANs may never reach a stable point, instead oscillating back and forth. + +## 6. Popular GAN Variants + +| Variant | Key Feature | Use Case | +| --- | --- | --- | +| **DCGAN** | Uses Convolutional layers instead of Dense layers. | Generating high-quality images. | +| **CycleGAN** | Learns to translate images from one domain to another without paired data. | Turning photos into paintings (e.g., Zebra to Horse). | +| **StyleGAN** | Allows control over specific "styles" (age, hair color, etc.). | Generating hyper-realistic human faces. | +| **Pix2Pix** | Conditional GAN for image-to-image translation. | Converting sketches into realistic photos. | + +## 7. Implementation Sketch (PyTorch) + +```python +import torch +import torch.nn as nn + +# Simple Discriminator +discriminator = nn.Sequential( + nn.Linear(784, 128), + nn.LeakyReLU(0.2), + nn.Linear(128, 1), + nn.Sigmoid() +) + +# Simple Generator +generator = nn.Sequential( + nn.Linear(100, 256), + nn.ReLU(), + nn.Linear(256, 784), + nn.Tanh() # Outputs pixels between -1 and 1 +) + +``` + +## References + +* **Original Paper:** [Generative Adversarial Networks (Goodfellow et al.)](https://arxiv.org/abs/1406.2661) +* **Google Developers:** [GANs Course](https://developers.google.com/machine-learning/gan) +* **This Person Does Not Exist:** [A showcase of StyleGAN capabilities](https://thispersondoesnotexist.com/) + +--- + +**GANs are masters of generation, but they are hard to control. What if we wanted a model that can gradually "denoise" an image into existence?** \ No newline at end of file