deepai/Lab1_2/Lab1&2_Transformers-base.ipynb

1147 lines
46 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "Cv-9Vzunb_tf"
},
"source": [
"# Import Necessary Library"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "4f-K54nHb-Uq"
},
"outputs": [],
"source": [
"import torch\n",
"import torch.nn as nn\n",
"import torch.nn.functional as F\n",
"import torch.optim as optim\n",
"import torch.utils.data as data\n",
"import math\n",
"import os\n",
"import urllib.request\n",
"import pandas as pd\n",
"from functools import partial\n",
"from urllib.error import HTTPError\n",
"from datetime import datetime"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BtW5eDFocsMA"
},
"source": [
"# What is Attention?\n",
"\n",
"Attention in neural networks, particularly relevant for sequential tasks, refers to a mechanism that selectively focuses on certain parts of input data. This concept has gained significant interest in recent years. In essence, attention computes a weighted average of elements in a sequence, with the weights being dynamically determined based on the relevance of each element to a specific query. This allows the model to prioritize certain inputs over others.\n",
"\n",
"The attention mechanism consists of four primary components:\n",
"\n",
"* **Query**: A feature vector representing the target of the attention, essentially indicating the information the model seeks within the sequence.\n",
"* **Keys**: Feature vectors corresponding to each input element, describing the content or relevance of the elements. The keys help the model identify which elements to focus on, relative to the query.\n",
"* **Values**: Feature vectors representing the actual content from each input element that the model should aggregate.\n",
"* **Score function**: A function used to calculate attention weights, representing the relevance of each key-query pair. Common implementations include simple operations like the dot product or more complex structures like a small neural network.\n",
"\n",
"The attention mechanism operates by first computing scores between the query and each key using the score function. These scores determine the attention weights through a softmax function, ensuring that they sum to one and are non-negative. The output is then calculated as the weighted sum of the value vectors, with weights corresponding to the calculated attention scores.\n",
"\n",
"Mathematically, this process can be represented as:\n",
"\n",
"$$\n",
"\\alpha_i = \\frac{\\exp\\left(f_{attn}\\left(\\text{key}_i, \\text{query}\\right)\\right)}{\\sum_j \\exp\\left(f_{attn}\\left(\\text{key}_j, \\text{query}\\right)\\right)}, \\hspace{5mm} \\text{out} = \\sum_i \\alpha_i \\cdot \\text{value}_i\n",
"$$\n",
"\n",
"In practice, attention mechanisms can vary based on the choice of queries, the definition of key and value vectors, and the specific score function used. A prominent example is the **self-attention** mechanism used in the Transformer architecture, where each element in a sequence provides its own key, value, and query. The self-attention mechanism allows each element to attend to all elements in the sequence, including itself, resulting in a representation that incorporates information from the entire sequence.\n",
"\n",
"The above explanation provides a conceptual understanding of the attention mechanism, highlighting its components and operational principles without delving into the specific details of any particular implementation, such as the scaled dot product attention used in Transformers."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1DFh9Ic8dp-u"
},
"source": [
"### Scaled Dot Product Attention\n",
"\n",
"The scaled dot product attention is a fundamental component of the self-attention mechanism, enabling elements within a sequence to efficiently attend to one another. It operates on queries $Q\\in\\mathbb{R}^{T\\times d_k}$, keys $K\\in\\mathbb{R}^{T\\times d_k}$, and values $V\\in\\mathbb{R}^{T\\times d_v}$, where $T$ represents the sequence length and $d_k$, $d_v$ denote the dimensions of queries/keys and values, respectively.\n",
"\n",
"The mechanism calculates the attention values based on the dot product similarity between each query $Q_i$ and key $K_j$, and scales the results by the square root of the dimensionality of the keys, $d_k$. The formula for this calculation is:\n",
"\n",
"$$\\text{Attention}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^T}{\\sqrt{d_k}}\\right)V$$\n",
"\n",
"Here, the matrix product $QK^T$ computes the dot product between all pairs of queries and keys, forming a $T\\times T$ matrix where each entry represents the attention score from one element to another. After applying the softmax function, these scores are used as weights to compute a weighted average of the value vectors.\n",
"\n",
"The scaling factor $1/\\sqrt{d_k}$ is critical for maintaining the variance of the attention scores at an appropriate level. Without this scaling, the variance of the dot products could become too large, leading to a situation where the softmax function saturates, with most of its output concentrated on a single element. This would hinder learning by resulting in gradients that are almost zero.\n",
"\n",
"Additionally, the mechanism can include an optional masking step (denoted as `Mask (opt.)` in the diagram), useful in situations like batch processing of sequences of varying lengths. Padding is used to equalize the lengths of sequences, and the mask ensures that the padded positions do not affect the attention calculation, typically by assigning a very low value to these positions in the attention scores.\n",
"\n",
"In summary, the scaled dot product attention efficiently enables each element in a sequence to attend to all others, considering the relevance of each element, and is crucial for models that rely on self-attention, such as Transformers."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VKvaGxqIdvba"
},
"source": [
"### Implementing Scaled Dot Product Attention\n",
"\n",
"Scaled dot product attention is a core mechanism allowing each element in a sequence to consider all other elements efficiently, which is fundamental in self-attention models like Transformers. Here's a detailed guide to implementing scaled dot product attention, breaking down the components and the steps involved.\n",
"\n",
"#### Inputs to the Attention Mechanism\n",
"The attention function takes three inputs:\n",
"1. **Queries (Q)**: $Q\\in\\mathbb{R}^{T\\times d_k}$, where $T$ is the sequence length and $d_k$ is the dimensionality of the queries and keys.\n",
"2. **Keys (K)**: $K\\in\\mathbb{R}^{T\\times d_k}$.\n",
"3. **Values (V)**: $V\\in\\mathbb{R}^{T\\times d_v}$, where $d_v$ is the dimensionality of the values.\n",
"\n",
"#### Step-by-Step Calculation\n",
"1. **Dot Product of Queries and Keys**: Calculate the dot product between each query and all keys to obtain a measure of compatibility or relevance between each query-key pair. This results in a matrix of shape $T \\times T$, where each element $(i, j)$ represents the dot product between query $i$ and key $j$.\n",
" \n",
" $$\\text{Score Matrix} = QK^T$$\n",
"\n",
"2. **Scaling**: Scale the scores obtained in the previous step by dividing by $\\sqrt{d_k}$ to ensure stable gradients, as larger values of $d_k$ can lead to extremely small gradients, which can slow down learning and model convergence.\n",
"\n",
" $$\\text{Scaled Score Matrix} = \\frac{\\text{Score Matrix}}{\\sqrt{d_k}}$$\n",
"\n",
"3. **Optional Masking**: If masking is required (e.g., for padded positions in a batch of sequences), apply the mask by setting the scores for masked positions to a very large negative value, ensuring that they have minimal impact after the softmax step.\n",
"\n",
"4. **Softmax**: Apply the softmax function to the scaled scores along each row. This step converts the scores into probabilities, indicating the importance of each key relative to each query.\n",
"\n",
" $$\\text{Attention Weights} = \\text{softmax}(\\text{Scaled Score Matrix})$$\n",
"\n",
"5. **Output Calculation**: Multiply the attention weights by the value vectors to obtain the final output. This step computes a weighted average of the value vectors, where the weights are determined by the attention scores.\n",
"\n",
" $$\\text{Output} = \\text{Attention Weights} \\times V$$\n",
"\n",
"#### Implementation Tips\n",
"- **Dimensionality**: Ensure the dimensions of your matrices are correct. Matrix multiplication will not be possible if the inner dimensions do not match.\n",
"- **Numerical Stability**: When implementing the softmax function, ensure numerical stability by subtracting the maximum value in each row of the scores matrix before applying the exponential function.\n",
"- **Batch Processing**: If implementing attention in batch, include an additional batch dimension in your matrices (e.g., $Q\\in\\mathbb{R}^{B\\times T\\times d_k}$ for a batch size of $B$) and ensure your implementation supports this.\n",
"- **Testing**: Verify the correctness of your implementation with simple test cases to ensure it behaves as expected.\n",
"\n",
"This framework should provide a clear structure for students to implement scaled dot product attention, enhancing their understanding of its role and functionality in self-attention models."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jsFoInPLeFk9"
},
"source": [
"# Task: Please implement a scaled dot product function"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2, 2)"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dk = 2\n",
"t = 3\n",
"v = torch.randn(t, dk)\n",
"\n",
"len(v[0]),v.shape[-1]"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_54620/1567451330.py:14: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
" softmax = torch.nn.functional.softmax(scaled)\n"
]
},
{
"data": {
"text/plain": [
"(tensor([[5., 5.],\n",
" [5., 5.],\n",
" [5., 5.]]),\n",
" tensor([[5., 5.],\n",
" [5., 5.],\n",
" [5., 5.]]),\n",
" tensor([[5., 5.],\n",
" [5., 5.],\n",
" [5., 5.]]),\n",
" tensor([[5., 5., 5.],\n",
" [5., 5., 5.]]),\n",
" tensor([[50., 50., 50.],\n",
" [50., 50., 50.],\n",
" [50., 50., 50.]]),\n",
" tensor([[35.3553, 35.3553, 35.3553],\n",
" [35.3553, 35.3553, 35.3553],\n",
" [35.3553, 35.3553, 35.3553]]),\n",
" tensor([[0.3333, 0.3333, 0.3333],\n",
" [0.3333, 0.3333, 0.3333],\n",
" [0.3333, 0.3333, 0.3333]]),\n",
" tensor([[5., 5.],\n",
" [5., 5.],\n",
" [5., 5.]]),\n",
" 2,\n",
" tensor([5., 5.]))"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dk = 2\n",
"t = 3\n",
"\n",
"a = torch.zeros(t, dk) + 5\n",
"b = torch.zeros(t, dk) + 5\n",
"v = torch.zeros(t, dk) + 5\n",
"\n",
"bt = b.mT\n",
"\n",
"dot = torch.mm(a, bt)\n",
"\n",
"scaled = dot/math.sqrt(dk)\n",
"\n",
"softmax = torch.nn.functional.softmax(scaled)\n",
"\n",
"output = torch.mm(softmax, v)\n",
"\n",
"a,b,v,bt, dot, scaled, softmax, output, len(a[0]),a[1]"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "XCv8_IzSdut4"
},
"outputs": [],
"source": [
"def scaled_dot_product(q, k, v, mask=None):\n",
" # implemented by the student, you can ignore the mask implementation currently\n",
" # just assignment all the mask is on\n",
"\n",
" shape_len = len(k.shape)\n",
"\n",
" transpose = k.mT\n",
" d = k.shape[-1]\n",
"\n",
" score_scale = torch.matmul(q, transpose)/math.sqrt(d)\n",
"\n",
" attention_weight = torch.nn.functional.softmax(score_scale, 1)\n",
"\n",
" output = torch.matmul(attention_weight, v)\n",
"\n",
" return output, attention_weight"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"Q = torch.Tensor([[-0.19737370312213898, -1.0540887117385864, 0.02383515052497387, 0.46185705065727234], [-1.2415547370910645, 0.8366656303405762, 0.3741966784000397, 0.9099264740943909], [0.3436168134212494, 0.6154376268386841, 1.1926648616790771, 1.6477248668670654]])\n",
"K = torch.Tensor([[1.9663442373275757, 0.15551914274692535, -0.8715013861656189, 0.32070425152778625], [-5.85474967956543, 1.7047394514083862, -1.0024793148040771, 1.3307985067367554], [0.06319630891084671, -2.030783176422119, -5.436811447143555, -0.42979586124420166]])\n",
"V = torch.Tensor([[-82.127197265625, 0.9534303545951843, -28.78610610961914, -10.762138366699219], [-16.467313766479492, 60.92831802368164, -36.08392333984375, 31.648052215576172], [20.485767364501953, 45.4570198059082, 15.208494186401367, 31.43212890625]])\n",
"\n",
"ans = scaled_dot_product(Q, K, V)[0].tolist()\n",
"\n",
"pf = pd.read_csv(\"A1_template_template.csv\")\n",
"\n",
"pf.loc[0] = [6644818, Q.tolist(), K.tolist(), V.tolist(), ans]\n",
"\n",
"pf.to_csv('A1_template.csv', sep=',', index=False)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "wMIShH5wcrUK",
"outputId": "b5e6f270-0cae-4f2c-d388-e0e26ed28b6a"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tensor([[ 1.2840, 0.9623],\n",
" [ 1.0821, -0.2264],\n",
" [ 0.4840, -1.0348]])\n",
"tensor([[ 0.0392, 0.2658],\n",
" [ 3.1410, 1.9842],\n",
" [ 1.2559, -1.1543]])\n",
"tensor([[ 0.2172, -0.7752],\n",
" [-1.0788, -1.9513],\n",
" [ 0.9364, -1.2229]])\n",
"tensor([[-1.0142, -1.9154],\n",
" [-0.4535, -1.6679],\n",
" [ 0.5474, -1.2476]])\n",
"tensor(-1.2095e-05)\n"
]
}
],
"source": [
"# Test case\n",
"seq_len, d_k = 3, 2\n",
"torch.manual_seed(3025)\n",
"q = torch.randn(seq_len, d_k)\n",
"k = torch.randn(seq_len, d_k)\n",
"v = torch.randn(seq_len, d_k)\n",
"valid = torch.tensor([[-1.0142, -1.9154],\n",
" [-0.4535, -1.6679],\n",
" [ 0.5474, -1.2476]])\n",
"output, attention_weight = scaled_dot_product(q,k,v)\n",
"differences = (output - valid).mean()\n",
"print(q)\n",
"print(k)\n",
"print(v)\n",
"print(output)\n",
"print(differences)\n",
"assert torch.abs(differences) < 0.0001, 'the product must be similar output as expected'"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DnDq4vT7kEGE"
},
"source": [
"# Multi-Head Attention\n",
"\n",
"Multi-Head Attention is an advancement over the scaled dot product attention, enabling the model to concurrently attend to information from different representation subspaces at different positions. This is particularly useful when dealing with complex data where different elements of the sequence may have different types of relevance or relationships to other elements.\n",
"\n",
"#### Concept\n",
"Instead of a single attention \"head,\" Multi-Head Attention uses multiple sets of Query, Key, and Value weight matrices to project the input into different subspaces, allowing the model to capture various aspects of the information. Each set of projections is referred to as a \"head.\" The attention outputs from each head are then concatenated and linearly transformed into the expected dimension.\n",
"\n",
"#### Mathematical Representation\n",
"Given Query, Key, and Value matrices (Q, K, V), the process can be mathematically described as:\n",
"\n",
"$$\n",
"\\begin{split}\n",
" \\text{Multihead}(Q,K,V) & = \\text{Concat}(\\text{head}_1,...,\\text{head}_h)W^{O}\\\\\n",
" \\text{where } \\text{head}_i & = \\text{Attention}(QW_i^Q,KW_i^K, VW_i^V)\n",
"\\end{split}\n",
"$$\n",
"\n",
"In this formula:\n",
"- $W_i^Q \\in \\mathbb{R}^{D \\times d_k}$, $W_i^K \\in \\mathbb{R}^{D \\times d_k}$, and $W_i^V \\in \\mathbb{R}^{D \\times d_v}$ are parameter matrices for the $i$-th attention head.\n",
"- $W^O \\in \\mathbb{R}^{h \\cdot d_k \\times d_{out}}$ is the parameter matrix for the linear transformation after concatenating the heads.\n",
"- $D$ is the dimensionality of the input, $h$ is the number of heads, and $d_{out}$ is the output dimensionality.\n",
"\n",
"#### Integration in Neural Networks\n",
"In a neural network, the Multi-Head Attention layer is typically applied to a feature map $X \\in \\mathbb{R}^{B \\times T \\times d_{\\text{model}}}$, where $B$ is the batch size, $T$ is the sequence length, and $d_{\\text{model}}$ is the dimensionality of the model's hidden layer. Here, $X$ serves as $Q$, $K$, and $V$. The transformation to query, key, and value representations is done using separate learnable weight matrices $W^Q$, $W^K$, and $W^V$.\n",
"\n",
"#### Implementation Notes\n",
"- **Heads**: Each head captures different aspects of the input data. More heads allow the model to simultaneously focus on different subspaces.\n",
"- **Dimensionality**: Ensure the dimensions of your weight matrices and inputs align correctly.\n",
"- **Efficiency**: Despite the increased complexity, Multi-Head Attention can be efficiently parallelized, making it suitable for large-scale problems.\n",
"\n",
"By utilizing Multi-Head Attention, models can gain a more nuanced understanding of the data, capturing various types of relationships within the sequence. This is especially beneficial in complex tasks like language understanding, where different words or phrases may have different kinds of relationships with others in the sequence."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(tensor([0, 1, 2, 3]), tensor([4, 5, 6, 7]), tensor([ 8, 9, 10]))"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"torch.arange(11).chunk(3, dim=-1)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"id": "zOiDz_FkkDDm"
},
"outputs": [],
"source": [
"class MultiheadAttention(nn.Module):\n",
" def __init__(self, input_dim, embed_dim, num_heads):\n",
" super().__init__()\n",
" assert embed_dim % num_heads == 0, \"Embedding dimension must be 0 modulo number of heads.\"\n",
"\n",
" self.embed_dim = embed_dim\n",
" self.num_heads = num_heads\n",
" self.head_dim = embed_dim // num_heads\n",
" self.qkv_proj = nn.Linear(input_dim, 3 * embed_dim)\n",
" self.o_proj = nn.Linear(embed_dim, embed_dim)\n",
"\n",
" self._reset_parameters()\n",
"\n",
" def _reset_parameters(self):\n",
" # Original Transformer initialization, see PyTorch documentation\n",
" nn.init.xavier_uniform_(self.qkv_proj.weight)\n",
" self.qkv_proj.bias.data.fill_(0)\n",
" nn.init.xavier_uniform_(self.o_proj.weight)\n",
" self.o_proj.bias.data.fill_(0)\n",
"\n",
" def forward(self, x, mask=None, return_attention=False):\n",
" batch_size, seq_length, embed_dim = x.size()\n",
" qkv = self.qkv_proj(x)\n",
" qkv = qkv.reshape(batch_size, seq_length, self.num_heads, 3 * self.head_dim)\n",
" qkv = qkv.permute(0, 2, 1, 3) # [Batch, Head, SeqLen, Dims]\n",
" q, k, v = qkv.chunk(3, dim=-1)\n",
" values, attention = scaled_dot_product(q, k, v, mask=mask)\n",
" values = values.permute(0, 2, 1, 3) # [Batch, SeqLen, Head, Dims]\n",
" values = values.reshape(batch_size, seq_length, embed_dim)\n",
" o = self.o_proj(values)\n",
"\n",
" if return_attention:\n",
" return o, attention\n",
" else:\n",
" return o"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sLI_NEVtlSNI"
},
"source": [
"# Transformer Encoder\n",
"\n",
"The Transformer Encoder plays a crucial role in transforming input sequences into rich, attention-based representations, primarily used in Sequence-to-Sequence tasks like machine translation. While the original Transformer model consists of both encoder and decoder, the encoder alone has been foundational in numerous advances in NLP and beyond. This section focuses on the encoder's architecture, function, and key components.\n",
"\n",
"#### Overview\n",
"The Transformer Encoder is composed of a stack of $N$ identical layers, each containing two main sub-layers:\n",
"\n",
"1. **Multi-Head Attention Mechanism**: Enables the model to attend to different positions of the input sequence simultaneously.\n",
"2. **Position-wise Feed-Forward Networks**: Consists of fully connected layers applied to each position separately, allowing for individual processing of each sequence element.\n",
"\n",
"#### Encoder Architecture\n",
"Each layer in the encoder includes the following steps:\n",
"\n",
"1. **Input Processing**: The input $x$ (where $x$ can be $Q$, $K$, and $V$) is first passed through the Multi-Head Attention mechanism.\n",
"2. **Residual Connection and Layer Normalization**: The output from the Multi-Head Attention is then added back to the input $x$ through a residual connection, followed by layer normalization:\n",
" \n",
" $$\\text{LayerNorm}(x + \\text{Multihead}(x, x, x))$$\n",
"\n",
" The residual connections help in maintaining the flow of the original input information through the network and are crucial for training deeper models by improving gradient flow. Layer Normalization is used to stabilize the learning process and ensure consistent feature magnitude across sequence elements.\n",
"\n",
"3. **Position-wise Feed-Forward Networks (FFN)**: Each position is processed individually by a two-layered feed-forward network with ReLU activation in between:\n",
" \n",
" $$\n",
" \\begin{split}\n",
" \\text{FFN}(x) & = \\max(0, xW_1 + b_1)W_2 + b_2\\\\\n",
" x & = \\text{LayerNorm}(x + \\text{FFN}(x))\n",
" \\end{split}\n",
" $$\n",
"\n",
" This component allows for further processing of the information added by the attention mechanism, preparing it for the next layer.\n",
"\n",
"#### Considerations in Design\n",
"- **Layer Normalization**: Chosen over Batch Normalization due to its independence from batch size and better performance in language tasks.\n",
"- **Dimensionality of MLP in FFN**: Typically 2-8 times larger than the dimensionality of the input $x$ ($d_{\\text{model}}$), allowing for more complex transformations and faster parallelizable execution.\n",
"- **Dropout**: Applied in MLP and on the outputs of MLP and Multi-Head Attention for regularization.\n",
"\n",
"The Transformer Encoder's architecture, with its repetitive yet intricate structure, allows for effective processing and transformation of sequence data, making it a powerful tool in various sequence modeling tasks. The next steps involve implementing the encoder block, paying close attention to the integration of Multi-Head Attention, residual connections, layer normalization, and feed-forward networks within each layer."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"id": "a1HiddBnlW4J"
},
"outputs": [],
"source": [
"class EncoderBlock(nn.Module):\n",
" def __init__(self, input_dim, num_heads, dim_feedforward, dropout=0.0):\n",
" \"\"\"EncoderBlock.\n",
"\n",
" Args:\n",
" input_dim: Dimensionality of the input\n",
" num_heads: Number of heads to use in the attention block\n",
" dim_feedforward: Dimensionality of the hidden layer in the MLP\n",
" dropout: Dropout probability to use in the dropout layers\n",
" \"\"\"\n",
" super().__init__()\n",
"\n",
" # Attention layer\n",
" self.self_attn = MultiheadAttention(input_dim, input_dim, num_heads)\n",
"\n",
" # Two-layer MLP\n",
" self.linear_net = nn.Sequential(\n",
" nn.Linear(input_dim, dim_feedforward),\n",
" nn.Dropout(dropout),\n",
" nn.ReLU(inplace=True),\n",
" nn.Linear(dim_feedforward, input_dim),\n",
" )\n",
"\n",
" # Layers to apply in between the main layers\n",
" self.norm1 = nn.LayerNorm(input_dim)\n",
" self.norm2 = nn.LayerNorm(input_dim)\n",
" self.dropout = nn.Dropout(dropout)\n",
"\n",
" def forward(self, x, mask=None):\n",
" # Attention part\n",
" attn_out = self.self_attn(x, mask=mask)\n",
" x = x + self.dropout(attn_out)\n",
" x = self.norm1(x)\n",
"\n",
" # MLP part\n",
" linear_out = self.linear_net(x)\n",
" x = x + self.dropout(linear_out)\n",
" x = self.norm2(x)\n",
"\n",
" return x\n",
"\n",
"\n",
"\n",
"\n",
"class TransformerEncoder(nn.Module):\n",
" def __init__(self, num_layers, **block_args):\n",
" super().__init__()\n",
" self.layers = nn.ModuleList([EncoderBlock(**block_args) for _ in range(num_layers)])\n",
"\n",
" def forward(self, x, mask=None):\n",
" for layer in self.layers:\n",
" x = layer(x, mask=mask)\n",
" return x\n",
"\n",
" def get_attention_maps(self, x, mask=None):\n",
" attention_maps = []\n",
" for layer in self.layers:\n",
" _, attn_map = layer.self_attn(x, mask=mask, return_attention=True)\n",
" attention_maps.append(attn_map)\n",
" x = layer(x)\n",
" return attention_maps\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"class PositionalEncoding(nn.Module):\n",
" def __init__(self, d_model, max_len=5000):\n",
" \"\"\"Positional Encoding.\n",
"\n",
" Args:\n",
" d_model: Hidden dimensionality of the input.\n",
" max_len: Maximum length of a sequence to expect.\n",
" \"\"\"\n",
" super().__init__()\n",
"\n",
" # Create matrix of [SeqLen, HiddenDim] representing the positional encoding for max_len inputs\n",
" pe = torch.zeros(max_len, d_model)\n",
" position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)\n",
" div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))\n",
" pe[:, 0::2] = torch.sin(position * div_term)\n",
" pe[:, 1::2] = torch.cos(position * div_term)\n",
" pe = pe.unsqueeze(0)\n",
"\n",
" # register_buffer => Tensor which is not a parameter, but should be part of the modules state.\n",
" # Used for tensors that need to be on the same device as the module.\n",
" # persistent=False tells PyTorch to not add the buffer to the state dict (e.g. when we save the model)\n",
" self.register_buffer(\"pe\", pe, persistent=False)\n",
"\n",
" def forward(self, x):\n",
" x = x + self.pe[:, : x.size(1)]\n",
" return x"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bKeMF9xLmQH0"
},
"source": [
"# Sequence to Sequence Tasks\n",
"\n",
"Sequence to Sequence (Seq2Seq) tasks involve converting an input sequence into an output sequence, where the input and output may vary in length. This model structure is commonly used in applications like machine translation, text summarization, and more. Typically, a Seq2Seq model comprises an encoder to interpret the input sequence and a decoder to generate the output sequence autoregressively.\n",
"\n",
"#### Simplified Task: Sequence Reversal\n",
"For educational purposes, we'll focus on a simplified Seq2Seq task: reversing a sequence of numbers. Despite its simplicity, this task is a good testbed for understanding Seq2Seq models, especially since it requires capturing long-term dependencies, something traditional RNNs might struggle with, but Transformers are well-equipped to handle.\n",
"\n",
"#### Task Description:\n",
"- **Input**: A sequence of $N$ numbers ranging from $0$ to $M$.\n",
"- **Output**: The reversed sequence of the input.\n",
"\n",
"In Numpy, if our input sequence is $x$, the desired output is $x$[::-1]. Although straightforward, this task provides a clear demonstration of a model's ability to handle sequences and understand dependencies across positions.\n",
"\n",
"#### Implementation Steps:\n",
"- **Create a Dataset Class**: The first step is to create a dataset class that can generate sequences of numbers and their reversed counterparts. This class will be used to train and evaluate the Seq2Seq model.\n",
"\n",
"By starting with this simple task, we can focus on the mechanics and capabilities of the Transformer encoder in handling sequences, setting the stage for tackling more complex Seq2Seq tasks in the future."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"id": "PSBkeOmtmPhX"
},
"outputs": [
{
"data": {
"text/plain": [
"(16, 10)"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"class ReverseDataset(data.Dataset):\n",
" def __init__(self, num_categories, seq_len, size):\n",
" super().__init__()\n",
" self.num_categories = num_categories\n",
" self.seq_len = seq_len\n",
" self.size = size\n",
"\n",
" self.data = torch.randint(self.num_categories, size=(self.size, self.seq_len))\n",
"\n",
" def __len__(self):\n",
" return self.size\n",
"\n",
" def __getitem__(self, idx):\n",
" inp_data = self.data[idx]\n",
" labels = torch.flip(inp_data, dims=(0,))\n",
" return inp_data, labels\n",
"\n",
"seq_len = 16\n",
"num_categories = 10\n",
"batch_size = 128\n",
"dataset = partial(ReverseDataset, num_categories, seq_len)\n",
"train_loader = data.DataLoader(dataset(10000), batch_size=batch_size, shuffle=True, drop_last=True, pin_memory=True)\n",
"val_loader = data.DataLoader(dataset(1000), batch_size=64, drop_last=True, shuffle=False)\n",
"\n",
"seq_len, num_categories"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VZ52A-Hhma4b"
},
"source": [
"# Compose the network"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"id": "JxlCGvdomaDJ"
},
"outputs": [],
"source": [
"class TransformerPredictor(nn.Module):\n",
" def __init__(\n",
" self,\n",
" input_dim,\n",
" model_dim,\n",
" num_classes,\n",
" num_heads,\n",
" num_layers,\n",
" dropout=0.0,\n",
" input_dropout=0.0,\n",
" ):\n",
" \"\"\"TransformerPredictor.\n",
"\n",
" Args:\n",
" input_dim: Hidden dimensionality of the input\n",
" model_dim: Hidden dimensionality to use inside the Transformer\n",
" num_classes: Number of classes to predict per sequence element\n",
" num_heads: Number of heads to use in the Multi-Head Attention blocks\n",
" num_layers: Number of encoder blocks to use.\n",
" dropout: Dropout to apply inside the model\n",
" input_dropout: Dropout to apply on the input features\n",
" \"\"\"\n",
" super().__init__()\n",
" # Input dim -> Model dim\n",
" self.input_net = nn.Sequential(\n",
" nn.Dropout(input_dropout),\n",
" nn.Linear(input_dim, model_dim)\n",
" )\n",
" # Positional encoding for sequences\n",
" self.positional_encoding = PositionalEncoding(d_model=model_dim)\n",
" # Transformer\n",
" self.transformer = TransformerEncoder(\n",
" num_layers=num_layers,\n",
" input_dim=model_dim,\n",
" dim_feedforward=2 * model_dim,\n",
" num_heads=num_heads,\n",
" dropout=dropout,\n",
" )\n",
" # Output classifier per sequence lement\n",
" self.output_net = nn.Sequential(\n",
" nn.Linear(model_dim, model_dim),\n",
" nn.LayerNorm(model_dim),\n",
" nn.ReLU(inplace=True),\n",
" nn.Dropout(dropout),\n",
" nn.Linear(model_dim, num_classes),\n",
" )\n",
"\n",
" def forward(self, x, mask=None, add_positional_encoding=True):\n",
" \"\"\"\n",
" Args:\n",
" x: Input features of shape [Batch, SeqLen, input_dim]\n",
" mask: Mask to apply on the attention outputs (optional)\n",
" add_positional_encoding: If True, we add the positional encoding to the input.\n",
" Might not be desired for some tasks.\n",
" \"\"\"\n",
" x = self.input_net(x)\n",
" if add_positional_encoding:\n",
" x = self.positional_encoding(x)\n",
" x = self.transformer(x, mask=mask)\n",
" x = self.output_net(x)\n",
" return x\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uUuW7DbBnjsS"
},
"source": [
"# Task: Writing Training Loop"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"torch.cuda.device_count()"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'NVIDIA GeForce RTX 3090 Ti'"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"torch.cuda.get_device_name(0)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "cZaOx-7qni7y",
"outputId": "a181d978-f0e0-451b-9e95-587ce9d8c2bd"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"EPOCH 1:\n",
"LOSS train 2.3047391891479494\n",
"EPOCH 2:\n",
"LOSS train 2.3059343338012694\n",
"EPOCH 3:\n",
"LOSS train 2.303365612030029\n",
"EPOCH 4:\n",
"LOSS train 2.278448724746704\n",
"EPOCH 5:\n",
"LOSS train 2.271419906616211\n",
"EPOCH 6:\n",
"LOSS train 2.272932434082031\n",
"EPOCH 7:\n",
"LOSS train 2.2925734519958496\n",
"EPOCH 8:\n",
"LOSS train 2.273321104049683\n",
"EPOCH 9:\n",
"LOSS train 2.2358773708343507\n",
"EPOCH 10:\n",
"LOSS train 2.213832139968872\n",
"EPOCH 11:\n",
"LOSS train 2.1974945068359375\n",
"EPOCH 12:\n",
"LOSS train 2.138561820983887\n",
"EPOCH 13:\n",
"LOSS train 2.131034755706787\n",
"EPOCH 14:\n",
"LOSS train 2.092240905761719\n",
"EPOCH 15:\n",
"LOSS train 2.028573417663574\n",
"EPOCH 16:\n",
"LOSS train 2.009480619430542\n",
"EPOCH 17:\n",
"LOSS train 2.0119842529296874\n",
"EPOCH 18:\n",
"LOSS train 1.966892457008362\n",
"EPOCH 19:\n",
"LOSS train 1.9739716291427611\n",
"EPOCH 20:\n",
"LOSS train 1.9691336631774903\n",
"EPOCH 21:\n",
"LOSS train 1.9692431688308716\n",
"EPOCH 22:\n",
"LOSS train 1.9720533609390258\n",
"EPOCH 23:\n",
"LOSS train 1.9636199712753295\n",
"EPOCH 24:\n",
"LOSS train 2.0134324550628664\n",
"EPOCH 25:\n",
"LOSS train 1.9760711431503295\n",
"EPOCH 26:\n",
"LOSS train 1.9674718379974365\n",
"EPOCH 27:\n",
"LOSS train 1.9762607574462892\n",
"EPOCH 28:\n",
"LOSS train 1.9764994859695435\n",
"EPOCH 29:\n",
"LOSS train 1.9699514150619506\n",
"EPOCH 30:\n",
"LOSS train 1.95670006275177\n",
"EPOCH 31:\n",
"LOSS train 1.946057105064392\n",
"EPOCH 32:\n",
"LOSS train 1.9565371990203857\n",
"EPOCH 33:\n",
"LOSS train 1.9599705457687377\n",
"EPOCH 34:\n",
"LOSS train 1.9696622133255004\n",
"EPOCH 35:\n",
"LOSS train 1.9993957996368408\n",
"EPOCH 36:\n",
"LOSS train 1.9636467695236206\n",
"EPOCH 37:\n",
"LOSS train 1.980830192565918\n",
"EPOCH 38:\n",
"LOSS train 1.9654539108276368\n",
"EPOCH 39:\n",
"LOSS train 1.9689129829406737\n",
"EPOCH 40:\n",
"LOSS train 1.955962347984314\n",
"EPOCH 41:\n",
"LOSS train 1.9647478580474853\n",
"EPOCH 42:\n",
"LOSS train 1.9532663106918335\n",
"EPOCH 43:\n",
"LOSS train 1.9503717422485352\n",
"EPOCH 44:\n",
"LOSS train 1.9499874591827393\n",
"EPOCH 45:\n",
"LOSS train 1.9529696941375732\n",
"EPOCH 46:\n",
"LOSS train 1.9518198251724244\n",
"EPOCH 47:\n",
"LOSS train 1.9523835182189941\n",
"EPOCH 48:\n",
"LOSS train 1.9561205148696899\n",
"EPOCH 49:\n",
"LOSS train 1.9675297260284423\n",
"EPOCH 50:\n",
"LOSS train 2.123178768157959\n",
"EPOCH 51:\n",
"LOSS train 1.970911931991577\n",
"EPOCH 52:\n",
"LOSS train 1.9587018251419068\n",
"EPOCH 53:\n",
"LOSS train 1.9622526168823242\n",
"EPOCH 54:\n",
"LOSS train 1.9551706790924073\n",
"EPOCH 55:\n",
"LOSS train 1.953707218170166\n",
"EPOCH 56:\n",
"LOSS train 1.9466333389282227\n",
"EPOCH 57:\n",
"LOSS train 1.9582770824432374\n",
"EPOCH 58:\n",
"LOSS train 1.9466321229934693\n",
"EPOCH 59:\n",
"LOSS train 1.9557215929031373\n",
"EPOCH 60:\n",
"LOSS train 1.9505679607391357\n",
"EPOCH 61:\n",
"LOSS train 1.9520682334899901\n",
"EPOCH 62:\n",
"LOSS train 1.955586814880371\n",
"EPOCH 63:\n",
"LOSS train 1.9475157499313354\n",
"EPOCH 64:\n",
"LOSS train 1.9377191305160522\n",
"EPOCH 65:\n",
"LOSS train 1.938973307609558\n",
"EPOCH 66:\n",
"LOSS train 1.9429319143295287\n",
"EPOCH 67:\n",
"LOSS train 1.9438214540481566\n",
"EPOCH 68:\n",
"LOSS train 1.9364233016967773\n",
"EPOCH 69:\n",
"LOSS train 1.9589627027511596\n",
"EPOCH 70:\n",
"LOSS train 1.9416004180908204\n",
"EPOCH 71:\n",
"LOSS train 1.9382025003433228\n",
"EPOCH 72:\n",
"LOSS train 1.9310474157333375\n",
"EPOCH 73:\n",
"LOSS train 1.939470887184143\n",
"EPOCH 74:\n",
"LOSS train 1.9370584964752198\n",
"EPOCH 75:\n",
"LOSS train 1.9400960445404052\n",
"EPOCH 76:\n",
"LOSS train 1.9455738306045531\n",
"EPOCH 77:\n",
"LOSS train 1.9308057308197022\n",
"EPOCH 78:\n",
"LOSS train 1.9302024841308594\n",
"EPOCH 79:\n",
"LOSS train 1.9345638751983643\n",
"EPOCH 80:\n",
"LOSS train 1.9394316673278809\n",
"EPOCH 81:\n",
"LOSS train 1.9305338621139527\n",
"EPOCH 82:\n",
"LOSS train 1.9336636304855346\n",
"EPOCH 83:\n",
"LOSS train 1.921869659423828\n",
"EPOCH 84:\n",
"LOSS train 1.9273949146270752\n",
"EPOCH 85:\n",
"LOSS train 1.916986870765686\n",
"EPOCH 86:\n",
"LOSS train 1.9170085191726685\n",
"EPOCH 87:\n",
"LOSS train 1.9086025476455688\n",
"EPOCH 88:\n",
"LOSS train 1.911173439025879\n",
"EPOCH 89:\n",
"LOSS train 1.8953789472579956\n",
"EPOCH 90:\n",
"LOSS train 1.9057527542114259\n",
"EPOCH 91:\n",
"LOSS train 1.883132243156433\n",
"EPOCH 92:\n",
"LOSS train 1.895125651359558\n",
"EPOCH 93:\n",
"LOSS train 1.8788848161697387\n",
"EPOCH 94:\n",
"LOSS train 1.870681929588318\n",
"EPOCH 95:\n",
"LOSS train 1.8580753803253174\n",
"EPOCH 96:\n",
"LOSS train 1.8578021287918092\n",
"EPOCH 97:\n",
"LOSS train 1.8478908300399781\n",
"EPOCH 98:\n",
"LOSS train 1.8258821010589599\n",
"EPOCH 99:\n",
"LOSS train 1.8657661914825439\n",
"EPOCH 100:\n",
"LOSS train 1.8161733627319336\n"
]
}
],
"source": [
"input_dim = 10 # This needs to be 10 because yes\n",
"model_dim = 1024 # size of the hidden layer (transformers)\n",
"num_classes = train_loader.dataset.num_categories\n",
"num_heads = 8\n",
"num_layers = 1\n",
"with torch.cuda.device(torch.device('cuda')):\n",
" \n",
" # please create the model\n",
" model = TransformerPredictor(input_dim, model_dim, num_classes, num_heads, num_layers).cuda()\n",
"\n",
" # please create the optimizer\n",
" optimizer = torch.optim.Adam(model.parameters())\n",
" loss_fn = torch.nn.CrossEntropyLoss()\n",
" # please train the model, with the whole training pipeline\n",
" \n",
" def train(epoch_index, tb_writer):\n",
" running_loss = 0\n",
" last_loss = 0\n",
" \n",
" for i, data in enumerate(train_loader):\n",
" inputs, labels = data\n",
" \n",
" # inputs = inputs.to(torch.float32)\n",
" inputs = F.one_hot(inputs, num_classes=num_classes).float().cuda()\n",
"\n",
" labels = labels.cuda()\n",
" \n",
" outputs = model(inputs)\n",
" \n",
" loss = loss_fn(outputs.view(-1, 10), labels.view(-1))\n",
" optimizer.zero_grad()\n",
" loss.backward()\n",
" \n",
" # Adjust learning weights\n",
" optimizer.step()\n",
" \n",
" running_loss += loss.item()\n",
" if i % 5 == 0:\n",
" last_loss = running_loss / 5 # loss per batch\n",
" # print(' batch {} loss: {}'.format(i + 1, last_loss))\n",
" tb_x = epoch_index * len(train_loader) + i + 1\n",
" # tb_writer.add_scalar('Loss/train', last_loss, tb_x)\n",
" running_loss = 0.\n",
" \n",
" return last_loss\n",
" \n",
" \n",
" timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')\n",
" # writer = torch.utils.tensorboard.writer.SummaryWriter('runs/fashion_trainer_{}'.format(timestamp))\n",
" epoch_number = 0\n",
" \n",
" EPOCHS = 100\n",
" \n",
" best_vloss = 1_000_000.\n",
" \n",
" for epoch in range(EPOCHS):\n",
" print('EPOCH {}:'.format(epoch_number + 1))\n",
" \n",
" # Make sure gradient tracking is on, and do a pass over the data\n",
" model.train(True)\n",
" avg_loss = train(epoch_number, None)\n",
" \n",
" avg_vloss = 0\n",
" print('LOSS train {}'.format(avg_loss))\n",
" \n",
" epoch_number += 1\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NVqbotkCrCSy"
},
"source": [
"# Evaluation\n",
"Here is the evaluation code, can you do better than 2.0?"
]
},
{
"cell_type": "code",
"execution_count": 218,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "vkRNkGBspuZh",
"outputId": "f5a0ff6d-e24c-4a94-d5f8-8113956f3b18"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Validation Loss: 2.3046957651774087\n"
]
}
],
"source": [
"# Validating the validation loss\n",
"criterion = nn.CrossEntropyLoss()\n",
"# Validation loop\n",
"model.eval()\n",
"with torch.no_grad():\n",
" val_loss = 0\n",
" for inputs, labels in val_loader:\n",
" inp_data = F.one_hot(inputs, num_classes=10).float().cuda()\n",
" outputs = model(inp_data)\n",
" loss = criterion(outputs.view(1024,10), labels.view(-1).cuda())\n",
" val_loss += loss.item()\n",
" print(f\"Validation Loss: {val_loss / len(val_loader)}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"colab": {
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}