# Import Necessary Library

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as data
import math
import os
import urllib.request
import pandas as pd
from functools import partial
from urllib.error import HTTPError
from datetime import datetime

# What is Attention?

Attention in neural networks, particularly relevant for sequential tasks, refers to a mechanism that selectively focuses on certain parts of input data. This concept has gained significant interest in recent years. In essence, attention computes a weighted average of elements in a sequence, with the weights being dynamically determined based on the relevance of each element to a specific query. This allows the model to prioritize certain inputs over others.

The attention mechanism consists of four primary components:

* **Query**: A feature vector representing the target of the attention, essentially indicating the information the model seeks within the sequence.
* **Keys**: Feature vectors corresponding to each input element, describing the content or relevance of the elements. The keys help the model identify which elements to focus on, relative to the query.
* **Values**: Feature vectors representing the actual content from each input element that the model should aggregate.
* **Score function**: A function used to calculate attention weights, representing the relevance of each key-query pair. Common implementations include simple operations like the dot product or more complex structures like a small neural network.

The attention mechanism operates by first computing scores between the query and each key using the score function. These scores determine the attention weights through a softmax function, ensuring that they sum to one and are non-negative. The output is then calculated as the weighted sum of the value vectors, with weights corresponding to the calculated attention scores.

Mathematically, this process can be represented as:

$$
\alpha_i = \frac{\exp\left(f_{attn}\left(\text{key}_i, \text{query}\right)\right)}{\sum_j \exp\left(f_{attn}\left(\text{key}_j, \text{query}\right)\right)}, \hspace{5mm} \text{out} = \sum_i \alpha_i \cdot \text{value}_i
$$

In practice, attention mechanisms can vary based on the choice of queries, the definition of key and value vectors, and the specific score function used. A prominent example is the **self-attention** mechanism used in the Transformer architecture, where each element in a sequence provides its own key, value, and query. The self-attention mechanism allows each element to attend to all elements in the sequence, including itself, resulting in a representation that incorporates information from the entire sequence.

The above explanation provides a conceptual understanding of the attention mechanism, highlighting its components and operational principles without delving into the specific details of any particular implementation, such as the scaled dot product attention used in Transformers.

### Scaled Dot Product Attention

The scaled dot product attention is a fundamental component of the self-attention mechanism, enabling elements within a sequence to efficiently attend to one another. It operates on queries $Q\in\mathbb{R}^{T\times d_k}$, keys $K\in\mathbb{R}^{T\times d_k}$, and values $V\in\mathbb{R}^{T\times d_v}$, where $T$ represents the sequence length and $d_k$, $d_v$ denote the dimensions of queries/keys and values, respectively.

The mechanism calculates the attention values based on the dot product similarity between each query $Q_i$ and key $K_j$, and scales the results by the square root of the dimensionality of the keys, $d_k$. The formula for this calculation is:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Here, the matrix product $QK^T$ computes the dot product between all pairs of queries and keys, forming a $T\times T$ matrix where each entry represents the attention score from one element to another. After applying the softmax function, these scores are used as weights to compute a weighted average of the value vectors.

The scaling factor $1/\sqrt{d_k}$ is critical for maintaining the variance of the attention scores at an appropriate level. Without this scaling, the variance of the dot products could become too large, leading to a situation where the softmax function saturates, with most of its output concentrated on a single element. This would hinder learning by resulting in gradients that are almost zero.

Additionally, the mechanism can include an optional masking step (denoted as `Mask (opt.)` in the diagram), useful in situations like batch processing of sequences of varying lengths. Padding is used to equalize the lengths of sequences, and the mask ensures that the padded positions do not affect the attention calculation, typically by assigning a very low value to these positions in the attention scores.

In summary, the scaled dot product attention efficiently enables each element in a sequence to attend to all others, considering the relevance of each element, and is crucial for models that rely on self-attention, such as Transformers.

### Implementing Scaled Dot Product Attention

Scaled dot product attention is a core mechanism allowing each element in a sequence to consider all other elements efficiently, which is fundamental in self-attention models like Transformers. Here's a detailed guide to implementing scaled dot product attention, breaking down the components and the steps involved.

#### Inputs to the Attention Mechanism
The attention function takes three inputs:
1. **Queries (Q)**: $Q\in\mathbb{R}^{T\times d_k}$, where $T$ is the sequence length and $d_k$ is the dimensionality of the queries and keys.
2. **Keys (K)**: $K\in\mathbb{R}^{T\times d_k}$.
3. **Values (V)**: $V\in\mathbb{R}^{T\times d_v}$, where $d_v$ is the dimensionality of the values.

#### Step-by-Step Calculation
1. **Dot Product of Queries and Keys**: Calculate the dot product between each query and all keys to obtain a measure of compatibility or relevance between each query-key pair. This results in a matrix of shape $T \times T$, where each element $(i, j)$ represents the dot product between query $i$ and key $j$.
   
   $$\text{Score Matrix} = QK^T$$

2. **Scaling**: Scale the scores obtained in the previous step by dividing by $\sqrt{d_k}$ to ensure stable gradients, as larger values of $d_k$ can lead to extremely small gradients, which can slow down learning and model convergence.

   $$\text{Scaled Score Matrix} = \frac{\text{Score Matrix}}{\sqrt{d_k}}$$

3. **Optional Masking**: If masking is required (e.g., for padded positions in a batch of sequences), apply the mask by setting the scores for masked positions to a very large negative value, ensuring that they have minimal impact after the softmax step.

4. **Softmax**: Apply the softmax function to the scaled scores along each row. This step converts the scores into probabilities, indicating the importance of each key relative to each query.

   $$\text{Attention Weights} = \text{softmax}(\text{Scaled Score Matrix})$$

5. **Output Calculation**: Multiply the attention weights by the value vectors to obtain the final output. This step computes a weighted average of the value vectors, where the weights are determined by the attention scores.

   $$\text{Output} = \text{Attention Weights} \times V$$

#### Implementation Tips
- **Dimensionality**: Ensure the dimensions of your matrices are correct. Matrix multiplication will not be possible if the inner dimensions do not match.
- **Numerical Stability**: When implementing the softmax function, ensure numerical stability by subtracting the maximum value in each row of the scores matrix before applying the exponential function.
- **Batch Processing**: If implementing attention in batch, include an additional batch dimension in your matrices (e.g., $Q\in\mathbb{R}^{B\times T\times d_k}$ for a batch size of $B$) and ensure your implementation supports this.
- **Testing**: Verify the correctness of your implementation with simple test cases to ensure it behaves as expected.

This framework should provide a clear structure for students to implement scaled dot product attention, enhancing their understanding of its role and functionality in self-attention models.

# Task: Please implement a scaled dot product function

In [30]:
dk = 2
t = 3
v = torch.randn(t, dk)

len(v[0]),v.shape[-1]

(2, 2)

In [31]:
dk = 2
t = 3

a = torch.zeros(t, dk) + 5
b = torch.zeros(t, dk) + 5
v = torch.zeros(t, dk) + 5

bt = b.mT

dot = torch.mm(a, bt)

scaled = dot/math.sqrt(dk)

softmax = torch.nn.functional.softmax(scaled)

output = torch.mm(softmax, v)

a,b,v,bt, dot, scaled, softmax, output, len(a[0]),a[1]

  softmax = torch.nn.functional.softmax(scaled)


(tensor([[5., 5.],
         [5., 5.],
         [5., 5.]]),
 tensor([[5., 5.],
         [5., 5.],
         [5., 5.]]),
 tensor([[5., 5.],
         [5., 5.],
         [5., 5.]]),
 tensor([[5., 5., 5.],
         [5., 5., 5.]]),
 tensor([[50., 50., 50.],
         [50., 50., 50.],
         [50., 50., 50.]]),
 tensor([[35.3553, 35.3553, 35.3553],
         [35.3553, 35.3553, 35.3553],
         [35.3553, 35.3553, 35.3553]]),
 tensor([[0.3333, 0.3333, 0.3333],
         [0.3333, 0.3333, 0.3333],
         [0.3333, 0.3333, 0.3333]]),
 tensor([[5., 5.],
         [5., 5.],
         [5., 5.]]),
 2,
 tensor([5., 5.]))

In [3]:
def scaled_dot_product(q, k, v, mask=None):
    # implemented by the student, you can ignore the mask implementation currently
    # just assignment all the mask is on

    shape_len = len(k.shape)

    transpose = k.mT
    d = k.shape[-1]

    score_scale = torch.matmul(q, transpose)/math.sqrt(d)

    attention_weight = torch.nn.functional.softmax(score_scale, 1)

    output = torch.matmul(attention_weight, v)

    return output, attention_weight

In [33]:
Q = torch.Tensor([[-0.19737370312213898, -1.0540887117385864, 0.02383515052497387, 0.46185705065727234], [-1.2415547370910645, 0.8366656303405762, 0.3741966784000397, 0.9099264740943909], [0.3436168134212494, 0.6154376268386841, 1.1926648616790771, 1.6477248668670654]])
K = torch.Tensor([[1.9663442373275757, 0.15551914274692535, -0.8715013861656189, 0.32070425152778625], [-5.85474967956543, 1.7047394514083862, -1.0024793148040771, 1.3307985067367554], [0.06319630891084671, -2.030783176422119, -5.436811447143555, -0.42979586124420166]])
V = torch.Tensor([[-82.127197265625, 0.9534303545951843, -28.78610610961914, -10.762138366699219], [-16.467313766479492, 60.92831802368164, -36.08392333984375, 31.648052215576172], [20.485767364501953, 45.4570198059082, 15.208494186401367, 31.43212890625]])

ans = scaled_dot_product(Q, K, V)[0].tolist()

pf = pd.read_csv("A1_template_template.csv")

pf.loc[0] = [6644818, Q.tolist(), K.tolist(), V.tolist(), ans]

pf.to_csv('A1_template.csv', sep=',', index=False)

In [34]:
# Test case
seq_len, d_k = 3, 2
torch.manual_seed(3025)
q = torch.randn(seq_len, d_k)
k = torch.randn(seq_len, d_k)
v = torch.randn(seq_len, d_k)
valid = torch.tensor([[-1.0142, -1.9154],
        [-0.4535, -1.6679],
        [ 0.5474, -1.2476]])
output, attention_weight = scaled_dot_product(q,k,v)
differences = (output - valid).mean()
print(q)
print(k)
print(v)
print(output)
print(differences)
assert torch.abs(differences) < 0.0001, 'the product must be similar output as expected'

tensor([[ 1.2840,  0.9623],
        [ 1.0821, -0.2264],
        [ 0.4840, -1.0348]])
tensor([[ 0.0392,  0.2658],
        [ 3.1410,  1.9842],
        [ 1.2559, -1.1543]])
tensor([[ 0.2172, -0.7752],
        [-1.0788, -1.9513],
        [ 0.9364, -1.2229]])
tensor([[-1.0142, -1.9154],
        [-0.4535, -1.6679],
        [ 0.5474, -1.2476]])
tensor(-1.2095e-05)


# Multi-Head Attention

Multi-Head Attention is an advancement over the scaled dot product attention, enabling the model to concurrently attend to information from different representation subspaces at different positions. This is particularly useful when dealing with complex data where different elements of the sequence may have different types of relevance or relationships to other elements.

#### Concept
Instead of a single attention "head," Multi-Head Attention uses multiple sets of Query, Key, and Value weight matrices to project the input into different subspaces, allowing the model to capture various aspects of the information. Each set of projections is referred to as a "head." The attention outputs from each head are then concatenated and linearly transformed into the expected dimension.

#### Mathematical Representation
Given Query, Key, and Value matrices (Q, K, V), the process can be mathematically described as:

$$
\begin{split}
    \text{Multihead}(Q,K,V) & = \text{Concat}(\text{head}_1,...,\text{head}_h)W^{O}\\
    \text{where } \text{head}_i & = \text{Attention}(QW_i^Q,KW_i^K, VW_i^V)
\end{split}
$$

In this formula:
- $W_i^Q \in \mathbb{R}^{D \times d_k}$, $W_i^K \in \mathbb{R}^{D \times d_k}$, and $W_i^V \in \mathbb{R}^{D \times d_v}$ are parameter matrices for the $i$-th attention head.
- $W^O \in \mathbb{R}^{h \cdot d_k \times d_{out}}$ is the parameter matrix for the linear transformation after concatenating the heads.
- $D$ is the dimensionality of the input, $h$ is the number of heads, and $d_{out}$ is the output dimensionality.

#### Integration in Neural Networks
In a neural network, the Multi-Head Attention layer is typically applied to a feature map $X \in \mathbb{R}^{B \times T \times d_{\text{model}}}$, where $B$ is the batch size, $T$ is the sequence length, and $d_{\text{model}}$ is the dimensionality of the model's hidden layer. Here, $X$ serves as $Q$, $K$, and $V$. The transformation to query, key, and value representations is done using separate learnable weight matrices $W^Q$, $W^K$, and $W^V$.

#### Implementation Notes
- **Heads**: Each head captures different aspects of the input data. More heads allow the model to simultaneously focus on different subspaces.
- **Dimensionality**: Ensure the dimensions of your weight matrices and inputs align correctly.
- **Efficiency**: Despite the increased complexity, Multi-Head Attention can be efficiently parallelized, making it suitable for large-scale problems.

By utilizing Multi-Head Attention, models can gain a more nuanced understanding of the data, capturing various types of relationships within the sequence. This is especially beneficial in complex tasks like language understanding, where different words or phrases may have different kinds of relationships with others in the sequence.

In [35]:
torch.arange(11).chunk(3, dim=-1)

(tensor([0, 1, 2, 3]), tensor([4, 5, 6, 7]), tensor([ 8,  9, 10]))

In [36]:
class MultiheadAttention(nn.Module):
    def __init__(self, input_dim, embed_dim, num_heads):
        super().__init__()
        assert embed_dim % num_heads == 0, "Embedding dimension must be 0 modulo number of heads."

        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.qkv_proj = nn.Linear(input_dim, 3 * embed_dim)
        self.o_proj = nn.Linear(embed_dim, embed_dim)

        self._reset_parameters()

    def _reset_parameters(self):
        # Original Transformer initialization, see PyTorch documentation
        nn.init.xavier_uniform_(self.qkv_proj.weight)
        self.qkv_proj.bias.data.fill_(0)
        nn.init.xavier_uniform_(self.o_proj.weight)
        self.o_proj.bias.data.fill_(0)

    def forward(self, x, mask=None, return_attention=False):
        batch_size, seq_length, embed_dim = x.size()
        qkv = self.qkv_proj(x)
        qkv = qkv.reshape(batch_size, seq_length, self.num_heads, 3 * self.head_dim)
        qkv = qkv.permute(0, 2, 1, 3)  # [Batch, Head, SeqLen, Dims]
        q, k, v = qkv.chunk(3, dim=-1)
        values, attention = scaled_dot_product(q, k, v, mask=mask)
        values = values.permute(0, 2, 1, 3)  # [Batch, SeqLen, Head, Dims]
        values = values.reshape(batch_size, seq_length, embed_dim)
        o = self.o_proj(values)

        if return_attention:
            return o, attention
        else:
            return o

# Transformer Encoder

The Transformer Encoder plays a crucial role in transforming input sequences into rich, attention-based representations, primarily used in Sequence-to-Sequence tasks like machine translation. While the original Transformer model consists of both encoder and decoder, the encoder alone has been foundational in numerous advances in NLP and beyond. This section focuses on the encoder's architecture, function, and key components.

#### Overview
The Transformer Encoder is composed of a stack of $N$ identical layers, each containing two main sub-layers:

1. **Multi-Head Attention Mechanism**: Enables the model to attend to different positions of the input sequence simultaneously.
2. **Position-wise Feed-Forward Networks**: Consists of fully connected layers applied to each position separately, allowing for individual processing of each sequence element.

#### Encoder Architecture
Each layer in the encoder includes the following steps:

1. **Input Processing**: The input $x$ (where $x$ can be $Q$, $K$, and $V$) is first passed through the Multi-Head Attention mechanism.
2. **Residual Connection and Layer Normalization**: The output from the Multi-Head Attention is then added back to the input $x$ through a residual connection, followed by layer normalization:
   
   $$\text{LayerNorm}(x + \text{Multihead}(x, x, x))$$

    The residual connections help in maintaining the flow of the original input information through the network and are crucial for training deeper models by improving gradient flow. Layer Normalization is used to stabilize the learning process and ensure consistent feature magnitude across sequence elements.

3. **Position-wise Feed-Forward Networks (FFN)**: Each position is processed individually by a two-layered feed-forward network with ReLU activation in between:
   
   $$
   \begin{split}
       \text{FFN}(x) & = \max(0, xW_1 + b_1)W_2 + b_2\\
       x & = \text{LayerNorm}(x + \text{FFN}(x))
   \end{split}
   $$

    This component allows for further processing of the information added by the attention mechanism, preparing it for the next layer.

#### Considerations in Design
- **Layer Normalization**: Chosen over Batch Normalization due to its independence from batch size and better performance in language tasks.
- **Dimensionality of MLP in FFN**: Typically 2-8 times larger than the dimensionality of the input $x$ ($d_{\text{model}}$), allowing for more complex transformations and faster parallelizable execution.
- **Dropout**: Applied in MLP and on the outputs of MLP and Multi-Head Attention for regularization.

The Transformer Encoder's architecture, with its repetitive yet intricate structure, allows for effective processing and transformation of sequence data, making it a powerful tool in various sequence modeling tasks. The next steps involve implementing the encoder block, paying close attention to the integration of Multi-Head Attention, residual connections, layer normalization, and feed-forward networks within each layer.

In [37]:
class EncoderBlock(nn.Module):
    def __init__(self, input_dim, num_heads, dim_feedforward, dropout=0.0):
        """EncoderBlock.

        Args:
            input_dim: Dimensionality of the input
            num_heads: Number of heads to use in the attention block
            dim_feedforward: Dimensionality of the hidden layer in the MLP
            dropout: Dropout probability to use in the dropout layers
        """
        super().__init__()

        # Attention layer
        self.self_attn = MultiheadAttention(input_dim, input_dim, num_heads)

        # Two-layer MLP
        self.linear_net = nn.Sequential(
            nn.Linear(input_dim, dim_feedforward),
            nn.Dropout(dropout),
            nn.ReLU(inplace=True),
            nn.Linear(dim_feedforward, input_dim),
        )

        # Layers to apply in between the main layers
        self.norm1 = nn.LayerNorm(input_dim)
        self.norm2 = nn.LayerNorm(input_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Attention part
        attn_out = self.self_attn(x, mask=mask)
        x = x + self.dropout(attn_out)
        x = self.norm1(x)

        # MLP part
        linear_out = self.linear_net(x)
        x = x + self.dropout(linear_out)
        x = self.norm2(x)

        return x




class TransformerEncoder(nn.Module):
    def __init__(self, num_layers, **block_args):
        super().__init__()
        self.layers = nn.ModuleList([EncoderBlock(**block_args) for _ in range(num_layers)])

    def forward(self, x, mask=None):
        for layer in self.layers:
            x = layer(x, mask=mask)
        return x

    def get_attention_maps(self, x, mask=None):
        attention_maps = []
        for layer in self.layers:
            _, attn_map = layer.self_attn(x, mask=mask, return_attention=True)
            attention_maps.append(attn_map)
            x = layer(x)
        return attention_maps







class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        """Positional Encoding.

        Args:
            d_model: Hidden dimensionality of the input.
            max_len: Maximum length of a sequence to expect.
        """
        super().__init__()

        # Create matrix of [SeqLen, HiddenDim] representing the positional encoding for max_len inputs
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)

        # register_buffer => Tensor which is not a parameter, but should be part of the modules state.
        # Used for tensors that need to be on the same device as the module.
        # persistent=False tells PyTorch to not add the buffer to the state dict (e.g. when we save the model)
        self.register_buffer("pe", pe, persistent=False)

    def forward(self, x):
        x = x + self.pe[:, : x.size(1)]
        return x

# Sequence to Sequence Tasks

Sequence to Sequence (Seq2Seq) tasks involve converting an input sequence into an output sequence, where the input and output may vary in length. This model structure is commonly used in applications like machine translation, text summarization, and more. Typically, a Seq2Seq model comprises an encoder to interpret the input sequence and a decoder to generate the output sequence autoregressively.

#### Simplified Task: Sequence Reversal
For educational purposes, we'll focus on a simplified Seq2Seq task: reversing a sequence of numbers. Despite its simplicity, this task is a good testbed for understanding Seq2Seq models, especially since it requires capturing long-term dependencies, something traditional RNNs might struggle with, but Transformers are well-equipped to handle.

#### Task Description:
- **Input**: A sequence of $N$ numbers ranging from $0$ to $M$.
- **Output**: The reversed sequence of the input.

In Numpy, if our input sequence is $x$, the desired output is $x$[::-1]. Although straightforward, this task provides a clear demonstration of a model's ability to handle sequences and understand dependencies across positions.

#### Implementation Steps:
- **Create a Dataset Class**: The first step is to create a dataset class that can generate sequences of numbers and their reversed counterparts. This class will be used to train and evaluate the Seq2Seq model.

By starting with this simple task, we can focus on the mechanics and capabilities of the Transformer encoder in handling sequences, setting the stage for tackling more complex Seq2Seq tasks in the future.

In [38]:
class ReverseDataset(data.Dataset):
    def __init__(self, num_categories, seq_len, size):
        super().__init__()
        self.num_categories = num_categories
        self.seq_len = seq_len
        self.size = size

        self.data = torch.randint(self.num_categories, size=(self.size, self.seq_len))

    def __len__(self):
        return self.size

    def __getitem__(self, idx):
        inp_data = self.data[idx]
        labels = torch.flip(inp_data, dims=(0,))
        return inp_data, labels

seq_len = 16
num_categories = 10
batch_size = 128
dataset = partial(ReverseDataset, num_categories, seq_len)
train_loader = data.DataLoader(dataset(10000), batch_size=batch_size, shuffle=True, drop_last=True, pin_memory=True)
val_loader = data.DataLoader(dataset(1000), batch_size=64, drop_last=True, shuffle=False)

seq_len, num_categories

(16, 10)

# Compose the network

In [39]:
class TransformerPredictor(nn.Module):
    def __init__(
        self,
        input_dim,
        model_dim,
        num_classes,
        num_heads,
        num_layers,
        dropout=0.0,
        input_dropout=0.0,
    ):
        """TransformerPredictor.

        Args:
            input_dim: Hidden dimensionality of the input
            model_dim: Hidden dimensionality to use inside the Transformer
            num_classes: Number of classes to predict per sequence element
            num_heads: Number of heads to use in the Multi-Head Attention blocks
            num_layers: Number of encoder blocks to use.
            dropout: Dropout to apply inside the model
            input_dropout: Dropout to apply on the input features
        """
        super().__init__()
        # Input dim -> Model dim
        self.input_net = nn.Sequential(
            nn.Dropout(input_dropout),
            nn.Linear(input_dim, model_dim)
        )
        # Positional encoding for sequences
        self.positional_encoding = PositionalEncoding(d_model=model_dim)
        # Transformer
        self.transformer = TransformerEncoder(
            num_layers=num_layers,
            input_dim=model_dim,
            dim_feedforward=2 * model_dim,
            num_heads=num_heads,
            dropout=dropout,
        )
        # Output classifier per sequence lement
        self.output_net = nn.Sequential(
            nn.Linear(model_dim, model_dim),
            nn.LayerNorm(model_dim),
            nn.ReLU(inplace=True),
            nn.Dropout(dropout),
            nn.Linear(model_dim, num_classes),
        )

    def forward(self, x, mask=None, add_positional_encoding=True):
        """
        Args:
            x: Input features of shape [Batch, SeqLen, input_dim]
            mask: Mask to apply on the attention outputs (optional)
            add_positional_encoding: If True, we add the positional encoding to the input.
                                      Might not be desired for some tasks.
        """
        x = self.input_net(x)
        if add_positional_encoding:
            x = self.positional_encoding(x)
        x = self.transformer(x, mask=mask)
        x = self.output_net(x)
        return x


# Task: Writing Training Loop

In [40]:
torch.cuda.device_count()

1

In [41]:
torch.cuda.get_device_name(0)

'NVIDIA GeForce RTX 3090 Ti'

In [48]:
input_dim = 10 # This needs to be 10 because yes
model_dim = 1024 # size of the hidden layer (transformers)
num_classes = train_loader.dataset.num_categories
num_heads = 8
num_layers = 1
with torch.cuda.device(torch.device('cuda')):
    
    # please create the model
    model = TransformerPredictor(input_dim, model_dim, num_classes, num_heads, num_layers).cuda()

    # please create the optimizer
    optimizer = torch.optim.Adam(model.parameters())
    loss_fn = torch.nn.CrossEntropyLoss()
    # please train the model, with the whole training pipeline
    
    def train(epoch_index, tb_writer):
        running_loss = 0
        last_loss = 0
    
        for i, data in enumerate(train_loader):
            inputs, labels = data
    
            # inputs = inputs.to(torch.float32)
            inputs = F.one_hot(inputs, num_classes=num_classes).float().cuda()

            labels = labels.cuda()
      
            outputs = model(inputs)
    
            loss = loss_fn(outputs.view(-1, 10), labels.view(-1))
            optimizer.zero_grad()
            loss.backward()
    
            # Adjust learning weights
            optimizer.step()
    
            running_loss += loss.item()
            if i % 5 == 0:
                last_loss = running_loss / 5 # loss per batch
                # print('  batch {} loss: {}'.format(i + 1, last_loss))
                tb_x = epoch_index * len(train_loader) + i + 1
                # tb_writer.add_scalar('Loss/train', last_loss, tb_x)
                running_loss = 0.
    
        return last_loss
    
    
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    # writer = torch.utils.tensorboard.writer.SummaryWriter('runs/fashion_trainer_{}'.format(timestamp))
    epoch_number = 0
    
    EPOCHS = 100
    
    best_vloss = 1_000_000.
    
    for epoch in range(EPOCHS):
        print('EPOCH {}:'.format(epoch_number + 1))
    
        # Make sure gradient tracking is on, and do a pass over the data
        model.train(True)
        avg_loss = train(epoch_number, None)
        
        avg_vloss = 0
        print('LOSS train {}'.format(avg_loss))
    
        epoch_number += 1


EPOCH 1:
LOSS train 2.3047391891479494
EPOCH 2:
LOSS train 2.3059343338012694
EPOCH 3:
LOSS train 2.303365612030029
EPOCH 4:
LOSS train 2.278448724746704
EPOCH 5:
LOSS train 2.271419906616211
EPOCH 6:
LOSS train 2.272932434082031
EPOCH 7:
LOSS train 2.2925734519958496
EPOCH 8:
LOSS train 2.273321104049683
EPOCH 9:
LOSS train 2.2358773708343507
EPOCH 10:
LOSS train 2.213832139968872
EPOCH 11:
LOSS train 2.1974945068359375
EPOCH 12:
LOSS train 2.138561820983887
EPOCH 13:
LOSS train 2.131034755706787
EPOCH 14:
LOSS train 2.092240905761719
EPOCH 15:
LOSS train 2.028573417663574
EPOCH 16:
LOSS train 2.009480619430542
EPOCH 17:
LOSS train 2.0119842529296874
EPOCH 18:
LOSS train 1.966892457008362
EPOCH 19:
LOSS train 1.9739716291427611
EPOCH 20:
LOSS train 1.9691336631774903
EPOCH 21:
LOSS train 1.9692431688308716
EPOCH 22:
LOSS train 1.9720533609390258
EPOCH 23:
LOSS train 1.9636199712753295
EPOCH 24:
LOSS train 2.0134324550628664
EPOCH 25:
LOSS train 1.9760711431503295
EPOCH 26:
LOSS train 

# Evaluation
Here is the evaluation code, can you do better than 2.0?

In [218]:
# Validating the validation loss
criterion = nn.CrossEntropyLoss()
# Validation loop
model.eval()
with torch.no_grad():
    val_loss = 0
    for inputs, labels in val_loader:
      inp_data = F.one_hot(inputs, num_classes=10).float().cuda()
      outputs = model(inp_data)
      loss = criterion(outputs.view(1024,10), labels.view(-1).cuda())
      val_loss += loss.item()
    print(f"Validation Loss: {val_loss / len(val_loader)}")

Validation Loss: 2.3046957651774087
