{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5d3Ei9rwxz3K"
      },
      "source": [
        "# Assignment 10 - NLP using Deep Learning"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OfN36KJsxz3M"
      },
      "source": [
        "## Goals"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QinuznERxz3N"
      },
      "source": [
        "In this assignment you will get to work with recurrent network architectures with application to language processing tasks and observe behaviour of the learning using tensorboard visualization.\n",
        "\n",
        "You'll learn to use\n",
        "\n",
        " * word embeddings\n",
        " * LSTMs\n",
        " * performance monitoring via tensorboard + lightning\n",
        " * state-of-the-art cross-language transformers\n",
        "\n",
        "While the notebook contains a lot of code, the actual **TODO**s for you are lightweight and easy to find. Use Google colab or the lab machines and provided environment to get started and finish quickly.\n",
        "\n",
        "The goal of this exercise is to provide you with entry points to approach common NLP tasks with simple and elaborate methods."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RmfTfo9oxz3Q"
      },
      "source": [
        "### Deep learning environment in the lab"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9Gd4ekuXxz3S"
      },
      "source": [
        "With the same kind of preparation as in [Assignment 6](../A6/A6.ipynb) we are going to use **[pytorch](http://pytorch.org)** for the deep learning aspects of the assignment.\n",
        "\n",
        "There is a pytorch setup in the big data under the globally available anaconda installation.\n",
        "However, it is recommended that you use the custom **gt** conda environment that contains all python package dependencies that are relevant for this assignment (and also tensorflow, etc.)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CM27gT2Mxz3S"
      },
      "source": [
        "You could load it directly\n",
        "```\n",
        "source activate /usr/shared/CMPT/big-data/condaenv/nn11\n",
        "```\n",
        "Once activated, you couls also add it as a user kernel to your jupyter installation\n",
        "```\n",
        "python -m ipykernel install --user --name=\"py-nn11\"\n",
        "```\n",
        "and then choose it as kernel when running this notebook.\n",
        "To reproduce this environment on your own system, you could use `conda env export > environment.yml` and then use `mamba env update --prefix wherever_you_want_to_create_yours -f environment.yml` to make your own instance of this environment."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DPvzUIoG9MTc"
      },
      "source": [
        "### Google colab VM setup"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4DvJcayEhMoV"
      },
      "outputs": [],
      "source": [
        "try:\n",
        "  import portalocker\n",
        "except ModuleNotFoundError:\n",
        "  !pip install portalocker\n",
        "  import portalocker\n",
        "\n",
        "update_torchtext = False\n",
        "try:\n",
        "  import torchtext\n",
        "  update_torchtext = torchtext.__version__ < \"0.15\"\n",
        "except ModuleNotFoundError:\n",
        "  update_torchtext = True\n",
        "\n",
        "if update_torchtext:\n",
        "  !pip uninstall --yes fastai\n",
        "  import re\n",
        "  cudaver = !nvcc --version | grep release\n",
        "  cudaver = re.search(r\".*release (.*),.*\", cudaver[0]).group(1)\n",
        "  print(f\"Found CUDA version {cudaver}\")\n",
        "  cudaver_nodot = cudaver.replace(\".\",\"\")\n",
        "  !pip install -U torch torchvision torchaudio \"torchtext>=0.15\" --index-url https://download.pytorch.org/whl/cu{cudaver_nodot}\n",
        "  !pip install tensorboardX lightning"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "C8rZWAaXxz3T"
      },
      "outputs": [],
      "source": [
        "import torch\n",
        "import torch.nn as nn\n",
        "import numpy as np\n",
        "import torchtext.functional as F\n",
        "\n",
        "from IPython.display import Markdown\n",
        "import pandas as pd\n",
        "\n",
        "DEVICE = torch.device(\"cuda\") if torch.cuda.is_available() else torch.device(\"cpu\")\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "J7UlGxAxx3DU"
      },
      "outputs": [],
      "source": [
        "# location of \"GoogleNews-vectors-negative300.bin.gz\", only required if word2vec embedding is chosen\n",
        "from pathlib import Path\n",
        "bdenv_loc = Path('/usr/shared/CMPT/big-data')\n",
        "bdata = bdenv_loc / 'data'"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "PrFTYOvqjLyE"
      },
      "outputs": [],
      "source": [
        "!nvidia-smi"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JLXGrSCYx_gB"
      },
      "outputs": [],
      "source": [
        "torch.__version__, DEVICE"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "81w01bGqxz3V"
      },
      "source": [
        "# Task 1: Explore Word Embeddings"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fNGhZ6YTxz3W"
      },
      "source": [
        "Word embeddings are mappings between words and multi-dimensional vectors, where the difference between two word vectors has some relationship with the meaning of the corresponding words, i.e. words that are similar in meaning are mapped closely together (ideally). This part of the assignment should enable you to\n",
        "\n",
        "* Load a pretrained word embedding\n",
        "* Perform basic operations, such as distance queries and evaluate simple analogies"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cKpqMsEsxz3Y"
      },
      "outputs": [],
      "source": [
        "# Load pre-trained GloVe model, trained on news articles"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wSkjLELD_tL3"
      },
      "outputs": [],
      "source": [
        "from torchtext.vocab import GloVe\n",
        "glove_vectors = GloVe(name=\"6B\")\n",
        "EMBEDDING_DIM = glove_vectors.vectors.shape[1]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "YEJe6F-M_6sY"
      },
      "outputs": [],
      "source": [
        "# from google.colab import drive\n",
        "# drive.mount('/content/drive')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8nUEgKYBxz3a"
      },
      "outputs": [],
      "source": [
        "# read up about the vocab\n",
        "# obtain vector representations for two or more words of your choice\n",
        "\n",
        "# TODO ...\n",
        "\n",
        "# to confirm that this worked, print out the number of elements of the vector\n",
        "# and make a line plot that shows each vector as a line graph"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "c8Qx8LqExz3c"
      },
      "outputs": [],
      "source": [
        "# determine the 10 words that are closest in the embedding to the first word vector your produced above\n",
        "\n",
        "# TODO ...\n",
        "\n",
        "# are the nearest neighbours similar in meaning?\n",
        "# try different seed words, until you find one whose neighbourhood looks OK"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "oWGjY1snxz3c"
      },
      "outputs": [],
      "source": [
        "# using a combination of positive and negative words, find out which word is most\n",
        "# similar to woman + king - man\n",
        "\n",
        "# TODO ..."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0ZzxYWLuxz3d"
      },
      "outputs": [],
      "source": [
        "# You may find that the results of most word analogy combinations don't work well sometimes, but not in all cases.\n",
        "# However, explore a bit and find two more cases where the output of your word vector algebra makes sense.\n",
        "\n",
        "# TODO ..."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xGuycMsFxz3e"
      },
      "source": [
        "# Task 2: Sequence modeling with RNNs or transformers"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wNHXspW1xz3e"
      },
      "source": [
        "In this task you will get to use a learning and a rule-based model of text sentiment analysis. To keep things simple, you will receive almost all the code and are just left with the task to tune the given algorithms, see the part about instrumentation below.\n",
        "Look for *TODO* to find places where your input is required."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KMzlMDtkxz3f"
      },
      "source": [
        "## SST-2 Binary text classification with XLM-RoBERTa model and LSTMs\n",
        "\n",
        "The XLM-RoBERTa related portions of this notebook are from [a tutorial](https://pytorch.org/text/main/tutorials/sst2_classification_non_distributed.html) authored by `Parmeet Bhatia <parmeetbhatia@fb.com>`\n",
        "\n",
        "Adaptation of the modern torchtext pipeline to also allow switching to recurrent model with different pre-trained word embeddings by `Steven Bergner <sbergner@sfu.ca>`"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uDNjT94Cxz3f"
      },
      "source": [
        "The steps below demonstrate how to train a text classifier on SST-2 binary dataset using a pre-trained XLM-RoBERTa (XLM-R) model. Customizations to switch parts of the pipeline to different models are also enabled.\n",
        "\n",
        "We will show how to use torchtext library to:\n",
        "\n",
        "1. build text pre-processing pipeline for XLM-R model\n",
        "2. read SST-2 dataset and transform it using text and label transformation\n",
        "3. instantiate a classification model using pre-trained XLM-R encoder\n",
        "4. change pipeline components to swap out any part of the data and model pipeline\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GKYp60QBxz3g"
      },
      "source": [
        "## Data Transformation\n",
        "\n",
        "Models like XLM-R cannot work directly with raw text. The first step in training\n",
        "these models is to transform input text into tensor (numerical) form such that it\n",
        "can then be processed by models to make predictions. A standard way to process text is:\n",
        "\n",
        "1. Tokenize text\n",
        "2. Convert tokens into (integer) IDs\n",
        "3. Add any special tokens IDs\n",
        "\n",
        "XLM-R uses sentencepiece model for text tokenization. Below, we use pre-trained sentence piece\n",
        "model along with corresponding vocabulary to build text pre-processing pipeline using torchtext's transforms.\n",
        "The transforms are pipelined using :py:func:`torchtext.transforms.Sequential` which is similar to :py:func:`torch.nn.Sequential`\n",
        "but is torchscriptable. Note that the transforms support both batched and non-batched text inputs i.e, one\n",
        "can either pass a single sentence or list of sentences.\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "H4xQySubxz3g"
      },
      "source": [
        "Caution: If you want to learn more about torchtext, be careful to **not** read the docs at:\n",
        "https://torchtext.readthedocs.io/en/latest/\n",
        "They claim to be \"latest\", but are of version 0.4.0\n",
        "\n",
        "Instead, find **current docs** here: https://pytorch.org/text/stable/index.html\n",
        "or simply keep reading, as this tutorial shows how to use the recent version."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yiVXWjvbxz3g"
      },
      "outputs": [],
      "source": [
        "import torchtext.transforms as T\n",
        "from torch.hub import load_state_dict_from_url\n",
        "from torch.utils.data import DataLoader\n",
        "\n",
        "padding_idx = 1\n",
        "bos_idx = 0\n",
        "eos_idx = 2\n",
        "max_seq_len = 256\n",
        "xlmr_vocab_path = r\"https://download.pytorch.org/models/text/xlmr.vocab.pt\"\n",
        "xlmr_spm_model_path = r\"https://download.pytorch.org/models/text/xlmr.sentencepiece.bpe.model\"\n",
        "\n",
        "text_transform = T.Sequential(\n",
        "    T.SentencePieceTokenizer(xlmr_spm_model_path),\n",
        "    T.VocabTransform(load_state_dict_from_url(xlmr_vocab_path)),\n",
        "    T.Truncate(max_seq_len - 2),\n",
        "    T.AddToken(token=bos_idx, begin=True),\n",
        "    T.AddToken(token=eos_idx, begin=False),\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bHwGKnxOxz3l"
      },
      "source": [
        "Alternately, we could also use transform shipped with pre-trained model that does all of the above out-of-the-box\n",
        "```\n",
        "  text_transform = XLMR_BASE_ENCODER.transform()\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "dvPY3Lq2xz3i"
      },
      "outputs": [],
      "source": [
        "# obtain the vocabulary of the data pipeline, so that we can convert word <--> word_index\n",
        "# allowing us to plug in different word embeddings\n",
        "vocab = text_transform[1].vocab.vocab\n",
        "word_to_idx = vocab.get_stoi()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cd_eLUY7xz3i"
      },
      "source": [
        "## Model and training parameters\n",
        "\n",
        "In addition to the transformer model, we also create an LSTM based model for text classification.\n",
        "\n",
        "Change the parameters below to switch between models and make adjustments to the training."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "dTLMJf7Kxz3i"
      },
      "outputs": [],
      "source": [
        "import time\n",
        "\n",
        "# TODO make adjustments here to achieve acceptable training performance with LSTMs\n",
        "# Also, try out the Roberta model for comparison\n",
        "\n",
        "EPOCHS = 8\n",
        "USE_GPU = torch.cuda.is_available()\n",
        "DROPOUT = .1\n",
        "timestamp = str(int(time.time()))\n",
        "best_dev_acc = 0.0\n",
        "\n",
        "do_use_roberta_model = False\n",
        "if do_use_roberta_model:\n",
        "    LEARNING_RATE = 1e-5\n",
        "    EPOCHS = 1\n",
        "    BATCH_SIZE = 128\n",
        "    EMBEDDING_TYPE = 'built-in'\n",
        "else:\n",
        "    #EMBEDDING_TYPE = 'word2vec'\n",
        "    EMBEDDING_TYPE = 'glove'\n",
        "    #EMBEDDING_TYPE = 'glovefull'\n",
        "    EMBEDDING_DIM = 300\n",
        "    HIDDEN_DIM = 50\n",
        "    BATCH_SIZE = 128\n",
        "    USE_BILSTM = True\n",
        "    LEARNING_RATE = 1e-5\n",
        "    do_freeze_embedding = False\n",
        "    do_use_roberta_classifier = False\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6Uv243-6xz3i"
      },
      "outputs": [],
      "source": [
        "def maybe_gpu(v):\n",
        "    return v.cuda() if USE_GPU else v"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "O1EPpIjoxz3i"
      },
      "outputs": [],
      "source": [
        "from torch.autograd import Variable\n",
        "import torch.nn.functional as nnF\n",
        "\n",
        "class LSTMSentiment(nn.Module):\n",
        "\n",
        "    def __init__(self, embedding_dim, hidden_dim, vocab_size, label_size,\n",
        "                 use_gpu, batch_size, dropout=0.5, bidirectional=False, classifier_head=None):\n",
        "        \"\"\"Prepare individual layers\"\"\"\n",
        "        super(LSTMSentiment, self).__init__()\n",
        "        self.hidden_dim = hidden_dim\n",
        "        self.use_gpu = use_gpu\n",
        "        self.batch_size = batch_size\n",
        "        self.dropout = dropout\n",
        "        self.num_directions = 2 if bidirectional else 1\n",
        "        self.embeddings = nn.Embedding(vocab_size, embedding_dim)\n",
        "        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, bidirectional=bidirectional)\n",
        "        self.hidden2label = nn.Linear(hidden_dim*self.num_directions, label_size)\n",
        "        self.hidden = self.init_hidden()\n",
        "        self.classifier_head = classifier_head\n",
        "\n",
        "    def init_hidden(self, batch_size=None):\n",
        "        \"\"\"Choose appropriate size and type of hidden layer\"\"\"\n",
        "        if not batch_size:\n",
        "            batch_size = self.batch_size\n",
        "        #what = torch.randn\n",
        "        what = torch.zeros\n",
        "        # first is the hidden h\n",
        "        # second is the cell c\n",
        "        return (maybe_gpu(Variable(what(self.num_directions, batch_size, self.hidden_dim))),\n",
        "                maybe_gpu(Variable(what(self.num_directions, batch_size, self.hidden_dim))))\n",
        "\n",
        "    def classify(self, features):\n",
        "        y = self.hidden2label(features)\n",
        "        log_probs = nnF.log_softmax(y, dim=1)\n",
        "        return log_probs\n",
        "\n",
        "    def forward(self, sentence):\n",
        "        \"\"\"Use the layers of this model to propagate input and return class log probabilities\"\"\"\n",
        "        if self.use_gpu:\n",
        "            sentence = sentence.cuda()\n",
        "        x = self.embeddings(sentence).permute(1,0,2)\n",
        "        batch_size = x.shape[1]\n",
        "        self.hidden = self.init_hidden(batch_size=batch_size)\n",
        "        lstm_out, self.hidden = self.lstm(x, self.hidden)\n",
        "        features = lstm_out[-1]\n",
        "        if self.classifier_head:\n",
        "            #unsqueeze: introduce dummy second dimension, so that classifier_head can drop it\n",
        "            return self.classifier_head(torch.unsqueeze(features, 1))\n",
        "        else:\n",
        "            return self.classify(features)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NMq5cABlxz3k"
      },
      "source": [
        "Choose and load a word embedding that provides the feature input to the RNN/LSTM."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Ua1m0a8ixz3k"
      },
      "outputs": [],
      "source": [
        "if 'glove' == EMBEDDING_TYPE:\n",
        "    from torchtext.vocab import GloVe\n",
        "    glove_vectors = GloVe(name=\"6B\")\n",
        "    EMBEDDING_DIM = glove_vectors.vectors.shape[1]\n",
        "    use_embedding_directly = False\n",
        "    if use_embedding_directly:\n",
        "        pretrained_embeddings = maybe_gpu(glove_vectors.vectors)\n",
        "    else:\n",
        "        # prepare random embedding, then fill in glove vectors\n",
        "        pretrained_embeddings = np.random.uniform(-0.25, 0.25, (len(vocab), EMBEDDING_DIM)).astype('f')\n",
        "        pretrained_embeddings[0] = 0\n",
        "        for word, wi in glove_vectors.stoi.items():\n",
        "            try:\n",
        "                pretrained_embeddings[word_to_idx[word]-1] = glove_vectors.__getitem__(word)\n",
        "            except KeyError:\n",
        "                pass\n",
        "        pretrained_embeddings = maybe_gpu(torch.from_numpy(pretrained_embeddings))\n",
        "elif 'glovefull' == EMBEDDING_TYPE:\n",
        "    from torchtext.vocab import GloVe\n",
        "    glove_vectors = GloVe(cache=\"/usr/shared/CMPT/big-data/dot_torch_shared/.vector_cache/\")\n",
        "    # set freeze to false if you want them to be trainable\n",
        "    pretrained_embeddings = maybe_gpu(glove_vectors.vectors)\n",
        "    #my_embeddings = nn.Embedding.from_pretrained(pretrained_embeddings, freeze=True)\n",
        "elif 'word2vec' == EMBEDDING_TYPE:\n",
        "    pretrained_embeddings = np.random.uniform(-0.25, 0.25, (len(vocab), EMBEDDING_DIM)).astype('f')\n",
        "    pretrained_embeddings[0] = 0\n",
        "    try:\n",
        "        word2vec\n",
        "    except:\n",
        "        print('Load word embeddings...')\n",
        "        import gensim\n",
        "        word2vec = gensim.models.KeyedVectors.load_word2vec_format(\n",
        "                         bdata / 'GoogleNews-vectors-negative300.bin.gz', binary=True)\n",
        "        EMBEDDING_DIM = 300\n",
        "    for word, wi in word2vec.key_to_index.items():\n",
        "        try:\n",
        "            pretrained_embeddings[word_to_idx[word]-1] = word2vec.vectors[wi]\n",
        "        except KeyError:\n",
        "            pass\n",
        "    # text_field.vocab.load_vectors(wv_type='', wv_dim=300)\n",
        "    pretrained_embeddings = maybe_gpu(torch.from_numpy(pretrained_embeddings))\n",
        "else:\n",
        "    if not do_use_roberta_model:\n",
        "        print('Unknown embedding type {}'.format(EMBEDDING_TYPE))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ef-M5nGoxz3k"
      },
      "source": [
        "## Model preparation LSTM\n",
        "Initialize the RNN model, if the above configuration is set to use it."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "o4K_XRWTxz3k"
      },
      "outputs": [],
      "source": [
        "num_classes = 2\n",
        "\n",
        "if not do_use_roberta_model:\n",
        "    lstm_model = LSTMSentiment(embedding_dim=EMBEDDING_DIM, hidden_dim=HIDDEN_DIM,\n",
        "                                vocab_size=len(vocab), label_size=num_classes,\\\n",
        "                                use_gpu=USE_GPU, batch_size=BATCH_SIZE, dropout=DROPOUT, bidirectional=USE_BILSTM)\n",
        "    lstm_model.embeddings = nn.Embedding.from_pretrained(pretrained_embeddings, freeze=do_freeze_embedding)\n",
        "    model = lstm_model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "U7b98n5Sxz3l"
      },
      "source": [
        "## Dataset\n",
        "torchtext provides several standard NLP datasets. For complete list, refer to documentation\n",
        "at https://pytorch.org/text/stable/datasets.html. These datasets are build using composable torchdata\n",
        "datapipes and hence support standard flow-control and mapping/transformation using user defined functions\n",
        "and transforms. Below, we demonstrate how to use text and label processing transforms to pre-process the\n",
        "SST-2 dataset.\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ktywr1Auxz3l"
      },
      "outputs": [],
      "source": [
        "from torchtext.datasets import SST2\n",
        "from torch.utils.data import DataLoader\n",
        "\n",
        "batch_size = BATCH_SIZE\n",
        "\n",
        "train_datapipe = SST2(split=\"train\")\n",
        "dev_datapipe = SST2(split=\"dev\")\n",
        "\n",
        "def ttf_first(x):\n",
        "  #return (F.to_tensor(text_transform(x[0]), padding_value=padding_idx), x[1])\n",
        "  return (text_transform(x[0]), x[1])\n",
        "\n",
        "# Transform the raw dataset using non-batched API (i.e apply transformation line by line)\n",
        "#train_datapipe = train_datapipe.map(ttf_first)\n",
        "train_datapipe = train_datapipe.map(lambda x: (text_transform(x[0]), x[1]))\n",
        "train_datapipe = train_datapipe.batch(batch_size)\n",
        "train_datapipe = train_datapipe.set_length(len(list(train_datapipe)))\n",
        "train_datapipe = train_datapipe.rows2columnar([\"token_ids\", \"target\"])\n",
        "train_dataloader = DataLoader(train_datapipe, batch_size=None)\n",
        "\n",
        "#dev_datapipe = dev_datapipe.map(ttf_first)\n",
        "dev_datapipe = dev_datapipe.map(lambda x: (text_transform(x[0]), x[1]))\n",
        "dev_datapipe = dev_datapipe.batch(batch_size)\n",
        "dev_datapipe = dev_datapipe.set_length(len(list(dev_datapipe)))\n",
        "dev_datapipe = dev_datapipe.rows2columnar([\"token_ids\", \"target\"])\n",
        "dev_dataloader = DataLoader(dev_datapipe, batch_size=None)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EXFin5khxz3m"
      },
      "source": [
        "## Model preparation - RoBERTa\n",
        "\n",
        "torchtext provides SOTA pre-trained models that can be used to fine-tune on downstream NLP tasks.\n",
        "Below we use pre-trained XLM-R encoder with standard base architecture and attach a classifier head to fine-tune it\n",
        "on SST-2 binary classification task. We shall use standard Classifier head from the library, but users can define\n",
        "their own appropriate task head and attach it to the pre-trained encoder. For additional details on available pre-trained models,\n",
        "please refer to documentation at https://pytorch.org/text/main/models.html\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "BD3KzsY4xz3m"
      },
      "outputs": [],
      "source": [
        "num_classes = 2\n",
        "\n",
        "from torchtext.models import RobertaClassificationHead, XLMR_BASE_ENCODER\n",
        "\n",
        "def make_roberta_classification_model(num_classes):\n",
        "    roberta_input_dim = 768\n",
        "    classifier_head = RobertaClassificationHead(num_classes=num_classes, input_dim=roberta_input_dim)\n",
        "    return XLMR_BASE_ENCODER.get_model(head=classifier_head)\n",
        "\n",
        "if do_use_roberta_model:\n",
        "    model = make_roberta_classification_model(num_classes)\n",
        "else:\n",
        "    model = lstm_model\n",
        "    if do_use_roberta_classifier:\n",
        "        feature_dim = model.hidden_dim + (USE_BILSTM * model.hidden_dim)\n",
        "        classifier_head = RobertaClassificationHead(num_classes=num_classes, input_dim=feature_dim)\n",
        "        model.classifier_head = classifier_head\n",
        "\n",
        "model.to(DEVICE);"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "isOHG_2fxz3m"
      },
      "source": [
        "## Training methods\n",
        "\n",
        "Let's now define the standard optimizer and training criteria as well as some helper functions\n",
        "for training and evaluation. The methods below work for either choice of model.\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lSguGR5jMUQJ"
      },
      "outputs": [],
      "source": [
        "import lightning.pytorch as pl\n",
        "\n",
        "from torch.optim import AdamW\n",
        "\n",
        "# TODO adjust this cell for Task B1\n",
        "\n",
        "class LitModel(pl.LightningModule):\n",
        "    def __init__(self, model):\n",
        "        super().__init__()\n",
        "        self.model = model\n",
        "        self.model.train()\n",
        "        self.criteria = nn.CrossEntropyLoss()\n",
        "\n",
        "    def training_step(self, batch, batch_idx):\n",
        "        input = F.to_tensor(batch[\"token_ids\"], padding_value=padding_idx).to(self.device)\n",
        "        output = self.model(input)\n",
        "        loss = self.criteria(output, F.to_tensor(batch[\"target\"]).to(self.device))\n",
        "\n",
        "        return loss\n",
        "\n",
        "    def configure_optimizers(self):\n",
        "        return AdamW(self.model.parameters(), lr=LEARNING_RATE)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Fol9Xscbxz3n"
      },
      "outputs": [],
      "source": [
        "import torchtext.functional as F\n",
        "from torch.optim import AdamW\n",
        "\n",
        "learning_rate = LEARNING_RATE\n",
        "optim = AdamW(model.parameters(), lr=learning_rate)\n",
        "criteria = nn.CrossEntropyLoss()\n",
        "\n",
        "\n",
        "def train_step(input, target):\n",
        "    model.train()\n",
        "    output = model(input)\n",
        "    loss = criteria(output, target)\n",
        "    optim.zero_grad()\n",
        "    loss.backward()\n",
        "    optim.step()\n",
        "\n",
        "\n",
        "def eval_step(input, target):\n",
        "    output = model(input)\n",
        "    loss = criteria(output, target).item()\n",
        "    return float(loss), (output.argmax(1) == target).type(torch.float).sum().item()\n",
        "\n",
        "\n",
        "def evaluate():\n",
        "    model.eval()\n",
        "    total_loss = 0\n",
        "    correct_predictions = 0\n",
        "    total_predictions = 0\n",
        "    counter = 0\n",
        "    with torch.no_grad():\n",
        "        for batch in dev_dataloader:\n",
        "            input = F.to_tensor(batch[\"token_ids\"], padding_value=padding_idx).to(DEVICE)\n",
        "            target = torch.tensor(batch[\"target\"]).to(DEVICE)\n",
        "            loss, predictions = eval_step(input, target)\n",
        "            total_loss += loss\n",
        "            correct_predictions += predictions\n",
        "            total_predictions += len(target)\n",
        "            counter += 1\n",
        "\n",
        "    return total_loss / counter, correct_predictions / total_predictions\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gQ5JljdyKh4I"
      },
      "outputs": [],
      "source": [
        "from lightning.pytorch import Trainer\n",
        "from lightning.pytorch.loggers import TensorBoardLogger\n",
        "\n",
        "tb_logdir = \"logs-a10\"\n",
        "\n",
        "logger = TensorBoardLogger(tb_logdir, name=\"senti_model\")\n",
        "trainer = Trainer(logger=logger, max_epochs=1)\n",
        "torch.cuda.empty_cache()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cNzEI1NtJT8U"
      },
      "outputs": [],
      "source": [
        "# Load the TensorBoard notebook extension\n",
        "%load_ext tensorboard"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "M6BSoWbmILC4"
      },
      "outputs": [],
      "source": [
        "%tensorboard --logdir {tb_logdir}"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "WKmwrpzAJNV7"
      },
      "outputs": [],
      "source": [
        "trainer.fit(LitModel(model), train_dataloader, dev_dataloader)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lZCyxEIqxz3n"
      },
      "source": [
        "### The actual task (B1): Tensorboard instrumentation (TODO)\n",
        "\n",
        "Tensorboard is a visualiation system for deep learning. The pytorch integration can now be done via [toch lightning](https://www.pytorchlightning.ai/index.html).\n",
        "\n",
        "1. take a look at how tensorboard works for tensorflow profiling and visualization in [tensorboard](https://www.tensorflow.org/)\n",
        "1. instead of instrumenting with tensorboard directly, adjust the LightningModel training to pass performance info to the `logger`\n",
        "1. launch tensorboard and inspect the log folder, i.e. run `!tensorboard --logdir {tb_logdir}` from the folder of this notebook"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mX_sTP5Vxz3o"
      },
      "source": [
        "## Train\n",
        "\n",
        "Now we have all the ingredients to train our classification model. Note that we are able to directly iterate\n",
        "on our dataset object without using DataLoader. Our pre-process dataset  shall yield batches of data already,\n",
        "thanks to the batching datapipe we have applied. For distributed training, we would need to use DataLoader to\n",
        "take care of data-sharding.\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0qTNr3qdxz3p"
      },
      "outputs": [],
      "source": [
        "# TODO Use the tensorboard + lightning setup above to log the training loss and accuracy\n",
        "\n",
        "# Produce a screenshot of the experimental performance for different training rounds of your\n",
        "# (BI-)LSTM model\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "A3YN6PERxz3p"
      },
      "source": [
        "### Task B2: Tune the model (TODO)\n",
        "\n",
        "After connecting the output of your model train and test performance with tensorboard. Change the model and training parameters above to improve the model performance. We would like to see variable plots of how validation accuracy evolves over a number of epochs for at least two different parameter choices, you can stop exploring when you exceed a model accuracy of 76%.\n",
        "\n",
        "Show a tensorboard screenshot with performance plots that combine at least 2 different tuning attempts. Store the screenshot as `tensorboard.png`. Then keep the best performing parameters set in this notebook for submission and evaluate the comparison below with your best model. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "F7hUO8NWxz3p"
      },
      "source": [
        "## Comparison with Vader (NLTK)\n",
        "Vader is a rule-based sentiment analysis algorithm that performs quite well against more complex architectures. The test below is to see, whether LSTMs are able to beat its performance."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0Pzqx65Uxz3p"
      },
      "outputs": [],
      "source": [
        "# get text data from torchtext dataloader\n",
        "vocab_itos = vocab.get_itos()\n",
        "text_data = []\n",
        "for ba in dev_dataloader:\n",
        "    text = (\"\".join(\n",
        "            [\"\".join(\n",
        "                vocab_itos[tid]) for tokens in ba[\"token_ids\"] \n",
        "                for tid in tokens ])\n",
        "                .replace(\"▁\",\" \")\n",
        "                .replace(\"<s>\",\"\")\n",
        "                .split(\"</s>\"))\n",
        "    text_and_target = list(zip(text, ba[\"target\"]))\n",
        "    text_data.extend(text_and_target)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "WQI0SSFfxz3q"
      },
      "outputs": [],
      "source": [
        "import nltk\n",
        "nltk.download('vader_lexicon')\n",
        "from nltk.sentiment.vader import SentimentIntensityAnalyzer\n",
        "sid = SentimentIntensityAnalyzer()\n",
        "\n",
        "lab_vpred = np.zeros((len(text_data), 2))\n",
        "for k, (sentence, label) in enumerate(text_data):\n",
        "    ss = sid.polarity_scores(sentence)\n",
        "    lab_vpred[k,:] = (int(ss['compound']>0), int(label))\n",
        "\n",
        "vader_acc = 1-abs(lab_vpred[:,0]-lab_vpred[:,1]).mean()\n",
        "print('vader acc: {}'.format(vader_acc))\n",
        "logger.log_metrics({'Final/VaderAcc': vader_acc})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rdCBzypYxz3q"
      },
      "source": [
        "Perform the model tuning and training in the previous task 2.B2 until you outperform the Vader algorithm by at least 7% in accuracy using the LSTM model."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rDEPV64BLkJK"
      },
      "source": [
        "## Task 2.C Train and use the Roberta model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6fNDGalq29hd"
      },
      "outputs": [],
      "source": [
        "# some words from https://www.fluentu.com/blog/french/french-feelings/\n",
        "\n",
        "lines = \"\"\"This is fantastic!\n",
        " Not sure.\n",
        " Probably OK.\n",
        " Very well!\n",
        " Das ist gut!\n",
        "Ich finde das bemerkenswert.\n",
        "Klasse Vorstellung.\n",
        "Lieber nicht anschauen.\n",
        "heureux\n",
        "heureuse\n",
        "content\n",
        "pas triste\n",
        "absolument pas triste\n",
        "énervé\n",
        "pressé\n",
        "fâché\n",
        "en colère \n",
        "fatigué\n",
        "s’ennuyer\n",
        "occupé\n",
        "navré\n",
        "épuisé\n",
        "malade\n",
        "inquiet\n",
        "inquiète\n",
        "ravi\"\"\"\n",
        "\n",
        "model.eval();"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yBTBb0vLMT7k"
      },
      "source": [
        "### Your task (TODO)\n",
        "\n",
        "1. Train the Roberta model with the given setup above for 1 epoch\n",
        "1. Ensure that your model test accuracy is > 75%\n",
        "1. Save the model weights\n",
        "1. Complete the SentimentPredictor class below to:\n",
        "  1. Load the model from disk to create a new model on the CPU\n",
        "  1. Use the restored model to compute Positive sentiment probabilities for each word in a list of words\n",
        "1. Show a dataframe of the word list given above and positivity probabilities sorted in order of descending probability"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "nhFIbdPZ5_F6"
      },
      "outputs": [],
      "source": [
        "# TODO write your code below"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Rmt4EW5H4jR1"
      },
      "outputs": [],
      "source": [
        "class SentimentPredictor:\n",
        "  def __init__(self):\n",
        "    self.model = make_roberta_classification_model(2)\n",
        "    self.model.load_state_dict(torch.load(MODEL_PATH))\n",
        "    self.model.eval()\n",
        "\n",
        "  def __call__(self, text, prob=True):\n",
        "    if isinstance(text, str):\n",
        "      text = [text]\n",
        "      unpack = True\n",
        "    else:\n",
        "      unpack = False\n",
        "    #logit = # TODO\n",
        "    if prob:\n",
        "      # TODO\n",
        "      if unpack:\n",
        "        return probs[0,1].tolist()\n",
        "      else:\n",
        "        return probs[:,1].tolist()\n",
        "    else:\n",
        "      return bool(logit.argmax())\n",
        "\n",
        "sp = SentimentPredictor()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vJGRvGPg3qf3"
      },
      "outputs": [],
      "source": [
        "# TODO use sp(lines.split(\"\\n\")) for the above task"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Abvc0OLQPAk9"
      },
      "source": [
        "Note, that the model was trained for a task in English, simply based on .tsv files with phrases and a class label.\n",
        "\n",
        "In the example above, we are using this same model to perform the trained task in other languages. Please add some words from other languages you may know. Do the model predictions make sense?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1eYXOh_-xz3r"
      },
      "source": [
        "## Submission"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yrFzT7xXxz3r"
      },
      "source": [
        "Save [this notebook](A10.ipynb) containing all cell output and upload your submission as one `A10.ipynb` file.\n",
        "Also, include the screenshot of your tensorboard debugging session as `tensorboard.png`."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": []
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "provenance": [],
      "toc_visible": true
    },
    "gpuClass": "standard",
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.10"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
