Radi Akbar
Personal Project
The world of NLP is developing at a rapid pace. There have been many updates to the original Transformer architecture, that its hard to keep track of all the progress. The TransformerV2 project higlights some of the recent developments in the architecture by tweaking the sequence-to-sequence model and compare its performance on English to German machine translation task.
RoPE
The architecture of the model remains the same from the original [1] but with a few modifications. First update is replacing the Additional Positional Embedding (APE) with Rotary Positional Embedding (ROPE) [4]. ROPE rotates the embedded sequences to inject positional information instead of adding it. It's been shown that ROPE can hold long-term dependency between words in huge context windows.
RMS Normalization
The second update is replacing LayerNorm with RMSNorm [2]. LayerNorm addresses the problem of covariance shifts during training by standardizing the inputs by its mean and standard deviation. RMSNorm simplifies the normalization process by only dividing the input by its standard deviation.
SwiGLU Activation Function
The third update is replacing the ReLU activation function with SwiGLU [3]. It's been shown that SwiGLU improves the performance on NLP tasks. The paper also specifies that to make computation less expensive, they change the hidden size of the MLP component by multiplying it by 2/3.
I follow the original paper's training setup by using an Adam optimizer with
Most LLM papers train on huge batch sizes, but due to resource limitations I can only use a batch size of 16. To address this problem, I used gradient accumulation so that the model only update its gradient every 32 steps to simulate a batch size of 512.
According to this Medium Article, the original paper trains the model on 16 epochs with a batch size of 724. Since there's roughly 4.7m rows in the training dataset, that means it ran through approximately 6.2k steps per epoch and totaling to 100k steps. From these facts, I interpolated that my training would run for approximately 150k steps in total and that my warming up steps is 6k.
To prepare the tokenizer, I train a BPE tokenizer based on Huggingfaces' WMT14 German-English dataset. I made sure that the vocabulary size of the tokenizer is consistent with the one from the paper (roughly 37k tokens). I trained the tokenizer model and upload it to my Huggingface account. If you want to recreate the project, you can simply access the tokenizer by running the following line Python.
PreTrainedTokenizerFast.from_pretrained('radia/wmt14-de2en-tokenizer')
For this experiment, I used a compute engine from Google Cloud with 1 Nvidia L4 GPU. To run the training script, use the following line:
torchrun --standalone --nproc-per-node=1 train.py
The training takes approximately 3.5 days. The training could be faster if the there are multiple GPUs but you would have to tweak the training script to adjust for syncing batches across devices.
For brevity, I used a greedy search method to generate translation results and use Huggingface's sacrebleu to score my results against the test data. To reproduce the results, run the evaluate_model.py script to get the model's BLEU score. The model is tested on the same WMT14 test set mentioned in the original paper.
| Type | Parameter Size (in Millions) | Hidden Dimension Size | Heads | Layers | Dropout | BLEU |
|---|---|---|---|---|---|---|
| Transformer V2 | 63 | 512 | 8 | 6 | 0.1 | 28.1 |
| Transformers Original (base) | 65 | 512 | 8 | 6 | 0.1 | 27.3 |
| Transformers Original (big) | 213 | 1024 | 16 | 6 | 0.3 | 28.4 |
The SOTA methods for current LLM's improved the original transformer's performance by 0.8 points for the same parameters! Not to mention that the SOTA methods shrank the model by 2 million parameters and that it achieves relatively the same result as the bigger model with 150 million less parameters!
Although there are future works that needs to be done. For starters, I need to run an experiment with the big model size or score my model for other NLP tasks. I could also use Meta's BART pretraining setup and compare the results of the downstream tasks.
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.
[2] Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization.
[3] Noam Shazeer. 2020. Glu variants improve transformer.
[4] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. Roformer: Enhanced transformer with rotary position embedding.
[5] Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2021. GLM: General Language Model Pretraining with Autoregressive Blank Infilling.
[6] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models.
[7] Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. GPT-NeoX-20B: An Open-Source Autoregressive Language Model.

