Understanding transformers in one morning
Transformers are (deservedly so) a hot topic in machine learning.
If you are new to transformers, the resources in this article will help you understand their fundamentals and applications. It will take about one morning (four hours, give or take) to go through all items.
I created the list after spending much longer than one morning wading through many articles and videos. I lost time going around in circles, wasting time with superficial sources, or stumbling on articles that were too deep for my level when I first encountered them but were great once I was more prepared.
This list is organized in a logical sequence, building up the knowledge from the first principles, then going deeper into the details. They are the videos and articles that helped me the most. I hope they help you as well.
Hour 1 - The paper
First, read Google AI Research’s blog post Google AI Blog: Transformer: A Novel Neural Network Architecture for Language Understanding. Don’t follow the links; just read the post. Then read the paper Attention Is All You Need. Don’t worry about understanding the details at this point. Get familiar with terminology and pictures.
The paper has about 6,000 words. It would take twenty minutes to read at the average reading pace of 300 words per minute. But it’s a scientific paper, so it will take longer. Using the three-pass approach, let’s reserve an hour to read it.
Hour 2 - Key concepts
The second hour is about understanding the key concepts in the paper with Rasa’s Algorithm Whiteboard video series.
- Rasa Algorithm Whiteboard - Transformers & Attention 1: Self Attention (14 minutes): Explains attention first with a simple example using a time series, then with a text example. The video introduces word embedding, a key concept for NLP (natural language processing) models, including transformers. With these concepts explained, it defines self-attention.
- Rasa Algorithm Whiteboard - Transformers & Attention 2: Keys, Values, Queries (13 minutes): Building on the previous video, it explains keys, queries, and values. First, it explains the operations that make up the attention layer conceptually, as a process to add context to a value (you can think of a “value” as a “word” in this context). Since we are trying to create a model, it describes where we need to add trainable parameters (weights). With the concepts and weights in place, it reintroduces the operations as matrix operations that create the stackable self-attention block.
- Rasa Algorithm Whiteboard - Transformers & Attention 3: Multi Head Attention (11 minutes): Using a phrase as an example, it explains why we need more than one attention head to understand the context where words are used (multi-head attention). The fact that the attention heads are independent is a crucial concept in transformers. It allows matrix operations for each head to run in parallel, significantly speeding up the training process.
- Rasa Algorithm Whiteboard: Transformers & Attention 4 - Transformers (15 minutes): With the foundational concepts explained, this video covers the pictures in the “Attention is All You Need” paper that make up the transformer architecture. The new concept introduced here is positional encoding. It ends by highlighting how the transformer architecture lends itself to parallelization in ways other attention architectures cannot.
We just finished the second hour of the morning’s understanding transformers. Rasa’s videos are a great introduction but are still informal. That’s not a bug – it’s a feature. They introduce the key concepts in simple terms, making them easy to follow.
Hour 3 - Digging into details
Now we will switch to a more formal introduction with these two lectures from professor Peter Bloem, VU University in Amsterdam.
- Lecture 12.1 Self-attention (23 minutes): Explains, with the help of illustrations, the matrix operations to calculate self-attention, then moves on to keys, queries, and values. With the basic concepts in place, it explains why we need multi-head attention.
- Lecture 12.2 Transformers (18 minutes): Examines the pieces that make up the transformer model in the paper. The pictures from the paper are dissected with some math and code.
Hour 4 - Pick your adventure
- Go wide with LSTM is dead, long live Transformers (30 minutes): This talk gives a sense of history, explaining how we approached natural language problems in the past, their limitations, and how transformers overcame those limitations. It shows how to implement the transformer calculations with Python code. If you are better at visualizing code than math (like me), this can help you understand the operations.
- Go deep with The Annotated Transformer (30 to 60 minutes to read, hours and hours to experiment): This article by the Harvard NLP team annotates the transformer paper with modern (as of 2022) PyTorch code. Each section of the paper is supplemented by the code that implements it. Part 3, “A Real World Example”, implements a fully functional German-English translation example using a smaller dataset that makes it workable on smaller machines.
Where to go from here
It is a good time to reread the paper. It will make more sense now.
These are other articles and videos that helped me understand transformers. Some of them overlap with the ones above, and some are complementary.
- Positional embedding (encoding) is a key concept in understanding transformers. The transformer paper assumes that the reader knows that concept and briefly explains the reasons to use sine and cosine. This video from AI Coffee Break with Letitia explains in under ten minutes the concepts and the reasons to use sine and cosine.
- Transformers from scratch is the accompanying blog post to hour 3, “Digging into details.” Professor Bloem describes some concepts explored in the video and adds code to show they are implemented.
- Transformers from Scratch (same title, different article) takes more time than other articles to explain one-hot encoding, dot product, and matrix multiplication, among others, with illustrations. By the time it gets to “attention as matrix multiplication”, it’s easier to understand the math. This post can be a good refresher if you are rusty on the math side of machine learning.
- Transformer model for language understanding is TensorFlow’s official implementation of the paper. It is not as annotated as the PyTorch code in The Annotated Transformer, but still helpful if you are in a TensorFlow shop.
- The Transformer Model in Equations is exactly what the name says, transformers as mathematical operations. The “Discussion” section is an insightful explanation of the equations, valuable even if you don’t have a strong math background (like me).
- The Illustrated Transformer is an often-cited source for understanding transformers. It is a good source if someone can read only one article beyond the paper.
- Andrej Karpathy’s Let’s build GPT: from scratch, in code, spelled out walks through the code to build a transformer model from scratch. At just under two hours, it’s the best investment of time at the code level I have found. Andrej is a great teacher and knows what he is talking about.
For a sense of history, these two papers are highly cited as works that led to the transformer architecture.
- Neural Machine Translation by Jointly Learning to Align and Translate is the paper credited with introducing the “attention” mechanism.
- Effective Approaches to Attention-based Neural Machine Translation builds on the previous paper, introducing other important concepts, including dot-product attention. This official Tensorflow notebook implements a Spanish-to-English translation based on the paper.
Finally, Attention is all you need; Attentional Neural Network Models is a talk by Łukasz Kaiser, one of the paper’s authors. He builds up the solution, starting with how natural language translation used to be solved in the past, the limitations, and how transformers solve them. So far, it’s what I would expect from one of the authors. What makes this video interesting to me is how humble Łukasz is. He explains the trials and errors and, at one point, how they had to ask for help to train the model they created.
Reading a scientific paper makes it look like a linear story from problem to solution (“we had an idea and implemented it”). Watching Łukasz talk helps us understand how these great solutions don’t arrive out of thin air. Researchers build on top of previous work, try many variations, make mistakes, and ask for help to complete their work. Then they write the paper…
If your interests are in computer vision, it turns out transformers work quite well for that too.