Everything is a learning process. One of the learnings that I had from my first research post last week, was that I did not do full justice to the scope of the paper I was reviewing on few-shot learning models. I am trying a new approach this week, and your feedback on this is welcome! Today, I am reviewing the paper called “Longformer: The Long-Document Transformer”, as this really helps us understand how AI can parse, analyze and synthesize massive amounts of textual data, but to really understand the scope of this paper, we need to go back a couple of steps to understand why this model is really needed.
A side note – all of us like to make things sound more complicated than they are, researchers and scientists are more prone to do this as compared to engineers (The Big Bang Theory, anyone?), so towards the end of this review, I am making a list of terms and what they mean. If you find additional terms you are not clear with, please feel free to reach out and I will gladly explain further. This glossary of terms can also be accessed here.

Towards the middle part of the last decade (from around 2010-15), there existed 3 models that were largely used to understand textual inputs – they were called the RNNs (Recurrent Neural Networks) that were first developed in the 1980s, but gained popularity in early 2010s, CNNs (Convolutional Neural Networks), again an 80s model that gained popularity around 2015, and LTSM (Long Short-Term Memory), a model developed in the late 1990s but which again gained popularity in the early 2010s. all these models had some common problems - such as difficulty in capturing long-range dependencies in text and difficulty in parallelizing the computations.

The challenges that traditional (tradition changes fast in the world of AI/ML, so this is traditional for the 2010s) models led to the development of something called the Transformer model in 2017.

So, what is the Transformer model?

In the older models – RNNs, CNNs, LSTM – each sentence that was being analyzed from a deep-learning perspective, was broken down into pre-weighted tokens based on the part of speech they belonged to. So, each sentence has subjects, objects, actions (verbs), etc, and each of these parts of the sentence has its own weight, and the final weight of attention for that word was calculated iteratively based on its starting weight. You could say these models came with preconceived notions.

The challenge was, they could only analyze short sentences – for example, “the cat sat on the mat”. If you gave it a longer sentence with multiple subjects, objects, actions, and descriptions, these models failed to understand the sentence. For example, if the sentence given to it was “the cat sat on the mat in front of the fire, warming itself”, these models totally gave up – they could not create the needed tokens and weights from this long sentence, and that was the end of it.

In 2017, the concept of a Transformer model was introduced. Thinking back, the idea was simple, but we all know how the simplest ideas can be the hardest to come up with. It introduced a concept called self-attention. Self-attention basically said, “To hell with breaking down a sentence into its parts, I will look at all words in a sentence”. Basically, it even looked at words like “the”, “an”, “a” etc. The model created equally weighted tokens of all words in a sentence. It then proceeded to calculate the attention of each word with respect to the others and in the process, words like “the”, “an”, and “a” became negligibly important and their attention tended to zero. Effectively, the models helped identify all important words in a sentence, irrespective of the sentence length (within reasonable limits. A Smart Alec like me immediately went into the mode of “I will not use full stops, only commas and colons, and semicolons, and this creates a page-long sentence”. Well, the model failed when this was tried)

However, there is no denying that the Transformer model was revolutionary because it allowed for complex sentences (within limits), allowed for parallelization, and made general NLP a lot easier.

So, then, why was Longformer needed? Don’t these scientists have anything else to do in life?

Well, apart from the fact that scientists and researchers have very little to do in life (if you are a scientist, don’t take a hit out on me, I'm just kidding), one really important thing you do in any scientific discovery or research you do, is finding the limitations of what you researched/created/developed, and try to find a solution to this limitation. So, really, I was not the first Smart Alec who thought, “I will not use full stops, only commas and colons, and semicolons, and create a page-long sentence”. The folks who developed the Transformer did it first - they had to! And no, that does not mean I am a scientist or even think like one – maybe someday, but not today.

Anyway, coming back to the challenges of the Transformer model, the mathematical computation involved in the Transformer model made it such that with each new token (or the word in a sentence being analyzed), the computation increased quadratically – so 2 new words resulted in 2^4 = 16 new calculations. You can imagine a sentence that was 50 words long – it would require 50^4 calculations or 6,250,000 calculations. Not only did it a lot of time to do these calculations, but they also took up a lot of computational memory and resources, and the overall accuracy was reduced!

It was these limitations that the Longformer model was trying to remedy!

So, what exactly is the Longformer model?

For all their brilliance, scientists are simple people.

LONG sentence + transFORMER = LONGFORMER!

To put it simply, the Longformer model breaks down long-format text into two types of learning. There is a small fixed size, local context-based self-attention-driven learning. This creates multiple contextual representations of the sentences in the long text. This is then combined with an end-task-motivated, global attention-driven model that looks for relationships between all the local key tokens. If that sentence was confusing to you, think of it as finding the maxima of a series of curves – first, you find the local maxima of all the curves in a series, then you find the maxima of all the local maxima to find a global maximum! Simple no? even more simply put, in the context of finding the best team in a cricket or football tournament, you break the participating teams into groups, you find the best in each group, and then you find the best among these group level bests – we have been doing this forever in sports championships – the real genius was the guy who saw this in one context, say sports, and thought, “Why can't I do that to analyze large groups of text”. And lo, there you have a model that now is the bedrock of ChatGPT and its ability to understand massive amounts of data to answer our questions like “can you summarize this book for me?” (Come on, folks – if we show today’s AI how lazy we are, tomorrow’s AI, when it reaches singularity, will not even have to try hard to overthrow humans! Make AI work for it! Stop asking lazy questions!)

Ha…funny man! If you are that funny, tell me, what is Longformer good for?

Other than really understanding the sheer laziness of humans, Longformer helps with three main things –

  1. The Ability to handle longer documents - The Longformer model is designed to handle longer documents more efficiently than the traditional Transformer model. This is because it balances the computational cost (in time and GPU resources) with the ability to capture long-range dependencies in the text it is analyzing!
  2. Improved accuracy on long-document tasks - The Longformer model has shown improved accuracy on NLP tasks that involve long documents, such as sentiment analysis and document classification for a simple reason. How? All AI/ML equations have an error component built into them. The Lesser number of equations you need to compute, the lesser number of errors you need to account for.
  3. Ability to process inputs of varying lengths - The Longformer model can handle inputs of varying lengths, which is a major advantage over the traditional Transformer model, which is limited to inputs of a fixed length. Hey, if you have not realized how it can do this by now, we deserve to be ruled by AI overlords!

Stop kidding..its not funny anymore! Tell me more about how this can be used!

Well, apart from understanding you-know-what about you-know-who, The Longformer model has several potential applications in NLP tasks such as sentiment analysis, document classification, and machine translation. But all is not lost! Longformer is not good at everything – the biggest challenge is that because it is more complex, it is more difficult to train. Also, because it is based on a self-attention model, it is very sensitive to hyperparameters – things like the size of the local attention window. And hopefully, if I call it a few choice expletives, maybe it will change its mind about ruling us  (well, not really, it does not work that way, but I can dream!)

All jokes aside, hope this gives you better clarity on how tools like ChatGPT are able to synthesize gargantuan amounts of data and pass the law, medical, and my favourite nightmare, MBA exams! Can you imagine a 2-month old doing all this?

Now my AI overlord jokes look more realistic, don't they?

You can read the original paper on Longformer here and the paper on Transformer here

As promised, here is a quick glossary of terms used:
RNN - Recurrent Neural Networks (RNNs) are a type of neural network that have the ability to process sequential data. They were widely used for NLP tasks such as text classification and machine translation. Introduced in the 1980s, the became popular in early 2010s as one of the models to understand textual data.
CNN - Convolutional Neural Networks (CNNs) are a type of neural network that are well suited for processing data with a grid-like structure, such as images. They were adapted for NLP tasks such as text classification and sentiment analysis. CNNs were first introduced in the late 1980s for image classification and became popular for NLP tasks in the mid-2010s.
LSTM - Long Short-Term Memory (LSTM) networks are a type of RNN that are designed to handle the vanishing gradient problem that can occur when training RNNs. They were widely used for NLP tasks such as language modeling and machine translation.
Parallelization - It refers to the process of dividing a computational task into smaller, independent subtasks that can be executed simultaneously on multiple processors. This allows for faster processing and can reduce the overall time required to complete the task. In deep learning models, parallelizing computations can be challenging because the computations often depend on each other.
Vanishing Gradient - The vanishing gradient problem is a challenge that can occur when training recurrent neural networks (RNNs). The problem arises because the gradients used to update the weights of the independent and dependent variables used in these models like RNNs and LSTMs during training can become very small, leading to slow or ineffective learning. Simply put, the weight of any variable being computed at any time step depends on the weight in the previous time step. If you keep adding these dependent weights over multiple steps, the importance and the value of the weight keep reducing over time. If these weights become very small as they are propagated through the network, this can lead to slow or ineffective learning. These weights of various variables are also called “Attention”
Attention – simplistically speaking, these are the weights of various variables (or tokens or words) in a sentence that are used to calculate the relative importance of one word against the other. Generally, in attention models for understanding text, a sentence is broken down into parts as per parts of a sentence in the English language, so you will have a subject, an object, a verb (or the action), etc. the Attention model already has, to put it simplistically, “preconceived notions” as to the weights or attention between types of words. Attentions are refined from here onwards to a final value iteratively
Self-Attention – in this model, there is no “preconceived” notion of what word is more important. Effectively, all words of a sentence, or token, start at the same weight, and then, the attention of each word is calculated iteratively

We research, curate and publish daily updates from the field of AI.
Consider becoming a paying subscriber to get the latest!