Deep Dive into LSTMs & xLSTMs by Hand ✍️
Deep Dive into LSTMs and xLSTMs by Hand ✍️Explore the wisdom of LSTM leading into xLSTMs — a probable competition to the present-day LLMsImage by author (The ancient wizard as created by my 4-year old)“In the enchanted realm of Serentia, where ancient forests whispered secrets of spells long forgotten, there dwelled the Enigmastrider — a venerable wizard, guardian of timeless wisdom.One pivotal day as Serentia faced dire peril, the Enigmastrider wove a mystical ritual using the Essence Stones, imbued with the essence of past, present, and future. Drawing upon ancient magic he conjured the LSTM, a conduit of knowledge capable of preserving Serentia’s history and foreseeing its destiny. Like a river of boundless wisdom, the LSTM flowed transcending the present and revealing what lay beyond the horizon.From his secluded abode the Enigmastrider observed as Serentia was reborn, ascending to new heights. He knew that his arcane wisdom and tireless efforts had once again safeguarded a legacy in this magical realm.”And with that story we begin our expedition to the depths of one of the most appealing Recurrent Neural Networks — the Long Short-Term Memory Networks, very popularly known as the LSTMs. Why do we revisit this classic? Because they may once again become useful as longer context-lengths in language modeling grow in importance.Can LSTMs once again get an edge over LLMs?A short while ago, researchers in Austria came up with a promising initiative to revive the lost glory of LSTMs — by giving way to the more evolved Extended Long-short Term Memory, also called xLSTM. It would not be wrong to say that before Transformers, LSTMs had worn the throne for innumerous deep-learning successes. Now the question stands, with their abilities maximized and drawbacks minimized, can they compete with the present-day LLMs?To learn the answer, let’s move back in time a bit and revise what LSTMs were and what made them so special:Long Short Term Memory Networks were first introduced in the year 1997 by Hochreiter and Schmidhuber — to address the long-term dependency problem faced by RNNs. With around 106518 citations on the paper, it is no wonder that LSTMs are a classic.The key idea in an LSTM is the ability to learn when to remember and when to forget relevant information over arbitrary time intervals. Just like us humans. Rather than starting every idea from scratch — we rely on much older information and are able to very aptly connect the dots. Of course, when talking about LSTMs, the question arises — don’t RNNs do the same thing?The short answer is yes, they do. However, there is a big difference. The RNN architecture does not support delving too much in the past — only up to the immediate past. And that is not very helpful.As an example, let’s consider these line John Keats wrote in ‘To Autumn’:“Season of mists and mellow fruitfulness,Close bosom-friend of the maturing sun;”As humans, we understand that words “mists” and “mellow fruitfulness” are conceptually related to the season of autumn, evoking ideas of a specific time of year. Similarly, LSTMs can capture this notion and use it to understand the context further when the words “maturing sun” comes in. Despite the separation between these words in the sequence, LSTM networks can learn to associate and keep the previous connections intact. And this is the big contrast when compared with the original Recurrent Neural Network framework.And the way LSTMs do it is with the help of a gating mechanism. If we consider the architecture of an RNN vs an LSTM, the difference is very evident. The RNN has a very simple architecture — the past state and present input pass through an activation function to output the next state. An LSTM block, on the other hand, adds three more gates on top of an RNN block: the input gate, the forget gate and output gate which together handle the past state along with the present input. This idea of gating is what makes all the difference.To understand things further, let’s dive into the details with these incredible works on LSTMs and xLSTMs by the amazing Prof. Tom Yeh.First, let’s understand the mathematical cogs and wheels behind LSTMs before exploring their newer version.(All the images below, unless otherwise noted, are by Prof. Tom Yeh from the above-mentioned LinkedIn posts, which I have edited with his permission. )So, here we go:How does an LSTM work?[1] InitializeThe first step begins with randomly assigning values to the previous hidden state h0 and memory cells C0. Keeping it in sync with the diagrams, we seth0 → [1,1]C0 → [0.3, -0.5][2] Linear TransformIn the next step, we perform a linear transform by multiplying the four weight matrices (Wf, Wc, Wi and Wo) with the concatenated current input X1 and the previous hidden state that we assigned in the previous step.The resultant values are called feature values obtained as the combination of the current input and the hidden state.[3] Non-linear TransformThis step is crucial in the LSTM process. It is a non-linear transform with two parts — a sigmoid σ and tanh.The sigmoid is used to obtain gate values between 0 and 1. This layer essentially determines what information to retain and what to forget. The values always range between 0 and 1 — a ‘0’ implies completely eliminating the information whereas a ‘1’ implies keeping it in place.Forget gate (f1): [-4, -6] → [0, 0]Input gate (i1): [6, 4] → [1, 1]Output gate (o1): [4, -5] → [1, 0]In the next part, tanh is applied to obtain new candidate memory values that could be added on top of the previous information.Candidate memory (C’1): [1, -6] → [0.8, -1][4] Update MemoryOnce the above values are obtained, it is time to update the current state using these values.The previous step made the decision on what needs to be done, in this step we implement that decision.We do so in two parts:Forget : Multiply the current memory values (C0) element-wise with the obtained forget-gate values. What it does is it updates in the current state the values that were decided could be forgotten. → C0 .* f1Input : Multiply the updated memory values (C’1) element-wise with the input gate values to obtain ‘input-scaled’ the memory values. → C’1 .* i1Finally, we add these two terms above to get the updated memory C1, i.e. C0 .* f1 + C’1 .* i1 = C1[5] Candidate OutputFinally, we make the decision on how the output is going to look like:To begin, we first apply tanh as before to the new memory C1 to obtain a candidate output o’1. This pushes the values between -1 and 1.[6] Update Hidden StateTo get the final output, we multiply the candidate output o’1 obtained in the previous step with the sigmoid of the output gate o1 obtained in Step 3. The result obtained is the first output of the network and is the updated hidden state h1, i.e. o’1 * o1 = h1.— — Process t = 2 — -We continue with the subsequent iterations below:[7] InitializeFirst, we copy the updates from the previous steps i.e. updated hidden state h1 and memory C1.[8] Linear TransformWe repeat Step [2] which is element-wise weight and bias matrix multiplication.[9] Update Memory (C2)We repeat steps [3] and [4] which are the non-linear transforms using sigmoid and tanh layers, followed by the decision on forgetting the relevant parts and introducing new information — this gives us the updated memory C2.[10] Update Hidden State (h2)Finally, we repeat steps [5] and [6] which adds up to give us the second hidden state h2.Next, we have the final iteration.— — Process t = 3 — -[11] InitializeOnce again we copy the hidden state and memory from the previous iteration i.e. h2 and C2.[12] Linear TransformWe perform the same linear-transform as we do in Step 2.[13] Update Memory (C3)Next, we perform the non-linear transforms and perform the memory updates based on the values obtained during the transform.[14] Update Hidden State (h3)Once done, we use those values to obtain the final hidden state h3.Summary:To summarize the working above, the key thing to remember is that LSTM depends on three main gates : input, forget and output. And these gates as can be inferred from the names, control what part of the information and how much of it is relevant and which parts can be discarded.Very briefly, the steps to do so are as follows:Initialize the hidden state and memory values from the previous state.Perform linear-transform to help the network start looking at the hidden state and memory values.Apply non-linear transform (sigmoid and tanh) to determine what values to retain /discard and to obtain new candidate memory values.Based on the decision (values obtained) in Step 3, we perform memory updates.Next, we determine what the output is going to look like based on the memory update obtained in the previous step. We obtain a candidate output here.We combine the candidate output with the gated output value obtained in Step 3 to finally reach the intermediate hidden state.This loop continues for as many iterations as needed.Extended Long-Short Term Memory (xLSTM)The need for xLSTMsWhen LSTMs emerged, they definitely set the platform for doing something that was not done previously. Recurrent Neural Networks could have memory but it was very limited and hence the birth of LSTM — to support long-term dependencies. However, it was not enough. Because analyzing inputs as sequences obstructed the use of parallel computation and moreover, led to drops in performance due to long dependencies.Thus, as a solution to it all were born the transformers. But the question still remained — can we once again use LSTMs by addressing their limitations to achieve what Transformers do? To answer that question, came the xLSTM architecture.How is xLSTM different from LSTM?xLSTMs can be seen as a very evolved version of LSTMs. The underlying structure of LSTMs are preserved in xLSTM, however new elements have been introduced which help handle the drawbacks of the original form.Exponential Gating & Scalar Memory Mixing — sLSTMThe most crucial difference is the introduction of exponential gating. In LSTMs, when we perform Step [3], we induce a sigmoid gating to all gates, while for xLSTMs it has been replaced by exponential gating.For eg: For the input gate i1-is now,Images by authorWith a bigger range that exponential gating provides, xLSTMs are able to handle updates better as compared to the sigmoid function which compresses inputs to the range of (0, 1). There is a catch though — exponential values may grow up to be very large. To mitigate that problem, xLSTMs incorporate normalization and the logarithm function seen in the equations below plays an important role here.Image from Reference [1]Now, logarithm does reverse the effect of the exponential but their combined application, as the xLSTM paper claims, leads the way for balanced states.This exponential gating along with memory mixing among the different gates (as in the original LSTM) forms the sLSTM block.Matrix Memory Cell — mLSTMThe other new aspect of the xLSTM architecture is the increase from a scalar memory to matrix memory which allows it to process more information in parallel. It also draws semblance to the transformer architecture by introducing the key, query and value vectors and using them in the normalizer state as the weighted sum of key vectors, where each key vector is weighted by the input and forget gates.Once the sLSTM and mLSTM blocks are ready, they are stacked one over the other using residual connections to yield xLSTM blocks and finally the xLSTM architecture.Thus, the introduction of exponential gating (with appropriate normalization) along with newer memory structures establish a strong pedestal for the xLSTMs to achieve results similar to the transformers.To summarize:An LSTM is a special Recurrent Neural Network (RNN) that allows connecting previous information to the current state just as us humans do with persistence of our thoughts. LSTMs became incredibly popular because of their ability to look far into the past rather than depending only on the immediate past. What made it possible was the introduction of special gating elements into the RNN architecture-Forget Gate: Determines what information from the previous cell state should be kept or forgotten. By selectively forgetting irrelevant past information, the LSTM maintains long-term dependencies.Input Gate : Determines what new information should be stored in the cell state. By controlling how the cell state is updated, it incorporates new information important for predicting the current output.Output Gate : Determines what information should be the output as the hidden state. By selectively exposing parts of the cell state as the output, the LSTM can provide relevant information to subsequent layers while suppressing the non-pertinent details and thus propagating only the important information over longer sequences.2. An xLSTM is an evolved version of the LSTM that addresses the drawbacks faced by the LSTM. It is true that LSTMs are capable of handling long-term dependencies, however the information is processed sequentially and thus doesn’t incorporate the power of parallelism that today’s transformers capitalize on. To address that, xLSTMs bring in:sLSTM : Exponential gating that helps to include larger ranges as compared to sigmoid activation.mLSTM : New memory structures with matrix memory to enhance memory capacity and enhance more efficient information retrieval.Will LSTMs make their comeback?LSTMs overall are part of the Recurrent Neural Network family that process information in a sequential manner recursively. The advent of Transformers completely obliterated the application of recurrence however, their struggle to handle extremely long sequences still remains a burning problem. Research suggests that quadratic time is pertinent for long-ranges or long contexts.Thus, it does seem worthwhile to explore options that could at least enlighten a solution path and a good starting point would be going back to LSTMs — in short, LSTMs have a good chance of making a comeback. The present xLSTM results definitely look promising. And then, to round it all up — the use of recurrence by Mamba stands as a good testimony that this could be a lucrative path to explore.So, let’s follow along in this journey and see it unfold while keeping in mind the power of recurrence!P.S. If you would like to work through this exercise on your own, here is a link to a blank template for your use.Blank Template for hand-exerciseNow go have fun and create some Long Short-Term effect!Image by authorReferences:xLSTM: Extended Long Short-Term Memory, Maximilian et al. May 2024 https://arxiv.org/abs/2405.04517Long Short-Term Memory, Sepp Hochreiter and Jürgen Schmidhuber, 1997, Neural Comput. 9, 8 (November 15, 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735Deep Dive into LSTMs & xLSTMs by Hand ✍️ was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Welcome to Billionaire Club Co LLC, your gateway to a brand-new social media experience! Sign up today and dive into over 10,000 fresh daily articles and videos curated just for your enjoyment. Enjoy the ad free experience, unlimited content interactions, and get that coveted blue check verification—all for just $1 a month!
Account Frozen
Your account is frozen. You can still view content but cannot interact with it.
Please go to your settings to update your account status.
Open Profile Settings