Is Attention What You Really Need In Transformers?

Republished By Plato

Followers: 0

self-attention in transformers

In recent years there has been an explosion of methods based on self-attention and in particular Transformers, first in the field of Natural Language Processing and recently also in the field of Computer Vision.

If you don’t know what Transformers are, or if you want to know more about the mechanism of self-attention, I suggest you have a look at my first article on this topic.

The success of the transformers is related to their extreme effectiveness and ability to solve non-trivial problems in a superior way compared to previous architectures such as RNNs in natural language processing or convolutional networks in computer vision. At the basis of Transformers there is and always has been the mechanism of attention, considered “all you need”, indispensable and the true beating heart of this architecture. But not all that glitters is gold, in fact, the calculation of self-attention brings with it huge computational and memory costs that, such as require very large amounts of video memory and cause high training times.

This has not gone unnoticed by big companies like Apple and Google, who have been working hard to make Transformers that are not only able to achieve state-of-the-art results but also do so efficiently.

If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.

Efficient Transformers

Recently, Lukasz Kaiser, one of the co-creators of Transformers and Google’s researcher, presented a series of improvements to make Transformers more efficient even maintaining the self-attention mechanism, and the first and probably one of the most important aspect he focused on was memory efficiency.

Transformers use a large amount of memory, as multiple intermediate tensors are created and maintained during their execution, and as they accumulate they quickly saturate the video memory in the absence of large resources.

The method proposed by Google Brain to get around this problem is to avoid keeping all the tensors in memory, but rather to make each step of the process reversible.

To do this, two tensors are maintained at each step, the one resulting from the application of the layer and a copy of the previous tensor. This allows the process to continue without having to maintain the entire chain of intermediate tensors, but only those of the last step.

With this stratagem, one is therefore able to significantly reduce memory costs while still achieving the same results as a normal transformer. This could be one of the smartest methods currently known to keep the transformer architecture completely based on self-attention as it has traditionally been, but at a lower cost.

AFT: Attention Free Transformers

But at what point does the calculation of self-attention become so complex to manage? Is there a way to eliminate quadratic complexity? Do we really need attention calculation as we are currently using it? These are the questions that Apple’s researchers have asked themselves, and which form the basis of the Attention Free Transformers.

The problem lies in the dot product, which is used to combine queries, keys and values and is done by considering every single input vector as a query. Aware of this, the Attention Free Transformer is designed to never dot product while retaining the benefits.

As in the original transformer, AFT initially creates Q, K and V as a result of the linear transformation of input with the matrices of queries, keys and values.

The peculiarity is that at this point, instead of carrying out the dot product for the creation of the attention matrix, a weighted average of the values is carried out for each target position. The result of this is combined with the query by means of element-wise multiplication.

Through this mechanism, it is possible to obtain a linear computational and space complexity dependent on the number of output features and the length of the sequence considered. Conceptually, it is simply a different way of making information flow within the sequence but in a much less expensive way.

By testing the Attention Free Transformer on many tasks previously tested in the literature with the original Transformer, it was possible to see how, for example in the case of the Vision Transformer shown in the figure (left), the features obtained from the AFT (right), in this case, the AFT-Conv version, seems to be still meaningful even if approximated.

With this mechanism, it was not only possible to cut the cost of calculating attention but excellent results were obtained in all the tasks considered, an indication that this solution is able to maintain all the advantages of the dot product without the costs it requires to be calculated.

FNet: Fourier Networks

But there are also those who have considered abandoning the calculation of attention altogether and have gone in search of a mechanism that can be as effective as attention but not as costly to calculate.

It would seem that a good candidate for this task would be the Fourier transform. The Fourier transform does nothing more than taking a function in one domain, e.g. time, and take it into another domain, e.g. frequency.

The Fourier Network proposed by Google and based on that mechanism is exactly the same as a normal Transformers but with the attention calculation block replaced by a layer that takes care of the Fourier transform.

Having N input vectors, composed of T tokens, the Fourier transform is first applied to what is called the “hidden domain” and then to the “sequence domain”. All this is done without any type of parameter, bringing with it enormous advantages, since the parameters present in the other layers are the only ones that can be trained, thus reducing the number of model parameters.

So the transformation is linear and is applied to the input first column-wise and then row-wise. In addition to the simplicity obtained by applying the Fourier transform, there is also the advantage that it is reversible.

Exactly as in the case of AFT, this series of transformations makes it possible for the various parts of the sequence to influence each other and the result is a transformed representation that contains information derived from the various parts of the input sequence.

Apparently, this seems to be a very interesting method that is able to significantly reduce the costs of attention calculation and achieve discrete results. In fact, the FNet does not seem to be better than the classical Transformers and probably there are other better methods to obtain comparable results at lower costs but in the absence of large computational resources, the FNet may be a really valid choice.

What awaits us in the future?

It is now clear that the Transformer is an enormously powerful architecture capable of solving the most diverse problems with results we have never seen before, from translation to segmentation to classification. However, these have for too long been linked to their exaggerated consumption of resources, and their arrival in the world of computer vision has highlighted this problem even more and prompted many researchers to seek solutions.
In the future, we may see transformers based on attention but optimized to be lighter, or transformers deprived of their attention mechanism to make room for more approximate techniques, or totally new networks, similar to transformers, but with different input transformations strategies such as the Fourier transform. And if you want to be ready and know more about them I suggest you read my articles on Transformers and on DINO.

One thing is certain, if for the moment it has been impossible for most people to make full use of this architecture, it will soon be available to everyone, and its great potential combined with accessibility will make transformers even more pervasive and central than they already are.