Visual Transformers: More Than Meets The Eye

7 min readJan 22, 2023

A Vision Transformer is a type of neural network architecture that is used in computer vision tasks. It is basically based on the transformer architecture, which was originally introduced in the field of natural language processing (NLP) in the paper “Attention Is All You Need” by Vaswani et al[1]. It is a must read paper for every AI lover.

The key idea behind the transformer architecture(more then meets the eye😉) is the use of self-attention, which allows the model to weigh the importance of different parts of the input when making predictions. In the case of a vision transformer, the input is an image, and the self-attention mechanism allows the model to focus on specific regions of the image when making predictions about the image’s content.

Vision Transformer General Architecture [2]

The vision transformer architecture consists of two main parts: an encoder and a decoder. The encoder is used to extract features from the input image, while the decoder is used to make predictions about the image’s content. The encoder and decoder are both composed of multiple layers of self-attention and feed-forward neural networks. The encoder is typically made up of multiple layers of self-attention and feed-forward neural networks. Each layer of the encoder takes in the output of the previous layer, and applies self-attention and feed-forward operations to it. The self-attention operation allows the model to weigh the importance of different parts of the input image when extracting features. This allows the model to focus on specific regions of the image when learning to extract features.After the image has passed through the encoder, the output is then passed through the decoder. The decoder is also composed of multiple layers of self-attention and feed-forward neural networks. Each layer of the decoder takes in the output of the previous layer, and applies self-attention and feed-forward operations to it. I They use large amount of computation power and computational resources to achieve good results.

The output of the decoder is then used to make predictions about the image’s content. In the case of image classification, the output is typically a vector of class scores, which are then passed through a softmax function to produce a probability distribution over the classes.

Another important feature of vision transformer, is pre-training the model on large amount of image data and then fine tuning it on smaller dataset, which leads to better performance.

It is important to note that Vision transformer are computationally expensive and require large amount of data to train them and also large computational resources. Because of that, most of the time pre-trained models are used for general purpose tasks with fine-tuning of course.

CNN

A Convolutional Neural Network (CNN) is a type of neural network architecture that is typically used for image and video processing tasks. It is called “convolutional” because it is based on the mathematical operation of convolution, which is used to extract features from the input data.

A CNN is composed of multiple layers, each of which performs a different operation on the input data. The basic building block of a CNN is the convolutional layer, which applies a set of learnable filters to the input. The filters are small, typically 3x3 or 5x5 in size, and are moved across the input, pixel by pixel, in order to extract features from the input.

For example, a filter that detects edges would be activated when it is moved over an image and encounters an edge. The convolutional layer applies many such filters to the input, each of which can extract a different feature from the input.

After the convolutional layer, typically, there are one or more pooling layers which are used to reduce the dimensionality of the input. Pooling layers typically use a function such as max or average pooling to summarize the values in a small region of the input. This helps to make the model more robust to small changes in the input and also reduces the dimensionality of the data.

Following pooling layers, there are fully connected layers, also known as Dense layers, which perform a matrix multiplication of the input and a set of weights, and then applies an activation function to the result. These layers are used to learn higher-level features from the input and make predictions about the input.

Finally, output layer is used to produce the final predictions based on the input. CNNs are widely used in various image processing tasks like image classification, object detection, semantic segmentation and many more.

ViT Pros and Cons

Pros:

Pre-training and fine-tuning: Because of their large capacity, Vision Transformers can be pre-trained on a large dataset of images and then fine-tuned on a smaller dataset specific to a particular task.
Handling large input: Vision Transformers are not limited by the input size, which means that they can handle images of any resolution, making it useful for high-resolution image analysis tasks.
Fewer parameters: Vision Transformers generally have fewer parameters than traditional CNNs, which makes them easier to train and less prone to overfitting.

Cons:

Computational cost: Vision Transformers are computationally expensive to train and inference, they require a lot of computational resources.
Data requirements: Vision Transformer also require large amount of labeled data to train, which can be a limitation.
Complexity: Vision Transformer architecture is more complex than traditional CNNs, which can make them more difficult to design and train.
Lack of interpretability: Vision Transformer are considered as black-box models, it’s hard to understand how the model makes its predictions, which can make it difficult to trust the predictions.

CNN Pros and Cons

Pros:

Local connectivity: CNNs are designed to process data that has a grid-like structure, such as an image. They do this by applying convolutional filters to small regions of the input, which allows them to learn local features of the input. This is particularly useful for image-related tasks where spatial information is crucial.
Fewer parameters: CNNs generally have fewer parameters than fully connected neural networks, which makes them easier to train and less prone to overfitting, particularly when working with small datasets.
Well-established: CNNs have been widely used in computer vision and image processing tasks, with lots of pre-trained models available, which make it easy to use them in practical applications.
Interpretability: CNNs are easier to interpret than many other neural networks. Through visualization techniques, such as saliency maps, it’s possible to understand which regions of the input are most important for a specific prediction.

Cons:

Limited context: Due to the use of pooling layers, CNNs tend to lose spatial information, which can limit the context the model can take into account when making predictions.
Scale-variant: CNNs are sensitive to the scale of the objects in the image, and may perform poorly on images that contain objects of different scales.
Requires preprocessing: CNNs require images to be preprocessed and normalized to a specific size, which can be a limitation in some cases.
Requires a large amount of data: CNNs require a large amount of labeled data to train which can be a limitation.

ViT vs CNN

The main difference between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) is the way they process input data.

CNNs are designed to process data that has a grid-like structure, such as an image. They do this by applying convolutional filters to small regions of the input, which allows them to learn local features of the input. This is done by convolutional layers, in which the filters are moved over the image to extract features from different regions of the image.

CNNs also typically include pooling layers, which are used to downsample the input and reduce its dimensionality. This is done by taking the maximum or average of the values in a small region of the input, which helps to make the model more robust to small changes in the input.

On the other hand, Vision Transformers are based on the transformer architecture, which was originally introduced in the field of natural language processing (NLP). They use self-attention mechanism to weigh the importance of different parts of the input when making predictions.

In ViT models, the input image is transformed into a set of non-overlapping patches, which are then passed through a transformer-based encoder to extract features from the image. The encoder is composed of multiple layers of self-attention and feed-forward neural networks, and the output is then passed through a transformer-based decoder to make predictions about the image’s content.

In summary, CNNs process image by applying convolutional filters and pooling layers to extract local features, while Vision Transformers uses self-attention mechanism to weigh the importance of different regions of the image when making predictions.

Conclusion

At the end of the day, most of the data science conclusion ends with “It depends on your needs”. This one is not different case. You can always choose occams razor principle. Get simplest solution for your need. See you at next reading . . .