Zero-shot Learning, Explained

Zero-shot Learning, Explained

Source Node: 1776319

Zero-shot Learning, Explained
Bruce Warrington via Unsplash
 

The reason why machine learning models in general are becoming smarter is due to their dependency on using labeled data to help them discern between two similar objects. 

However, without these labeled datasets, you will encounter major obstacles when creating the most effective and trustworthy machine-learning model. Labeled datasets during the training phase of a model are important. 

Deep learning has been widely used to solve tasks such as Computer vision using supervised learning. However, as with many things in life, it comes with restrictions. Supervised classification requires a high quantity and quality of labeled training data in order to produce a robust model. This means that the classifying model cannot handle unseen classes. 

And we all know how much computational power, re-training, time, and money it takes to train a deep learning model.

But can a model still be able to discern between two objects without having used training data? Yes, it’s called zero-shot learning. Zero-shot learning is a model’s ability to be able to complete a task without having received or used any training examples. 

Humans are naturally capable of zero-shot learning without having to put much effort in. Our brains already store dictionaries and allow us to differentiate objects by looking at their physical properties due to our current knowledge base. We can use this knowledge base to see the similarities and differences between objects and find the link between them.

For example, let’s say we are trying to build a classification model on animal species. According to OurWorldInData, there were 2.13 million species calculated in 2021. Therefore, if we want to create the most effective classification model for animal species, we would need 2.13 million different classes. Also needed will be a lot of data. High quantity and quality data are hard to come across.

So how does zero-shot learning solve this problem?

Because zero-shot learning does not require the model to have learned the training data and how to classify classes, it allows us to rely less on the model’s need for labeled data. 

The following is what your data will need to consist of in order to proceed with zero-shot learning.

Seen Classes

This consists of the data classes that have been previously used to train a model. 

Unseen Classes

This consists of the data classes that have NOT been used to train a model and the new zero-shot learning model will generalize. 

Auxiliary Information

As the data in the unseen classes are not labeled, zero-shot learning will require auxiliary information in order to learn, and find correlations, links, and properties. This can be in the form of word embeddings, descriptions, and semantic information.

Zero-shot Learning Methods

Zero-shot learning is typically used in:

  • Classifier-based methods
  • Instance-based methods

Stages

Zero-shot learning is used to build models for classes that do not train using labeled data, therefore it requires these two stages:

1. Training

The training stage is the process of the learning method trying to capture as much knowledge as possible about the qualities of the data. We can view this as the learning phase. 

2. Inference

During the inference stage, all the learned knowledge from the training stage is applied and utilized in order to classify examples into a new set of classes. We can view this as the making predictions phase. 

How Does it Work?

The knowledge from the seen classes will be transferred to the unseen classes in a high-dimensional vector space; this is called semantic space. For example, in image classification the semantic space along with the image will undergo two steps:

1. Joint embedding space

This is where the semantic vectors and the vectors of the visual feature are projected to. 

2. Highest similarity

This is where features are matched against those of an unseen class. 

To help understand the process with the two stages (training and inference), let’s apply them in the use of image classification.

Training

Zero-shot Learning, Explained
Jari Hytönen via Unsplash
 

As a human being, if you were to read the text on the right in the image above, you would instantly assume that there are 4 kittens in a brown basket. But let’s say you have no idea what a ‘kitten’ is. You will assume that there is a brown basket with 4 things inside, which are called ‘kittens’. Once you come across more images that contain something that looks like a ‘kitten’, you will be able to differentiate a ‘kitten’ from other animals. 

This is what happens when you use Contrastive Language-Image Pretraining (CLIP) by OpenAI for zero-shot learning in image classification. It is known as auxiliary information. 

You might be thinking, ‘well that’s just labeled data’. I understand why you would think that, but they are not. Auxiliary information is not labels of the data, they are a form of supervision to help the model learn during the training stage.

When a zero-shot learning model sees a sufficient amount of image-text pairings, it will be able to differentiate and understand phrases and how they correlate with certain patterns in the images. Using the CLIP technique ‘contrastive learning’, the zero-shot learning model has been able to accumulate a good knowledge base to be able to make predictions on classification tasks. 

This is a summary of the CLIP approach where they train an image encoder and a text encoder together in order to predict the correct pairings of a batch of (image, text) training examples. Please see the image below:

 

Zero-shot Learning, Explained
Learning Transferable Visual Models From Natural Language Supervision

Inference

Once the model has gone through the training stage, it has a good knowledge base of image-text pairing and can now be used to make predictions. But before we can get right into making predictions, we need to set up the classification task by creating a list of all possible labels that the model could output. 

For example, sticking with the image classification task on animal species, we will need a list of all the species of animals. Each one of these labels will be encoded, T? to T? using the pretrained text encoder that occurred in the training stage. 

Once the labels have been encoded, we can input images through the pre-trained image encoder. We will use the distance metric cosine similarity to compute the similarities between the image encoding and each text label encoding.

The classification of the image is done based on the label with the greatest similarity to the image. And that is how zero-shot learning is achieved, specifically in image classification. 

Scarcity of Data

As mentioned before, high quantity and quality data are hard to get your hands on. Unlike humans who already possess the zero-shot learning ability, machines require input labeled data to learn and then be able to adapt to variances that may naturally occur. 

If we look at the animal species example, there were so many. And as the number of categories continues to grow in different domains, it will take a lot of work to keep up with collecting annotated data.

Due to this, zero-shot learning has become more valuable to us. More and more researchers are interested in automatic attribute recognition to compensate for the lack of available data. 

Data Labeling

Another benefit of zero-shot learning is its data labeling properties. Data labeling can be labor-intensive and very tedious, and due to this, it can lead to errors during the process. Data labeling requires experts, such as medical professionals who are working on a biomedical dataset, which is highly expensive and time-consuming. 

Zero-shot learning is becoming more popular due to the above limitations of data. There are a few papers I would recommend you read if you are interested in its abilities:

 
 
Nisha Arya is a Data Scientist and Freelance Technical Writer. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.
 

Time Stamp:

More from KDnuggets