New algorithm from Facebook researcher ushers in new image recognition paradigm
“VICReg could be used to model the dependencies between a video clip and the frame that follows, thus learning to predict the future in a video. “
Adrien Bardes, Facebook AI Research
Humans have an innate ability to identify objects in nature, even from a blurry glimpse of something. We do this efficiently by remembering only the high-level features that do the job (identifying) and ignoring details unless necessary. In the context of deep learning algorithms that perform object detection, contrastive learning has explored the premise of learning representation to get a big picture instead of doing the heavy lifting of devouring the details. at pixel level. But, contrastive learning has its own limits.
According to Andrew Ng, pre-training methods can suffer from three common faults: generate identical representation for different input examples (which leads to predicting the mean consistently in linear regression), generate different representations for examples that humans find similar (eg, the same object seen from two angles), and generate redundant parts of a representation (for example, several vectors that represent two eyes on a photo of a face). Problems with learning representations, writes Andrew Ng, boil down to problems of variance, invariance and covariance.
Read also: What is contrastive learning
Andrew Ng’s observations are a reference to a new self-supervised algorithm published by researchers from Facebook AI, PSL Research University and New York University, as well as Turing Prize winner Yann Lecun introduced called Regulation Variance-Invariance-Covariance (VICReg), which is based on Lecun’s Barlow Twins method.
The researchers designed VICReg (Variance-Invariance-Covariance Regularization) to avoid the collapse problem, which is managed more inefficiently in the case of contrastive methods. They do this by introducing a simple regularization term on the variance of embeddings along each dimension individually and by combining the variance term with a decorrelation mechanism based on reducing redundancy and regularizing covariance. The authors state that VICReg is performed on a par with several advanced methods.
VICReg is a simple, self-supervised approach to image representation, and its goals are:
- Learn the invariance at different views with an invariance term.
- Avoid collapsing representations with a variance regularization term.
- Distribute the information in the different dimensions of the representations with a covariance regularization term.
The results show that VICReg works on par with leading-edge methods and ushers in a new paradigm of non-contrastive self-supervised learning.
What the authors had to say
Speaking to Analytics India Magazine about the importance of VICReg, lead author Adrien Bardes, who is also a resident doctoral student at Facebook AI Research, Paris, said that self-supervised learning of representations is a paradigm learning which aims to learn meaningful representations of certain data. Recent approaches are based on Siamese networks and maximize the similarity between two augmented views of the same entry. One trivial solution is for the network to produce constant vectors, known as the collapse problem. VICReg is a new algorithm based on Siamese networks but aims to avoid collapse by regularizing the variance and covariance of network outputs. It achieves cutting-edge results in multiple computer vision references while being a simple and interpretable approach.
Asked how VICReg addresses the shortcomings of contrastive learning methods, Bardes explained that contrastive learning methods are based on a simple principle. They bring the entries that should encode similar information closer to each other in the integration space and prevent collapse by separating the entries that should encode dissimilar information. This process requires the extraction of a massive amount of negative pairs, pairs of separate inputs. Recent contrastive approaches to self-supervised learning have different strategies for extracting these negative pairs; they can sample them from a memory bank, like in MoCo, or sample them from the current batch, like in SimCLR, which in both cases is time consuming or memory intensive. VICReg, on the other hand, does not require these negative pairs; it implicitly prevents a collapse by requiring that the representations be different from each other without making a direct comparison between different examples. It therefore does not require MoCo’s memory bank and works with much smaller batch sizes than SimCLR.
For Bardes, self-supervised learning is possibly the most exciting subject in machine learning research. Data annotation is a very large process carried out by humans who have biases and can make mistakes. It is therefore impossible to annotate the large amount of data available today, for example medical or astronomical data and images and videos on the Internet. Training models that harness all of this data can only be achieved by using self-supervised learning. This is one of the motivations for the development of VICReg.
Bardes believes that VICReg is applicable in any scenario where one wants to model the relationships within a dataset. It can be used with any type of data, images, videos, text, audio, or proteins. For example, you can use it to model the dependencies between a video clip and the next frame, thus learning to predict the future in a video. Another example would be to understand the relationship between the graph of a molecule and its image seen under a microscope.
“We are in the early stages of developing self-supervised learning. Switching from contrastive to non-contrastive methods is the first step towards more practical algorithms. Current approaches rely on artisanal data augmentations that can be viewed as a kind of oversight. The next step will probably be to get rid of these increases. Another promising avenue is to manage uncertainties in data modeling. Current methods are mostly deterministic and always model the same relationship between two inputs. For example, if we go back to the frame prediction example, the current methods would only model the possible future of a video clip. Future approaches will likely use latent variables that model the space of possible predictions, ”Bardes concluded.
Join our Discord server. Be part of an engaging online community. Join here.
Subscribe to our newsletter
Receive the latest updates and relevant offers by sharing your email.