RESEARCH
Networks learn networks
Equivariant Architectures for Learning in Deep Weight Spaces
Designing machine learning architectures for processing neural networks in their raw weight matrix form is a newly introduced research direction with a wide range of intriguing applications. Unfortunately, the unique symmetry structure of deep weight spaces makes this design very challenging. We present a novel network architecture for learning in deep weight spaces, which is equivariant to the natural permutation symmetry of the MLPs. We demonstrate the effectiveness of our architecture and its advantages over natural baselines in various learning tasks.
Learning with
limited data
Guided Deep Kernel Learning
Combining Gaussian processes with the expressive power of deep neural networks is commonly done nowadays through deep kernel learning (DKL). Unfortunately, due to the kernel optimization process, this often results in losing their Bayesian benefits. In this study, we present a novel approach for learning deep kernels by utilizing infinite-width neural networks. We propose to use the Neural Network Gaussian Process (NNGP) model as a guide to the DKL model in the optimization process. Our approach harnesses the reliable uncertainty estimation of the NNGPs to adapt the DKL target confidence when it encounters novel data points. As a result, we get the best of both worlds, we leverage the Bayesian behavior of the NNGP, namely its robustness to overfitting, and accurate uncertainty estimation, while maintaining the generalization abilities, scalability, and flexibility of deep kernels.
Code here: https://github.com/IdanAchituve/GDKL
Federated Learning
Personalized Federated Learning with Gaussian Processes
Federated learning aims to learn a global model that performs well on client devices with limited cross-client communication. Personalized federated learning (PFL) further extends this setup to handle data heterogeneity between clients by learning personalized models. A key challenge in this setting is to learn effectively across clients even though each client has unique data that is often limited in size. we DEVELOPED pFedGP, a solution to PFL that is based on Gaussian processes (GPs) with deep kernel learning. We propose learning a shared kernel function across all clients, parameterized by a neural network, with a personal GP classifier for each client. Extensive experiments on standard PFL benchmarks with CIFAR-10, CIFAR-100, and CINIC-10, and on a new setup of learning under input noise show that pFedGP achieves well-calibrated predictions while significantly outperforming baseline methods, reaching up to 21% in accuracy gain.
Incremental-Learning
GP-Tree: A Gaussian Process Classifier for Few-Shot Incremental Learning
Gaussian processes (GPs) are non-parametric, flexible, models that work well in many tasks. Combining GPs with deep learning methods via deep kernel learning is especially compelling due to the strong expressive power induced by the network. However, inference in GPs, whether with or without deep kernel learning, can be computationally challenging on large datasets. Here, we propose GP-Tree, a novel method for multi-class classification with Gaussian processes and deep kernel learning. We develop a tree-based hierarchical model in which each internal node of the tree fits a GP to the data using the Polya-Gamma augmentation scheme. As a result, our method scales well with both the number of classes and data size. We demonstrate our method effectiveness against other Gaussian process training baselines, and we show how our general GP approach is easily applied to incremental few-shot learning and reaches state-of-the-art performance.
Multi-Task learning
Auxiliary Learning by Implicit Differentiation (ICLR2021)
Training neural networks with auxiliary tasks is a common practice for improving the performance of the main task. Two main challenges arise in this multi-task learning setting: (i) designing useful auxiliary tasks; and (ii) combining auxiliary tasks into a single coherent loss. Here, we propose a novel framework, AuxiLearn, that targets both challenges using implicit differentiation. When useful auxiliaries are known, we propose learning a network that non-linearly combines all losses into a single coherent objective function. When no useful auxiliary task is known, we learn a network that generates a meaningful, novel auxiliary task.
Multi-source domain adaptation
Teacher-Student Consistency For Multi-Source Domain Adaptation
In Multi-Source Domain Adaptation (MSDA), models are trained on samples from multiple source domains and used for inference on a different, target, domain. Mainstream domain adaptation approaches learn a joint representation of the source and the target domains. Unfortunately, a joint representation may emphasize features that are useful for the source domains but hurt inference on target (negative transfer) or remove essential information about the target domain (knowledge fading).
We propose Multi-source Student-Teacher (MUST), a novel procedure designed to alleviate these issues. The key idea has two steps: First, we train a teacher network on source labels and infer pseudo labels on the target. Then, we train a student network using the pseudo labels and regularized the teacher to fit the student predictions. This regularization helps the teacher predictions on the target data remain consistent between epochs. Evaluations of MUST on three MSDA benchmarks: digits, text sentiment analysis, and visual-object recognition show that MUST outperforms current SoTA, sometimes by a very large margin. We further analyze the solutions and the dynamics of the optimization showing that the learned models follow the target distribution density, implicitly using it as information within the unlabeled target data.
Self-supervised point clouds
Self-supervised learning for domain adaptation on point clouds (WACV2021)
Self-supervised learning (SSL) is a technique for learning useful representations from unlabeled data. It has been applied effectively to domain adaptation (DA) on images and videos. It is still unknown if and how it can be leveraged for domain adaptation in 3D perception problems. Here we describe the first study of SSL for DA on point clouds. We introduce a new family of pretext tasks, Deformation Reconstruction, inspired by the deformations encountered in sim-to-real transformations. In addition, we propose a novel training procedure for labeled point cloud data motivated by the MixUp method called Point cloud Mixup (PCM). Evaluations on domain adaptations datasets for classification and segmentation, demonstrate a large improvement over existing and baseline methods.
Long-tail learning
Long-tail learning with attributes (ECCV 2020)
Real-world data is predominantly unbalanced and long-tailed, but deep models struggle to recognize rare classes in the presence of frequent classes. Often, classes can be accompanied by side information like textual descriptions, but it is not fully clear how to use them for learning with unbalanced long-tail data. We describe DRAGON, a late-fusion architecture for long-tail learning with class descriptors. It learns to (1) correct the bias towards head classes on a sample-by-sample basis; and (2) fuse information from class-descriptions to improve the tail-class accuracy. DRAGON outperforms state-of-the-art models on the new benchmark and also is a new SoTA on existing benchmarks for GFSL with class descriptors (GFSL-d) and standard (vision-only) long-tailed learning.
Incremental Learning with Limited Access
Learning New Classes Without Forgetting the Original Ones (EMNLP 2019)
We address the problem of adding new classes to an existing classifier without hurting the original classes, when no access is allowed to any sample from the original classes. This problem arises frequently since models are often shared without their training data, due to privacy and data ownership concerns. We propose an easy-to-use approach that modifies the original classifier by retraining a suitable subset of layers using a linearly-tuned, knowledge-distillation regularization. The set of layers that is tuned depends on the number of new added classes and the number of original classes.
Cooperative Image Captioning
Joint Optimization of Networks for Image Captioning (ICCV 2019)
In image captioning task, descriptions can be made more informative if tuned using a downstream tasks. The challenge is the discrete nature of natural language, which make the optimization hard. To address this challenge, we developed a new effective optimization method. Our method takes advantage of the cooperative game between the two networks by transmitting more information to the downstream task.
Zero-Shot learning
Probabilistic AND-OR attribute grouping for Zero-shot learning (2018)
In zero-shot learning (ZSL), classifiers are trained to recognize visual classes without any image samples. Instead, it is given semantic information about the class, like a textual description or a set of attributes. We describe a probabilistic model trained end-to-end designed to capture natural soft and-or relations across groups of attributes.
Discriminative captions
Describe images in natural language, taking context into account
We introduce an inference technique to produce discriminative context-aware image captions (captions that describe differences between images or visual concepts) using only generic context-agnostic training data (captions that describe a concept or an image in isolation). For example, given images and captions of "siamese cat" and "tiger cat", we generate language that describes the "siamese cat" in a way that distinguishes it from "tiger cat".
Metric Learning (2016)
Learning Sparse Metrics, One Feature at a Time
Learning distance metrics from data amounts to optimization over the cone of positive definite (PD) matrices. This optimization is difficult since restricting optimization to remain within the PD cone or repeatedly projecting to the cone is prohibitively costly. We describe COMET, a block-coordinate descent procedure, which efficiently keeps the search within the PD cone, avoiding both costly projections and unnecessary computation of full gradients.
Learning a measure of similarity between pairs of objects is an important generic problem in machine learning. It is particularly useful in large scale applications like searching for an image that is similar to a given image or finding videos that are relevant to a given video...
LORETA
Low rank similarity learning
When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. Naive approaches for minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low rank matrix). We describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold that can be computed efficiently. LORETA also showed consistent improvement over standard methods in a large multi-label image classification task.
Generative AI with long tail data
Seed selection for text to image generation
Text-to-image diffusion models can synthesize a large variety of concepts in new compositions and scenarios. However, they still struggle with generating uncommon concepts, rare unusual combinations, or structured concepts like hand palms. Here we characterize the effect of unbalanced training data on text-to-image models and offer a remedy. We show that rare concepts can be cor- rectly generated by carefully selecting suitable generation seeds in the noise space, a technique that we call SeedSelect. We evaluate the benefit of SeedSelect in (1) few-shot semantic data augmentation, where we generate semantically correct images for few-shot and long-tail benchmarks and (2) on correcting images of hands, a well-known pitfall of current diffusion models.
Multitask learning
Auxiliary Learning as an Asymmetric Bargaining Game
Auxiliary learning is an effective method for enhancing the generalization capabilities of trained models, particularly when dealing with small datasets. However, this approach may present several difficulties: (i) optimizing multiple objectives can be more challenging, and (ii) how to balance the auxiliary tasks to best assist the main task is unclear.
In this work, we propose a novel approach, named AuxiNash, for balancing tasks in auxiliary learning by formalizing the problem as generalized bargaining game with asymmetric task bargaining power. Furthermore, we describe an efficient procedure for learning the bargaining power of tasks based on their contribution to the performance of the main task and derive theoretical guarantees for its convergence.
Federated Learning
Personalized Federated Learning using Hypernetworks
Personalized federated learning is tasked with training machine learning models for multiple clients, each with its own data distribution. The goal is to collaboratively train personalized models while accounting for the data disparity across clients and reducing communication costs.
We propose a novel approach to handle this problem using hypernetworks, termed pFedHN for personalized Federated HyperNetworks. In this approach, a central hypernetwork model is trained to generate a set of models, one model for each client. This architecture provides effective parameter-sharing across clients while maintaining the capacity to generate unique and diverse personal models. Furthermore, since hypernetwork parameters are never transmitted, this approach decouples communication cost from the trainable model size. We test pFedHN empirically in several personalized federated learning challenges and find that it outperforms previous methods. Finally, we show that pFedHN can generalize better to new clients whose distribution differ from any client observed during training.
Long-Tail Learning
Distributional Robustness Loss for Long-tail Learning
Real-world data is often unbalanced and long-tailed, but deep models struggle to recognize rare classes in the presence of frequent classes. To address unbalanced data, most studies try balancing the data, the loss, or the classifier to reduce classification bias towards head classes. Far less attention has been given to the latent representations learned with unbalanced data. We show that the feature extractor part of deep networks suffers greatly from this bias. We propose a new loss based on robustness theory, which encourages the model to learn high-quality representations for both head and tail classes. While the general form of the robustness loss may be hard to compute, we further derive an easy-to-compute upper bound that can be minimized efficiently. This procedure reduces representation bias towards head classes in the feature space and achieves new SOTA results on CIFAR100-LT, ImageNet-LT, and iNaturalist long-tail benchmarks. We find that training with robustness increases recognition accuracy of tail classes while largely maintaining the accuracy of head classes. The new robustness loss can be combined with various classifier balancing techniques and can be applied to representations at several layers of the deep model.
Compositional learning
A causal view of compositional zero-shot recognition
(NeurIPS 2020)
People easily recognize new visual categories that are new combinations of known components. This compositional generalization capacity is critical for learning in real-world domains because the long tail of new combinations dominates the distribution. Unfortunately, learning systems struggle with compositional generalization because they often build on features that are correlated with class labels even if they are not “essential” for the class.
We describe an approach for compositional generalization that builds on causal ideas. We describe compositional zero-shot learning from a causal perspective and a causal-inspired embedding model that learns disentangled representations of elementary components of visual objects from correlated (confounded) training data.
Multi-Objective Optimization
Learning the Pareto Front with Hypernetworks (ICLR2021)
Multi objective optimization problems are prevalent in machine learning. These problems have a set of optimal solutions, called the Pareto front, where each point on the front represents a different trade-off between possibly conflicting objectives. Recent optimization algorithms can target a specific desired ray in loss space, but still face two grave limitations: (i) A separate model has to be trained for each point on the front; and (ii) The exact trade-off must be known prior to the optimization process. Here, we tackle the problem of learning the entire Pareto front, with the capability of selecting a desired operating point on the front after training. We call this new setup Pareto-Front Learning (PFL). We describe an approach to PFL implemented using HyperNetworks, which we term Pareto HyperNetworks (PHNs). PHN learns the entire Pareto front simultaneously using a single hypernetwork, which receives as input a desired preference vector, and returns a Pareto-optimal model whose loss vector is in the desired ray. The unified model is runtime efficient compared to training multiple models, and generalizes to new operating points not used during training. PHNs learns the entire Pareto front in roughly the same time as learning a single point on the front, and also reaches a better solution set.
Reasoning in Video
Learning Object Permanence from Video (ECCV2020)
Object Permanence (OP) allows people to reason about the location of objects even when they are not perceived directly. It is critical for building a model of the world, since objects in natural visual scenes dynamically occlude and contain each-other. Here we introduce the setup of learning Object Permanence from labeled videos. We dissect the problem into four components, where objects are (1) visible, (2) occluded, (3) contained by another object and (4) carried by a containing object. We then present a unified deep architecture that learns to predict object location under these four scenarios. We evaluate the architecture and system on a new dataset based on CATER, and find that it outperforms previous localization methods and various baselines.
Zero-shot learning with attributes
A probabilistic approach to combine information from vision and attributes (CVPR2020)
Generalized zero-shot learning (GZSL) is the problem of learning a classifier where some classes have samples and others are learned from side information, like semantic attributes or text description, in a zero-shot learning fashion (ZSL). Training a single model that operates in these two regimes simultaneously is challenging.
We developed a probabilistic approach that combines three modular components: A "gating" model that makes soft decisions if a sample is from a "seen" class, and two experts. A ZSL expert, and an expert model for seen classes. We address two main difficulties in this approach: How to provide an accurate estimate of the gating probability without any training samples for unseen classes; and how to use expert predictions when it observes samples outside of its domain.
IOTA - Informative Object Annotations
Information Theory Metric for Selecting Relevant Image Descriptions (CVPR2020)
Capturing the interesting components of an image is a key aspect of image understanding. While people intuitively manage to focus on what is “informative” or “relevant”, automated classifiers can produce a large number of labels that are perhaps technically correct, but are often non-interesting. We present a new unsupervised approach for selecting the most informative term to describe an image.
Building on the insight that the relevance of a description depends on what a listener already knows (prior knowledge); The key idea is that communicated terms are aimed to reduce the uncertainty that a listener has about the semantic space. While the full estimation problem is intractable, we describe an efficient algorithm to approximate entropy reduction using a tree-structured graphical model (a Chow-Liu tree).
Generalized Zero-Shot Learning
Adaptive Confidence Smoothing for Generalized Zero-Shot Learning (2019)
Generalized zero-shot learning (GZSL) is the problem of learning a classifier where some classes have samples and others are learned from side information, like semantic attributes or text description, in a zero-shot learning fashion (ZSL). Training a single model that operates in these two regimes simultaneously is challenging. Here we describe a probabilistic approach that breaks the model into three modular components, and then combines them in a consistent way.
Decode neural activity from MEG
Fine differences in activity timing across the brain carry significant information (2018)
We develop a method to learn neural codes from few dozen samples, operating in extremely high-dimensional space. We discover surprising properties of coding with timing differences.
Brain transcriptome changes during development, reflecting processes that determine functional specialization of brain regions. We found that during the time that the brain becomes anatomically regionalized in early development, transcription specialization actually decreases reaching a low, ‘‘neurotypic’’, point around birth. This decrease of specialization is brain-wide, and mainly due to biological processes involved in constructing brain circuitry. Regional specialization rises again during post-natal development, largely due to specialization of plasticity and neural activity processes. Post-natal specialization is particularly significant in the cerebellum, whose expression signature becomes increasingly different from other brain regions.
Serotonin genes in adolescence
Adolescence is a period of profound neurophysiological, behavioral, cognitive and psychological changes, but not much is known about the underlying molecular neural mechanisms. We systematically analyze expression of genes forming serotonergic and dopaminergic synapses during adolescence and found two serotonin receptors, HTR1E, HTR1B exhibit a sharp transition of expression in the prefrontal cortex in adolescence. A similar but smoother rise in expression levels is observed in HTR4 and HTR5A, and in HTR1E and HTR1B in three other expression datasets published. The expression of HTR1E and HTR1B is correlated across subjects within each age group, suggesting that they are controlled by common mechanisms
A-to-I RNA editing by adenosine deaminases acting on RNA is a post-transcriptional modification that is crucial for normal life and development in vertebrates. We examine the relation between the expression of ADAR genes with the expression of their target genes. Surprisingly, we found that a large fraction of the edited genes are positively correlated with ADAR, opposing the assumption that editing would reduce expression. These findings suggest that ADAR expression does not have a genome-wide effect reducing the expression of editing targets. It is possible, however, that RNA editing by ADAR in non-coding regions of the gene might be a part of a more complex expression regulation mechanism.