From fd55c177cc2d398d56e53979f11f1de82e1860ab Mon Sep 17 00:00:00 2001
From: Sungwoo Kim <iam@sung-woo.kim>
Date: Mon, 12 Aug 2024 00:08:05 -0400
Subject: [PATCH] add papers

---
 ... of the Generalization Error Across Scales |  1 +
 ...h Neural Networks for Graph Classification |  1 +
 ...ethod for Solving Vehicle Routing Problems |  1 +
 ...urity Vulnerabilities of Transfer Learning |  1 +
 ...f the Number of Shots in Few-Shot Learning |  1 +
 ..., or what we can learn from a single image |  1 +
 ...gregated Memory For Reinforcement Learning |  1 +
 ...h momentum for over-parameterized learning |  4 ++++
 ...e Effects of Actions in Multiagent Systems |  1 +
 ...ibria of Linear-Quadratic Mean-Field Games |  1 +
 ... Fingerprints for Graph Attention Networks |  1 +
 ...uniform Discretization for Neural Networks |  1 +
 .../iclr/Adjustable Real-time Style Transfer  |  1 +
 ...ies: Attacking Deep Reinforcement Learning |  1 +
 ...obust Representations with Smooth Encoders |  1 +
 .../Adversarially robust transfer learning    |  1 +
 ...n Extended Semi-discrete Optimal transport |  1 +
 ... Nets that Respect the Triangle Inequality |  1 +
 ...mple of Zebrafish Swim Bout Classification |  1 +
 ...but Strong Baselines for Grammar Induction |  1 +
 ...imators of sequence-to-sequence functions? |  1 +
 ...Neural Connectivity in Video Architectures |  1 +
 ... Networks for Exploring the Chemical Space |  1 +
 ...ed Kernel-Wise Neural Network Quantization |  1 +
 .../iclr/Automated Relational Meta-learning   |  1 +
 ...eration through setter-solver interactions |  1 +
 ... Visual Categories with Ranking Statistics |  1 +
 ...ck with Transferable Model-based Embedding |  1 +
 ... of Descent Paths in Shallow ReLU Networks |  1 +
 ...Loss Landscapes and Adversarial Robustness |  1 +
 ...etwork Training Under Resource Constraints |  1 +
 .../iclr/CAQL: Continuous Action Q-Learning   |  1 +
 ... Invariants with Continuous Logic Networks |  1 +
 ...i-stage Multi-agent Reinforcement Learning |  1 +
 ...an gradient clipping mitigate label noise? |  1 +
 ...ial Perturbations via Randomized Smoothing |  1 +
 ... Expedited Deep Neural Network Compilation |  1 +
 ...emerge in a neural iterated learning model |  1 +
 ...putation Reallocation for Object Detection |  1 +
 ...nual Learning with Adaptive Weights (CLAW) |  1 +
 ...an Neural Networks for Non-Stationary Data |  1 +
 ...odular structure of deep generative models |  1 +
 data/2020/iclr/Curvature Graph Network        |  1 +
 ...ackdoor Attacks against Federated Learning |  1 +
 ...intGoal Navigators from 2.5 Billion Frames |  3 +++
 ...ta-Independent Neural Pruning via Coresets |  1 +
 ...en Embeddings for Neural Sequence Modeling |  1 +
 ... with global and local adaptive dilations" |  1 +
 ... Flexible Inference, Planning, and Control |  1 +
 ...Processes via Proper Spectral Sub-gradient |  1 +
 ...cattering and Homotopy Dictionary Learning |  1 +
 .../Deep Semi-Supervised Anomaly Detection    |  1 +
 ...entiable Scale-Invariant Sparsity Measures |  1 +
 ... with Differentiable Structure from Motion |  1 +
 ...ve Receptive Fields for Object Deformation |  2 ++
 data/2020/iclr/Depth-Adaptive Transformer     |  1 +
 ...tecting Extrapolation with Local Ensembles |  1 +
 ... Class-Conditional Capsule Reconstructions |  1 +
 ...versarial Network-Unseen Sample Generation |  1 +
 .../iclr/Differentially Private Meta-Learning |  1 +
 ...ing Factors of Variations Using Few Labels |  1 +
 ...ing from Errors for Confidence Calibration |  1 +
 ...casting with Determinantal Point Processes |  1 +
 ...h Noisy Labels as Semi-supervised Learning |  1 +
 ...ime Lag Regression: Predicting What & When |  1 +
 ...upervised and Unsupervised Skill Discovery |  1 +
 ... for Large-scale Knowledge Graph Reasoning |  1 +
 ...ES-MAML: Simple Hessian-Free Meta Learning |  1 +
 data/2020/iclr/Editable Neural Networks       |  1 +
 ... Stiefel Manifold via the Cayley Transform |  1 +
 ...serving Future Frame Prediction and Beyond |  1 +
 ...ial Attacks with a Distribution Classifier |  1 +
 .../iclr/Ensemble Distribution Distillation   |  1 +
 ...dle Points Faster with Stochastic Momentum |  1 +
 ...Search Phase of Neural Architecture Search |  1 +
 ...cement Learning with Deep Covering Options |  1 +
 ... Model-based Planning with Policy Networks |  1 +
 ...resentations with Featurewise Sort Pooling |  1 +
 ...than free: Revisiting adversarial training |  1 +
 ...for Faster Real-time Semantic Segmentation |  1 +
 ...n Systems via Neural Interaction Detection |  1 +
 .../Federated Adversarial Domain Adaptation   |  1 +
 ...r-Classes based on Graph spectral Measures |  1 +
 ...ssification with Distributional Signatures |  1 +
 ...sses of Deep Reinforcement Learning Agents |  1 +
 ...al Attack against Multiple Object Tracking |  1 +
 ...Should Know to Improve Batch Normalization |  1 +
 ... Variational to Deterministic Autoencoders |  1 +
 ...s. parametric equivalence of ReLU networks |  1 +
 ...xample Detection and Robust Classification |  1 +
 ...with Object-Centric Latent Representations |  1 +
 .../iclr/GLAD: Learning Sparse Graph Recovery |  1 +
 ...Gap-Aware Mitigation of Gradient Staleness |  1 +
 ...nds for deep convolutional neural networks |  1 +
 .../iclr/Generative Ratio Matching Networks   |  1 +
 ...o the Convergence of Nonlinear TD Learning |  1 +
 .../Global Relational Models of Source Code   |  1 +
 ...earning for semi-supervised classification |  1 +
 ...orizon Tasks via Visual Subgoal Generation |  1 +
 ...ition for Comparing Classifiers Adaptively |  1 +
 ...lows for Recovering Latent Representations |  1 +
 ...ization Under Extreme Overparameterization |  1 +
 .../iclr/Image-guided Neural Object Rendering |  1 +
 ...rning via Off-Policy Distribution Matching |  1 +
 ...sed Adversarial Training on Separable Data |  1 +
 ...ust Classification via an All-Layer Margin |  1 +
 ...Requires Revisiting Misclassified Examples |  1 +
 ...ndly Binarized Neural Network Architecture |  1 +
 ...on with Likelihood-based Generative Models |  1 +
 ...ued Neural Networks for Privacy Protection |  1 +
 ...ation for Encouraging Synergistic Behavior |  1 +
 ...istency between Neural Networks and Beyond |  1 +
 ...ge MOdeling for Lifelong Language Learning |  1 +
 data/2020/iclr/Language GANs Falling Short    |  1 +
 ...Deep Learning: Training BERT in 76 minutes |  1 +
 ...extensive games with imperfect information |  1 +
 data/2020/iclr/Learned Step Size quantization |  1 +
 ...resentations for CounterFactual Regression |  1 +
 ...nchronization Policies for Distributed SGD |  1 +
 ...rning Execution through Neural Code fusion |  1 +
 ...rdination: An Event-Based Deep RL Approach |  1 +
 ...an Formulas through Reinforcement Learning |  1 +
 ...from Demonstrations with Negative Sampling |  1 +
 ...ace Partitions for Nearest Neighbor Search |  1 +
 ...ependent embedding and Hungarian attention |  1 +
 ...ime for Problems in Reinforcement Learning |  1 +
 .../Learning to Learn by Zeroth-Order Oracle  |  1 +
 data/2020/iclr/Learning to Link               |  2 ++
 ...epresent Programs with Property Signatures |  1 +
 ...ing to solve the credit assignment problem |  1 +
 ...etworks for Low-precision Integer Hardware |  1 +
 ...and Compositionality in Zero-Shot Learning |  1 +
 .../Logic and the 2-Simplicial Transformer    |  1 +
 ...rce Knowledge-Grounded Dialogue Generation |  1 +
 ...t Training via Maximizing Certified Radius |  1 +
 ...trolling the Estimation Bias of Q-learning |  1 +
 ...: A Comprehensive Method on Realistic Data |  1 +
 ...ts for Learning to Learn from Few Examples |  1 +
 .../iclr/MetaPix: Few-Shot Video Retargeting  |  1 +
 ... to Learn Efficient Sparse Representations |  1 +
 ...une Large-scale Pretrained Language Models |  1 +
 ...oiting Mixup to Defend Adversarial Attacks |  1 +
 ...ment Learning for Networked System Control |  1 +
 ...cative Interactions and Where to Find Them |  1 +
 ...ain Adaptation on Person Re-identification |  1 +
 ... for interpretable time series forecasting |  1 +
 .../iclr/NAS evaluation is frustratingly hard |  1 +
 ...nsembles for Deep Learning on Tabular Data |  1 +
 data/2020/iclr/Neural Stored-program Memory   |  1 +
 ...Text Generation With Unlikelihood Training |  1 +
 data/2020/iclr/Novelty Detection Via Blurring |  1 +
 ...onal Overfitting in Reinforcement Learning |  1 +
 ... Generative Adversarial Imitation Learning |  1 +
 .../iclr/On Identifiability in Transformers   |  1 +
 ...n Maximization for Representation Learning |  1 +
 ...lity\" of generative adversarial networks" |  1 +
 ...e of the Adaptive Learning Rate and Beyond |  1 +
 ...nt Learning for Neural Machine Translation |  1 +
 ...l Networks by Jacobian Spectrum Evaluation |  1 +
 ...ion even with a Pessimistic Initialisation |  1 +
 ...Option Discovery using Deep Skill Chaining |  1 +
 ...ning and Its Application to Age Estimation |  1 +
 .../Overlearning Reveals Sensitive Attributes |  3 +++
 ...d Physical Parameter Estimation from Video |  1 +
 ...shape the loss surfaces of neural networks |  1 +
 ...e languages: lottery tickets in RL and NLP |  1 +
 ...r Fast and Accurate Multi-sentence Scoring |  1 +
 ...l Policy Search for Reinforcement Learning |  1 +
 ... for Embedding-based Large-scale Retrieval |  1 +
 ...rvised Knowledge-Pretrained Language Model |  1 +
 ...ry Banks for Incremental Domain Adaptation |  1 +
 ...works under Regularization and Constraints |  1 +
 .../iclr/Pruned Graph Scattering Transforms   |  1 +
 ... 3D Object Detection in Autonomous Driving |  1 +
 ...ints: a Geometric Study of Linear Networks |  1 +
 ... Sample Efficient for Infinite-Horizon MDP |  1 +
 ...-Performance Learned Lossy Representations |  1 +
 ...ng to New Environment Dynamics via Reading |  1 +
 ...tical Training For Collaborative Filtering |  1 +
 data/2020/iclr/Ranking Policy Gradient        |  1 +
 ...ds Understanding the Effectiveness of MAML |  1 +
 ...ension Dataset Requiring Logical Reasoning |  1 +
 ...bution Matching and Augmentation Anchoring |  1 +
 ...iance Reduced Temporal Difference Learning |  1 +
 ...rent neural circuits for contour detection |  1 +
 ...ced active learning for image segmentation |  1 +
 ...ence Model for Natural Question Generation |  1 +
 ...bles of Information-Constrained Primitives |  1 +
 ... Model for Stochastic Multi-Object Systems |  1 +
 ...ss-Entropy Loss for Adversarial Robustness |  1 +
 ...ia Bias-Free Convolutional Neural Networks |  1 +
 ...the Generalization of Adversarial Training |  1 +
 .../Robust training with ensemble consensus   |  1 +
 ...iant of Adam for Strongly Convex Functions |  0
 ...o Filter Noisy Labels with Self-Ensembling |  1 +
 ...Reinforcement Learning with Sparse Rewards |  1 +
 ...ning of Bayesian Quantized Neural Networks |  1 +
 ...on by Entropy Penalized Reparameterization |  1 +
 ...r Reasoning With a Symbolic Knowledge Base |  1 +
 ...ning with Additive Parameter Decomposition |  1 +
 ...Efficient Data Selection for Deep Learning |  1 +
 ...arative Discrimination for Text Generation |  1 +
 ...arning for Self-Supervised Monocular Depth |  1 +
 ... in Multi-Task Deep Reinforcement Learning |  1 +
 ...parse Deconvolution - A Geometric Approach |  1 +
 ...its Are All You Need for Black-Box Attacks |  1 +
 ...ry-Efficient Hard-label Adversarial Attack |  1 +
 ...ficient Distributed SGD with Slow Momentum |  1 +
 ...AUC Maximization with Deep Neural Networks |  1 +
 ...nerative Networks with Basis Decomposition |  1 +
 ...Large-Batch Training That Generalizes Well |  1 +
 ...raph Pooling via Conditional Random Fields |  1 +
 ... Dataset for Table-based Fact Verification |  1 +
 ...Incremental Learning Drives Generalization |  1 +
 ...hastic Evaluation on an Information Budget |  1 +
 ... of the Hessian of DNN throughout training |  1 +
 ... for Learning Disentangled Representations |  1 +
 ...treet! Model Extraction of BERT-based APIs |  1 +
 ...ur Headache of Training an MRF, Take AdVIL |  1 +
 ... Algorithms in Generative Adversarial Nets |  1 +
 ...erturbations of Deep Feature Distributions |  1 +
 ...d Attention with Hierarchical Accumulation |  1 +
 ...t by Cell-based Neural Architecture Search |  1 +
 ... in Non-autoregressive Machine Translation |  1 +
 ... Variational Mutual Information Estimators |  1 +
 ...n on Real Scans using Adversarial Training |  1 +
 ...ional Disentangled Representation Learning |  1 +
 ...zation for Discrete and Continuous Control |  1 +
 ...ks for Video-level Representation Learning |  1 +
 ... Generic Visual-Linguistic Representations |  1 +
 ...Solving Partially Observable Control Tasks |  1 +
 ...haracters Extracted from Real-World Videos |  2 ++
 ...ased Model for Stochastic Video Generation |  1 +
 ...a-Learning from Demonstrations and Rewards |  1 +
 ...lustering by Exploiting Unique Class Count |  1 +
 ...ural networks cannot learn: depth vs width |  1 +
 ...mmunication-Efficient Distributed Learning |  1 +
 ...entation for Training Deep Neural Networks |  0
 ...f Self-Expressive Deep Subspace Clustering |  1 +
 .../A Design Space Study for LISTA and Beyond |  1 +
 ...t Descent Exponentially Favors Flat Minima |  1 +
 ...ative Gaussian Mixture Model with Sparsity |  1 +
 ...nal Approach to Controlled Text Generation |  1 +
 ...nerative Image Models and Its Applications |  0
 ...u Need for High-Resolution Video Synthesis |  1 +
 ...ow Framework For Analyzing Network Pruning |  1 +
 ...o Robust Regression without Correspondence |  1 +
 ...oretic Perspective on Local Explainability |  1 +
 ...anguage Models Help Solve Downstream Tasks |  1 +
 ...alization Bounds for Graph Neural Networks |  1 +
 ...aptive Multi-Exit Neural Network Inference |  1 +
 ... Learning with Continuous-time Information |  1 +
 ...regation and its Relationship to Attention |  1 +
 ...g and Boosting Adversarial Transferability |  1 +
 ...er Layer for Few-Shot Image Classification |  1 +
 ... for Group Equivariant Convolution Kernels |  1 +
 ...of cold posteriors in deep neural networks |  1 +
 ...t framework to distill future trajectories |  1 +
 ...it bias in training linear neural networks |  1 +
 ...died Environments for Interactive Learning |  1 +
 ...iators via Constrained Structural Learning |  0
 ...g Unlabeled data by REgularizing Diversity |  1 +
 ...astic Gradient MCMC via Variance Reduction |  1 +
 ...epresentations with Graph Multiset Pooling |  1 +
 ...articipation in Non-IID Federated Learning |  1 +
 ...nments with Non-Stationary Markov Policies |  1 +
 ...-level uncertainty in deep neural networks |  1 +
 ...ning of Audio-Visual Video Representations |  1 +
 ...n Network for Efficient Action Recognition |  1 +
 ...ph Convolutional Networks into Deep Models |  1 +
 ...: Adaptive Text to Speech for Custom Voice |  1 +
 ...ntum Optimizers on Scale-invariant Weights |  1 +
 ...sivity via Spectral Reinforcement Learning |  1 +
 ...Methods for Min-Max Optimization and Games |  1 +
 .../2021/iclr/Adaptive Federated Optimization |  1 +
 ...k Generation for Hard-Exploration Problems |  1 +
 ... Generalized PageRank Graph Neural Network |  1 +
 ...Adaptive and Generative Zero-Shot Learning |  1 +
 ...and improved sampling for image generation |  2 ++
 .../iclr/Adversarially Guided Actor-Critic    |  1 +
 ...tter: Illustration on Image Classification |  1 +
 .../iclr/Aligning AI With Shared Human Values |  1 +
 ...ransformers for Image Recognition at Scale |  1 +
 ...ng Approach for Real-World Image Denoising |  0
 ... Neural Networks in a Spectral Perspective |  1 +
 ... Hidden Representations and Task Semantics |  1 +
 ...g Sparse Embeddings for Large Vocabularies |  1 +
 ...n Questions with Multi-Hop Dense Retrieval |  1 +
 ...regressive Models via Ordered Autoencoding |  1 +
 ...trastive Learning for Dense Text Retrieval |  1 +
 ...larity Through Differentiable Weight Masks |  1 +
 ...formed by Gradient Boosted Decision Trees? |  0
 ...etter given the same number of parameters? |  1 +
 ...e Generalization in Reinforcement Learning |  1 +
 ...chastic Method using Deep Denoising Priors |  1 +
 ...l Constellation Nets for Few-Shot Learning |  1 +
 .../Auction Learning as a Two-Player Game     |  1 +
 ... Networks for Complex Dynamics Forecasting |  1 +
 ...etric Surrogates for Semantic Segmentation |  1 +
 ...hedule by Bayesian Optimization on the Fly |  1 +
 ...Offline Policy Evaluation and Optimization |  1 +
 .../2021/iclr/Autoregressive Entity Retrieval |  1 +
 ...liary Learning by Implicit Differentiation |  1 +
 ...osition: the Good, the Bad and the neutral |  1 +
 ...ion for Bilinear Games and Normal Matrices |  1 +
 ...eting Attention in Protein Language Models |  1 +
 ...epresentation Change for Few-shot Learning |  0
 ...ining Quantization by Block Reconstruction |  1 +
 ...BREEDS: Benchmarks for Subpopulation Shift |  1 +
 ...ixed-Precision Neural Network Quantization |  1 +
 ...thesis Through Learning-Guided Exploration |  1 +
 .../Bag of Tricks for Adversarial Training    |  1 +
 ...raints and Rewards with Meta-Gradient D4PG |  1 +
 ...ement Learning Through Continuation Method |  1 +
 ...263lya-Gamma Augmented Gaussian Processes" |  1 +
 ...havioral Cloning from Noisy Demonstrations |  0
 ...sk bound and superiority to kernel methods |  1 +
 ...ning by Reducing Representational Collapse |  1 +
 ...l Representations for Image Classification |  1 +
 ...omplex Multiplications with 1 n Parameters |  1 +
 ...et: Binary Neural Network for Point Clouds |  1 +
 ...ence for Non-Autoregressive Text-to-Speech |  0
 ...ation for Efficient Reinforcement Learning |  1 +
 ...dient Boosting Meets Graph Neural Networks |  1 +
 ...-Shot Recognition and Novel-View Synthesis |  1 +
 ... SGD with Gradient Subspace Identification |  1 +
 ...ent Non-Convex Stochastic Gradient Descent |  1 +
 ...-Aware Cumulative Accessibility Estimation |  1 +
 ...Achieve Goals via Recursive Classification |  1 +
 ...nsupervised Visual Representation Learning |  1 +
 ...tion Regularization for Continual Learning |  1 +
 ...ural Network Training via Cyclic Precision |  1 +
 ...orization Network for Video Classification |  1 +
 ...dential and Private Collaborative Learning |  1 +
 ...libration of Neural Networks using Splines |  1 +
 .../Calibration tests beyond classification   |  1 +
 .../Can a Fruit Fly Learn Word Embeddings?    |  1 +
 .../Capturing Label Characteristics in VAEs   |  1 +
 ...izing Flows via Continuous Transformations |  1 +
 ...for Causal Structure and Transfer Learning |  1 +
 ... Adversarial Networks for Image Generation |  1 +
 ...obustness with Compositional Architectures |  0
 ...m and Coordination via Game Decompositions |  1 +
 ...he performance gap in unnormalized ResNets |  1 +
 ...g with Heaviside Continuous Approximations |  1 +
 ...A Pipeline Toolkit for Medical Time Series |  1 +
 ...Continual)? Generalized Zero-Shot Learning |  1 +
 ...e Discrimination and Feature Decorrelation |  1 +
 ...ed Joint Mixup with Supermodular Diversity |  1 +
 ...als for Evaluating Dialogue State Trackers |  1 +
 ...ed Approach for Controlled Text Generation |  1 +
 ...ntation for Natural Language Understanding |  1 +
 ...g Interdependence in Graph Neural Networks |  1 +
 data/2021/iclr/Colorization Transformer       |  1 +
 ...ata Augmentation Can Harm Your Calibration |  1 +
 ... Models out-performs Graph Neural Networks |  1 +
 ...chine Learning for Network Flow Estimation |  1 +
 ... Reinforcement Learning: Intention Sharing |  0
 ...works for Faster Multi-Platform Deployment |  1 +
 ...uery Answering with Neural Link Predictors |  1 +
 ...Convolutional and Fully-Connected Networks |  1 +
 .../Concept Learners for Few-Shot Learning    |  1 +
 ...ive Modeling via Learning the Latent Space |  1 +
 ...rastive Learning of Visual Representations |  1 +
 ... in NLP Using Fewer Parameters & Less Data |  1 +
 ...sentation with Hamiltonian Neural Networks |  1 +
 ...onservative Safety Critics for Exploration |  1 +
 ...emplating Real-World Object Classification |  1 +
 ... Efficient Sample-Dependent Dropout Module |  1 +
 ...ion Networks for Online Continual Learning |  1 +
 ...nual learning in recurrent neural networks |  1 +
 ...er Estimation without Minimax Optimization |  1 +
 ...r Generalization in Reinforcement Learning |  1 +
 ...arning is a Time Reversal Adversarial Game |  1 +
 ...ent Learning via Embedded Self Predictions |  1 +
 ...turbations for Conditional Text Generation |  1 +
 ...astive Learning with Hard Negative Samples |  1 +
 .../Contrastive Syn-to-Real Generalization    |  1 +
 ...ons for Model-based Reinforcement Learning |  1 +
 ... Optimal Transport and Convex Optimization |  1 +
 ...egularization behind Neural Reconstruction |  1 +
 ...t via Distributionally Robust Optimisation |  1 +
 ...l Roles of Graphs in Graph Neural Networks |  1 +
 ...ience replay for multi-agent communication |  1 +
 .../iclr/Counterfactual Generative Networks   |  1 +
 ...ecture for learning long time dependencies |  1 +
 data/2021/iclr/Creative Sketch Generation     |  1 +
 ... for Weakly-Supervised Action Localization |  1 +
 ... better segmentation with weak supervision |  1 +
 ...of Performance Collapse Without Indicators |  1 +
 ...hod for optimization with hard constraints |  1 +
 ...ntial Dynamic Programming Neural Optimizer |  1 +
 ...ditional Redundancy Adversarial Estimation |  1 +
 ...al Energy-Based GAN for Domain Translation |  1 +
 ...cy Multi-Agent Decomposed Policy Gradients |  1 +
 ...eration with Music via Curriculum Learning |  1 +
 ...rning with Self-Predictive Representations |  1 +
 ...ataset Condensation with Gradient Matching |  1 +
 ...: Ownership Resolution in Machine Learning |  1 +
 ...Meta-Learning from Kernel Ridge-Regression |  1 +
 ...DeLighT: Deep and Light-weight Transformer |  0
 ...-Enhanced Bert with Disentangled Attention |  1 +
 ...pt-based Explanations with Causal Analysis |  1 +
 ...ntralized Attribution of Generative Models |  1 +
 ...ti-Task Learning: a Random Matrix Approach |  0
 ...nstructing the Regularization of BatchNorm |  0
 ...sentations via Invertible Generative Flows |  1 +
 ...ing Non-autoregressive Machine Translation |  1 +
 ...hallow for ReLU Networks in Kernel Regimes |  1 +
 .../Deep Learning meets Projective Clustering |  3 +++
 ...Networks and the Multiple Manifold Problem |  1 +
 ...inting by Conferrable Adversarial Examples |  1 +
 ...rnel and Laplace Kernel Have the Same RKHS |  1 +
 ...Defenses against General Poisoning Attacks |  1 +
 ...Data Based on Order-Identity Decomposition |  0
 ...rom data via risk-seeking policy gradients |  1 +
 ...ing By Solving Derived Non-Parametric MDPs |  1 +
 ...ansformers for End-to-End Object Detection |  1 +
 ...n-Aware Training for Graph Neural Networks |  1 +
 .../iclr/Denoising Diffusion Implicit Models  |  1 +
 ...rning via Model-Based Offline Optimization |  1 +
 ...-Graph Networks into Negotiation Dialogues |  1 +
 ...satile Diffusion Model for Audio Synthesis |  1 +
 .../Differentiable Segmentation of Sequences  |  1 +
 ...ion Layers for Deep Reinforcement Learning |  1 +
 ... Needs Better Features (or Much More Data) |  1 +
 .../Directed Acyclic Graph Neural Networks    |  1 +
 ...adient Descent with Moderate Learning Rate |  1 +
 ...Symbolic Expressions in Informal Documents |  1 +
 ...trategic Behavior via Reward Randomization |  1 +
 ...ssive Orderings with Variational Inference |  1 +
 ... set of policies for the worst case reward |  1 +
 ...rning for Forecasting Multiple Time Series |  1 +
 ...ntangled Recurrent Wasserstein Autoencoder |  1 +
 ...cal Networks for Few-Shot Concept Learning |  1 +
 ...arisation of Deep Networks for Fine-Tuning |  1 +
 ...Reader to Retriever for Question Answering |  1 +
 ...tine-resilient Stochastic Gradient Descent |  1 +
 ...in and Applications to Generative Modeling |  1 +
 ...eneration using a Gaussian Process Trigger |  1 +
 ...3D Shape Reconstruction from 2D Image GANs |  1 +
 ... Representations Vary with Width and Depth |  1 +
 ...mbedding Perturbation for Private Learning |  1 +
 ... network robustness to common corruptions? |  1 +
 .../iclr/Domain Generalization with MixStyle  |  1 +
 ...arning with Mutual Information Constraints |  1 +
 ...rNAS: Dirichlet Neural Architecture Search |  1 +
 ...epresentation for Noise-Robust Exploration |  1 +
 ...e Streaming ASR with Full-context Modeling |  1 +
 ...ization in Deep Neural Network Compilation |  1 +
 .../iclr/Dynamic Tensor Rematerialization     |  1 +
 ...d Regenerate Images for Continual Learning |  1 +
 ...ks: Double Descent and How to Eliminate it |  1 +
 ... Optimization with Blended Search Strategy |  1 +
 ...tract Reasoning with Dual-Contrast Network |  1 +
 ...m Features: Improved Bounds and Algorithms |  0
 ... Efficient Vote Attack on Capsule Networks |  1 +
 ...Against Patch Attacks on Image Classifiers |  1 +
 ...Cascaded Inference with Expanded Admission |  0
 ...th Modular Networks and Task-Driven Priors |  1 +
 ... Estimation for Unsupervised Stabilization |  1 +
 .../iclr/Efficient Generalized Spherical CNNs |  1 +
 ...ble Interaction in Spiking-neuron Networks |  1 +
 ...ed MDPs with Application to Constrained RL |  1 +
 ... Learning using Actor-Learner Distillation |  1 +
 ...tural Gradients for Reinforcement Learning |  1 +
 .../iclr/EigenGame: PCA as a Nash Equilibrium |  1 +
 ... Rules In Multi-Agent Driving Environments |  1 +
 ...Symbols through Binding in External Memory |  1 +
 ...Entity Problem in Named Entity Recognition |  1 +
 ...imization? A Sample Complexity Perspective |  1 +
 .../iclr/End-to-End Egospheric Spatial Memory |  1 +
 .../End-to-end Adversarial Text-to-Speech     |  1 +
 ... guarantees within neural network policies |  1 +
 ... Image Editing via Latent Space Navigation |  1 +
 ...nt descent algorithms and wide flat minima |  1 +
 ...stants of monotone deep equilibrium models |  1 +
 ...ctive Uncertainty in Deep Object Detectors |  1 +
 ... of samples with Smooth Unique Information |  1 +
 ...enerative Models through Manifold Topology |  1 +
 ...s vs Cross-Entropy in Classification Tasks |  2 ++
 ...valuation of Similarity-based Explanations |  1 +
 ...or Explanation through Robustness Analysis |  1 +
 ...Evolving Reinforcement Learning Algorithms |  1 +
 ...han State-of-the-Art Feature Visualization |  0
 .../Explainable Deep One-Class Classification |  1 +
 ...r Forecasting on Temporal Knowledge Graphs |  1 +
 ...Decisions by Interpretable Policy Learning |  1 +
 ...fficacy of Counterfactually Augmented Data |  1 +
 ...Feature Spaces for Representation Learning |  1 +
 ...mplicit Priors in the Infinite-Width Limit |  1 +
 ...iant and Equivariant Graph Neural Networks |  1 +
 ...asks from Zero-Order Trajectory Optimizers |  1 +
 ...e Memorization via Scale of Initialization |  1 +
 ...etric Learning and Behavior Regularization |  0
 ...edge in Structured, Dynamical Environments |  0
 .../Fair Mixup: Fairness via Interpolation    |  1 +
 ...rBatch: Batch Selection for Model Fairness |  1 +
 ...iasing Method for Pretrained Text Encoders |  1 +
 ...s on Singular Values of Convolution Layers |  0
 ...arning Of Recurrent Independent Mechanisms |  1 +
 ...ections for Local Robustness Certification |  1 +
 ...nd Massively Parallel Incomplete Verifiers |  1 +
 ...tic subgradient method under interpolation |  1 +
 ...and High-Quality End-to-End Text to Speech |  1 +
 ...eddings for Preserving Euclidean Distances |  1 +
 ... Ensemble Applicable to Federated Learning |  1 +
 ...IID Features via Local Batch Normalization |  1 +
 ...up under Mean Augmented Federated Learning |  1 +
 ...d Learning Based on Dynamic Regularization |  1 +
 ...A New Perspective and Practical Algorithms |  1 +
 ...ter-Client Consistency & Disjoint Learning |  1 +
 ...n Optimization with Deep Kernel Surrogates |  1 +
 ... via Learning the Representation, Provably |  1 +
 .../Fidelity-based Deep Adiabatic Scheduling  |  1 +
 ...ction for Crosslingual Embedding Alignment |  1 +
 ...ative Network for Text-to-Speech Synthesis |  1 +
 ...Fooling a Complete Neural Network Verifier |  1 +
 ...tionality implies generalization, provably |  1 +
 ... Parametric Partial Differential Equations |  1 +
 ...ew-shot Learning: Distribution Calibration |  1 +
 ...ith Convolutional Variational Autoencoders |  1 +
 ... to Learning Sparse Representations Online |  1 +
 ...GAN \"Steerability\" without optimization" |  1 +
 ...r Blind Denoising with Single Noisy Images |  1 +
 .../iclr/GANs Can Play Lottery Tickets Too    |  1 +
 ...itional Computation and Automatic Sharding |  1 +
 ...isotropic convolutions on geometric graphs |  1 +
 .../Generalization bounds via distillation    |  1 +
 ...ata-driven models of primary visual cortex |  1 +
 .../2021/iclr/Generalized Energy Based Models |  1 +
 data/2021/iclr/Generalized Multimodal ELBO    |  0
 ...Generalized Variational Continual Learning |  1 +
 ...uter Programs using Optimized Obfuscations |  1 +
 ...ape and Appearance across Multiple Domains |  1 +
 ...n-and-Language Navigation with Bayes' Rule |  1 +
 .../2021/iclr/Generative Scene Graph Networks |  1 +
 ...ve Time-series Modeling with Fourier Flows |  0
 ...y Evolution in Deep Reinforcement Learning |  1 +
 ... Algorithms for Neural Architecture Search |  1 +
 ...e Instance-reweighted Adversarial Training |  1 +
 ...ethod for Explaining Uncertainty Estimates |  1 +
 ...r Neural Networks in the Mean Field Regime |  1 +
 ...r neural networks in the mean-field regime |  1 +
 ...the flow: Adaptive control for Neural ODEs |  1 +
 ...ed Pre-Training for Table Semantic Parsing |  1 +
 ... Typically Occurs at the Edge of Stability |  1 +
 ...t Projection Memory for Continual Learning |  1 +
 ...imization in Massively Multilingual Models |  1 +
 .../Graph Coarsening with Neural Networks     |  1 +
 ...tion with Low-rank Learnable Local Filters |  1 +
 data/2021/iclr/Graph Edit Networks            |  1 +
 ...mation Bottleneck for Subgraph Recognition |  1 +
 ...ls: A Meta-Algorithm for Scalable Learning |  1 +
 data/2021/iclr/Graph-Based Continual Learning |  1 +
 ...aining Code Representations with Data Flow |  1 +
 ...nite-time Analysis and Improved Complexity |  1 +
 .../Grounded Language Learning Fast and Slow  |  1 +
 ...mously-Acquired Skills via Goal Generation |  1 +
 ...nd Events Through Dynamic Visual Reasoning |  1 +
 ...p Equivariant Conditional Neural Processes |  1 +
 ...quivariant Generative Adversarial Networks |  1 +
 ...iant Stand-Alone Self-Attention For Vision |  1 +
 ...ks by Structured Continuous Sparsification |  1 +
 ...Aware Neural Architecture Search Benchmark |  1 +
 ...ory Forecasting with Hallucinative Intents |  1 +
 ...sarial scenarios and generalization bounds |  1 +
 ...derated Learning for Heterogeneous Clients |  1 +
 ...Deep Learning with Adaptive Regularization |  1 +
 ...sive Modeling for Neural Video Compression |  1 +
 ... Learning by Discovering Intrinsic Options |  1 +
 .../iclr/High-Capacity Expert Binary Networks |  1 +
 .../iclr/Hopfield Networks is All You Need    |  1 +
 ...p Transformer for Spatiotemporal Reasoning |  1 +
 .../iclr/How Benign is Benign Overfitting ?   |  1 +
 ...p Help With Robustness and Generalization? |  1 +
 ...Is Sufficient to Learn Deep ReLU Networks? |  1 +
 ... From Feedforward to Graph Neural Networks |  1 +
 ...aph Attention Design with Self-Supervision |  1 +
 ... No-Press Diplomacy via Equilibrium Search |  1 +
 ...ject and Agent Dynamics with Hypernetworks |  1 +
 ... Towards A Single Model for Multiple Tasks |  1 +
 data/2021/iclr/Hyperbolic Neural Networks++   |  1 +
 ...er Discrete Flows for Lossless Compression |  1 +
 ...-Level Pretext Tasks for Few-Shot Learning |  0
 ...aluating Generalization in Theorem Proving |  1 +
 ...ayer Reordering for Transformer Structures |  1 +
 ...w of Hamiltonian Systems via Meta-Learning |  1 +
 ...le time scales and long-range dependencies |  1 +
 ...ng Deep Reinforcement Learning from Pixels |  1 +
 ...hics and Interpretable 3D Neural Rendering |  1 +
 ... Representation Learning in Linear Bandits |  1 +
 ...nd Three-Layer Networks in Polynomial Time |  1 +
 .../iclr/Implicit Gradient Regularization     |  1 +
 data/2021/iclr/Implicit Normalizing Flows     |  1 +
 ...Data-Efficient Deep Reinforcement Learning |  1 +
 ...: Towards Accurate and Efficient Detectors |  0
 ...ssive Modeling with Distribution Smoothing |  1 +
 ...p-Norm Distance Metrics Using Half Spaces" |  1 +
 ...ss via Channel-wise Activation Suppressing |  1 +
 ... Spherical Sliced Fused Gromov Wasserstein |  1 +
 ...nce in Contrastive Representation Learning |  1 +
 ...ing VAEs' Robustness to Adversarial Attack |  1 +
 ...r via Disentangled Representation Learning |  1 +
 ...ion Framework for Semi-Supervised Learning |  1 +
 .../In Search of Lost Domain Generalization   |  1 +
 ...rmation for Out-of-Distribution Robustness |  1 +
 ...ynamics Models for Improved Generalization |  1 +
 ...vector quantization in deep embedded space |  1 +
 .../iclr/Individually Fair Gradient Boosting  |  1 +
 data/2021/iclr/Individually Fair Rankings     |  1 +
 ...mporal Networks via Causal Anonymous Walks |  1 +
 ...mation for Generative Adversarial Networks |  1 +
 ...nce Functions in Deep Learning Are Fragile |  1 +
 ... from An Information Theoretic Perspective |  1 +
 .../Information Laundering for Model Privacy  |  1 +
 ...Regularization of Factorized Neural Layers |  1 +
 ...ntics into Unsupervised Domain Translation |  1 +
 ...arning Useful Heuristics for Data Labeling |  1 +
 ...lity Using Self-explaining Neural Networks |  1 +
 ...ptimisation with Weisfeiler-Lehman Kernels |  0
 ...s for NLP With Differentiable Edge Masking |  1 +
 ...lation Representation from Word Embeddings |  0
 ...oosting Dropout from a Game-Theoretic View |  1 +
 ...udio-Visual Separation of On-Screen Sounds |  1 +
 ...cit learning ability that regularizes DNNs |  1 +
 ...ling for Learning on 3D Protein Structures |  1 +
 ...ttention Better Than Matrix Decomposition? |  1 +
 ...Knowledge Distillation: An Empirical Study |  1 +
 ...mark for High-level Mathematical Reasoning |  1 +
 ...Network for Generalized Zero-shot Learning |  1 +
 ...d Equivariant Graph Convolutional Networks |  1 +
 ...al Embedding Space: Clusters and Manifolds |  0
 ...learning for emergent systematicity in VQA |  1 +
 ...me Solving via Single Policy Best Response |  1 +
 ...ble, Locally Block Allocated Latent Memory |  1 +
 ...e Distillation as Semiparametric Inference |  1 +
 ...softmax regression representation learning |  1 +
 ...earnable Frontend for Audio Classification |  1 +
 ... long-range Interactions without Attention |  1 +
 ... of Source Code from Structure and Context |  1 +
 ...oblem in Neurobiology and Machine Learning |  1 +
 ...Simulation for Deep Reinforcement Learning |  1 +
 ...-Modulated Generative Adversarial Networks |  1 +
 ...mptotics for deep Gaussian neural networks |  1 +
 .../2021/iclr/Latent Convergent Cross Mapping |  1 +
 ...kill Planning for Exploration and Transfer |  1 +
 ...e Sparsity for the Magnitude-based Pruning |  1 +
 ...le Embedding sizes for Recommender Systems |  1 +
 ...planations for Sequential Decision-Making" |  0
 ...earning A Minimax Optimizer: A Pilot Study |  1 +
 ...ith Global Reference for Image Compression |  1 +
 ...ciative Inference Using Fast Weight Memory |  1 +
 ...ns Using Low-rank Adaptive Label Smoothing |  1 +
 ...or Control with Dynamics Cycle-Consistency |  1 +
 ...atures in Instrumental Variable Regression |  1 +
 ... via Coarse-to-Fine Expanding and Sampling |  0
 ...ed Models by Diffusion Recovery Likelihood |  1 +
 ...l Representations via Interactive Gameplay |  1 +
 ...ic Representations of Topological Features |  1 +
 ...ifferentiable Fluid Models that Generalize |  1 +
 ...nforcement Learning without Reconstruction |  1 +
 ... with Region Proposal Interaction Networks |  1 +
 ...h-Based Representations of Man-Made Shapes |  1 +
 ... Mesh-Based Simulation with Graph Networks |  1 +
 ...ctured Sparse Neural Networks From Scratch |  1 +
 ...ctions for Ordinary Differential Equations |  1 +
 ...mics for Molecular Conformation Generation |  1 +
 ...earning Parametrised Graph Shift Operators |  1 +
 ...mantic Graphs for Video-grounded Dialogues |  1 +
 ...stractions for Hidden-Parameter Block MDPs |  0
 ... Decentralized Neural Barrier Certificates |  1 +
 ...Edits via Incremental Tree Transformations |  1 +
 ...Subgoal Representations with Slow Dynamics |  1 +
 ...osition with Ordered Memory Policy Network |  1 +
 ...ns with Generative Neuro-Symbolic Modeling |  1 +
 ...p Policy Gradients using Residual Variance |  1 +
 ...Learning What To Do by Simulating the Past |  1 +
 ...ng Problems using Variational Autoencoders |  0
 ...ng a Latent Simplex in Input Sparsity Time |  1 +
 ...ed mathematical computations from examples |  1 +
 ...ntations for Deep One-Class Classification |  1 +
 ...rom sparse data with graph neural networks |  1 +
 ...earning explanations that are hard to vary |  1 +
 ...ion with Weakly Supervised Disentanglement |  1 +
 ...tructure with Geometric Vector Perceptrons |  1 +
 ...iding dataset biases without modeling them |  1 +
 ...turbation sets for robust machine learning |  1 +
 ...arning the Pareto Front with Hypernetworks |  2 ++
 ...Augmented Models via Targeted Perturbation |  1 +
 ...D Shapes with Generative Cellular Automata |  1 +
 ...ke Decisions via Submodular Regularization |  0
 ...ach Goals via Iterated Supervised Learning |  1 +
 ...mple Data For Compositional Generalization |  1 +
 ...ues as a Hypergraph on the Action Vertices |  1 +
 ...lobal Contexts in Experience Replay Buffer |  1 +
 ... Set Waypoints for Audio-Visual Navigation |  1 +
 ...h separate excitatory and inhibitory units |  1 +
 ...o: Adversarially Motivated Intrinsic Goals |  1 +
 ...endent Label Noise: A Progressive Approach |  1 +
 ...ndent Label Noise: A Sample Sieve Approach |  1 +
 ...based Support Estimation in Sublinear Time |  1 +
 ...elong Learning of Compositional Structures |  1 +
 .../LiftPool: Bidirectional ConvNet Pooling   |  1 +
 ...ecentralized Optimization with Compression |  1 +
 ...e in Constrained Saddle-point Optimization |  2 ++
 ...tivity in Multitask and Continual Learning |  1 +
 ...nt Ascent with Finite Timescale Separation |  0
 ...s for Rank-Constrained Convex Optimization |  1 +
 ...ee Weight Sharing for Network Width Search |  1 +
 ...ce of Winning Tickets in Lifelong Learning |  1 +
 ...a : A Benchmark for Efficient Transformers |  1 +
 .../Long-tail learning via logit adjustment   |  1 +
 ...Routing Diverse Distribution-Aware Experts |  1 +
 ...n via Convergence-Simulation Driven Search |  1 +
 ...tructured Convolutional Models via Lifting |  1 +
 ...Social Media Users from Facial Recognition |  1 +
 ...everse accurate integrator for Neural ODEs |  1 +
 ...ampling for Multi-objective Drug Discovery |  1 +
 ...-Level Relationships for Few-Shot Learning |  0
 ...ated Data Augmentation in the Latent Space |  1 +
 ...work for Efficient Neural Network Training |  1 +
 ...ale Organization of Neural Language Models |  1 +
 ...ing via Self-supervised Skip-tree Training |  1 +
 ...g Massive Multitask Language Understanding |  1 +
 .../Memory Optimization for Deep Networks     |  1 +
 data/2021/iclr/Meta Back-Translation          |  1 +
 ...aussian VAE for Unsupervised Meta-Learning |  0
 ... Task Distributions in Humans and Machines |  1 +
 .../Meta-Learning with Neural Tangent Kernels |  1 +
 ...-learning Symmetries by Reparameterization |  1 +
 ...Meta-learning with negative learning rates |  1 +
 ... Normalize Few-Shot Batches Across Domains |  1 +
 ... Experts for Unsupervised Image Clustering |  1 +
 ...rence in Sequential Latent-Variable Models |  1 +
 ...ind the Pad - CNNs Can Develop Blind Spots |  1 +
 .../Minimum Width for Universal Approximation |  1 +
 ...lgorithm that directly controls perplexity |  1 +
 ...istillation of Large-scale Language Models |  1 +
 ...ed-Features Vectors and Subspace Splitting |  0
 ...pervised Learning with Momentum Prototypes |  1 +
 ...onvolutions for Visual Counting and Beyond |  1 +
 ...oup Performance Gap with Data Augmentation |  1 +
 data/2021/iclr/Model-Based Offline Planning   |  1 +
 ... with Self-Supervised Functional Distances |  1 +
 ...odel properties and which model to choose? |  1 +
 ...er in Distributionally Robust Optimization |  1 +
 ...ramework for Task-oriented Dialogue System |  1 +
 ...cule Optimization by Explainable Evolution |  0
 .../iclr/Monotonic Kronecker-Factored Lattice |  1 +
 ...rning with Language Action Value Estimates |  1 +
 ...ild Convolutional Neural Network Ensembles |  1 +
 ...ual Information Maximization-based Binning |  1 +
 ...GD for Heterogeneous Hierarchical Networks |  1 +
 ...rks by Pruning A Randomly Weighted Network |  1 +
 ...tworks for Irregularly Sampled Time Series |  1 +
 ...hastic process identifies causes of cancer |  1 +
 ...sentation Learning in LSTM Language Models |  1 +
 ...ion answering over text, tables and images |  1 +
 data/2021/iclr/Multiplicative Filter Networks |  1 +
 ...Matching for Out-of-Distribution Detection |  1 +
 ...ecasting via Conditioned Normalizing Flows |  1 +
 ...Mutual Information State Intrinsic Control |  1 +
 ...hology in Graph-Based Incompatible Control |  1 +
 ...Architecture Search for Speech Recognition |  1 +
 .../iclr/NBDT: Neural-Backed Decision Tree    |  1 +
 ...Search for End-to-end Learning and Control |  1 +
 ...ive Features for Robust 3D Pose Estimation |  1 +
 .../iclr/Nearest Neighbor Machine Translation |  1 +
 data/2021/iclr/Negative Data Augmentation     |  1 +
 ...F: Effective Deep Modeling of Tabular Data |  1 +
 ...tters: A Case Study on Retraining Variants |  1 +
 ... Sufficient Statistics for Implicit Models |  1 +
 ...ours: A Theoretically Inspired Perspective |  1 +
 ...ackdoor Triggers from Deep Neural Networks |  1 +
 .../iclr/Neural Delay Differential Equations  |  1 +
 ...t Continuous-Time Prediction and Filtering |  1 +
 ...orial Problems in Structured Output Spaces |  1 +
 ...onservation Laws in Deep Learning Dynamics |  1 +
 ...ual G-Invariances from Single Environments |  1 +
 data/2021/iclr/Neural ODE Processes           |  1 +
 .../Neural Pruning via Growing Regularization |  1 +
 .../Neural Spatio-Temporal Point Processes    |  1 +
 ...nthesis of Binaural Speech From Mono Audio |  0
 data/2021/iclr/Neural Thompson Sampling       |  1 +
 .../Neural Topic Model via Optimal Transport  |  1 +
 ...al: improved quantized and sparse training |  1 +
 .../Neural networks with late-phase weights   |  1 +
 ...nd generation for RNA secondary structures |  1 +
 data/2021/iclr/Neurally Augmented ALISTA      |  1 +
 ...ted Mean Estimation and Variance Reduction |  1 +
 ...or Making Better Mistakes in Deep Networks |  1 +
 ...and stable training of energy-based models |  1 +
 ...el noise helps combat inherent label noise |  1 +
 ...of Image Backgrounds in Object Recognition |  1 +
 ...-policy Evaluation: Primal and Dual Bounds |  1 +
 .../Nonseparable Symplectic Neural Networks   |  1 +
 ...ccelerating Offline Reinforcement Learning |  1 +
 ...ining for Transfer with Domain Classifiers |  1 +
 ...a Normalized Maximum Likelihood Estimation |  1 +
 ...Consistency-Based Semi-Supervised Learning |  1 +
 ...g and Mitigating Bias in Graph Connections |  1 +
 ...Adaptation in Model-Agnostic Meta-Learning |  1 +
 ...eural Networks versus Graph-Augmented MLPs |  1 +
 ...Retrieval, and Sparse Matrix Factorization |  2 ++
 ...Universal Representations Across Languages |  1 +
 data/2021/iclr/On Position Embeddings in BERT |  1 +
 ...d Image Representations for GAN Evaluation |  0
 ...In Active Learning: How and When to Fix It |  1 +
 ...al Networks and its Practical Implications |  1 +
 ...entions in Adaptive Human-AI Collaboration |  1 +
 ...s: Approximation and Optimization Analysis |  1 +
 ... the Dynamics of Training Attention Models |  1 +
 ...bal Convergence in Multi-Loss Optimization |  1 +
 ...ularization in Stochastic Gradient Descent |  1 +
 ...ptions, Explanations, and Strong Baselines |  1 +
 ...g: Global Convergence with Implicit Layers |  1 +
 ...gled Representations in Realistic Settings |  1 +
 ... Rotation Equivariant Point Cloud Networks |  2 ++
 ...ouble Descent Peak in Ridgeless Regression |  1 +
 ...n and memorization in deep neural networks |  1 +
 ...networks and Restricted Boltzmann Machines |  1 +
 ...in model-based deep reinforcement learning |  1 +
 ...ithic Task Formulations in Neural Networks |  1 +
 ...fication based on Self-supervised Learning |  1 +
 ...en Question Answering over Tables and Text |  1 +
 ...Neural Networks to Spiking Neural Networks |  1 +
 ...Descent under Neural Tangent Kernel Regime |  1 +
 ...Regularization can Mitigate Double Descent |  1 +
 ... Generalized Linear Function Approximation |  1 +
 ... Evolutionary Graph Reinforcement Learning |  1 +
 ...olutional Layers with the Cayley Transform |  1 +
 ...Profit: Instance-Adaptive Data Compression |  1 +
 ... worst-case generalisation: friend or foe? |  1 +
 ...ctions for Deep Neural Network Classifiers |  1 +
 ...frame Reconstruction from Raw Point Clouds |  1 +
 .../PDE-Driven Spatiotemporal Disentanglement |  1 +
 ...ng: Principled masking of correlated spans |  1 +
 ...poral Convolution on Point Cloud Sequences |  1 +
 ...sformers for Video Representation Learning |  1 +
 .../2021/iclr/Parameter-Based Value Functions |  1 +
 ...havioral Priors for Reinforcement Learning |  1 +
 .../iclr/Partitioned Learned Bloom Filters    |  1 +
 ...ness: Defense Against Unseen Threat Models |  1 +
 ...arning with First Order Model Optimization |  1 +
 ... order reduction with guaranteed stability |  1 +
 ...xed Reward Shaping for Goal-Directed Tasks |  1 +
 ... from Pixels using Inverse Dynamics Models |  1 +
 ...tion Benchmark with Differentiable Physics |  1 +
 ...points for Keypoint Based Object Detection |  0
 ... Hard-label Black-box Adversarial Examples |  0
 ...lo Tree Search Applied to Molecular Design |  1 +
 ...rrent Learning with a Sparse Approximation |  1 +
 ...nsformers for Concept-centric Common Sense |  1 +
 ...ccuracy When Adding New Unobserved Classes |  1 +
 ...ing Inductive Biases of Pre-Trained Models |  1 +
 ...fectiousness for Proactive Contact Tracing |  1 +
 ...sation over directed actions by grid cells |  1 +
 .../Primal Wasserstein Imitation Learning     |  1 +
 ...stem Side Channels Using Generative Models |  0
 data/2021/iclr/Private Post-GAN Boosting      |  1 +
 ...stic Numeric Convolutional Neural Networks |  1 +
 .../iclr/Probing BERT in Hyperbolic Spaces    |  1 +
 ... more fat from a network at initialization |  1 +
 ... Conditional Sampling of Normalizing Flows |  1 +
 ...toencoder via Invertible Mutual Dependence |  1 +
 ... Theft using an Ensemble of Diverse Models |  1 +
 ...e Learning of Unsupervised Representations |  1 +
 ...sentation Learning for Relation Extraction |  1 +
 ... Learning with Combinatorial Latent States |  1 +
 ...ion of adversarial examples with detection |  1 +
 ...able Convergence under K\305\201 Geometry" |  1 +
 ...itialization: Why Are We Missing the Mark? |  1 +
 ...ng Pseudo Labels for Semantic Segmentation |  1 +
 ...LEX: Duplex Dueling Multi-Agent Q-Learning |  1 +
 ...uantifying Differences in Reward Functions |  1 +
 ...-GAP: Recursive Gradient Attack on Privacy |  1 +
 ...prop converges with proper hyper-parameter |  1 +
 ...ic Rules for Reasoning on Knowledge Graphs |  1 +
 ...rning Roles to Decompose Multi-Agent Tasks |  1 +
 data/2021/iclr/Random Feature Attention       |  1 +
 .../iclr/Randomized Automatic Differentiation |  1 +
 ... Q-Learning: Learning Fast Without a Model |  1 +
 ...ion in Procedurally-Generated Environments |  1 +
 ...-Through Gumbel-Softmax Gradient Estimator |  1 +
 ... Learning to Generate Graphs from Datasets |  1 +
 .../Rapid Task-Solving in Novel Environments  |  1 +
 .../iclr/Recurrent Independent Mechanisms     |  1 +
 ...erative Models with Binary Neural Networks |  1 +
 ...ive Models via Discriminator Gradient Flow |  1 +
 ...- An Empirical Study on Continuous Control |  1 +
 ...Regularized Inverse Reinforcement Learning |  1 +
 .../Reinforcement Learning with Random Delays |  1 +
 ...Framework for Multimodal Generative Models |  1 +
 ...xplanations Reduce Catastrophic Forgetting | 23 +++++++++++++++++++
 ...ntributions Using Out-of-Distribution Data |  1 +
 ...Offline Model-based Reinforcement Learning |  1 +
 ...th Deep Autoencoding Predictive Components |  1 +
 ...n Learning via Invariant Causal Mechanisms |  1 +
 ...tion accuracy of clinical factors from EEG |  1 +
 ...l Programs with Blended Abstract Semantics |  1 +
 ...for Robust Out-of-domain Few-Shot Learning |  1 +
 ...: Neural ODEs and Their Numerical Solution |  1 +
 ...ifelong Learning with Skill-Space Planning |  1 +
 ...chitecture Selection in Differentiable NAS |  1 +
 .../iclr/Rethinking Attention with Performers |  1 +
 ...ng Coupling in Pre-trained Language Models |  1 +
 ...sitional Encoding in Language Pre-training |  1 +
 ...tion: A Bias-Variance Tradeoff Perspective |  1 +
 ...ibution Methods for Model Interpretability |  1 +
 ...tion for Code Summarization via Hybrid GNN |  1 +
 ...tation Learning for Reinforcement Learning |  1 +
 ...namic Convolution via Matrix Decomposition |  1 +
 .../Revisiting Few-sample BERT Fine-tuning    |  1 +
 ... for Persistent Long-Term Video Prediction |  1 +
 ...ing: an Alternative to End-to-end Training |  1 +
 ...es by Minimizing the Maximal Expected Loss |  1 +
 ...Analysis of Nonlinear Feedforward Networks |  1 +
 ...Risk-Averse Offline Reinforcement Learning |  1 +
 ...re Bayesian Networks in Nearly-Linear Time |  1 +
 ... mitigated by properly learned smoothening |  1 +
 .../iclr/Robust Pruning at Initialization     |  1 +
 ...bservations with Learned Optimal Adversary |  1 +
 ...sentation Learning via Random Convolutions |  1 +
 ...Hindering the memorization of noisy labels |  1 +
 ...Accurate and Fast Neural Network Inference |  0
 ...D: Sign Agnostic Learning with Derivatives |  1 +
 ...ntation in Conversational Semantic Parsing |  0
 ...Networks toward Greedy Block-wise Learning |  0
 ...sed Distillation For Visual Representation |  1 +
 ...orcement Learning in Unstable Environments |  1 +
 ...e Orthogonal Learned and Random Embeddings |  1 +
 ...work for Self-Supervised Outlier Detection |  1 +
 ...erring When Diagnosing Poor Generalization |  1 +
 ...ntation Strategy for Better Regularization |  1 +
 ...ient Automated Deep Reinforcement Learning |  1 +
 ...le Bayesian Inverse Reinforcement Learning |  1 +
 ...Nonsymmetric Determinantal Point Processes |  1 +
 ...lable Transfer Learning with Expert Models |  1 +
 ...ing Gradients for Neural Model Explanation |  1 +
 ...caling the Convex Barrier with Active Sets |  0
 ... through Stochastic Differential Equations |  1 +
 ...tion Can Magnify Disparities Across Groups |  1 +
 ...causal impact of class selectivity in DNNs |  1 +
 ...arning of Compressed Video Representations |  1 +
 ...rvised Policy Adaptation during Deployment |  1 +
 ...stness for the Low-label, High-data Regime |  1 +
 ...sed Learning from a Multi-view Perspective |  1 +
 ...n Learning with Relative Predictive Coding |  1 +
 ...arning with Object-centric Representations |  1 +
 ...t Transfer Across Extreme Task Differences |  1 +
 ...emantic Re-tuning with Contrastive Tension |  0
 .../Semi-supervised Keypoint Localization     |  1 +
 ...variance for Enforcing Individual Fairness |  1 +
 ...aration and Concentration in Deep Networks |  1 +
 ...f Sequences by Low-Rank Tensor Projections |  1 +
 ...taneous Optimization of Speed and Accuracy |  1 +
 ...tructure as Conditional Density Estimation |  1 +
 ...erstanding Discriminative Features in CNNs |  1 +
 ...e-Texture Debiased Neural Network Training |  1 +
 data/2021/iclr/Shapley Explanation Networks   |  1 +
 ...hapley explainability on the data manifold |  1 +
 ...ific Capacity for Multilingual Translation |  1 +
 ...ith Gradient-dominated Objective Functions |  1 +
 ...n for Efficiently Improving Generalization |  1 +
 ...gsignature transforms, on both CPU and GPU |  1 +
 ...Goes a Long Way: ADRL for DNN Quantization |  1 +
 .../iclr/Simple Spectral Graph Convolution    |  1 +
 .../iclr/Single-Photon Image Classification   |  3 +++
 ...tic Provably Finds Globally Optimal Policy |  1 +
 ... RNN with Strict Upper Computational Limit |  1 +
 .../iclr/Sliced Kernelized Stein Discrepancy  |  1 +
 ...ement Learning Problems via Task Reduction |  1 +
 .../iclr/Sparse Quantized Spectral Clustering |  1 +
 ...ions in probabilistic matrix factorization |  1 +
 ...ers for Improved Generative Image Modeling |  1 +
 .../Spatially Structured Recurrent Modules    |  1 +
 ...Spatio-Temporal Graph Scattering Transform |  1 +
 .../iclr/Stabilized Medical Image Attacks     |  1 +
 ...tistical inference for individual fairness |  1 +
 ...g Long-Run Dynamics of Energy-Based Models |  1 +
 ...lation between Augmented Natural Languages |  1 +
 ...for Pre-trained Language Model Fine-tuning |  1 +
 ...cks for video-text representation learning |  1 +
 ...Aware Actor-Critic for 3D Molecular Design |  1 +
 ...alisation with group invariant predictions |  1 +
 ...tes on the Fly Helps Language Pre-Training |  0
 .../iclr/Taming GANs with Lookahead-Minmax    |  1 +
 ... Networks via Flipping Limited Weight Bits |  1 +
 .../iclr/Task-Agnostic Morphology Evolution   |  1 +
 ...eaching Temporal Logics to Neural Networks |  1 +
 data/2021/iclr/Teaching with Commentaries     |  1 +
 ...ally-Extended \316\265-Greedy Exploration" |  0
 ...st-Time Adaptation by Entropy Minimization |  1 +
 ...Generation by Learning from Demonstrations |  1 +
 ...ine Learners are Good Offline Generalizers |  0
 ...imism in Fixed-Dataset Policy Optimization |  1 +
 ...nsion of Images and Its Impact on Learning |  1 +
 .../iclr/The Recurrent Neural Tangent Kernel  |  1 +
 .../The Risks of Invariant Risk Minimization  |  1 +
 ...ce of Adaptive Polyak's Heavy-ball Methods |  1 +
 ...arning Through Spatial Variable Embeddings |  1 +
 ...ches in Deep Convolutional Kernels Methods |  1 +
 ...of integration in text classification RNNs |  1 +
 ...LU networks on orthogonally separable data |  0
 ... role of Disentanglement in Generalisation |  0
 ...ining with Deep Networks on Unlabeled Data |  1 +
 ...unds on estimation error for meta-learning |  1 +
 .../iclr/Tilted Empirical Risk Minimization   |  1 +
 ...rvised Bayesian Recovery of Corrupted Data |  1 +
 ...e Segmentation Using Discrete Morse Theory |  1 +
 ...for High-fidelity Few-shot Image Synthesis |  1 +
 .../Towards Impartial Multi-task Learning     |  1 +
 ...n Natural Data with Temporal Sparse Coding |  1 +
 ...ix Factorization: Greedy Low-Rank Learning |  1 +
 ...ust Neural Networks via Close-loop Control |  1 +
 ...gainst Natural Language Word Substitutions |  1 +
 ...s in Data Augmentation: An Empirical Study |  1 +
 ...xpressive Power of Random Features in CNNs |  1 +
 ...ugmentations via Contrastive Discriminator |  1 +
 ...ependent subnetworks for robust prediction |  1 +
 ...zation Noise for Extreme Model Compression |  1 +
 ...n using Equivariant Continuous Convolution |  1 +
 ...models are unsupervised structure learners |  1 +
 ...eralisation in Deep Reinforcement Learning |  1 +
 ...cting Linear Terms in Deep Neural Networks |  1 +
 .../iclr/Trusted Multi-View Classification    |  1 +
 ...ssion for efficient recommendation systems |  1 +
 ...RL via Policy Decoupling with Transformers |  0
 ...acher for Semi-Supervised Object Detection |  1 +
 ...ation with Finite-State Probabilistic RNNs |  1 +
 ...on in Autoregressive Structured Prediction |  1 +
 ...age Classifiers using Conformal Prediction |  1 +
 ...rtainty in Gradient Boosting via Ensembles |  1 +
 ...e Learning for Optimal Bayesian Classifier |  0
 ...ization in Generative Adversarial Networks |  1 +
 ...er Fusion in Sequence-to-Sequence Learning |  1 +
 ...l Choice in Non-Autoregressive Translation |  1 +
 ...sm and sparsity on neural network training |  1 +
 ...odes of out-of-distribution generalization |  1 +
 ... of importance weighting for deep learning |  1 +
 ...A Nasty Teacher That CANNOT teach students |  1 +
 ...n by Pixel-to-Segment Contrastive Learning |  1 +
 ...ural networks via nonlinear control theory |  1 +
 ...amples: Making Personal Data Unexploitable |  1 +
 ...visual Synthesis via Exemplar Autoencoders |  1 +
 ...iscovery of 3D Physical Objects from Video |  1 +
 ...t-Space Interpolation in Generative Models |  1 +
 ...earning using Local Spatial Predictability |  1 +
 ...e Series with Temporal Neighborhood Coding |  1 +
 ...of Optimal Representations During Training |  1 +
 ...lyze and leverage compositionality in GANs |  1 +
 ...-RED2: Video Adaptive Redundancy Reduction |  1 +
 ...ional Autoencoders and Energy-based Models |  1 +
 ...ng Causal Effects of Continuous Treatments |  1 +
 ...sformer Network for Object Goal Navigation |  1 +
 ...eck for Effective Low-Resource Fine-Tuning |  1 +
 .../Variational Intrinsic Control Revisited   |  1 +
 ...Localisation and Dense 3D Mapping in 6 DoF |  1 +
 ...er Networks and Polynomial-time Algorithms |  1 +
 ...e Models and Can Outperform Them on Images |  1 +
 ...s for Unsupervised Representation Learning |  1 +
 ...hanism for Online RL with Unknown Dynamics |  1 +
 ...mperceptible Warping-based Backdoor Attack |  1 +
 ...d: Online contextualized few-shot learning |  1 +
 .../Wasserstein Embedding for Graph Learning  |  1 +
 .../iclr/Wasserstein-2 Generative Networks    |  1 +
 ...cial Perception and Human-AI Collaboration |  1 +
 ...timating Gradients for Waveform Generation |  1 +
 ...ual Representation from Human Interactions |  1 +
 ...Discrimination Good for Transfer Learning? |  1 +
 ... Actor-Critic Methods? A Large-Scale Study |  1 +
 ...Not Be Contrastive in Contrastive Learning |  1 +
 ...ine RL with Linear Function Approximation? |  2 ++
 ...dy of inductive biases in seq2seq learners |  2 ++
 data/2021/iclr/When Do Curricula Work?        |  1 +
 ...ng f-Divergence is Robust with Label Noise |  1 +
 ...econditioning help or hurt generalization? |  1 +
 ...ample-Efficient than Fully-Connected Nets? |  1 +
 ...ng sampling bias with stochastic gradients |  1 +
 ...nt via Semi-Markov Afterstate Actor-Critic |  1 +
 ...Scale Data Poisoning via Gradient Matching |  1 +
 ...erence with Ultra-Low-Precision Arithmetic |  1 +
 ...ce with Online Learning from User Feedback |  1 +
 ...l Supervision for Semantic Image Synthesis |  1 +
 .../Zero-Cost Proxies for Lightweight NAS     |  1 +
 ...t Synthesis with Group-Supervised Learning |  1 +
 ...stem identification and visuomotor control |  1 +
 ...gy for Contrastive Representation Learning |  1 +
 ... Modelling with Missing not at Random Data |  1 +
 ...bit Optimizers via Block-wise Quantization |  1 +
 ...Pathways and Imaging Phenotypes of Disease |  0
 ...rson Mixing Methods and Their Applications |  0
 ... Representative Variable Selection Methods |  1 +
 ...ent Paradigm for 3D Point Cloud Completion |  1 +
 ...ional Approach to Clustering Survival Data |  1 +
 ...ine-Grained Analysis on Distribution Shift |  1 +
 ...e-Tuning Approach to Belief State Modeling |  1 +
 ... Representation for Reinforcement Learning |  1 +
 ...-Selection for Stochastic Gradient Descent |  1 +
 ...d for Computational Learning and Inversion |  1 +
 ...ss Framework for Randomly Initialized CNNs |  1 +
 ...ning Instabilities of Deep Learning Models |  0
 ...nel Perspective of Infinite Tree Ensembles |  1 +
 ...l Networks Go Beyond Weisfeiler-Lehman?\"" |  1 +
 ...Deep RELU Network Under Noisy Observations |  1 +
 ...m to Build E(N)-Equivariant Steerable CNNs |  0
 ...rvative Bandits and Reinforcement Learning |  1 +
 ...tion in Model-Based Reinforcement Learning |  1 +
 ...ribution Detection in Deep Neural Networks |  1 +
 ...Normalizing Flow Toward Energy-Based Model |  1 +
 ...m Inputs and Advantage over Fixed Features |  1 +
 .../A Theory of Tournament Representations    |  1 +
 ...Generative Ability of Adversarial Training |  1 +
 ...ustness Framework for Adversarial Training |  1 +
 ...s Architecture-Independent Model Distances |  1 +
 ...mal transport: analysis and implementation |  1 +
 ...he randomized singular value decomposition |  1 +
 ...mplicit networks via over-parameterization |  1 +
 ...rence Applied To Pyramidal Bayesian Models |  1 +
 ...n Using Adversarial Extreme Value Analysis |  1 +
 ...-Explicit Matching and Implicit Similarity |  1 +
 ... Axial Shifted MLP Architecture for Vision |  1 +
 ...by Pairing GNNs with Neural Wave Functions |  1 +
 ...ng with Parallel Differentiable Simulation |  1 +
 ...th Alleviated Forgetting in Local Training |  1 +
 ...ith Stable Subgoal Representation Learning |  1 +
 ...n a Large-Scale Imperfect-Information Game |  1 +
 ...ased towards high entropy optimal policies |  1 +
 ...Neighbour Discovery in the Structure Space |  1 +
 ...stance-adaptive Data Augmentation Policies |  1 +
 ...-Supervised Learning and Domain Adaptation |  1 +
 ...o Adapt in Transfer Reinforcement Learning |  1 +
 ...twork for 3D Shape Representation Learning |  1 +
 ... Retriever-Ranker for Dense Text Retrieval |  1 +
 ...l Robustness Through the Lens of Causality |  1 +
 data/2022/iclr/Adversarial Support Alignment  |  1 +
 ...ng of Backdoors via Implicit Hypergradient |  1 +
 .../Adversarially Robust Conformal Prediction |  1 +
 ...dictions against Adversarial Perturbations |  1 +
 ...sed Proof Cost Network to Aid Game Solving |  1 +
 ...iation for Stochastic Bilevel Optimization |  1 +
 ...lanning and Synthesizable Molecular Design |  1 +
 ...to Federated Learning with Class Imbalance |  1 +
 ...Molecular Geometry Generation from Scratch |  1 +
 ...tive on Model-Based Reinforcement Learning |  1 +
 ...xt Learning as Implicit Bayesian Inference |  1 +
 ...arning with Instance-Dependent Label Noise |  1 +
 ...retic View On Pruning Deep Neural Networks |  1 +
 ...ayer-Peeled Perspective on Neural Collapse |  1 +
 ...Variance in Diffusion Probabilistic Models |  1 +
 ... Landscape of Noise-Contrastive Estimation |  1 +
 ...Ornstein-Uhlenbeck variational autoencoder |  1 +
 ...ndom Feature Regression in High Dimensions |  1 +
 ...ar Data with Internal Contrastive Learning |  1 +
 ...aly Detection with Association Discrepancy |  1 +
 ...onfidence Bonuses For Scalable Exploration |  1 +
 ...r Domain Analysis: From Theory to Practice |  1 +
 ...ense Prediction with Confidence Adaptivity |  1 +
 ...Convolutional Models: a Kernel Perspective |  1 +
 ...ing Generalization of SGD via Disagreement |  1 +
 ...on that Works on CNN, RNN, and Transformer |  0
 ...ally-invariant Classification in OOD Tasks |  0
 ...ased adversarial black-box methods is easy |  1 +
 ...Interpretability with Concept Transformers |  1 +
 ...ightweight, Noise-Robust, and Transferable |  1 +
 .../Augmented Sliced Wasserstein Distances    |  1 +
 ...ning to Route Transferable Representations |  1 +
 ...aling Vision Transformers without Training |  1 +
 ...omated Self-Supervised Learning for Graphs |  1 +
 ...mize Problems with Strong Ranking Property |  1 +
 ...ntric Abstractions for High-Level Planning |  1 +
 ...ement Learning: Formalism and Benchmarking |  1 +
 .../2022/iclr/Autoregressive Diffusion Models |  1 +
 ...lows for Predictive Uncertainty Estimation |  1 +
 ...Search, Retrieval, and Similarity Learning |  1 +
 .../2022/iclr/BAM: Bayes with Adaptive Memory |  1 +
 ...for Fast and High-Quality Speech Synthesis |  1 +
 ...T: BERT Pre-Training of Image Transformers |  1 +
 ... Improving Real-time Predictions in Future |  1 +
 ...efense via Decoupling the Training Process |  1 +
 ...tacks to Pre-trained NLP Foundation Models |  1 +
 ...gation Boosts Self-supervised Distillation |  1 +
 ...ack, and Self-Reinforcing User Preferences |  0
 .../Bayesian Framework for Gradient Leakage   |  1 +
 ...r Learning to Optimize: What, Why, and How |  1 +
 .../Bayesian Neural Network Priors Revisited  |  1 +
 ...marking the Spectrum of Agent Capabilities |  1 +
 ...visory Signals by Observing Learning Paths |  1 +
 ...Adversarial Examples for Black-box Domains |  1 +
 ...orks for Multi-goal Reinforcement Learning |  0
 .../BiBERT: Accurate Fully Binarized BERT     |  1 +
 ...r Phase Retrieval of Meromorphic Functions |  1 +
 .../Boosted Curriculum Reinforcement Learning |  1 +
 ...moothing with Variance Reduced Classifiers |  1 +
 ...ied Robustness of L-infinity Distance Nets |  1 +
 data/2022/iclr/Bootstrapped Meta-Learning     |  1 +
 ...mantic Segmentation with Regional Contrast |  1 +
 .../iclr/Bregman Gradient Policy Optimization |  1 +
 ...Marketing via Recurrent Intensity Modeling |  1 +
 ... Problems with Inscrutable Representations |  1 +
 ...ive Approach to Exploring Many-to-one Maps |  1 +
 ...ng on Heterogeneous Datasets via Bucketing |  1 +
 ...urriculum for Learning Goal-Reaching Tasks |  1 +
 ...entiable Data Augmentation for EEG Signals |  1 +
 ...sformer for Unsupervised Domain Adaptation |  1 +
 ...ous Kernel Convolution For Sequential Data |  1 +
 ...te Research Transparency and Comparability |  1 +
 ...rcement Learning against Poisoning Attacks |  1 +
 ...tionary Distribution Correction Estimation |  1 +
 ...ment Learning through Functional Smoothing |  1 +
 ...Classifier Suffice For Action Recognition? |  1 +
 ...early Classified Under All Possible Views? |  1 +
 ...Locality in Non-parametric Language Models |  1 +
 ...lization in textual reinforcement learning |  1 +
 ...extual Bandits with Targeted Interventions |  1 +
 ...rium Models via Interval Bound Propagation |  1 +
 ...trastive Learning via Augmentation Overlap |  1 +
 ...rs via Gradient-based Subword Tokenization |  1 +
 ...ion-Aware Molecule Representation Learning |  1 +
 ...ive GAN for Conditional Waveform Synthesis |  1 +
 .../iclr/Churn Reduction via Distillation     |  1 +
 ... Inverse Task for Dynamic Scene Deblurring |  1 +
 ...e Awareness by Generating Images of Floods |  1 +
 ...ng Generative Models in Zero-shot Learning |  1 +
 ...ontrastive BERT for Reinforcement Learning |  1 +
 .../iclr/CoMPS: Continual Meta Policy Search  |  1 +
 ...epresentations for Time Series Forecasting |  1 +
 ...ng an Extensible Relational Representation |  1 +
 ...ime Series for Accelerated Active Learning |  0
 ...s with Incomplete or Missing Neighborhoods |  1 +
 ...g Class-conditional GANs with Limited Data |  1 +
 ...easoning of Objects and Events from Videos |  1 +
 ...ritic Methods for Homogeneous Markov Games |  1 +
 ...ng Differences that Affect Decision Making |  1 +
 ...-Neuron Relaxation Guided Branch-and-Bound |  1 +
 ...ention: Disentangling Search and Retrieval |  1 +
 ...ining for End-to-End Deep AUC Maximization |  1 +
 ...ngle Source Cross-Domain Few-Shot Learning |  1 +
 ...ersarial Learning for Large-Batch Training |  1 +
 ...nditional Contrastive Learning with Kernel |  1 +
 ... by Conditioning Variational Auto-Encoders |  1 +
 ...itional Object-Centric Learning from Video |  1 +
 ...sequence Networks with Learned Activations |  0
 ...iable Model of Whole-Brain Neural Activity |  1 +
 ...Consistent Counterfactuals for Deep Models |  1 +
 ...mical System Identification and Prediction |  1 +
 ...icy Optimization via Bayesian World Models |  1 +
 ...ing Linear-chain CRFs to Regular Languages |  1 +
 ...hogonal Convolutions in an Explicit Manner |  1 +
 ... Transfer using Generalized Policy Updates |  1 +
 ... Manipulations with Differentiable Physics |  1 +
 ...text-Aware Sparse Deep Coordination Graphs |  1 +
 ...ation for Generative Commonsense Reasoning |  1 +
 ...ntinual Learning with Filter Atom Swapping |  1 +
 ...rning with Recursive Gradient Optimization |  1 +
 ...ormalization for Online Continual Learning |  1 +
 ...Learning with Forward Mode Differentiation |  1 +
 ...s via Reward-Switching Policy Optimization |  1 +
 ...Parallel Data for Unsupervised Translation |  0
 ...tering via Generative Adversarial Networks |  1 +
 ...ling Directions Orthogonal to a Classifier |  1 +
 ...ipschitz Constant improves Polynomial Nets |  1 +
 data/2022/iclr/Convergent Graph Solvers       |  1 +
 ...nt and Efficient Deep Q Learning Algorithm |  0
 ...presentation with a Split MLP Architecture |  1 +
 ... Modules Through a Shared Global Workspace |  1 +
 ...ctual Plans under Distributional Ambiguity |  1 +
 ...raining Sets via Weak Indirect Supervision |  1 +
 ...itical Points in Quantum Generative Models |  1 +
 ...n Imitation Learning via Optimal Transport |  1 +
 ...eighted Language-Invariant Representations |  1 +
 ...earning for Zero-Shot Generalization in RL |  1 +
 ...g to Search in Bottom-Up Program Synthesis |  1 +
 ...ansformer Hinging on Cross-scale Attention |  1 +
 ... for Open-Set Single Domain Generalization |  1 +
 ... Human Demonstrations for Offline Learning |  1 +
 ...toencoder for Periodic Material Generation |  1 +
 ...o uncover learning principles in the brain |  1 +
 ...namic Scale Networks for Multi-View Stereo |  1 +
 ...MLP-like Architecture for Dense Prediction |  1 +
 ...losed-form ODEs from Observed Trajectories |  1 +
 ...c Anchor Boxes are Better Queries for DETR |  1 +
 ...entation in Offline Reinforcement Learning |  1 +
 ...ased Explanation for Graph Neural Networks |  1 +
 ...rning for Periodic Time Series Forecasting |  1 +
 ...aneous Explanations via Concept Traversals |  1 +
 ...IVA: Dataset Derivative of a Learning Task |  1 +
 ...ering Layer for Neural Network Compression |  1 +
 ... Learning Requires Explicit Regularization |  1 +
 ...nition with Optimal Transport Distillation |  1 +
 ...ing Won't Save You From Facial Recognition |  1 +
 ...ion for Architecting Hardware Accelerators |  1 +
 ... Grammar Learning for Molecular Generation |  1 +
 ...ol with a Deep Stochastic Koopman Operator |  0
 ...ity in MARL via Trust-Region Decomposition |  1 +
 ... Multi-Agent Kernel Approximation Approach |  1 +
 ...clarative nets that are equilibrium models |  1 +
 ...tive Biases of Hamiltonian Neural Networks |  1 +
 ...aptation for Cross-Domain Object Detection |  1 +
 .../iclr/Deep Attentive Variational Inference |  1 +
 data/2022/iclr/Deep AutoAugment               |  1 +
 ...he All-Round Blessings of Dynamic Sparsity |  1 +
 ...haping the Kernel with Tailored Rectifiers |  1 +
 .../2022/iclr/Deep Point Cloud Reconstruction |  1 +
 ...eep ReLU Networks Preserve Expected Length |  1 +
 ...ruptions Through Adversarial Augmentations |  1 +
 ...sis for Evaluation of Data Representations |  1 +
 ...ith Supplementary Imperfect Demonstrations |  1 +
 ...ization Models and Implicit Regularization |  1 +
 ...ty in Automatic Speech Recognition Systems |  1 +
 ...or Conditional Score-based Data Generation |  1 +
 ...r: Tiny Transformer with Shared Dictionary |  1 +
 ...Deformable Object Manipulations with Tools |  1 +
 data/2022/iclr/Differentiable DAG Sampling    |  1 +
 ...ximization for Set Representation Learning |  1 +
 ... Scene Reconstructions from a Single Image |  1 +
 ...d Language Models Better Few-shot Learners |  1 +
 ...Scaffolding Tree for Molecule Optimization |  1 +
 ...lly Private Fine-tuning of Language Models |  1 +
 ...ents Estimation with Polylogarithmic Space |  1 +
 ...th Fast Maximum Likelihood Sampling Scheme |  1 +
 ...overy for State Covering and Goal Reaching |  1 +
 ...riant Rationales for Graph Neural Networks |  1 +
 ...iscovering Latent Concepts Learned in BERT |  1 +
 ... Scarce Data with Physics-encoded Learning |  1 +
 ...ning the Representation Bottleneck of DNNS |  1 +
 ...ased Active Learning for Domain Adaptation |  1 +
 ...s Strengthen Vision Transformer Robustness |  1 +
 ...criminative Similarity for Data Clustering |  1 +
 ...sis with Partial Information Decomposition |  1 +
 ...lets for X2I Translation with Limited Data |  1 +
 ...stribution Compression in Near-Linear Time |  1 +
 ...nforcement Learning with Monotonic Splines |  1 +
 ...Principal Components via Geodesic Descents |  1 +
 ...t Models with Parametric Likelihood Ratios |  1 +
 ...s from Periodically Shifting Distributions |  1 +
 .../Dive Deeper Into Integral Pose Regression |  1 +
 ...e-aware Federated Self-Supervised Learning |  1 +
 ...rated Learning via Submodular Maximization |  0
 ...s Image Recognition Performance in AlexNet |  1 +
 ...al Coordinates on the Latent Space of GANs |  1 +
 ...ision? A User Study, Baseline, And Dataset |  1 +
 ...We Need Anisotropic Graph Neural Networks? |  1 +
 ...works transfer invariances across classes? |  1 +
 ...thing on graphs with tabular node features |  1 +
 ...n Adversarial Training: A Game Perspective |  1 +
 ...tematic Errors with Cross-Modal Embeddings |  1 +
 ...ne Learning Using Second-Order Information |  1 +
 ... Stimuli Induced Patterns in M EEG Signals |  1 +
 ...or Doubly Efficient Reinforcement Learning |  1 +
 data/2022/iclr/Dual Lottery Ticket Hypothesis |  1 +
 ...Normalization improves Vision Transformers |  1 +
 ...are Comparison of Learned Reward Functions |  1 +
 ...tion Neural Networks in Contextual Bandits |  1 +
 ...ion Transformers via Token Reorganizations |  0
 ...raining via Extreme Activation Compression |  1 +
 ...catastrophic forgetting in neural networks |  0
 ...cation by Scheduled Grow-and-Prune Methods |  1 +
 ...ch for Combinatorial Optimization Problems |  1 +
 ...-Width Neural Networks that Learn Features |  0
 ...g Policy via Human-AI Copilot Optimization |  1 +
 ...l Discovery without Acyclicity Constraints |  1 +
 ...n Transformers for Representation Learning |  1 +
 ...n for Improved Training of Neural Networks |  1 +
 ...ng for On-Demand and In-Situ Customization |  1 +
 ...mers via Adaptive Fourier Neural Operators |  0
 ...l Prediction with General Function Classes |  1 +
 ...ong Sequences with Structured State Spaces |  1 +
 ...en playing games is better than optimizing |  1 +
 ...c Objectives with Skewed Hessian Spectrums |  1 +
 ... Manipulations with Einstein-like Notation |  1 +
 ...from SGD with Truncated Heavy-tailed Noise |  1 +
 ...arning and explicit probabilistic modeling |  1 +
 .../2022/iclr/Emergent Communication at Scale |  1 +
 ...ation Objectives with Adaptive Tree Search |  1 +
 ...rsity for Fixed-to-Fixed Model Compression |  1 +
 ...ing of Probabilistic Hierarchies on Graphs |  1 +
 ... to Valuation Problems in Machine Learning |  1 +
 ...spired Molecular Conformation Optimization |  1 +
 ...g Cross-lingual Transfer by Manifold Mixup |  1 +
 ...ntQA: Entity Linking as Question Answering |  1 +
 ...ntropy Model for Learned Image Compression |  1 +
 ...nt Predictive Coding for Visual Navigation |  1 +
 ... Graph Mechanics Networks with Constraints |  1 +
 ...ncouraging Equivariance in Representations |  1 +
 .../Equivariant Subgraph Aggregation Networks |  1 +
 ... Neural Network based Molecular Potentials |  1 +
 ...ng for More Powerful Graph Neural Networks |  1 +
 ...ined nonconvex-nonconcave minimax problems |  1 +
 ...with Orthogonal Projected Gradient Descent |  1 +
 ...entanglement of Structured Representations |  1 +
 ...nal Distortion in Neural Language Modeling |  1 +
 ...lanner Amortization for Continuous Control |  1 +
 ...roblems, Pitfalls, and Practical Solutions |  1 +
 data/2022/iclr/Evidential Turing Processes    |  1 +
 ...based Selection for Reinforcement Learning |  1 +
 ...e Multi-Task Scaling for Transfer Learning |  1 +
 ...ble GNN-Based Models over Knowledge Graphs |  1 +
 ...earning Interpretable Temporal Logic Rules |  1 +
 ... based on Directional Feature Interactions |  1 +
 ...ctivation Value for Partial-Label Learning |  1 +
 ...oring Memorization in Adversarial Training |  1 +
 ...ompression for pre-trained language models |  1 +
 ...ing the Limits of Large Scale Pre-training |  1 +
 ...d Language Models via Metropolis--Hastings |  1 +
 ...mation Properties of Graph Neural Networks |  1 +
 ...Contextual Complexity and Unpredictability |  1 +
 ...ILDS Benchmark for Unsupervised Adaptation |  1 +
 ...ly Multiplication for Network Quantization |  1 +
 ...tic descriptions, and Conceptual Relations |  1 +
 ...ed Interactive Language-Image Pre-Training |  1 +
 ...tructions in Language with Modular Methods |  1 +
 ...Transformer Advanced by Fully Pre-training |  1 +
 data/2022/iclr/Fair Normalizing Flows         |  1 +
 ...Fairness Calibration for Face Verification |  1 +
 ...airness Guarantees under Demographic Shift |  1 +
 ...periments on Conditional Language Modeling |  0
 data/2022/iclr/Fast AdvProp                   |  1 +
 .../Fast Differentiable Matrix Square Root    |  1 +
 ...for Model Interpretability and Compression |  1 +
 data/2022/iclr/Fast Model Editing at Scale    |  1 +
 .../Fast Regression for Structured Inputs     |  1 +
 ...gical clustering with Wasserstein distance |  1 +
 ...stSHAP: Real-Time Shapley Value Estimation |  1 +
 data/2022/iclr/Feature Kernel Distillation    |  1 +
 ...ntation for Federated Image Classification |  1 +
 ...l Communication Cost in Federated Learning |  1 +
 ...Communication-Efficient Federated Learning |  1 +
 ...ata with Class-conditional-sharing Clients |  1 +
 ...Backdoor Attacks on Visual Object Tracking |  1 +
 ...arning via Dirichlet Tessellation Ensemble |  0
 ...Series Imputation by Graph Neural Networks |  1 +
 ...g of Counterfactual Physics in Pixel Space |  1 +
 ...rially Robust Features via Metameric Tasks |  1 +
 ...ter in each of your Deep Generative Models |  1 +
 ...tures and Underperform Out-of-Distribution |  1 +
 ...le Physics: A Yarn-level Model for Fabrics |  1 +
 ...ned Language Models are Zero-Shot Learners |  1 +
 ...Reinforcement Learning with Average Reward |  0
 ...ography: Train the images, not the network |  1 +
 ...volutions With Differentiable Kernel Sizes |  1 +
 ...d: Group Distributional Robustness Follows |  1 +
 .../Fooling Explanations in Text Classifiers  |  1 +
 ...itous Forgetting in Connectionist Networks |  1 +
 ...r Invariant and Equivariant Network Design |  1 +
 ... Embedding Learning with Provable Benefits |  1 +
 ...vel Perspective to Optimize Recommendation |  1 +
 ...ing Any GNN with Local Structure Awareness |  1 +
 ...al Training for Simulation-Based Inference |  1 +
 ... Min-Imax Optimization via Anderson Mixing |  1 +
 ...ricks for Subgraph Representation Learning |  1 +
 ...ter? Revisiting GNN for Question Answering |  1 +
 ... Modeling based on Global Contexts via GNN |  1 +
 ... End-to-End Task-Oriented Dialogue Systems |  0
 ... Graph Neural Diffusion with A Source Term |  1 +
 .../Gaussian Mixture Convolution Networks     |  1 +
 ... for Experimental Design in Drug Discovery |  1 +
 ...ement Learning through Logical Composition |  1 +
 ...on Through the Lens of Leave-One-Out Error |  1 +
 ...Through the Lens of Adversarial Robustness |  1 +
 ...for Offline Hindsight Information Matching |  1 +
 ...ized Demographic Parity for Group Fairness |  1 +
 data/2022/iclr/Generalized Kernel Thinning    |  1 +
 ...ws in Hidden Convex-Concave Games and GANs |  1 +
 ...et covariance models for texture synthesis |  1 +
 ...lizing Few-Shot NAS with Gradient Matching |  1 +
 ...e Implicit Generative Adversarial Networks |  1 +
 ...ative Modeling with Optimal Transport Maps |  1 +
 ...urce for Multiview Representation Learning |  1 +
 ...ated Exploration in Reinforcement Learning |  1 +
 .../Generative Principal Component Analysis   |  1 +
 .../iclr/Generative Pseudo-Inverse Memory     |  1 +
 ...odel for Molecular Conformation Generation |  1 +
 ...s for Protein Interface Contact Prediction |  1 +
 ...s improve E(3) Equivariant Message Passing |  1 +
 ...entation with Implicit Displacement Fields |  1 +
 ...A Heavy-Neck Paradigm for Object Detection |  1 +
 ...ix Learning in Trainable Embedding Indexes |  1 +
 ... Policy Gradient in Markov Potential Games |  1 +
 ...d Planning via Hindsight Experience Replay |  1 +
 ...Neural Networks using Gradient Information |  1 +
 ...rmance Inference with Theoretical Insights |  1 +
 ...tance Learning for Incomplete Observations |  1 +
 ...mization by Back-propagating through Model |  1 +
 ...radient Matching for Domain Generalization |  1 +
 ...Step Denoiser for convergent Plug-and-Play |  1 +
 ...fies genomic loci regulating transcription |  1 +
 ...ia Neighborhood Wasserstein Reconstruction |  1 +
 ...aph Condensation for Graph Neural Networks |  1 +
 ...arch for the Traveling Salesperson Problem |  1 +
 ... Structural and Positional Representations |  1 +
 ... Anomaly Detection of Multiple Time Series |  1 +
 ...regularly Sampled Multivariate Time Series |  1 +
 .../iclr/Graph-Relational Domain Adaptation   |  1 +
 ...arest Neighbor Search in Hyperbolic Spaces |  0
 ...ching Old MLPs New Tricks Via Distillation |  1 +
 ...s for Class-Imbalanced Node Classification |  0
 ...Testing of Networks: Algorithms and Theory |  1 +
 ...: Graph REASoning Enhanced Language Models |  1 +
 ...up equivariant neural posterior estimation |  1 +
 ...e Parallelism for Large-scale DNN Training |  1 +
 ...-Training and Prompting of Language Models |  1 +
 ...verse Gradients for Physical Deep Learning |  1 +
 ...hifts on Graphs: An Invariance Perspective |  1 +
 ...ncoder For Irregularly Sampled Time Series |  1 +
 ...nerative Models with Closed-Form Solutions |  1 +
 ...ace Models For Changing Dynamics Scenarios |  1 +
 ...hot Imitation with Skill Transition Models |  1 +
 ...emory for Few-shot Learning Across Domains |  1 +
 ...Nonconvex Algorithms with AdaGrad Stepsize |  1 +
 ...ounds with Fast Rates for Minimax Problems |  1 +
 ...Relabeling for Meta-Reinforcement Learning |  1 +
 ...aging Past Traversals to Aid 3D Perception |  1 +
 ...rievers for improved open-ended generation |  1 +
 ...ree Compatible Training in Image Retrieval |  0
 ...ow Attentive are Graph Attention Networks? |  1 +
 ...ntly Assessing Machine Learning API Shifts |  1 +
 .../iclr/How Do Vision Transformers Work?     |  1 +
 ... with Self-supervised Contrastive Learning |  1 +
 ...Memory for Error in Low-Precision Training |  1 +
 ...an CLIP Benefit Vision-and-Language Tasks? |  1 +
 ... Pre-Training Perform with Streaming Data? |  1 +
 ...eep networks: a loss landscape perspective |  1 +
 ...Consistency: Logit Anchoring on Clean Data |  1 +
 ...s? A Zeroth-Order Optimization Perspective |  1 +
 ...r MAML to Excel in Few-Shot Classification |  1 +
 ... missing data in supervised deep learning? |  1 +
 ...g? A one-hidden-layer theoretical analysis |  1 +
 ...ls for Non-stationary Time Series Analysis |  1 +
 ... Learning via Hybrid Action Representation |  1 +
 ...Learning with Heterogeneous Communications |  1 +
 ...rence at the Discrete-Continuous Interface |  1 +
 data/2022/iclr/Hybrid Random Features         |  1 +
 ...ion Method for Deep Reinforcement Learning |  1 +
 ...ter Tuning with Renyi Differential Privacy |  1 +
 ...nctional Relationships in 3D Indoor Scenes |  1 +
 ...U: Efficient GCN Training via Lazy Updates |  1 +
 ... Approach to Out-of-Distribution Detection |  1 +
 .../iclr/Illiterate DALL-E Learns to Compose  |  1 +
 ...ge BERT Pre-training with Online Tokenizer |  0
 data/2022/iclr/Imbedding Deep Neural Networks |  1 +
 ...itation Learning by Reinforcement Learning |  1 +
 ...ervations under Transition Model Disparity |  1 +
 ...ersarial Training for Deep Neural Networks |  0
 ...tion in Underparameterized Neural Networks |  1 +
 ...covery of Subspaces of Unknown Codimension |  1 +
 ...ic l2 robustness on CIFAR-10 and CIFAR-100 |  1 +
 ... Recognition via Privacy-Agnostic Clusters |  1 +
 ...tion with Annealed and Energy-Based Bounds |  1 +
 ...ve Translation Models Without Distillation |  1 +
 ...ample Weights for Imbalance Classification |  1 +
 ...oals for Following Temporal Specifications |  1 +
 ...l Extraction with Calibrated Proof of Work |  1 +
 ...egative Detection for Contrastive Learning |  1 +
 ...odels for End-to-End Rigid Protein Docking |  1 +
 ...ediction Using Analogy Subgraph Embeddings |  1 +
 ...AN: Towards Infinite-Pixel Image Synthesis |  1 +
 ...ct Analysis of (Quantized) Neural Networks |  1 +
 ... to Graph Active Learning with Soft Labels |  1 +
 ...rough Empowerment in Visual Model-based RL |  1 +
 ...ne Memory Selection for Continual Learning |  1 +
 ...atless Compression of Stochastic Gradients |  1 +
 ...tour Stochastic Gradient Langevin Dynamics |  1 +
 ...d Diversity Denoising and Artefact Removal |  1 +
 ...ing for Out-of-Distribution Generalization |  1 +
 ...ng Non-Stationary and Reactionary Policies |  1 +
 ...sing Subgroup Gaps in Deep Metric Learning |  1 +
 ... in RL? A Case Study in Continuous Control |  1 +
 ...ily a Necessity for Graph Neural Networks? |  1 +
 ...compatible with Interpolating Classifiers? |  1 +
 ...f Play for Automatic Curriculum Generation |  1 +
 ...o to Tango: Mixup for Deep Metric Learning |  1 +
 ...rative and Byzantine Decentralized Teaming |  1 +
 ... for Antibody Sequence-Structure Co-design |  1 +
 ...ues: a measure of joint feature importance |  1 +
 data/2022/iclr/KL Guided Domain Adaptation    |  0
 ...l Control Policies Through Robot-Awareness |  1 +
 ...ction Relations for Reinforcement Learning |  1 +
 data/2022/iclr/Knowledge Infused Decoding     |  1 +
 ...moval in Sampling-based Bayesian Inference |  1 +
 .../L0-Sparse Canonical Correlation Analysis  |  0
 ...uage Learning Based on Prompt Tuning of T5 |  1 +
 ...eration Selection for Multi-Agent Learning |  1 +
 ...ure in Neural Rough Differential Equations |  1 +
 .../Label Encoding for Regression Networks    |  1 +
 ...and Protection in Two-party Split Learning |  1 +
 ...emantic Segmentation with Diffusion Models |  1 +
 ...ssion with weighted low-rank factorization |  1 +
 ...Language modeling via stochastic processes |  1 +
 ...aluation based on semantic representations |  1 +
 .../Language-driven Semantic Segmentation     |  1 +
 ... Be Strong Differentially Private Learners |  1 +
 ...ogeneity: Convergence and Balancing Effect |  1 +
 ...ation Learning on Graphs via Bootstrapping |  1 +
 ...Animate Images via Latent Space Navigation |  1 +
 ...rs for Joint Multi-Agent Motion Prediction |  1 +
 ...gorithm for Training Graph Neural Networks |  1 +
 ...ugh Adversarial Invertible Transformations |  1 +
 ...input via mixed and anisotropic smoothness |  1 +
 .../iclr/Learned Simulators for Turbulence    |  0
 ...hirality with Invariance to Bond Rotations |  1 +
 ...orcement Learning without External Rewards |  1 +
 ...on by Masked Multimodal Cluster Prediction |  1 +
 ...oment Restrictions by Importance Weighting |  1 +
 ... Environment Fields via Implicit Functions |  1 +
 ...gression with Power-Law Priors and Targets |  1 +
 ...ning Curves for SGD on Structured Features |  1 +
 ...Encoder using Natural Evolution Strategies |  1 +
 ...rative Models: A Contrastive Learning View |  1 +
 ...Models at Scale via Composite Optimization |  1 +
 ...Networks via Structure-Regularized Pruning |  0
 ...Bin Packing on Packing Configuration Trees |  1 +
 ... by Differentiating Through Sample Quality |  1 +
 ...hod based on Complementary Learning System |  1 +
 ...arning Features with Parameter-Free Layers |  1 +
 ...ve Meta-learner of Behavioral Similarities |  1 +
 ...ield Games and Approximate Nash Equilibria |  1 +
 ...nal Networks on the Stochastic Block Model |  1 +
 ...ith Differentiable Nondeterministic Stacks |  1 +
 ...bution via Randomized Return Decomposition |  1 +
 ...Multimodal VAEs through Mutual Supervision |  1 +
 ...ntextual Bandits through Perturbed Rewards |  1 +
 ...t-Oriented Dynamics for Planning from Text |  1 +
 .../Learning Optimal Conformal Classifiers    |  1 +
 ...nted Set Representations for Meta-Learning |  1 +
 ... One-Shot, Any-Sparsity, And No Retraining |  1 +
 ... Fisher Kernel with Low-rank Approximation |  1 +
 ...ving Two-stage Stochastic Integer Programs |  1 +
 ...ns via Retracing in Reinforcement Learning |  1 +
 ...g Strides in Convolutional Neural Networks |  1 +
 ...earning Super-Features for Image Retrieval |  1 +
 ...Reward Networks for Reinforcement Learning |  1 +
 ...atent Processes from General Temporal Data |  1 +
 .../iclr/Learning Towards The Largest Margins |  1 +
 ...Object Localization with Policy Adaptation |  1 +
 ...ions from Undirected State-only Experience |  1 +
 ...Architectures by Propagating Network Codes |  1 +
 ...n End-to-End with Cross-Modal Transformers |  1 +
 ...kly-supervised Contrastive Representations |  1 +
 ...nline adaptation in Reinforcement Learning |  1 +
 .../Learning by Directional Gradient Descent  |  1 +
 ...ks: Self-knowledge transfer and forgetting |  1 +
 .../iclr/Learning meta-features for AutoML    |  1 +
 ...more skills through optimistic exploration |  1 +
 ... Observations with Finite Element Networks |  1 +
 ...e Part Segmentation with Gradient Matching |  1 +
 .../Learning to Complete Code with Sketches   |  1 +
 ...earning to Dequantise with Truncated Flows |  1 +
 ...gmentation of Ultra-High Resolution Images |  1 +
 ...Molecular Scaffolds with Structural Motifs |  1 +
 ...lize across Domains on Single Test Samples |  1 +
 ...be Guided in the Architect-Builder Problem |  1 +
 ...to Map for Active Semantic Goal Navigation |  1 +
 ...ng Memory Networks for Traffic Forecasting |  1 +
 ...e Learning rate with Graph Neural Networks |  1 +
 ... with hierarchical latent mixture policies |  1 +
 ...A Study Using Real-World Human Annotations |  1 +
 .../Learning-Augmented $k$-means Clustering   |  1 +
 ...it Tests for Unsupervised Code Translation |  1 +
 ...to predict out-of-distribution performance |  1 +
 ...Bridge using Forward-Backward SDEs Theory" |  1 +
 ... and Natural Languages via Corpus Transfer |  1 +
 ...z-constrained Unsupervised Skill Discovery |  1 +
 ...w-Rank Adaptation of Large Language Models |  1 +
 ...r Generalization in Reinforcement Learning |  1 +
 ...ng Expressive Memory for Sequence Modeling |  1 +
 ...iences For Class task Incremental Learning |  1 +
 ...ss Compression with Probabilistic Circuits |  1 +
 ...t as Entropy Constrained Optimal Transport |  1 +
 ... Distance: An Integer Programming Approach |  1 +
 ...oisy Contrastive Learner in Classification |  1 +
 ...el with Neural Transport Latent Space MCMC |  1 +
 ...ical Performance via Hierarchical Modeling |  1 +
 ... Multi-Task Multitrack Music Transcription |  1 +
 ...ative Network Manifolds Without Retraining |  1 +
 ... Neural Scaling Law and Minimax Optimality |  1 +
 ...fficient exploration in novel environments |  1 +
 ...guage Models to Grounded Conceptual Spaces |  1 +
 ... adaptation under generalized target shift |  1 +
 ...oved Data-Augmented Reinforcement Learning |  1 +
 ...e Diversity in Deep Reinforcement Learning |  1 +
 ... (Provably) Solves Some Robust RL Problems |  1 +
 ...aximum n-times Coverage for Vaccine Design |  1 +
 ...ack-box Testing of Visual Reasoning Models |  1 +
 ...esentations via Quantized Reversed Probing |  1 +
 data/2022/iclr/Memorizing Transformers        |  1 +
 ...ory Augmented Optimizers for Deep Learning |  1 +
 ...th Data Compression for Continual Learning |  1 +
 ...nsformers through entity mention attention |  1 +
 .../iclr/Message Passing Neural PDE Solvers   |  1 +
 ...over Novel Classes given Very Limited Data |  1 +
 ...for Energy Based Deterministic Uncertainty |  1 +
 ... Learning by Watching Video Demonstrations |  1 +
 ...ith Fewer Tasks through Task Interpolation |  1 +
 ...ng Universal Controllers with Transformers |  1 +
 ...Distribution Shifts and Training Conflicts |  1 +
 ...tation for Generative Adversarial Networks |  1 +
 ...fling: Tight Convergence Bounds and Beyond |  1 +
 ...esn't Imply Distribution Learning for GANs |  1 +
 ...zation with Smooth Algorithmic Adversaries |  1 +
 .../iclr/Mirror Descent Policy Optimization   |  1 +
 .../iclr/Missingness Bias in Model Debugging  |  1 +
 .../MoReL: Multi-omics Relational Learning    |  1 +
 ...se, and Mobile-friendly Vision Transformer |  1 +
 ...pretability for Multiple Instance Learning |  1 +
 ...o: A Growing Brain That Learns Continually |  1 +
 ...Reinforcement Learning with Regularization |  1 +
 ...el-augmented Prioritized Experience Replay |  1 +
 ...-label Classification using Box Embeddings |  1 +
 ...nforcement Learning via Neural Composition |  1 +
 ...Features for Monocular 3D Object Detection |  1 +
 .../Monotonic Differentiable Sorting Networks |  1 +
 .../iclr/Multi-Agent MDP Homomorphic Networks |  1 +
 ...ng: Teaching RL Policies to Act with Style |  1 +
 ...-Mode Deep Matrix and Tensor Factorization |  1 +
 ...ol for Strategic Exploration in Text Games |  1 +
 data/2022/iclr/Multi-Task Processes           |  1 +
 ...e Optimization by Learning Space Partition |  1 +
 .../iclr/Multimeasurement Generative Models   |  1 +
 ... with Approximate Implicit Differentiation |  1 +
 ...ning Enables Zero-Shot Task Generalization |  1 +
 ... NAS Evaluation is (Now) Surprisingly Easy |  1 +
 ...ural Architecture Search at Initialization |  1 +
 ...ction of Automated Machine Learning Models |  1 +
 ... Gradient Conflict aware Supernet Training |  1 +
 ...tive Model for Interpretable Deep Learning |  1 +
 ...guage Descriptions of Deep Visual Features |  1 +
 ...ainty for Exponential Family Distributions |  1 +
 ...or Linear Mixture MDPs with Plug-in Solver |  1 +
 ...raging Variance Information with Pessimism |  1 +
 ...etwork Augmentation for Tiny Deep Learning |  1 +
 ...Noise via Parameter Attack During Training |  0
 .../iclr/NeuPL: Neural Population Learning    |  1 +
 ...ximity to and Dynamics on the Central Path |  1 +
 ...eep Representation and Shallow Exploration |  1 +
 .../2022/iclr/Neural Deep Equilibrium Solvers |  1 +
 .../Neural Link Prediction with Walk Pooling  |  1 +
 ...stic Optimization for Continuous-Time Data |  1 +
 ...or Logical Reasoning over Knowledge Graphs |  1 +
 ...Space Invariance in Combinatorial Problems |  1 +
 ...n Hausdorff distance of Tropical Zonotopes |  1 +
 ...rnel Learners: The Silent Alignment Effect |  1 +
 .../iclr/Neural Parameter Allocation Search   |  1 +
 ...ying more attention to the context dataset |  1 +
 .../iclr/Neural Program Synthesis with Query  |  1 +
 ...l Inference with Node-Specific Information |  1 +
 ...ast and Accurate Numerical Optimal Control |  1 +
 .../Neural Spectral Marked Point Processes    |  1 +
 ...Neural Stochastic Dual Dynamic Programming |  1 +
 ...ediction for Inductive Node Classification |  1 +
 .../iclr/Neural Variational Dropout Processes |  1 +
 ...ime: consistency guarantees and algorithms |  1 +
 ...tation Change in Online Continual Learning |  1 +
 ...: Overlapping Features of Training Methods |  1 +
 ...Rate for Training Large Transformer Models |  1 +
 ...rvised Multi-scale Neighborhood Prediction |  1 +
 ... Representations of Large Knowledge Graphs |  1 +
 data/2022/iclr/Noisy Feature Mixup            |  1 +
 ... Approximations for Initial Value Problems |  1 +
 ...le Transfer with Self-Parallel Supervision |  1 +
 ...rification and Applicability Authorization |  1 +
 ...CA Using Volume-Preserving Transformations |  1 +
 ...age Embeddings for Cross-Lingual Alignment |  1 +
 ...for Scene Decomposition and Representation |  1 +
 ...jects via Discriminative Weight Generation |  1 +
 data/2022/iclr/Objects in Semantic Topology   |  1 +
 ...Pessimism, Optimization and Generalization |  1 +
 ...orcement Learning with Implicit Q-Learning |  1 +
 ... Learning with Value-based Episodic Memory |  1 +
 .../iclr/Omni-Dimensional Dynamic Convolution |  1 +
 ...nfiguration for time series classification |  1 +
 ...ederated Learning for Image Classification |  0
 ...rs in Imitation and Reinforcement Learning |  1 +
 ...ive Optimization with Gradient Compression |  1 +
 ...uation Metrics for Graph Generative Models |  1 +
 ...ial Transferability of Vision Transformers |  1 +
 ...n Incorporating Inductive Biases into VAEs |  1 +
 ...esentations in Deep Reinforcement Learning |  1 +
 ...Missing Labels in Semi-Supervised Learning |  1 +
 .../On Predicting Generalization using GANs   |  1 +
 ...y in Cell-based Neural Architecture Search |  1 +
 ...bust Prefix-Tuning for Text Classification |  1 +
 ...etworks with global convergence guarantees |  1 +
 ... Robustness for Ensemble Models and Beyond |  1 +
 ...tention and Dynamic Depth-wise Convolution |  1 +
 ...t Training with Interval Bound Propagation |  1 +
 ...GD and AdaGrad for Stochastic Optimization |  1 +
 ...tarts Algorithm for Reinforcement Learning |  1 +
 ...the Existence of Universal Lottery Tickets |  1 +
 ...ormation-Theoretic Bounds and Implications |  1 +
 ...alibration in Membership Inference Attacks |  1 +
 ... Bias Reduction in Few-Shot Classification |  1 +
 ... Learning and Learnability of Quasimetrics |  1 +
 .../On the Limitations of Multimodal VAEs     |  1 +
 ...Memorization Power of ReLU Neural Networks |  1 +
 ...zing Individual Neurons in Language Models |  1 +
 ...imation with Probabilistic Neural Networks |  1 +
 ...le of Neural Collapse in Transfer Learning |  1 +
 ... Functions in Energy-Based Sequence Models |  1 +
 ...of recurrent encoder-decoder architectures |  1 +
 ... estimation for Regression and Forecasting |  1 +
 ...tistical learning and perceptual distances |  1 +
 ...on heterogeneity in emergent communication |  1 +
 ...icy Model Errors in Reinforcement Learning |  1 +
 ...ng Incremental Skills for a Changing World |  1 +
 ...d Hoc Teamwork under Partial Observability |  0
 data/2022/iclr/Online Adversarial Attacks     |  1 +
 ... Task Configuration with Anytime Inference |  1 +
 ...ion for Rehearsal-based Continual Learning |  1 +
 .../Online Facility Location with Predictions |  1 +
 ...a-Learning with Hypergradient Distillation |  1 +
 ...finding the Optimal Policy for Linear MDPs |  1 +
 ...n Pretraining With Gene Ontology Embedding |  1 +
 ...Good Closed-Set Classifier is All You Need |  1 +
 .../iclr/Open-World Semi-Supervised Learning  |  1 +
 ...Vision and Language Knowledge Distillation |  1 +
 ... Ultra-low-latency Spiking Neural Networks |  1 +
 ...ptimal Representations for Covariate Shift |  1 +
 .../Optimal Transport for Causal Discovery    |  1 +
 ...led Recognition with Learnable Cost Matrix |  1 +
 ...eralization of Three layer Neural Networks |  0
 ...n inspired Multi-Branch Equilibrium Models |  1 +
 data/2022/iclr/Optimizer Amalgamation         |  1 +
 ... Networks with Gradient Lexicase Selection |  1 +
 ...d Value Mapping for Reinforcement Learning |  1 +
 ... of Nuisance-Induced Spurious Correlations |  1 +
 ...pectral Bias of Neural Value Approximation |  1 +
 ... from Language Models with Diverse Prompts |  1 +
 .../PAC Prediction Sets Under Covariate Shift |  1 +
 .../iclr/PAC-Bayes Information Bottleneck     |  1 +
 ...gs and Adversarial Reconstruction Learning |  1 +
 ...phatic Temporal Difference Learning Method |  1 +
 ...imation of universal graph representations |  1 +
 ...ction Intervals from Three Neural Networks |  1 +
 ...licy Learning with Adaptive Decision Trees |  1 +
 ...f Attention GANs for Synthetic Time Series |  1 +
 ...ith a Multi-Grid Solver for Long Sequences |  1 +
 data/2022/iclr/Pareto Policy Adaptation       |  1 +
 ...Model-based Offline Reinforcement Learning |  1 +
 ...Multi-Objective Combinatorial Optimization |  1 +
 ...twork for Non-rigid Point Set Registration |  1 +
 ...for mean field neural network optimization |  1 +
 ... Robust Against Adversarial Perturbations? |  1 +
 ...iliary Proposal for MCMC in Discrete Space |  1 +
 ...A Stochastic Control Approach For Sampling |  1 +
 ...al Network, and How to Find It Efficiently |  1 +
 ...chitecture for Structured Inputs & Outputs |  1 +
 ... Faster Distributed Nonconvex Optimization |  1 +
 .../Permutation-Based SGD: Is Random Optimal? |  1 +
 ...inty-Driven Offline Reinforcement Learning |  1 +
 ...nforcement Learning under Partial Coverage |  1 +
 .../iclr/Phase Collapse in Neural Networks    |  1 +
 ...le Descent in Finite-Width Neural Networks |  1 +
 ... Disambiguation for Partial Label Learning |  1 +
 ...works with Pipelined Feature Communication |  1 +
 ...ge Modeling Framework for Object Detection |  1 +
 ... Sparse training for Neural Network Models |  1 +
 ...ochastic Environments with a Learned Model |  1 +
 ...'n' Seek: Can You Find the Winning Ticket? |  1 +
 ...r Efficient Token Mixing in Long Sequences |  1 +
 ...oning and Backdooring Contrastive Learning |  1 +
 .../Policy Gradients Incorporating the Future |  1 +
 ...for Provably Robust Reinforcement Learning |  1 +
 ...Policy improvement by planning with Gumbel |  1 +
 ...rspective of Classification Loss Functions |  1 +
 ...earning And Using Hierarchical Affordances |  1 +
 ...for Detecting Unknown Spurious Correlation |  1 +
 ...s for Two-Class and Multi-Attack Scenarios |  1 +
 ...rocess Via Tractable Dependent Predictions |  1 +
 ...tegration via Separable Bijective Networks |  1 +
 ...ular Graph Representation with 3D Geometry |  1 +
 ...Mesh-reduced Space with Temporal Attention |  1 +
 ...in Continual Learning: A Comparative Study |  1 +
 ...rial Mixture of Training Signal Generators |  1 +
 ... Models with Data-Dependent Adaptive Prior |  1 +
 .../iclr/Privacy Implications of Shuffling    |  1 +
 .../Probabilistic Implicit Scene Completion   |  1 +
 ...planning with self-supervised world models |  1 +
 ...tic Reinforcement Learning without Oracles |  1 +
 ...tion for Fast Sampling of Diffusion Models |  1 +
 ...Deep Unsupervised RGB-D Saliency Detection |  1 +
 ...g for Theorem Proving with Language Models |  1 +
 ...ve on identifiable representation learning |  1 +
 ...hts at Initialization using Meta-Gradients |  1 +
 ...e Authoring via Learned Inverse Kinematics |  1 +
 ...n mechanisms for few shot image generation |  1 +
 ...Prototypical Contrastive Predictive Coding |  0
 ...ltiway Domains via Representation Learning |  1 +
 ...arning-based Algorithm For Sparse Recovery |  1 +
 ...stractors using Multistep Inverse Dynamics |  0
 .../iclr/Provably Robust Adversarial Examples |  1 +
 ...s for mean-field two-player zero-sum games |  1 +
 ...pothesis for Convolutional Neural Networks |  0
 ... Methods for Diffusion Models on Manifolds |  1 +
 ... for Semi-Supervised Keypoint Localization |  1 +
 ...Range Time Series Modeling and Forecasting |  1 +
 ...tremely Low-bit Post-Training Quantization |  1 +
 ...Quadtree Attention for Vision Transformers |  1 +
 ... Units via Topological Entropy Calculation |  1 +
 ...cks Against Black-Box Deep Learning Models |  1 +
 ...dding on Hyper-Relational Knowledge Graphs |  1 +
 ...Objects for Long-Range Distance Estimation |  1 +
 ...nforced and Recurrent Relational Reasoning |  1 +
 ...ring for Cross-Domain Parameter Estimation |  1 +
 ...y random features with no performance loss |  1 +
 .../iclr/Real-Time Neural Voice Camouflage    |  1 +
 .../iclr/Recursive Disentanglement Network    |  1 +
 ...Learning: Are Gradient Subspaces Low-Rank? |  1 +
 ...a Better Accuracy vs. Robustness Trade-off |  0
 ...to-Local Attention for Vision Transformers |  1 +
 ...ders for Isometric Representation Learning |  0
 ...ce of Discrete Markovian Context Evolution |  1 +
 ...te Representation Model: Method and Theory |  0
 ... using Guidance from Offline Demonstration |  1 +
 ...ransformer for Visual Relational Reasoning |  1 +
 ...presentations of the hippocampal formation |  1 +
 ...Relational Learning with Variational Bayes |  1 +
 ... Modeling Relations between Data and Tasks |  1 +
 .../iclr/Relational Surrogate Loss Learning   |  1 +
 ...p Inference Attacks without Losing Utility |  1 +
 ...rial Distillation with Unreliable Teachers |  1 +
 ...for Online and Offline RL in Low-rank MDPs |  1 +
 .../iclr/Representation-Agnostic Shape Fields |  1 +
 ...inuity for Unsupervised Continual Learning |  1 +
 ...beddings with Mixtures of Topic Embeddings |  1 +
 ...Biases via Influence-based Data Relabeling |  1 +
 ... Can Drive Divergence of SGD with Momentum |  1 +
 ...ative Models Using Scalable Fingerprinting |  1 +
 ...ility from a Data Distribution Perspective |  1 +
 ...Estimation for Positive-Unlabeled Learning |  1 +
 ... Learning and Its Connection to Offline RL |  1 +
 ...int Cloud: A Simple Residual MLP Framework |  1 +
 ...raining for Better Downstream Transferring |  1 +
 ...sentation as a Token-Level Bipartite Graph |  1 +
 ...erceptible Adversarial Image Perturbations |  1 +
 ...ies Forecasting against Distribution Shift |  1 +
 ...ith Lottery Regulated Grouped Convolutions |  1 +
 ...Offline Model Based Reinforcement Learning |  1 +
 ...hing in BERT from the Perspective of Graph |  1 +
 ...e models for Out-of-distribution detection |  0
 ...in Preference-based Reinforcement Learning |  1 +
 ...in Federated Learning with Modified Models |  1 +
 ...tributions Improve Adversarial Robustness? |  1 +
 ... Data Privacy Against Adversarial Learning |  0
 ...ble SDE Learning: A Functional Perspective |  1 +
 ...dient Homogenization in Multitask Learning |  1 +
 ...al for Offline RL via Supervised Learning? |  1 +
 ...ing with Stochastic Differential Equations |  1 +
 .../iclr/SGD Can Converge to Local Maxima     |  0
 ... bi-level optimization and implicit models |  1 +
 ...lations by Second-Order Structured Pruning |  1 +
 ...sentation Learning for Speech Pre-Training |  1 +
 ...ization via Diagonal Hessian Approximation |  1 +
 ...ta-Features for Neural Architecture Search |  1 +
 ...nt Preference-based Reinforcement Learning |  1 +
 ...ing with Differentiable Symbolic Execution |  1 +
 ...scover spurious features in Deep Learning? |  1 +
 ...cement Learning via Uncertainty Estimation |  1 +
 ...radient Algorithm for Zero-Sum Markov Game |  1 +
 ...y of Losses for Learning with Noisy Labels |  1 +
 ...edistribution for Efficient Face Detection |  1 +
 .../Sampling with Mirrored Stein Operators    |  1 +
 ...yperparameters by Implicit Differentiation |  1 +
 ...Nonsymmetric Determinantal Point Processes |  1 +
 ...om Pretraining and Finetuning Transformers |  0
 ...tures of Neural Network Gaussian Processes |  1 +
 ...caling Laws for Neural Machine Translation |  1 +
 ...e Learning using Random Feature Corruption |  1 +
 ...nd Rotationally Equivariant Spherical CNNs |  1 +
 ...ing future trajectories of multiple agents |  0
 ... with Critically-Damped Langevin Diffusion |  1 +
 ...ctive Ensembles for Consistent Predictions |  1 +
 data/2022/iclr/Self-Joint Supervised Learning |  1 +
 ...d Electroencephalographic Seizure Analysis |  1 +
 ...Supervised Inference in State-Space Models |  1 +
 ...ed Feature Selection with Correlated Gates |  1 +
 ...versarial Training for Improved Robustness |  1 +
 ...arning is More Robust to Dataset Imbalance |  1 +
 ...tein divergence and applications on graphs |  0
 ... Learning: Theory and Optimization Methods |  1 +
 ...adient Alignment for Multilingual Learning |  1 +
 ...Optimal Approximators of Korobov Functions |  1 +
 ...nforcement Learning or Behavioral Cloning? |  1 +
 ... End-task Aware Training as an Alternative |  1 +
 ...fle Private Stochastic Convex Optimization |  1 +
 .../Signing the Supermask: Keep, Hide, Invert |  1 +
 ...ge Model Pretraining with Weak Supervision |  1 +
 ...D Molecular Property Prediction and Beyond |  0
 ...l sketch representation in continuous time |  1 +
 .../Skill-based Meta-Reinforcement Learning   |  1 +
 ...Imaging with Score-Based Generative Models |  1 +
 .../Sound Adversarial Audio-Visual Navigation |  1 +
 ...ir with Minimality and Locality Guarantees |  1 +
 ...nt Shift via Bottom-Up Feature Restoration |  1 +
 .../iclr/Space-Time Graph Neural Networks     |  1 +
 ... Tree-based Graph Generation for Molecules |  1 +
 .../Sparse Attention with Learning to Hash    |  1 +
 ...arse Communication via Mixed Distributions |  1 +
 ...d Object Detection with Learnable Sparsity |  1 +
 ...eneralization from More Efficient Training |  1 +
 ...driven Policy for Antiviral Drug Discovery |  1 +
 ... is All You Need for Deep Face Recognition |  1 +
 ...al Message Passing for 3D Molecular Graphs |  1 +
 ...ast and accurate recurrent neural networks |  1 +
 ...ccuracy with Spurious Attribute Estimation |  1 +
 ...mension Dependence of Langevin Monte Carlo |  1 +
 ...ation for Discrete Representation Learning |  1 +
 ... Operators for Equivariant Neural Networks |  1 +
 ...zation for Generative Adversarial Networks |  1 +
 ...Denoising Autoencoders for Text Generation |  1 +
 ...l network for learning Hamiltonian systems |  1 +
 ...aining is Not Necessary for Generalization |  1 +
 .../iclr/Strength of Minibatch Noise in SGD   |  1 +
 ...ogeneous Multi-Task Reinforcement Learning |  1 +
 ...nd Applications of Aligned StyleGAN Models |  1 +
 ...erator for High-resolution Image Synthesis |  1 +
 ...rs for Few-Shot Class Incremental Learning |  1 +
 ...Model For Learning Fine-Grained Embeddings |  1 +
 ...stive Language-Image Pre-training Paradigm |  1 +
 ...rogeneous disease-related imaging patterns |  1 +
 ...mization Improves Sharpness-Aware Training |  1 +
 ...ed Search Spaces of Tabular NAS Benchmarks |  1 +
 ...g for Cross-Domain Few-Shot Classification |  1 +
 ...: Towards Interpretability and Scalability |  1 +
 ...eneration from Pre-trained Language Models |  1 +
 ...al Network for Time Series Signal Analysis |  1 +
 ...ional Networks for Time-Series Forecasting |  1 +
 ...raining via Learning a Neural SQL Executor |  1 +
 ...ptive Convolutions for Video Understanding |  1 +
 ...p Output with learned Multi-Agent Sampling |  1 +
 ...herence from dynamic point cloud sequences |  1 +
 ...al Imitation Learning with Suboptimal Data |  1 +
 ...Gradient Projection for Continual Learning |  1 +
 ...ing Trilemma with Denoising Diffusion GANs |  1 +
 ...ivated Transformer with Stochastic Experts |  1 +
 ...tation for Sequence to Sequence Generation |  1 +
 ...um Bipartite Matching in Few-Shot Learning |  1 +
 ...ed Generalization Bounds for Meta Learning |  1 +
 .../iclr/Task-Induced Representation Learning |  1 +
 ...rning and Few-Shot Sequence Classification |  1 +
 ...g Neural Network via Gradient Re-weighting |  1 +
 ...r Systematic Suboptimality in Human Models |  1 +
 ...een Contrastive Learning and Meta-Learning |  1 +
 ... Extreme Points of the Dual Convex Program |  1 +
 ...ty of Encoders in Variational Autoencoders |  1 +
 ...: Mapping and Mitigating Misaligned Models |  1 +
 data/2022/iclr/The Efficiency Misnomer        |  1 +
 ...lution of Uncertainty of Learning in Games |  1 +
 ...cy Optimization in Infinite-Horizon POMDPs |  1 +
 ...xact Characterization of Optimal Solutions |  1 +
 ...ing: Rethinking Pretraining Example Design |  1 +
 ...try of Unsupervised Reinforcement Learning |  1 +
 ...BERT Reproductions for Robustness Analysis |  1 +
 ...formers Improves Systematic Generalization |  1 +
 ...sparate Impact of Semi-Supervised Learning |  1 +
 ...inear Mode Connectivity of Neural Networks |  1 +
 ...ns for the OOD Generalization of RL Agents |  1 +
 ...pectral Bias of Polynomial Neural Networks |  1 +
 ...ynamics in High-dimensional Kernel Methods |  1 +
 ...Uncanny Similarity of Recurrence and Depth |  1 +
 ...he Most Naive Baseline for Sparse Training |  1 +
 ...roximation Bounds for ReLU Neural Networks |  1 +
 ...cation and Cooperation with Theory of Mind |  1 +
 ...d Graph Generation without Exchangeability |  1 +
 ...ration and multiclass-to-binary reductions |  1 +
 data/2022/iclr/Topological Experience Replay  |  1 +
 .../iclr/Topological Graph Neural Networks    |  1 +
 .../Topologically Regularized Data Embeddings |  1 +
 ...t Optimization and Hysteresis Quantization |  1 +
 ...types in a Nearest Neighbor-friendly Space |  0
 ...Histology Images with Contrastive Learning |  1 +
 ...d Representation Disentanglement Framework |  1 +
 ...nual Knowledge Learning of Language Models |  1 +
 ...rks: A GNTK-based Optimization Perspective |  1 +
 ...ement Learning: Lower Bound and Optimality |  1 +
 ...ich Bounds on the Rate-Distortion Function |  1 +
 ...of Neural Networks Learned by Transduction |  1 +
 ...ion Approximation in Zero-Sum Markov Games |  1 +
 ...ated Learning Using Knowledge Distillation |  1 +
 ...aph Neural Networks for Atomic Simulations |  1 +
 ...ation via Decomposing Excess Risk Dynamics |  1 +
 ...he Data Dependency of Mixup-style Training |  1 +
 ...Against Evasion Attack on Categorical Data |  1 +
 ...w of Parameter-Efficient Transfer Learning |  1 +
 ... and detecting harmful distribution shifts |  1 +
 ... Biases Enables Input Length Extrapolation |  1 +
 ...e Reconstruction via Bi-level Optimization |  1 +
 ...fold Identification and Variance Reduction |  1 +
 ...ia Distribution Matching for Complex Tasks |  1 +
 ...ow-rank phenomenon: beyond linear networks |  1 +
 ...ing through self- and mutual-distillations |  1 +
 ...ture Spaces via Model-Based Regularization |  1 +
 ...arial Attack based on Integrated Gradients |  1 +
 ...-Control Policy for Efficient Agent Design |  1 +
 ...larly Spaced Events and Their Participants |  1 +
 .../iclr/Transformer-based Transform Coding   |  1 +
 .../Transformers Can Do Bayesian Inference    |  1 +
 ...merging Property of Assembling Weak Models |  1 +
 ...Counting with Predictions in Graph Streams |  1 +
 ...h a Topological Prior for Trojan Detection |  1 +
 ...model differences (on ImageNet and beyond) |  1 +
 ...tion in Multi-Agent Reinforcement Learning |  1 +
 ... for Improved Generalization or Efficiency |  1 +
 ...ing for Out-of-Distribution Generalization |  1 +
 ...se in Contrastive Self-supervised Learning |  1 +
 ...ain Randomization for Sim-to-real Transfer |  1 +
 ...trinsic Robustness Using Label Uncertainty |  1 +
 ...upervision: An Identifiability Perspective |  1 +
 ...ection Attack by Promoting Unnoticeability |  1 +
 ...meterization in Recursive Value Estimation |  1 +
 ...ng Capacity Loss in Reinforcement Learning |  1 +
 ...d dictionary learning for pattern recovery |  1 +
 ...ng and bottlenecks on graphs via curvature |  1 +
 ...Attention for Efficient Speech Recognition |  1 +
 ...riance Collapse of SVGD in High Dimensions |  1 +
 ...t Spatial-Temporal Representation Learning |  1 +
 .../Unified Visual Transformer Compression    |  1 +
 ...nce with Black-box Optimization and Beyond |  1 +
 ... Constraints is Possible with Transformers |  1 +
 .../2022/iclr/Universalizing Weak Supervision |  1 +
 ...-Learning via The Adaptation Learning Rate |  0
 ...LM for Sparse Semi-Blind Source Separation |  1 +
 ...rvised Discovery of Object Radiance Fields |  1 +
 ...ensor Product Representations on the Torus |  1 +
 ...nd Partial Differential Equation in a Loop |  1 +
 ...tion by Distilling Feature Correspondences |  1 +
 ...r Induction with Shared Structure Modeling |  1 +
 ...easure the Severity of Depressive Symptoms |  1 +
 ...ation Error: ELBO and Exponential Families |  1 +
 ...ls for Manipulating 3D ARTiculated Objects |  1 +
 ...al networks in the overparametrized regime |  1 +
 ...egularization for Self-Supervised Learning |  1 +
 ...ou Don't Know by Virtual Outlier Synthesis |  1 +
 ...te Abstractions for Long-Horizon Reasoning |  1 +
 ...eighted Model-Based Reinforcement Learning |  1 +
 ...enerative Modeling of Feature Incompletion |  1 +
 .../iclr/Variational Neural Cellular Automata |  1 +
 ... Routing with Nested Subjective Timescales |  1 +
 ...ensional data: landscape and implicit bias |  1 +
 ...nal methods for simulation-based inference |  1 +
 ... oracle guiding for reinforcement learning |  1 +
 ...antized Image Modeling with Improved VQGAN |  1 +
 ...ve Fully Transformer-based Object Detector |  1 +
 ...AN: Training GANs with Vision Transformers |  1 +
 ...pulators Need to Also See from Their Hands |  1 +
 .../iclr/Visual Correspondence Hallucination  |  1 +
 ...Generalize Strongly Within the Same Domain |  1 +
 ...epresentation Learning over Latent Domains |  0
 ...g sensor and recurrent neural computations |  1 +
 ...enerative Model of Parametric CAD Sketches |  1 +
 ...mporal Classification Loss with Wild Cards |  1 +
 ...y Supervised Monocular 3D Object Detection |  1 +
 ...n by Generalization in Federated Learning? |  1 +
 ...ches Zero Loss? --A Mathematical Framework |  1 +
 ...s? Augment Difficult but Not too Different |  0
 ...Tree Search for Combinatorial Optimization |  1 +
 ...arge Number of Players Sample-Efficiently? |  1 +
 ... Pre-training or Strong Data Augmentations |  1 +
 data/2022/iclr/When should agents explore?    |  1 +
 ...Why, and Which Pretrained GANs Are Useful? |  1 +
 ...Study from the Parameter-Space Perspective |  1 +
 ...Partner in Positive and Unlabeled Learning |  0
 ...l and Efficient Evasion Attacks in Deep RL |  1 +
 ...allel Use of Labels and Features on Graphs |  1 +
 ...Needed to Produce a Primate Ventral Stream |  1 +
 ...pproach To Faster and More Accurate Models |  1 +
 ...on for long-horizon dexterous manipulation |  1 +
 ...ency in Deep Learning with A Minimax Model |  1 +
 ...ature Attribution in Trajectory Prediction |  1 +
 ...n Framework for Hypergraph Neural Networks |  1 +
 ...l Directional Boundary by Vector Transform |  1 +
 ...gative-free symmetric contrastive learning |  1 +
 ...Supervised Learning for MRI Reconstruction |  1 +
 ...for Federated Learning with Local Sparsity |  1 +
 ...cosFormer: Rethinking Softmax In Attention |  1 +
 ...iFlood: A Stable and Effective Regularizer |  1 +
 ... dynamics with applications to neural data |  1 +
 ...mark for formal Olympiad-level mathematics |  1 +
 ...achine Translation Via Code-Switch Decoder |  1 +
 ...r Single Multi-Labeled Text Classification |  1 +
 ...ified Framework for Soft Threshold Pruning |  1 +
 ...eural Networks for Universal Approximation |  1 +
 ...llation for Multi-View 3D Object Detection |  1 +
 ...n for Faster and Better Visual Pretraining |  1 +
 ...for Geometry-Sequence Modeling in Proteins |  1 +
 ...hanced Explainer for Graph Neural Networks |  1 +
 .../Delving into Semantic Scale Imbalance     |  1 +
 ...nd Rectifying Vision Models using Language |  1 +
 ...f-Distribution Robustness via Disagreement |  1 +
 ... Tensors for Memory-Efficient DNN Training |  1 +
 ...l Affordance for Dual-gripper Manipulation |  1 +
 ...afe Exploration with Weakest Preconditions |  1 +
 ...All You Need for Oriented Object Detection |  1 +
 ... Examples via Augmenting Content and Style |  1 +
 ...t Data-Free Learning from Black-Box Models |  1 +
 ...ostic Representation for Disease Diagnosis |  1 +
 ...ge-Graphs for Differentiable Rule Learning |  1 +
 ...E(3)-Invariant Denoising Distance Matching |  1 +
 ...ng convex conjugates for optimal transport |  1 +
 ... Dense Contrastive Representation Learning |  1 +
 ...ly Detection in Industry Vision: Graphcore |  1 +
 ...ning for Low-rank General-sum Markov Games |  0
 ...-Sample Matching for Domain Generalization |  1 +
 ...eature Extractor for Few-shot Segmentation |  1 +
 ...Improves Adaptation to Distribution Shifts |  1 +
 ...sk Prompting for Dense Scene Understanding |  1 +
 ...asses by Extrapolating from a Single Image |  1 +
 .../Trainability Preserving Neural Pruning    |  1 +
 2209 files changed, 2144 insertions(+)
 create mode 100644 data/2020/iclr/A Constructive Prediction of the Generalization Error Across Scales
 create mode 100644 data/2020/iclr/A Fair Comparison of Graph Neural Networks for Graph Classification
 create mode 100644 data/2020/iclr/A Learning-based Iterative Method for Solving Vehicle Routing Problems
 create mode 100644 data/2020/iclr/A Target-Agnostic Attack on Deep Models: Exploiting Security Vulnerabilities of Transfer Learning
 create mode 100644 data/2020/iclr/A Theoretical Analysis of the Number of Shots in Few-Shot Learning
 create mode 100644 data/2020/iclr/A critical analysis of self-supervision, or what we can learn from a single image
 create mode 100644 data/2020/iclr/AMRL: Aggregated Memory For Reinforcement Learning
 create mode 100644 data/2020/iclr/Accelerating SGD with momentum for over-parameterized learning
 create mode 100644 data/2020/iclr/Action Semantics Network: Considering the Effects of Actions in Multiagent Systems
 create mode 100644 data/2020/iclr/Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games
 create mode 100644 data/2020/iclr/Adaptive Structural Fingerprints for Graph Attention Networks
 create mode 100644 data/2020/iclr/Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks
 create mode 100644 data/2020/iclr/Adjustable Real-time Style Transfer
 create mode 100644 data/2020/iclr/Adversarial Policies: Attacking Deep Reinforcement Learning
 create mode 100644 data/2020/iclr/Adversarially Robust Representations with Smooth Encoders
 create mode 100644 data/2020/iclr/Adversarially robust transfer learning
 create mode 100644 data/2020/iclr/Ae-OT: a New Generative Model based on Extended Semi-discrete Optimal transport
 create mode 100644 data/2020/iclr/An Inductive Bias for Distances: Neural Nets that Respect the Triangle Inequality
 create mode 100644 data/2020/iclr/Analysis of Video Feature Learning in Two-Stream CNNs on the Example of Zebrafish Swim Bout Classification
 create mode 100644 data/2020/iclr/Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction
 create mode 100644 data/2020/iclr/Are Transformers universal approximators of sequence-to-sequence functions?
 create mode 100644 data/2020/iclr/AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures
 create mode 100644 data/2020/iclr/Augmenting Genetic Algorithms with Deep Neural Networks for Exploring the Chemical Space
 create mode 100644 data/2020/iclr/AutoQ: Automated Kernel-Wise Neural Network Quantization
 create mode 100644 data/2020/iclr/Automated Relational Meta-learning
 create mode 100644 data/2020/iclr/Automated curriculum generation through setter-solver interactions
 create mode 100644 data/2020/iclr/Automatically Discovering and Learning New Visual Categories with Ranking Statistics
 create mode 100644 data/2020/iclr/Black-Box Adversarial Attack with Transferable Model-based Embedding
 create mode 100644 data/2020/iclr/Bounds on Over-Parameterization for Guaranteed Existence of Descent Paths in Shallow ReLU Networks
 create mode 100644 data/2020/iclr/Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness
 create mode 100644 data/2020/iclr/Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints
 create mode 100644 data/2020/iclr/CAQL: Continuous Action Q-Learning
 create mode 100644 data/2020/iclr/CLN2INV: Learning Loop Invariants with Continuous Logic Networks
 create mode 100644 data/2020/iclr/CM3: Cooperative Multi-goal Multi-stage Multi-agent Reinforcement Learning
 create mode 100644 data/2020/iclr/Can gradient clipping mitigate label noise?
 create mode 100644 data/2020/iclr/Certified Robustness for Top-k Predictions against Adversarial Perturbations via Randomized Smoothing
 create mode 100644 data/2020/iclr/Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation
 create mode 100644 data/2020/iclr/Compositional languages emerge in a neural iterated learning model
 create mode 100644 data/2020/iclr/Computation Reallocation for Object Detection
 create mode 100644 data/2020/iclr/Continual Learning with Adaptive Weights (CLAW)
 create mode 100644 data/2020/iclr/Continual Learning with Bayesian Neural Networks for Non-Stationary Data
 create mode 100644 data/2020/iclr/Counterfactuals uncover the modular structure of deep generative models
 create mode 100644 data/2020/iclr/Curvature Graph Network
 create mode 100644 data/2020/iclr/DBA: Distributed Backdoor Attacks against Federated Learning
 create mode 100644 data/2020/iclr/DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames
 create mode 100644 data/2020/iclr/Data-Independent Neural Pruning via Coresets
 create mode 100644 data/2020/iclr/DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling
 create mode 100644 "data/2020/iclr/Deep 3D Pan via local adaptive \"t-shaped\" convolutions with global and local adaptive dilations"
 create mode 100644 data/2020/iclr/Deep Imitative Models for Flexible Inference, Planning, and Control
 create mode 100644 data/2020/iclr/Deep Learning of Determinantal Point Processes via Proper Spectral Sub-gradient
 create mode 100644 data/2020/iclr/Deep Network Classification by Scattering and Homotopy Dictionary Learning
 create mode 100644 data/2020/iclr/Deep Semi-Supervised Anomaly Detection
 create mode 100644 data/2020/iclr/DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures
 create mode 100644 data/2020/iclr/DeepV2D: Video to Depth with Differentiable Structure from Motion
 create mode 100644 data/2020/iclr/Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation
 create mode 100644 data/2020/iclr/Depth-Adaptive Transformer
 create mode 100644 data/2020/iclr/Detecting Extrapolation with Local Ensembles
 create mode 100644 data/2020/iclr/Detecting and Diagnosing Adversarial Images with Class-Conditional Capsule Reconstructions
 create mode 100644 data/2020/iclr/Difference-Seeking Generative Adversarial Network-Unseen Sample Generation
 create mode 100644 data/2020/iclr/Differentially Private Meta-Learning
 create mode 100644 data/2020/iclr/Disentangling Factors of Variations Using Few Labels
 create mode 100644 data/2020/iclr/Distance-Based Learning from Errors for Confidence Calibration
 create mode 100644 data/2020/iclr/Diverse Trajectory Forecasting with Determinantal Point Processes
 create mode 100644 data/2020/iclr/DivideMix: Learning with Noisy Labels as Semi-supervised Learning
 create mode 100644 data/2020/iclr/Dynamic Time Lag Regression: Predicting What & When
 create mode 100644 data/2020/iclr/Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery
 create mode 100644 data/2020/iclr/Dynamically Pruned Message Passing Networks for Large-scale Knowledge Graph Reasoning
 create mode 100644 data/2020/iclr/ES-MAML: Simple Hessian-Free Meta Learning
 create mode 100644 data/2020/iclr/Editable Neural Networks
 create mode 100644 data/2020/iclr/Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform
 create mode 100644 data/2020/iclr/Efficient and Information-Preserving Future Frame Prediction and Beyond
 create mode 100644 data/2020/iclr/Enhancing Transformation-Based Defenses Against Adversarial Attacks with a Distribution Classifier
 create mode 100644 data/2020/iclr/Ensemble Distribution Distillation
 create mode 100644 data/2020/iclr/Escaping Saddle Points Faster with Stochastic Momentum
 create mode 100644 data/2020/iclr/Evaluating The Search Phase of Neural Architecture Search
 create mode 100644 data/2020/iclr/Exploration in Reinforcement Learning with Deep Covering Options
 create mode 100644 data/2020/iclr/Exploring Model-based Planning with Policy Networks
 create mode 100644 data/2020/iclr/FSPool: Learning Set Representations with Featurewise Sort Pooling
 create mode 100644 data/2020/iclr/Fast is better than free: Revisiting adversarial training
 create mode 100644 data/2020/iclr/FasterSeg: Searching for Faster Real-time Semantic Segmentation
 create mode 100644 data/2020/iclr/Feature Interaction Interpretability: A Case for Explaining Ad-Recommendation Systems via Neural Interaction Detection
 create mode 100644 data/2020/iclr/Federated Adversarial Domain Adaptation
 create mode 100644 data/2020/iclr/Few-Shot Learning on graphs via super-Classes based on Graph spectral Measures
 create mode 100644 data/2020/iclr/Few-shot Text Classification with Distributional Signatures
 create mode 100644 data/2020/iclr/Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents
 create mode 100644 data/2020/iclr/Fooling Detection Alone is Not Enough: Adversarial Attack against Multiple Object Tracking
 create mode 100644 data/2020/iclr/Four Things Everyone Should Know to Improve Batch Normalization
 create mode 100644 data/2020/iclr/From Variational to Deterministic Autoencoders
 create mode 100644 data/2020/iclr/Functional vs. parametric equivalence of ReLU networks
 create mode 100644 data/2020/iclr/GAT: Generative Adversarial Training for Adversarial Example Detection and Robust Classification
 create mode 100644 data/2020/iclr/GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations
 create mode 100644 data/2020/iclr/GLAD: Learning Sparse Graph Recovery
 create mode 100644 data/2020/iclr/Gap-Aware Mitigation of Gradient Staleness
 create mode 100644 data/2020/iclr/Generalization bounds for deep convolutional neural networks
 create mode 100644 data/2020/iclr/Generative Ratio Matching Networks
 create mode 100644 data/2020/iclr/Geometric Insights into the Convergence of Nonlinear TD Learning
 create mode 100644 data/2020/iclr/Global Relational Models of Source Code
 create mode 100644 data/2020/iclr/Graph inference learning for semi-supervised classification
 create mode 100644 data/2020/iclr/Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation
 create mode 100644 data/2020/iclr/I Am Going MAD: Maximum Discrepancy Competition for Comparing Classifiers Adaptively
 create mode 100644 data/2020/iclr/Identifying through Flows for Recovering Latent Representations
 create mode 100644 data/2020/iclr/Identity Crisis: Memorization and Generalization Under Extreme Overparameterization
 create mode 100644 data/2020/iclr/Image-guided Neural Object Rendering
 create mode 100644 data/2020/iclr/Imitation Learning via Off-Policy Distribution Matching
 create mode 100644 data/2020/iclr/Implicit Bias of Gradient Descent based Adversarial Training on Separable Data
 create mode 100644 data/2020/iclr/Improved Sample Complexities for Deep Neural Networks and Robust Classification via an All-Layer Margin
 create mode 100644 data/2020/iclr/Improving Adversarial Robustness Requires Revisiting Misclassified Examples
 create mode 100644 data/2020/iclr/In Search for a SAT-friendly Binarized Neural Network Architecture
 create mode 100644 data/2020/iclr/Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models
 create mode 100644 data/2020/iclr/Interpretable Complex-Valued Neural Networks for Privacy Protection
 create mode 100644 data/2020/iclr/Intrinsic Motivation for Encouraging Synergistic Behavior
 create mode 100644 data/2020/iclr/Knowledge Consistency between Neural Networks and Beyond
 create mode 100644 data/2020/iclr/LAMOL: LAnguage MOdeling for Lifelong Language Learning
 create mode 100644 data/2020/iclr/Language GANs Falling Short
 create mode 100644 data/2020/iclr/Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
 create mode 100644 data/2020/iclr/Lazy-CFR: fast and near-optimal regret minimization for extensive games with imperfect information
 create mode 100644 data/2020/iclr/Learned Step Size quantization
 create mode 100644 data/2020/iclr/Learning Disentangled Representations for CounterFactual Regression
 create mode 100644 data/2020/iclr/Learning Efficient Parameter Server Synchronization Policies for Distributed SGD
 create mode 100644 data/2020/iclr/Learning Execution through Neural Code fusion
 create mode 100644 data/2020/iclr/Learning Expensive Coordination: An Event-Based Deep RL Approach
 create mode 100644 data/2020/iclr/Learning Heuristics for Quantified Boolean Formulas through Reinforcement Learning
 create mode 100644 data/2020/iclr/Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling
 create mode 100644 data/2020/iclr/Learning Space Partitions for Nearest Neighbor Search
 create mode 100644 data/2020/iclr/Learning deep graph matching with channel-independent embedding and Hungarian attention
 create mode 100644 data/2020/iclr/Learning the Arrow of Time for Problems in Reinforcement Learning
 create mode 100644 data/2020/iclr/Learning to Learn by Zeroth-Order Oracle
 create mode 100644 data/2020/iclr/Learning to Link
 create mode 100644 data/2020/iclr/Learning to Represent Programs with Property Signatures
 create mode 100644 data/2020/iclr/Learning to solve the credit assignment problem
 create mode 100644 data/2020/iclr/Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware
 create mode 100644 data/2020/iclr/Locality and Compositionality in Zero-Shot Learning
 create mode 100644 data/2020/iclr/Logic and the 2-Simplicial Transformer
 create mode 100644 data/2020/iclr/Low-Resource Knowledge-Grounded Dialogue Generation
 create mode 100644 data/2020/iclr/MACER: Attack-free and Scalable Robust Training via Maximizing Certified Radius
 create mode 100644 data/2020/iclr/Maxmin Q-learning: Controlling the Estimation Bias of Q-learning
 create mode 100644 data/2020/iclr/Measuring Compositional Generalization: A Comprehensive Method on Realistic Data
 create mode 100644 data/2020/iclr/Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples
 create mode 100644 data/2020/iclr/MetaPix: Few-Shot Video Retargeting
 create mode 100644 data/2020/iclr/Minimizing FLOPs to Learn Efficient Sparse Representations
 create mode 100644 data/2020/iclr/Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models
 create mode 100644 data/2020/iclr/Mixup Inference: Better Exploiting Mixup to Defend Adversarial Attacks
 create mode 100644 data/2020/iclr/Multi-agent Reinforcement Learning for Networked System Control
 create mode 100644 data/2020/iclr/Multiplicative Interactions and Where to Find Them
 create mode 100644 data/2020/iclr/Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification
 create mode 100644 data/2020/iclr/N-BEATS: Neural basis expansion analysis for interpretable time series forecasting
 create mode 100644 data/2020/iclr/NAS evaluation is frustratingly hard
 create mode 100644 data/2020/iclr/Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data
 create mode 100644 data/2020/iclr/Neural Stored-program Memory
 create mode 100644 data/2020/iclr/Neural Text Generation With Unlikelihood Training
 create mode 100644 data/2020/iclr/Novelty Detection Via Blurring
 create mode 100644 data/2020/iclr/Observational Overfitting in Reinforcement Learning
 create mode 100644 data/2020/iclr/On Computation and Generalization of Generative Adversarial Imitation Learning
 create mode 100644 data/2020/iclr/On Identifiability in Transformers
 create mode 100644 data/2020/iclr/On Mutual Information Maximization for Representation Learning
 create mode 100644 "data/2020/iclr/On the \"steerability\" of generative adversarial networks"
 create mode 100644 data/2020/iclr/On the Variance of the Adaptive Learning Rate and Beyond
 create mode 100644 data/2020/iclr/On the Weaknesses of Reinforcement Learning for Neural Machine Translation
 create mode 100644 data/2020/iclr/One-Shot Pruning of Recurrent Neural Networks by Jacobian Spectrum Evaluation
 create mode 100644 data/2020/iclr/Optimistic Exploration even with a Pessimistic Initialisation
 create mode 100644 data/2020/iclr/Option Discovery using Deep Skill Chaining
 create mode 100644 data/2020/iclr/Order Learning and Its Application to Age Estimation
 create mode 100644 data/2020/iclr/Overlearning Reveals Sensitive Attributes
 create mode 100644 data/2020/iclr/Physics-as-Inverse-Graphics: Unsupervised Physical Parameter Estimation from Video
 create mode 100644 data/2020/iclr/Piecewise linear activations substantially shape the loss surfaces of neural networks
 create mode 100644 data/2020/iclr/Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP
 create mode 100644 data/2020/iclr/Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring
 create mode 100644 data/2020/iclr/Population-Guided Parallel Policy Search for Reinforcement Learning
 create mode 100644 data/2020/iclr/Pre-training Tasks for Embedding-based Large-scale Retrieval
 create mode 100644 data/2020/iclr/Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model
 create mode 100644 data/2020/iclr/Progressive Memory Banks for Incremental Domain Adaptation
 create mode 100644 data/2020/iclr/ProxSGD: Training Structured Neural Networks under Regularization and Constraints
 create mode 100644 data/2020/iclr/Pruned Graph Scattering Transforms
 create mode 100644 data/2020/iclr/Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving
 create mode 100644 data/2020/iclr/Pure and Spurious Critical Points: a Geometric Study of Linear Networks
 create mode 100644 data/2020/iclr/Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP
 create mode 100644 data/2020/iclr/Quantifying the Cost of Reliable Photo Authentication via High-Performance Learned Lossy Representations
 create mode 100644 data/2020/iclr/RTFM: Generalising to New Environment Dynamics via Reading
 create mode 100644 data/2020/iclr/RaCT: Toward Amortized Ranking-Critical Training For Collaborative Filtering
 create mode 100644 data/2020/iclr/Ranking Policy Gradient
 create mode 100644 data/2020/iclr/Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML
 create mode 100644 data/2020/iclr/ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning
 create mode 100644 data/2020/iclr/ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring
 create mode 100644 data/2020/iclr/Reanalysis of Variance Reduced Temporal Difference Learning
 create mode 100644 data/2020/iclr/Recurrent neural circuits for contour detection
 create mode 100644 data/2020/iclr/Reinforced active learning for image segmentation
 create mode 100644 data/2020/iclr/Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation
 create mode 100644 data/2020/iclr/Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives
 create mode 100644 data/2020/iclr/Relational State-Space Model for Stochastic Multi-Object Systems
 create mode 100644 data/2020/iclr/Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness
 create mode 100644 data/2020/iclr/Robust And Interpretable Blind Image Denoising Via Bias-Free Convolutional Neural Networks
 create mode 100644 data/2020/iclr/Robust Local Features for Improving the Generalization of Adversarial Training
 create mode 100644 data/2020/iclr/Robust training with ensemble consensus
 create mode 100644 data/2020/iclr/SAdam: A Variant of Adam for Strongly Convex Functions
 create mode 100644 data/2020/iclr/SELF: Learning to Filter Noisy Labels with Self-Ensembling
 create mode 100644 data/2020/iclr/SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards
 create mode 100644 data/2020/iclr/Sampling-Free Learning of Bayesian Quantized Neural Networks
 create mode 100644 data/2020/iclr/Scalable Model Compression by Entropy Penalized Reparameterization
 create mode 100644 data/2020/iclr/Scalable Neural Methods for Reasoning With a Symbolic Knowledge Base
 create mode 100644 data/2020/iclr/Scalable and Order-robust Continual Learning with Additive Parameter Decomposition
 create mode 100644 data/2020/iclr/Selection via Proxy: Efficient Data Selection for Deep Learning
 create mode 100644 data/2020/iclr/Self-Adversarial Learning with Comparative Discrimination for Text Generation
 create mode 100644 data/2020/iclr/Semantically-Guided Representation Learning for Self-Supervised Monocular Depth
 create mode 100644 data/2020/iclr/Sharing Knowledge in Multi-Task Deep Reinforcement Learning
 create mode 100644 data/2020/iclr/Short and Sparse Deconvolution - A Geometric Approach
 create mode 100644 data/2020/iclr/Sign Bits Are All You Need for Black-Box Attacks
 create mode 100644 data/2020/iclr/Sign-OPT: A Query-Efficient Hard-label Adversarial Attack
 create mode 100644 data/2020/iclr/SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum
 create mode 100644 data/2020/iclr/Stochastic AUC Maximization with Deep Neural Networks
 create mode 100644 data/2020/iclr/Stochastic Conditional Generative Networks with Basis Decomposition
 create mode 100644 data/2020/iclr/Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well
 create mode 100644 data/2020/iclr/StructPool: Structured Graph Pooling via Conditional Random Fields
 create mode 100644 data/2020/iclr/TabFact: A Large-scale Dataset for Table-based Fact Verification
 create mode 100644 data/2020/iclr/The Implicit Bias of Depth: How Incremental Learning Drives Generalization
 create mode 100644 data/2020/iclr/The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget
 create mode 100644 data/2020/iclr/The asymptotic spectrum of the Hessian of DNN throughout training
 create mode 100644 data/2020/iclr/Theory and Evaluation Metrics for Learning Disentangled Representations
 create mode 100644 data/2020/iclr/Thieves on Sesame Street! Model Extraction of BERT-based APIs
 create mode 100644 data/2020/iclr/To Relieve Your Headache of Training an MRF, Take AdVIL
 create mode 100644 data/2020/iclr/Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets
 create mode 100644 data/2020/iclr/Transferable Perturbations of Deep Feature Distributions
 create mode 100644 data/2020/iclr/Tree-Structured Attention with Hierarchical Accumulation
 create mode 100644 data/2020/iclr/Understanding Architectures Learnt by Cell-based Neural Architecture Search
 create mode 100644 data/2020/iclr/Understanding Knowledge Distillation in Non-autoregressive Machine Translation
 create mode 100644 data/2020/iclr/Understanding the Limitations of Variational Mutual Information Estimators
 create mode 100644 data/2020/iclr/Unpaired Point Cloud Completion on Real Scans using Adversarial Training
 create mode 100644 data/2020/iclr/Unsupervised Model Selection for Variational Disentangled Representation Learning
 create mode 100644 data/2020/iclr/V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control
 create mode 100644 data/2020/iclr/V4D: 4D Convolutional Neural Networks for Video-level Representation Learning
 create mode 100644 data/2020/iclr/VL-BERT: Pre-training of Generic Visual-Linguistic Representations
 create mode 100644 data/2020/iclr/Variational Recurrent Models for Solving Partially Observable Control Tasks
 create mode 100644 data/2020/iclr/Vid2Game: Controllable Characters Extracted from Real-World Videos
 create mode 100644 data/2020/iclr/VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation
 create mode 100644 data/2020/iclr/Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards
 create mode 100644 data/2020/iclr/Weakly Supervised Clustering by Exploiting Unique Class Count
 create mode 100644 data/2020/iclr/What graph neural networks cannot learn: depth vs width
 create mode 100644 data/2021/iclr/A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning
 create mode 100644 data/2021/iclr/A Block Minifloat Representation for Training Deep Neural Networks
 create mode 100644 data/2021/iclr/A Critique of Self-Expressive Deep Subspace Clustering
 create mode 100644 data/2021/iclr/A Design Space Study for LISTA and Beyond
 create mode 100644 data/2021/iclr/A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima
 create mode 100644 data/2021/iclr/A Discriminative Gaussian Mixture Model with Sparsity
 create mode 100644 data/2021/iclr/A Distributional Approach to Controlled Text Generation
 create mode 100644 data/2021/iclr/A Geometric Analysis of Deep Generative Image Models and Its Applications
 create mode 100644 data/2021/iclr/A Good Image Generator Is What You Need for High-Resolution Video Synthesis
 create mode 100644 data/2021/iclr/A Gradient Flow Framework For Analyzing Network Pruning
 create mode 100644 data/2021/iclr/A Hypergradient Approach to Robust Regression without Correspondence
 create mode 100644 data/2021/iclr/A Learning Theoretic Perspective on Local Explainability
 create mode 100644 data/2021/iclr/A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks
 create mode 100644 data/2021/iclr/A PAC-Bayesian Approach to Generalization Bounds for Graph Neural Networks
 create mode 100644 data/2021/iclr/A Panda? No, It's a Sloth: Slowdown Attacks on Adaptive Multi-Exit Neural Network Inference
 create mode 100644 data/2021/iclr/A Temporal Kernel Approach for Deep Learning with Continuous-time Information
 create mode 100644 data/2021/iclr/A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention
 create mode 100644 data/2021/iclr/A Unified Approach to Interpreting and Boosting Adversarial Transferability
 create mode 100644 data/2021/iclr/A Universal Representation Transformer Layer for Few-Shot Image Classification
 create mode 100644 data/2021/iclr/A Wigner-Eckart Theorem for Group Equivariant Convolution Kernels
 create mode 100644 data/2021/iclr/A statistical theory of cold posteriors in deep neural networks
 create mode 100644 data/2021/iclr/A teacher-student framework to distill future trajectories
 create mode 100644 data/2021/iclr/A unifying view on implicit bias in training linear neural networks
 create mode 100644 data/2021/iclr/ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
 create mode 100644 data/2021/iclr/ANOCE: Analysis of Causal Effects with Multiple Mediators via Constrained Structural Learning
 create mode 100644 data/2021/iclr/ARMOURED: Adversarially Robust MOdels using Unlabeled data by REgularizing Diversity
 create mode 100644 data/2021/iclr/Accelerating Convergence of Replica Exchange Stochastic Gradient MCMC via Variance Reduction
 create mode 100644 data/2021/iclr/Accurate Learning of Graph Representations with Graph Multiset Pooling
 create mode 100644 data/2021/iclr/Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning
 create mode 100644 data/2021/iclr/Acting in Delayed Environments with Non-Stationary Markov Policies
 create mode 100644 data/2021/iclr/Activation-level uncertainty in deep neural networks
 create mode 100644 data/2021/iclr/Active Contrastive Learning of Audio-Visual Video Representations
 create mode 100644 data/2021/iclr/AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition
 create mode 100644 data/2021/iclr/AdaGCN: Adaboosting Graph Convolutional Networks into Deep Models
 create mode 100644 data/2021/iclr/AdaSpeech: Adaptive Text to Speech for Custom Voice
 create mode 100644 data/2021/iclr/AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights
 create mode 100644 data/2021/iclr/Adapting to Reward Progressivity via Spectral Reinforcement Learning
 create mode 100644 data/2021/iclr/Adaptive Extra-Gradient Methods for Min-Max Optimization and Games
 create mode 100644 data/2021/iclr/Adaptive Federated Optimization
 create mode 100644 data/2021/iclr/Adaptive Procedural Task Generation for Hard-Exploration Problems
 create mode 100644 data/2021/iclr/Adaptive Universal Generalized PageRank Graph Neural Network
 create mode 100644 data/2021/iclr/Adaptive and Generative Zero-Shot Learning
 create mode 100644 data/2021/iclr/Adversarial score matching and improved sampling for image generation
 create mode 100644 data/2021/iclr/Adversarially Guided Actor-Critic
 create mode 100644 data/2021/iclr/Adversarially-Trained Deep Nets Transfer Better: Illustration on Image Classification
 create mode 100644 data/2021/iclr/Aligning AI With Shared Human Values
 create mode 100644 data/2021/iclr/An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
 create mode 100644 data/2021/iclr/An Unsupervised Deep Learning Approach for Real-World Image Denoising
 create mode 100644 data/2021/iclr/Analyzing the Expressive Power of Graph Neural Networks in a Spectral Perspective
 create mode 100644 data/2021/iclr/Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics
 create mode 100644 data/2021/iclr/Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies
 create mode 100644 data/2021/iclr/Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval
 create mode 100644 data/2021/iclr/Anytime Sampling for Autoregressive Models via Ordered Autoencoding
 create mode 100644 data/2021/iclr/Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
 create mode 100644 data/2021/iclr/Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks
 create mode 100644 data/2021/iclr/Are Neural Rankers still Outperformed by Gradient Boosted Decision Trees?
 create mode 100644 data/2021/iclr/Are wider nets better given the same number of parameters?
 create mode 100644 data/2021/iclr/Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning
 create mode 100644 data/2021/iclr/Async-RED: A Provably Convergent Asynchronous Block Parallel Stochastic Method using Deep Denoising Priors
 create mode 100644 data/2021/iclr/Attentional Constellation Nets for Few-Shot Learning
 create mode 100644 data/2021/iclr/Auction Learning as a Two-Player Game
 create mode 100644 data/2021/iclr/Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting
 create mode 100644 data/2021/iclr/Auto Seg-Loss: Searching Metric Surrogates for Semantic Segmentation
 create mode 100644 data/2021/iclr/AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly
 create mode 100644 data/2021/iclr/Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization
 create mode 100644 data/2021/iclr/Autoregressive Entity Retrieval
 create mode 100644 data/2021/iclr/Auxiliary Learning by Implicit Differentiation
 create mode 100644 data/2021/iclr/Auxiliary Task Update Decomposition: the Good, the Bad and the neutral
 create mode 100644 data/2021/iclr/Average-case Acceleration for Bilinear Games and Normal Matrices
 create mode 100644 data/2021/iclr/BERTology Meets Biology: Interpreting Attention in Protein Language Models
 create mode 100644 data/2021/iclr/BOIL: Towards Representation Change for Few-shot Learning
 create mode 100644 data/2021/iclr/BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction
 create mode 100644 data/2021/iclr/BREEDS: Benchmarks for Subpopulation Shift
 create mode 100644 data/2021/iclr/BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization
 create mode 100644 data/2021/iclr/BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration
 create mode 100644 data/2021/iclr/Bag of Tricks for Adversarial Training
 create mode 100644 data/2021/iclr/Balancing Constraints and Rewards with Meta-Gradient D4PG
 create mode 100644 data/2021/iclr/Batch Reinforcement Learning Through Continuation Method
 create mode 100644 "data/2021/iclr/Bayesian Few-Shot Classification with One-vs-Each P\303\263lya-Gamma Augmented Gaussian Processes"
 create mode 100644 data/2021/iclr/Behavioral Cloning from Noisy Demonstrations
 create mode 100644 data/2021/iclr/Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods
 create mode 100644 data/2021/iclr/Better Fine-Tuning by Reducing Representational Collapse
 create mode 100644 data/2021/iclr/Beyond Categorical Label Representations for Image Classification
 create mode 100644 data/2021/iclr/Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with 1 n Parameters
 create mode 100644 data/2021/iclr/BiPointNet: Binary Neural Network for Point Clouds
 create mode 100644 data/2021/iclr/Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech
 create mode 100644 data/2021/iclr/Blending MPC & Value Function Approximation for Efficient Reinforcement Learning
 create mode 100644 data/2021/iclr/Boost then Convolve: Gradient Boosting Meets Graph Neural Networks
 create mode 100644 data/2021/iclr/Bowtie Networks: Generative Modeling for Joint Few-Shot Recognition and Novel-View Synthesis
 create mode 100644 data/2021/iclr/Bypassing the Ambient Dimension: Private SGD with Gradient Subspace Identification
 create mode 100644 data/2021/iclr/Byzantine-Resilient Non-Convex Stochastic Gradient Descent
 create mode 100644 data/2021/iclr/C-Learning: Horizon-Aware Cumulative Accessibility Estimation
 create mode 100644 data/2021/iclr/C-Learning: Learning to Achieve Goals via Recursive Classification
 create mode 100644 data/2021/iclr/CO2: Consistent Contrast for Unsupervised Visual Representation Learning
 create mode 100644 data/2021/iclr/CPR: Classifier-Projection Regularization for Continual Learning
 create mode 100644 data/2021/iclr/CPT: Efficient Deep Neural Network Training via Cyclic Precision
 create mode 100644 data/2021/iclr/CT-Net: Channel Tensorization Network for Video Classification
 create mode 100644 data/2021/iclr/CaPC Learning: Confidential and Private Collaborative Learning
 create mode 100644 data/2021/iclr/Calibration of Neural Networks using Splines
 create mode 100644 data/2021/iclr/Calibration tests beyond classification
 create mode 100644 data/2021/iclr/Can a Fruit Fly Learn Word Embeddings?
 create mode 100644 data/2021/iclr/Capturing Label Characteristics in VAEs
 create mode 100644 data/2021/iclr/Categorical Normalizing Flows via Continuous Transformations
 create mode 100644 data/2021/iclr/CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning
 create mode 100644 data/2021/iclr/CcGAN: Continuous Conditional Generative Adversarial Networks for Image Generation
 create mode 100644 data/2021/iclr/Certify or Predict: Boosting Certified Robustness with Compositional Architectures
 create mode 100644 data/2021/iclr/Chaos of Learning Beyond Zero-sum and Coordination via Game Decompositions
 create mode 100644 data/2021/iclr/Characterizing signal propagation to close the performance gap in unnormalized ResNets
 create mode 100644 data/2021/iclr/ChipNet: Budget-Aware Pruning with Heaviside Continuous Approximations
 create mode 100644 data/2021/iclr/Clairvoyance: A Pipeline Toolkit for Medical Time Series
 create mode 100644 data/2021/iclr/Class Normalization for (Continual)? Generalized Zero-Shot Learning
 create mode 100644 data/2021/iclr/Clustering-friendly Representation Learning via Instance Discrimination and Feature Decorrelation
 create mode 100644 data/2021/iclr/Co-Mixup: Saliency Guided Joint Mixup with Supermodular Diversity
 create mode 100644 data/2021/iclr/CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers
 create mode 100644 data/2021/iclr/CoCon: A Self-Supervised Approach for Controlled Text Generation
 create mode 100644 data/2021/iclr/CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding
 create mode 100644 data/2021/iclr/Collective Robustness Certificates: Exploiting Interdependence in Graph Neural Networks
 create mode 100644 data/2021/iclr/Colorization Transformer
 create mode 100644 data/2021/iclr/Combining Ensembles and Data Augmentation Can Harm Your Calibration
 create mode 100644 data/2021/iclr/Combining Label Propagation and Simple Models out-performs Graph Neural Networks
 create mode 100644 data/2021/iclr/Combining Physics and Machine Learning for Network Flow Estimation
 create mode 100644 data/2021/iclr/Communication in Multi-Agent Reinforcement Learning: Intention Sharing
 create mode 100644 data/2021/iclr/CompOFA - Compound Once-For-All Networks for Faster Multi-Platform Deployment
 create mode 100644 data/2021/iclr/Complex Query Answering with Neural Link Predictors
 create mode 100644 data/2021/iclr/Computational Separation Between Convolutional and Fully-Connected Networks
 create mode 100644 data/2021/iclr/Concept Learners for Few-Shot Learning
 create mode 100644 data/2021/iclr/Conditional Generative Modeling via Learning the Latent Space
 create mode 100644 data/2021/iclr/Conditional Negative Sampling for Contrastive Learning of Visual Representations
 create mode 100644 data/2021/iclr/Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data
 create mode 100644 data/2021/iclr/Conformation-Guided Molecular Representation with Hamiltonian Neural Networks
 create mode 100644 data/2021/iclr/Conservative Safety Critics for Exploration
 create mode 100644 data/2021/iclr/Contemplating Real-World Object Classification
 create mode 100644 data/2021/iclr/Contextual Dropout: An Efficient Sample-Dependent Dropout Module
 create mode 100644 data/2021/iclr/Contextual Transformation Networks for Online Continual Learning
 create mode 100644 data/2021/iclr/Continual learning in recurrent neural networks
 create mode 100644 data/2021/iclr/Continuous Wasserstein-2 Barycenter Estimation without Minimax Optimization
 create mode 100644 data/2021/iclr/Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning
 create mode 100644 data/2021/iclr/Contrastive Divergence Learning is a Time Reversal Adversarial Game
 create mode 100644 data/2021/iclr/Contrastive Explanations for Reinforcement Learning via Embedded Self Predictions
 create mode 100644 data/2021/iclr/Contrastive Learning with Adversarial Perturbations for Conditional Text Generation
 create mode 100644 data/2021/iclr/Contrastive Learning with Hard Negative Samples
 create mode 100644 data/2021/iclr/Contrastive Syn-to-Real Generalization
 create mode 100644 data/2021/iclr/Control-Aware Representations for Model-based Reinforcement Learning
 create mode 100644 data/2021/iclr/Convex Potential Flows: Universal Probability Distributions with Optimal Transport and Convex Optimization
 create mode 100644 data/2021/iclr/Convex Regularization behind Neural Reconstruction
 create mode 100644 data/2021/iclr/Coping with Label Shift via Distributionally Robust Optimisation
 create mode 100644 data/2021/iclr/CopulaGNN: Towards Integrating Representational and Correlational Roles of Graphs in Graph Neural Networks
 create mode 100644 data/2021/iclr/Correcting experience replay for multi-agent communication
 create mode 100644 data/2021/iclr/Counterfactual Generative Networks
 create mode 100644 data/2021/iclr/Coupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture for learning long time dependencies
 create mode 100644 data/2021/iclr/Creative Sketch Generation
 create mode 100644 data/2021/iclr/Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization
 create mode 100644 data/2021/iclr/Cut out the annotator, keep the cutout: better segmentation with weak supervision
 create mode 100644 data/2021/iclr/DARTS-: Robustly Stepping out of Performance Collapse Without Indicators
 create mode 100644 data/2021/iclr/DC3: A learning method for optimization with hard constraints
 create mode 100644 data/2021/iclr/DDPNOpt: Differential Dynamic Programming Neural Optimizer
 create mode 100644 data/2021/iclr/DICE: Diversity in Deep Ensembles via Conditional Redundancy Adversarial Estimation
 create mode 100644 data/2021/iclr/DINO: A Conditional Energy-Based GAN for Domain Translation
 create mode 100644 data/2021/iclr/DOP: Off-Policy Multi-Agent Decomposed Policy Gradients
 create mode 100644 data/2021/iclr/Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning
 create mode 100644 data/2021/iclr/Data-Efficient Reinforcement Learning with Self-Predictive Representations
 create mode 100644 data/2021/iclr/Dataset Condensation with Gradient Matching
 create mode 100644 data/2021/iclr/Dataset Inference: Ownership Resolution in Machine Learning
 create mode 100644 data/2021/iclr/Dataset Meta-Learning from Kernel Ridge-Regression
 create mode 100644 data/2021/iclr/DeLighT: Deep and Light-weight Transformer
 create mode 100644 data/2021/iclr/Deberta: decoding-Enhanced Bert with Disentangled Attention
 create mode 100644 data/2021/iclr/Debiasing Concept-based Explanations with Causal Analysis
 create mode 100644 data/2021/iclr/Decentralized Attribution of Generative Models
 create mode 100644 data/2021/iclr/Deciphering and Optimizing Multi-Task Learning: a Random Matrix Approach
 create mode 100644 data/2021/iclr/Deconstructing the Regularization of BatchNorm
 create mode 100644 data/2021/iclr/Decoupling Global and Local Representations via Invertible Generative Flows
 create mode 100644 data/2021/iclr/Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation
 create mode 100644 data/2021/iclr/Deep Equals Shallow for ReLU Networks in Kernel Regimes
 create mode 100644 data/2021/iclr/Deep Learning meets Projective Clustering
 create mode 100644 data/2021/iclr/Deep Networks and the Multiple Manifold Problem
 create mode 100644 data/2021/iclr/Deep Neural Network Fingerprinting by Conferrable Adversarial Examples
 create mode 100644 data/2021/iclr/Deep Neural Tangent Kernel and Laplace Kernel Have the Same RKHS
 create mode 100644 data/2021/iclr/Deep Partition Aggregation: Provable Defenses against General Poisoning Attacks
 create mode 100644 data/2021/iclr/Deep Repulsive Clustering of Ordered Data Based on Order-Identity Decomposition
 create mode 100644 data/2021/iclr/Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients
 create mode 100644 data/2021/iclr/DeepAveragers: Offline Reinforcement Learning By Solving Derived Non-Parametric MDPs
 create mode 100644 data/2021/iclr/Deformable DETR: Deformable Transformers for End-to-End Object Detection
 create mode 100644 data/2021/iclr/Degree-Quant: Quantization-Aware Training for Graph Neural Networks
 create mode 100644 data/2021/iclr/Denoising Diffusion Implicit Models
 create mode 100644 data/2021/iclr/Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization
 create mode 100644 data/2021/iclr/DialoGraph: Incorporating Interpretable Strategy-Graph Networks into Negotiation Dialogues
 create mode 100644 data/2021/iclr/DiffWave: A Versatile Diffusion Model for Audio Synthesis
 create mode 100644 data/2021/iclr/Differentiable Segmentation of Sequences
 create mode 100644 data/2021/iclr/Differentiable Trust Region Layers for Deep Reinforcement Learning
 create mode 100644 data/2021/iclr/Differentially Private Learning Needs Better Features (or Much More Data)
 create mode 100644 data/2021/iclr/Directed Acyclic Graph Neural Networks
 create mode 100644 data/2021/iclr/Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate
 create mode 100644 data/2021/iclr/Disambiguating Symbolic Expressions in Informal Documents
 create mode 100644 data/2021/iclr/Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization
 create mode 100644 data/2021/iclr/Discovering Non-monotonic Autoregressive Orderings with Variational Inference
 create mode 100644 data/2021/iclr/Discovering a set of policies for the worst case reward
 create mode 100644 data/2021/iclr/Discrete Graph Structure Learning for Forecasting Multiple Time Series
 create mode 100644 data/2021/iclr/Disentangled Recurrent Wasserstein Autoencoder
 create mode 100644 data/2021/iclr/Disentangling 3D Prototypical Networks for Few-Shot Concept Learning
 create mode 100644 data/2021/iclr/Distance-Based Regularisation of Deep Networks for Fine-Tuning
 create mode 100644 data/2021/iclr/Distilling Knowledge from Reader to Retriever for Question Answering
 create mode 100644 data/2021/iclr/Distributed Momentum for Byzantine-resilient Stochastic Gradient Descent
 create mode 100644 data/2021/iclr/Distributional Sliced-Wasserstein and Applications to Generative Modeling
 create mode 100644 data/2021/iclr/Diverse Video Generation using a Gaussian Process Trigger
 create mode 100644 data/2021/iclr/Do 2D GANs Know 3D Shape? Unsupervised 3D Shape Reconstruction from 2D Image GANs
 create mode 100644 data/2021/iclr/Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth
 create mode 100644 data/2021/iclr/Do not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning
 create mode 100644 data/2021/iclr/Does enhanced shape bias improve neural network robustness to common corruptions?
 create mode 100644 data/2021/iclr/Domain Generalization with MixStyle
 create mode 100644 data/2021/iclr/Domain-Robust Visual Imitation Learning with Mutual Information Constraints
 create mode 100644 data/2021/iclr/DrNAS: Dirichlet Neural Architecture Search
 create mode 100644 data/2021/iclr/Drop-Bottleneck: Learning Discrete Compressed Representation for Noise-Robust Exploration
 create mode 100644 data/2021/iclr/Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling
 create mode 100644 data/2021/iclr/DynaTune: Dynamic Tensor Program Optimization in Deep Neural Network Compilation
 create mode 100644 data/2021/iclr/Dynamic Tensor Rematerialization
 create mode 100644 data/2021/iclr/EEC: Learning to Encode and Regenerate Images for Continual Learning
 create mode 100644 data/2021/iclr/Early Stopping in Deep Networks: Double Descent and How to Eliminate it
 create mode 100644 data/2021/iclr/Economic Hyperparameter Optimization with Blended Search Strategy
 create mode 100644 data/2021/iclr/Effective Abstract Reasoning with Dual-Contrast Network
 create mode 100644 data/2021/iclr/Effective Distributed Learning with Random Features: Improved Bounds and Algorithms
 create mode 100644 data/2021/iclr/Effective and Efficient Vote Attack on Capsule Networks
 create mode 100644 data/2021/iclr/Efficient Certified Defenses Against Patch Attacks on Image Classifiers
 create mode 100644 data/2021/iclr/Efficient Conformal Prediction via Cascaded Inference with Expanded Admission
 create mode 100644 data/2021/iclr/Efficient Continual Learning with Modular Networks and Task-Driven Priors
 create mode 100644 data/2021/iclr/Efficient Empowerment Estimation for Unsupervised Stabilization
 create mode 100644 data/2021/iclr/Efficient Generalized Spherical CNNs
 create mode 100644 data/2021/iclr/Efficient Inference of Flexible Interaction in Spiking-neuron Networks
 create mode 100644 data/2021/iclr/Efficient Reinforcement Learning in Factored MDPs with Application to Constrained RL
 create mode 100644 data/2021/iclr/Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation
 create mode 100644 data/2021/iclr/Efficient Wasserstein Natural Gradients for Reinforcement Learning
 create mode 100644 data/2021/iclr/EigenGame: PCA as a Nash Equilibrium
 create mode 100644 data/2021/iclr/Emergent Road Rules In Multi-Agent Driving Environments
 create mode 100644 data/2021/iclr/Emergent Symbols through Binding in External Memory
 create mode 100644 data/2021/iclr/Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition
 create mode 100644 data/2021/iclr/Empirical or Invariant Risk Minimization? A Sample Complexity Perspective
 create mode 100644 data/2021/iclr/End-to-End Egospheric Spatial Memory
 create mode 100644 data/2021/iclr/End-to-end Adversarial Text-to-Speech
 create mode 100644 data/2021/iclr/Enforcing robust control guarantees within neural network policies
 create mode 100644 data/2021/iclr/Enjoy Your Editing: Controllable GANs for Image Editing via Latent Space Navigation
 create mode 100644 data/2021/iclr/Entropic gradient descent algorithms and wide flat minima
 create mode 100644 data/2021/iclr/Estimating Lipschitz constants of monotone deep equilibrium models
 create mode 100644 data/2021/iclr/Estimating and Evaluating Regression Predictive Uncertainty in Deep Object Detectors
 create mode 100644 data/2021/iclr/Estimating informativeness of samples with Smooth Unique Information
 create mode 100644 data/2021/iclr/Evaluating the Disentanglement of Deep Generative Models through Manifold Topology
 create mode 100644 data/2021/iclr/Evaluation of Neural Architectures trained with square Loss vs Cross-Entropy in Classification Tasks
 create mode 100644 data/2021/iclr/Evaluation of Similarity-based Explanations
 create mode 100644 data/2021/iclr/Evaluations and Methods for Explanation through Robustness Analysis
 create mode 100644 data/2021/iclr/Evolving Reinforcement Learning Algorithms
 create mode 100644 data/2021/iclr/Exemplary Natural Images Explain CNN Activations Better than State-of-the-Art Feature Visualization
 create mode 100644 data/2021/iclr/Explainable Deep One-Class Classification
 create mode 100644 data/2021/iclr/Explainable Subgraph Reasoning for Forecasting on Temporal Knowledge Graphs
 create mode 100644 data/2021/iclr/Explaining by Imitating: Understanding Decisions by Interpretable Policy Learning
 create mode 100644 data/2021/iclr/Explaining the Efficacy of Counterfactually Augmented Data
 create mode 100644 data/2021/iclr/Exploring Balanced Feature Spaces for Representation Learning
 create mode 100644 data/2021/iclr/Exploring the Uncertainty Properties of Neural Networks' Implicit Priors in the Infinite-Width Limit
 create mode 100644 data/2021/iclr/Expressive Power of Invariant and Equivariant Graph Neural Networks
 create mode 100644 data/2021/iclr/Extracting Strong Policies for Robotics Tasks from Zero-Order Trajectory Optimizers
 create mode 100644 data/2021/iclr/Extreme Memorization via Scale of Initialization
 create mode 100644 data/2021/iclr/FOCAL: Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization
 create mode 100644 data/2021/iclr/Factorizing Declarative and Procedural Knowledge in Structured, Dynamical Environments
 create mode 100644 data/2021/iclr/Fair Mixup: Fairness via Interpolation
 create mode 100644 data/2021/iclr/FairBatch: Batch Selection for Model Fairness
 create mode 100644 data/2021/iclr/FairFil: Contrastive Neural Debiasing Method for Pretrained Text Encoders
 create mode 100644 data/2021/iclr/Fantastic Four: Differentiable and Efficient Bounds on Singular Values of Convolution Layers
 create mode 100644 data/2021/iclr/Fast And Slow Learning Of Recurrent Independent Mechanisms
 create mode 100644 data/2021/iclr/Fast Geometric Projections for Local Robustness Certification
 create mode 100644 data/2021/iclr/Fast and Complete: Enabling Complete Neural Network Verification with Rapid and Massively Parallel Incomplete Verifiers
 create mode 100644 data/2021/iclr/Fast convergence of stochastic subgradient method under interpolation
 create mode 100644 data/2021/iclr/FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
 create mode 100644 data/2021/iclr/Faster Binary Embeddings for Preserving Euclidean Distances
 create mode 100644 data/2021/iclr/FedBE: Making Bayesian Model Ensemble Applicable to Federated Learning
 create mode 100644 data/2021/iclr/FedBN: Federated Learning on Non-IID Features via Local Batch Normalization
 create mode 100644 data/2021/iclr/FedMix: Approximation of Mixup under Mean Augmented Federated Learning
 create mode 100644 data/2021/iclr/Federated Learning Based on Dynamic Regularization
 create mode 100644 data/2021/iclr/Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms
 create mode 100644 data/2021/iclr/Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint Learning
 create mode 100644 data/2021/iclr/Few-Shot Bayesian Optimization with Deep Kernel Surrogates
 create mode 100644 data/2021/iclr/Few-Shot Learning via Learning the Representation, Provably
 create mode 100644 data/2021/iclr/Fidelity-based Deep Adiabatic Scheduling
 create mode 100644 data/2021/iclr/Filtered Inner Product Projection for Crosslingual Embedding Alignment
 create mode 100644 data/2021/iclr/Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
 create mode 100644 data/2021/iclr/Fooling a Complete Neural Network Verifier
 create mode 100644 data/2021/iclr/For self-supervised learning, Rationality implies generalization, provably
 create mode 100644 data/2021/iclr/Fourier Neural Operator for Parametric Partial Differential Equations
 create mode 100644 data/2021/iclr/Free Lunch for Few-shot Learning: Distribution Calibration
 create mode 100644 data/2021/iclr/Fully Unsupervised Diversity Denoising with Convolutional Variational Autoencoders
 create mode 100644 data/2021/iclr/Fuzzy Tiling Activations: A Simple Approach to Learning Sparse Representations Online
 create mode 100644 "data/2021/iclr/GAN \"Steerability\" without optimization"
 create mode 100644 data/2021/iclr/GAN2GAN: Generative Noise Learning for Blind Denoising with Single Noisy Images
 create mode 100644 data/2021/iclr/GANs Can Play Lottery Tickets Too
 create mode 100644 data/2021/iclr/GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
 create mode 100644 data/2021/iclr/Gauge Equivariant Mesh CNNs: Anisotropic convolutions on geometric graphs
 create mode 100644 data/2021/iclr/Generalization bounds via distillation
 create mode 100644 data/2021/iclr/Generalization in data-driven models of primary visual cortex
 create mode 100644 data/2021/iclr/Generalized Energy Based Models
 create mode 100644 data/2021/iclr/Generalized Multimodal ELBO
 create mode 100644 data/2021/iclr/Generalized Variational Continual Learning
 create mode 100644 data/2021/iclr/Generating Adversarial Computer Programs using Optimized Obfuscations
 create mode 100644 data/2021/iclr/Generating Furry Cars: Disentangling Object Shape and Appearance across Multiple Domains
 create mode 100644 data/2021/iclr/Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule
 create mode 100644 data/2021/iclr/Generative Scene Graph Networks
 create mode 100644 data/2021/iclr/Generative Time-series Modeling with Fourier Flows
 create mode 100644 data/2021/iclr/Genetic Soft Updates for Policy Evolution in Deep Reinforcement Learning
 create mode 100644 data/2021/iclr/Geometry-Aware Gradient Algorithms for Neural Architecture Search
 create mode 100644 data/2021/iclr/Geometry-aware Instance-reweighted Adversarial Training
 create mode 100644 data/2021/iclr/Getting a CLUE: A Method for Explaining Uncertainty Estimates
 create mode 100644 data/2021/iclr/Global Convergence of Three-layer Neural Networks in the Mean Field Regime
 create mode 100644 data/2021/iclr/Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime
 create mode 100644 data/2021/iclr/Go with the flow: Adaptive control for Neural ODEs
 create mode 100644 data/2021/iclr/GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing
 create mode 100644 data/2021/iclr/Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability
 create mode 100644 data/2021/iclr/Gradient Projection Memory for Continual Learning
 create mode 100644 data/2021/iclr/Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models
 create mode 100644 data/2021/iclr/Graph Coarsening with Neural Networks
 create mode 100644 data/2021/iclr/Graph Convolution with Low-rank Learnable Local Filters
 create mode 100644 data/2021/iclr/Graph Edit Networks
 create mode 100644 data/2021/iclr/Graph Information Bottleneck for Subgraph Recognition
 create mode 100644 data/2021/iclr/Graph Traversal with Tensor Functionals: A Meta-Algorithm for Scalable Learning
 create mode 100644 data/2021/iclr/Graph-Based Continual Learning
 create mode 100644 data/2021/iclr/GraphCodeBERT: Pre-training Code Representations with Data Flow
 create mode 100644 data/2021/iclr/Greedy-GQ with Variance Reduction: Finite-time Analysis and Improved Complexity
 create mode 100644 data/2021/iclr/Grounded Language Learning Fast and Slow
 create mode 100644 data/2021/iclr/Grounding Language to Autonomously-Acquired Skills via Goal Generation
 create mode 100644 data/2021/iclr/Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning
 create mode 100644 data/2021/iclr/Group Equivariant Conditional Neural Processes
 create mode 100644 data/2021/iclr/Group Equivariant Generative Adversarial Networks
 create mode 100644 data/2021/iclr/Group Equivariant Stand-Alone Self-Attention For Vision
 create mode 100644 data/2021/iclr/Growing Efficient Deep Networks by Structured Continuous Sparsification
 create mode 100644 data/2021/iclr/HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark
 create mode 100644 data/2021/iclr/HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents
 create mode 100644 data/2021/iclr/Heating up decision boundaries: isocapacitory saturation, adversarial scenarios and generalization bounds
 create mode 100644 data/2021/iclr/HeteroFL: Computation and Communication Efficient Federated Learning for Heterogeneous Clients
 create mode 100644 data/2021/iclr/Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization
 create mode 100644 data/2021/iclr/Hierarchical Autoregressive Modeling for Neural Video Compression
 create mode 100644 data/2021/iclr/Hierarchical Reinforcement Learning by Discovering Intrinsic Options
 create mode 100644 data/2021/iclr/High-Capacity Expert Binary Networks
 create mode 100644 data/2021/iclr/Hopfield Networks is All You Need
 create mode 100644 data/2021/iclr/Hopper: Multi-hop Transformer for Spatiotemporal Reasoning
 create mode 100644 data/2021/iclr/How Benign is Benign Overfitting ?
 create mode 100644 data/2021/iclr/How Does Mixup Help With Robustness and Generalization?
 create mode 100644 data/2021/iclr/How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?
 create mode 100644 data/2021/iclr/How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks
 create mode 100644 data/2021/iclr/How to Find Your Friendly Neighborhood: Graph Attention Design with Self-Supervision
 create mode 100644 data/2021/iclr/Human-Level Performance in No-Press Diplomacy via Equilibrium Search
 create mode 100644 data/2021/iclr/HyperDynamics: Meta-Learning Object and Agent Dynamics with Hypernetworks
 create mode 100644 data/2021/iclr/HyperGrid Transformers: Towards A Single Model for Multiple Tasks
 create mode 100644 data/2021/iclr/Hyperbolic Neural Networks++
 create mode 100644 data/2021/iclr/IDF++: Analyzing and Improving Integer Discrete Flows for Lossless Compression
 create mode 100644 data/2021/iclr/IEPT: Instance-Level and Episode-Level Pretext Tasks for Few-Shot Learning
 create mode 100644 data/2021/iclr/INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving
 create mode 100644 data/2021/iclr/IOT: Instance-wise Layer Reordering for Transformer Structures
 create mode 100644 data/2021/iclr/Identifying Physical Law of Hamiltonian Systems via Meta-Learning
 create mode 100644 data/2021/iclr/Identifying nonlinear dynamical systems with multiple time scales and long-range dependencies
 create mode 100644 data/2021/iclr/Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels
 create mode 100644 data/2021/iclr/Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering
 create mode 100644 data/2021/iclr/Impact of Representation Learning in Linear Bandits
 create mode 100644 data/2021/iclr/Implicit Convex Regularizers of CNN Architectures: Convex Optimization of Two- and Three-Layer Networks in Polynomial Time
 create mode 100644 data/2021/iclr/Implicit Gradient Regularization
 create mode 100644 data/2021/iclr/Implicit Normalizing Flows
 create mode 100644 data/2021/iclr/Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning
 create mode 100644 data/2021/iclr/Improve Object Detection with Feature-based Knowledge Distillation: Towards Accurate and Efficient Detectors
 create mode 100644 data/2021/iclr/Improved Autoregressive Modeling with Distribution Smoothing
 create mode 100644 "data/2021/iclr/Improved Estimation of Concentration Under \342\204\223p-Norm Distance Metrics Using Half Spaces"
 create mode 100644 data/2021/iclr/Improving Adversarial Robustness via Channel-wise Activation Suppressing
 create mode 100644 data/2021/iclr/Improving Relational Regularized Autoencoders with Spherical Sliced Fused Gromov Wasserstein
 create mode 100644 data/2021/iclr/Improving Transformation Invariance in Contrastive Representation Learning
 create mode 100644 data/2021/iclr/Improving VAEs' Robustness to Adversarial Attack
 create mode 100644 data/2021/iclr/Improving Zero-Shot Voice Style Transfer via Disentangled Representation Learning
 create mode 100644 data/2021/iclr/In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning
 create mode 100644 data/2021/iclr/In Search of Lost Domain Generalization
 create mode 100644 data/2021/iclr/In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness
 create mode 100644 data/2021/iclr/Incorporating Symmetry into Deep Dynamics Models for Improved Generalization
 create mode 100644 data/2021/iclr/Incremental few-shot learning via vector quantization in deep embedded space
 create mode 100644 data/2021/iclr/Individually Fair Gradient Boosting
 create mode 100644 data/2021/iclr/Individually Fair Rankings
 create mode 100644 data/2021/iclr/Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks
 create mode 100644 data/2021/iclr/Influence Estimation for Generative Adversarial Networks
 create mode 100644 data/2021/iclr/Influence Functions in Deep Learning Are Fragile
 create mode 100644 data/2021/iclr/InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective
 create mode 100644 data/2021/iclr/Information Laundering for Model Privacy
 create mode 100644 data/2021/iclr/Initialization and Regularization of Factorized Neural Layers
 create mode 100644 data/2021/iclr/Integrating Categorical Semantics into Unsupervised Domain Translation
 create mode 100644 data/2021/iclr/Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling
 create mode 100644 data/2021/iclr/Interpretable Models for Granger Causality Using Self-explaining Neural Networks
 create mode 100644 data/2021/iclr/Interpretable Neural Architecture Search via Bayesian Optimisation with Weisfeiler-Lehman Kernels
 create mode 100644 data/2021/iclr/Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking
 create mode 100644 data/2021/iclr/Interpreting Knowledge Graph Relation Representation from Word Embeddings
 create mode 100644 data/2021/iclr/Interpreting and Boosting Dropout from a Game-Theoretic View
 create mode 100644 data/2021/iclr/Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
 create mode 100644 data/2021/iclr/Intraclass clustering: an implicit learning ability that regularizes DNNs
 create mode 100644 data/2021/iclr/Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures
 create mode 100644 data/2021/iclr/Is Attention Better Than Matrix Decomposition?
 create mode 100644 data/2021/iclr/Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study
 create mode 100644 data/2021/iclr/IsarStep: a Benchmark for High-level Mathematical Reasoning
 create mode 100644 data/2021/iclr/Isometric Propagation Network for Generalized Zero-shot Learning
 create mode 100644 data/2021/iclr/Isometric Transformation Invariant and Equivariant Graph Convolutional Networks
 create mode 100644 data/2021/iclr/Isotropy in the Contextual Embedding Space: Clusters and Manifolds
 create mode 100644 data/2021/iclr/Iterated learning for emergent systematicity in VQA
 create mode 100644 data/2021/iclr/Iterative Empirical Game Solving via Single Policy Best Response
 create mode 100644 data/2021/iclr/Kanerva++: Extending the Kanerva Machine With Differentiable, Locally Block Allocated Latent Memory
 create mode 100644 data/2021/iclr/Knowledge Distillation as Semiparametric Inference
 create mode 100644 data/2021/iclr/Knowledge distillation via softmax regression representation learning
 create mode 100644 data/2021/iclr/LEAF: A Learnable Frontend for Audio Classification
 create mode 100644 data/2021/iclr/LambdaNetworks: Modeling long-range Interactions without Attention
 create mode 100644 data/2021/iclr/Language-Agnostic Representation Learning of Source Code from Structure and Context
 create mode 100644 data/2021/iclr/Large Associative Memory Problem in Neurobiology and Machine Learning
 create mode 100644 data/2021/iclr/Large Batch Simulation for Deep Reinforcement Learning
 create mode 100644 data/2021/iclr/Large Scale Image Completion via Co-Modulated Generative Adversarial Networks
 create mode 100644 data/2021/iclr/Large-width functional asymptotics for deep Gaussian neural networks
 create mode 100644 data/2021/iclr/Latent Convergent Cross Mapping
 create mode 100644 data/2021/iclr/Latent Skill Planning for Exploration and Transfer
 create mode 100644 data/2021/iclr/Layer-adaptive Sparsity for the Magnitude-based Pruning
 create mode 100644 data/2021/iclr/Learnable Embedding sizes for Recommender Systems
 create mode 100644 "data/2021/iclr/Learning \"What-if\" Explanations for Sequential Decision-Making"
 create mode 100644 data/2021/iclr/Learning A Minimax Optimizer: A Pilot Study
 create mode 100644 data/2021/iclr/Learning Accurate Entropy Model with Global Reference for Image Compression
 create mode 100644 data/2021/iclr/Learning Associative Inference Using Fast Weight Memory
 create mode 100644 data/2021/iclr/Learning Better Structured Representations Using Low-rank Adaptive Label Smoothing
 create mode 100644 data/2021/iclr/Learning Cross-Domain Correspondence for Control with Dynamics Cycle-Consistency
 create mode 100644 data/2021/iclr/Learning Deep Features in Instrumental Variable Regression
 create mode 100644 data/2021/iclr/Learning Energy-Based Generative Models via Coarse-to-Fine Expanding and Sampling
 create mode 100644 data/2021/iclr/Learning Energy-Based Models by Diffusion Recovery Likelihood
 create mode 100644 data/2021/iclr/Learning Generalizable Visual Representations via Interactive Gameplay
 create mode 100644 data/2021/iclr/Learning Hyperbolic Representations of Topological Features
 create mode 100644 data/2021/iclr/Learning Incompressible Fluid Dynamics from Scratch - Towards Fast, Differentiable Fluid Models that Generalize
 create mode 100644 data/2021/iclr/Learning Invariant Representations for Reinforcement Learning without Reconstruction
 create mode 100644 data/2021/iclr/Learning Long-term Visual Dynamics with Region Proposal Interaction Networks
 create mode 100644 data/2021/iclr/Learning Manifold Patch-Based Representations of Man-Made Shapes
 create mode 100644 data/2021/iclr/Learning Mesh-Based Simulation with Graph Networks
 create mode 100644 data/2021/iclr/Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch
 create mode 100644 data/2021/iclr/Learning Neural Event Functions for Ordinary Differential Equations
 create mode 100644 data/2021/iclr/Learning Neural Generative Dynamics for Molecular Conformation Generation
 create mode 100644 data/2021/iclr/Learning Parametrised Graph Shift Operators
 create mode 100644 data/2021/iclr/Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues
 create mode 100644 data/2021/iclr/Learning Robust State Abstractions for Hidden-Parameter Block MDPs
 create mode 100644 data/2021/iclr/Learning Safe Multi-agent Control with Decentralized Neural Barrier Certificates
 create mode 100644 data/2021/iclr/Learning Structural Edits via Incremental Tree Transformations
 create mode 100644 data/2021/iclr/Learning Subgoal Representations with Slow Dynamics
 create mode 100644 data/2021/iclr/Learning Task Decomposition with Ordered Memory Policy Network
 create mode 100644 data/2021/iclr/Learning Task-General Representations with Generative Neuro-Symbolic Modeling
 create mode 100644 data/2021/iclr/Learning Value Functions in Deep Policy Gradients using Residual Variance
 create mode 100644 data/2021/iclr/Learning What To Do by Simulating the Past
 create mode 100644 data/2021/iclr/Learning a Latent Search Space for Routing Problems using Variational Autoencoders
 create mode 100644 data/2021/iclr/Learning a Latent Simplex in Input Sparsity Time
 create mode 100644 data/2021/iclr/Learning advanced mathematical computations from examples
 create mode 100644 data/2021/iclr/Learning and Evaluating Representations for Deep One-Class Classification
 create mode 100644 data/2021/iclr/Learning continuous-time PDEs from sparse data with graph neural networks
 create mode 100644 data/2021/iclr/Learning explanations that are hard to vary
 create mode 100644 data/2021/iclr/Learning from Demonstration with Weakly Supervised Disentanglement
 create mode 100644 data/2021/iclr/Learning from Protein Structure with Geometric Vector Perceptrons
 create mode 100644 data/2021/iclr/Learning from others' mistakes: Avoiding dataset biases without modeling them
 create mode 100644 data/2021/iclr/Learning perturbation sets for robust machine learning
 create mode 100644 data/2021/iclr/Learning the Pareto Front with Hypernetworks
 create mode 100644 data/2021/iclr/Learning to Deceive Knowledge Graph Augmented Models via Targeted Perturbation
 create mode 100644 data/2021/iclr/Learning to Generate 3D Shapes with Generative Cellular Automata
 create mode 100644 data/2021/iclr/Learning to Make Decisions via Submodular Regularization
 create mode 100644 data/2021/iclr/Learning to Reach Goals via Iterated Supervised Learning
 create mode 100644 data/2021/iclr/Learning to Recombine and Resample Data For Compositional Generalization
 create mode 100644 data/2021/iclr/Learning to Represent Action Values as a Hypergraph on the Action Vertices
 create mode 100644 data/2021/iclr/Learning to Sample with Local and Global Contexts in Experience Replay Buffer
 create mode 100644 data/2021/iclr/Learning to Set Waypoints for Audio-Visual Navigation
 create mode 100644 data/2021/iclr/Learning to live with Dale's principle: ANNs with separate excitatory and inhibitory units
 create mode 100644 data/2021/iclr/Learning with AMIGo: Adversarially Motivated Intrinsic Goals
 create mode 100644 data/2021/iclr/Learning with Feature-Dependent Label Noise: A Progressive Approach
 create mode 100644 data/2021/iclr/Learning with Instance-Dependent Label Noise: A Sample Sieve Approach
 create mode 100644 data/2021/iclr/Learning-based Support Estimation in Sublinear Time
 create mode 100644 data/2021/iclr/Lifelong Learning of Compositional Structures
 create mode 100644 data/2021/iclr/LiftPool: Bidirectional ConvNet Pooling
 create mode 100644 data/2021/iclr/Linear Convergent Decentralized Optimization with Compression
 create mode 100644 data/2021/iclr/Linear Last-iterate Convergence in Constrained Saddle-point Optimization
 create mode 100644 data/2021/iclr/Linear Mode Connectivity in Multitask and Continual Learning
 create mode 100644 data/2021/iclr/Local Convergence Analysis of Gradient Descent Ascent with Finite Timescale Separation
 create mode 100644 data/2021/iclr/Local Search Algorithms for Rank-Constrained Convex Optimization
 create mode 100644 data/2021/iclr/Locally Free Weight Sharing for Network Width Search
 create mode 100644 data/2021/iclr/Long Live the Lottery: The Existence of Winning Tickets in Lifelong Learning
 create mode 100644 data/2021/iclr/Long Range Arena : A Benchmark for Efficient Transformers
 create mode 100644 data/2021/iclr/Long-tail learning via logit adjustment
 create mode 100644 data/2021/iclr/Long-tailed Recognition by Routing Diverse Distribution-Aware Experts
 create mode 100644 data/2021/iclr/Loss Function Discovery for Object Detection via Convergence-Simulation Driven Search
 create mode 100644 data/2021/iclr/Lossless Compression of Structured Convolutional Models via Lifting
 create mode 100644 data/2021/iclr/LowKey: Leveraging Adversarial Attacks to Protect Social Media Users from Facial Recognition
 create mode 100644 data/2021/iclr/MALI: A memory efficient and reverse accurate integrator for Neural ODEs
 create mode 100644 data/2021/iclr/MARS: Markov Molecular Sampling for Multi-objective Drug Discovery
 create mode 100644 data/2021/iclr/MELR: Meta-Learning via Modeling Episode-Level Relationships for Few-Shot Learning
 create mode 100644 data/2021/iclr/MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space
 create mode 100644 data/2021/iclr/MONGOOSE: A Learnable LSH Framework for Efficient Neural Network Training
 create mode 100644 data/2021/iclr/Mapping the Timescale Organization of Neural Language Models
 create mode 100644 data/2021/iclr/Mathematical Reasoning via Self-supervised Skip-tree Training
 create mode 100644 data/2021/iclr/Measuring Massive Multitask Language Understanding
 create mode 100644 data/2021/iclr/Memory Optimization for Deep Networks
 create mode 100644 data/2021/iclr/Meta Back-Translation
 create mode 100644 data/2021/iclr/Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
 create mode 100644 data/2021/iclr/Meta-Learning of Structured Task Distributions in Humans and Machines
 create mode 100644 data/2021/iclr/Meta-Learning with Neural Tangent Kernels
 create mode 100644 data/2021/iclr/Meta-learning Symmetries by Reparameterization
 create mode 100644 data/2021/iclr/Meta-learning with negative learning rates
 create mode 100644 data/2021/iclr/MetaNorm: Learning to Normalize Few-Shot Batches Across Domains
 create mode 100644 data/2021/iclr/MiCE: Mixture of Contrastive Experts for Unsupervised Image Clustering
 create mode 100644 data/2021/iclr/Mind the Gap when Conditioning Amortised Inference in Sequential Latent-Variable Models
 create mode 100644 data/2021/iclr/Mind the Pad - CNNs Can Develop Blind Spots
 create mode 100644 data/2021/iclr/Minimum Width for Universal Approximation
 create mode 100644 data/2021/iclr/Mirostat: a Neural Text decoding Algorithm that directly controls perplexity
 create mode 100644 data/2021/iclr/MixKD: Towards Efficient Distillation of Large-scale Language Models
 create mode 100644 data/2021/iclr/Mixed-Features Vectors and Subspace Splitting
 create mode 100644 data/2021/iclr/MoPro: Webly Supervised Learning with Momentum Prototypes
 create mode 100644 data/2021/iclr/MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond
 create mode 100644 data/2021/iclr/Model Patching: Closing the Subgroup Performance Gap with Data Augmentation
 create mode 100644 data/2021/iclr/Model-Based Offline Planning
 create mode 100644 data/2021/iclr/Model-Based Visual Planning with Self-Supervised Functional Distances
 create mode 100644 data/2021/iclr/Model-based micro-data reinforcement learning: what are the crucial model properties and which model to choose?
 create mode 100644 data/2021/iclr/Modeling the Second Player in Distributionally Robust Optimization
 create mode 100644 data/2021/iclr/Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue System
 create mode 100644 data/2021/iclr/Molecule Optimization by Explainable Evolution
 create mode 100644 data/2021/iclr/Monotonic Kronecker-Factored Lattice
 create mode 100644 data/2021/iclr/Monte-Carlo Planning and Learning with Language Action Value Estimates
 create mode 100644 data/2021/iclr/More or Less: When and How to Build Convolutional Neural Network Ensembles
 create mode 100644 data/2021/iclr/Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning
 create mode 100644 data/2021/iclr/Multi-Level Local SGD: Distributed SGD for Heterogeneous Hierarchical Networks
 create mode 100644 data/2021/iclr/Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network
 create mode 100644 data/2021/iclr/Multi-Time Attention Networks for Irregularly Sampled Time Series
 create mode 100644 data/2021/iclr/Multi-resolution modeling of a discrete stochastic process identifies causes of cancer
 create mode 100644 data/2021/iclr/Multi-timescale Representation Learning in LSTM Language Models
 create mode 100644 data/2021/iclr/MultiModalQA: complex question answering over text, tables and images
 create mode 100644 data/2021/iclr/Multiplicative Filter Networks
 create mode 100644 data/2021/iclr/Multiscale Score Matching for Out-of-Distribution Detection
 create mode 100644 data/2021/iclr/Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows
 create mode 100644 data/2021/iclr/Mutual Information State Intrinsic Control
 create mode 100644 data/2021/iclr/My Body is a Cage: the Role of Morphology in Graph-Based Incompatible Control
 create mode 100644 data/2021/iclr/NAS-Bench-ASR: Reproducible Neural Architecture Search for Speech Recognition
 create mode 100644 data/2021/iclr/NBDT: Neural-Backed Decision Tree
 create mode 100644 data/2021/iclr/NOVAS: Non-convex Optimization via Adaptive Stochastic Search for End-to-end Learning and Control
 create mode 100644 data/2021/iclr/NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation
 create mode 100644 data/2021/iclr/Nearest Neighbor Machine Translation
 create mode 100644 data/2021/iclr/Negative Data Augmentation
 create mode 100644 data/2021/iclr/Net-DNF: Effective Deep Modeling of Tabular Data
 create mode 100644 data/2021/iclr/Network Pruning That Matters: A Case Study on Retraining Variants
 create mode 100644 data/2021/iclr/Neural Approximate Sufficient Statistics for Implicit Models
 create mode 100644 data/2021/iclr/Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective
 create mode 100644 data/2021/iclr/Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks
 create mode 100644 data/2021/iclr/Neural Delay Differential Equations
 create mode 100644 data/2021/iclr/Neural Jump Ordinary Differential Equations: Consistent Continuous-Time Prediction and Filtering
 create mode 100644 data/2021/iclr/Neural Learning of One-of-Many Solutions for Combinatorial Problems in Structured Output Spaces
 create mode 100644 data/2021/iclr/Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics
 create mode 100644 data/2021/iclr/Neural Networks for Learning Counterfactual G-Invariances from Single Environments
 create mode 100644 data/2021/iclr/Neural ODE Processes
 create mode 100644 data/2021/iclr/Neural Pruning via Growing Regularization
 create mode 100644 data/2021/iclr/Neural Spatio-Temporal Point Processes
 create mode 100644 data/2021/iclr/Neural Synthesis of Binaural Speech From Mono Audio
 create mode 100644 data/2021/iclr/Neural Thompson Sampling
 create mode 100644 data/2021/iclr/Neural Topic Model via Optimal Transport
 create mode 100644 data/2021/iclr/Neural gradients are near-lognormal: improved quantized and sparse training
 create mode 100644 data/2021/iclr/Neural networks with late-phase weights
 create mode 100644 data/2021/iclr/Neural representation and generation for RNA secondary structures
 create mode 100644 data/2021/iclr/Neurally Augmented ALISTA
 create mode 100644 data/2021/iclr/New Bounds For Distributed Mean Estimation and Variance Reduction
 create mode 100644 data/2021/iclr/No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks
 create mode 100644 data/2021/iclr/No MCMC for me: Amortized sampling for fast and stable training of energy-based models
 create mode 100644 data/2021/iclr/Noise against noise: stochastic label noise helps combat inherent label noise
 create mode 100644 data/2021/iclr/Noise or Signal: The Role of Image Backgrounds in Object Recognition
 create mode 100644 data/2021/iclr/Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds
 create mode 100644 data/2021/iclr/Nonseparable Symplectic Neural Networks
 create mode 100644 data/2021/iclr/OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning
 create mode 100644 data/2021/iclr/Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers
 create mode 100644 data/2021/iclr/Offline Model-Based Optimization via Normalized Maximum Likelihood Estimation
 create mode 100644 data/2021/iclr/On Data-Augmentation and Consistency-Based Semi-Supervised Learning
 create mode 100644 data/2021/iclr/On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections
 create mode 100644 data/2021/iclr/On Fast Adversarial Robustness Adaptation in Model-Agnostic Meta-Learning
 create mode 100644 data/2021/iclr/On Graph Neural Networks versus Graph-Augmented MLPs
 create mode 100644 data/2021/iclr/On InstaHide, Phase Retrieval, and Sparse Matrix Factorization
 create mode 100644 data/2021/iclr/On Learning Universal Representations Across Languages
 create mode 100644 data/2021/iclr/On Position Embeddings in BERT
 create mode 100644 data/2021/iclr/On Self-Supervised Image Representations for GAN Evaluation
 create mode 100644 data/2021/iclr/On Statistical Bias In Active Learning: How and When to Fix It
 create mode 100644 data/2021/iclr/On the Bottleneck of Graph Neural Networks and its Practical Implications
 create mode 100644 data/2021/iclr/On the Critical Role of Conventions in Adaptive Human-AI Collaboration
 create mode 100644 data/2021/iclr/On the Curse of Memory in Recurrent Neural Networks: Approximation and Optimization Analysis
 create mode 100644 data/2021/iclr/On the Dynamics of Training Attention Models
 create mode 100644 data/2021/iclr/On the Impossibility of Global Convergence in Multi-Loss Optimization
 create mode 100644 data/2021/iclr/On the Origin of Implicit Regularization in Stochastic Gradient Descent
 create mode 100644 data/2021/iclr/On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
 create mode 100644 data/2021/iclr/On the Theory of Implicit Deep Learning: Global Convergence with Implicit Layers
 create mode 100644 data/2021/iclr/On the Transfer of Disentangled Representations in Realistic Settings
 create mode 100644 data/2021/iclr/On the Universality of Rotation Equivariant Point Cloud Networks
 create mode 100644 data/2021/iclr/On the Universality of the Double Descent Peak in Ridgeless Regression
 create mode 100644 data/2021/iclr/On the geometry of generalization and memorization in deep neural networks
 create mode 100644 data/2021/iclr/On the mapping between Hopfield networks and Restricted Boltzmann Machines
 create mode 100644 data/2021/iclr/On the role of planning in model-based deep reinforcement learning
 create mode 100644 data/2021/iclr/One Network Fits All? Modular versus Monolithic Task Formulations in Neural Networks
 create mode 100644 data/2021/iclr/Online Adversarial Purification based on Self-supervised Learning
 create mode 100644 data/2021/iclr/Open Question Answering over Tables and Text
 create mode 100644 data/2021/iclr/Optimal Conversion of Conventional Artificial Neural Networks to Spiking Neural Networks
 create mode 100644 data/2021/iclr/Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime
 create mode 100644 data/2021/iclr/Optimal Regularization can Mitigate Double Descent
 create mode 100644 data/2021/iclr/Optimism in Reinforcement Learning with Generalized Linear Function Approximation
 create mode 100644 data/2021/iclr/Optimizing Memory Placement using Evolutionary Graph Reinforcement Learning
 create mode 100644 data/2021/iclr/Orthogonalizing Convolutional Layers with the Cayley Transform
 create mode 100644 data/2021/iclr/Overfitting for Fun and Profit: Instance-Adaptive Data Compression
 create mode 100644 data/2021/iclr/Overparameterisation and worst-case generalisation: friend or foe?
 create mode 100644 data/2021/iclr/PAC Confidence Predictions for Deep Neural Network Classifiers
 create mode 100644 data/2021/iclr/PC2WF: 3D Wireframe Reconstruction from Raw Point Clouds
 create mode 100644 data/2021/iclr/PDE-Driven Spatiotemporal Disentanglement
 create mode 100644 data/2021/iclr/PMI-Masking: Principled masking of correlated spans
 create mode 100644 data/2021/iclr/PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences
 create mode 100644 data/2021/iclr/Parameter Efficient Multimodal Transformers for Video Representation Learning
 create mode 100644 data/2021/iclr/Parameter-Based Value Functions
 create mode 100644 data/2021/iclr/Parrot: Data-Driven Behavioral Priors for Reinforcement Learning
 create mode 100644 data/2021/iclr/Partitioned Learned Bloom Filters
 create mode 100644 data/2021/iclr/Perceptual Adversarial Robustness: Defense Against Unseen Threat Models
 create mode 100644 data/2021/iclr/Personalized Federated Learning with First Order Model Optimization
 create mode 100644 data/2021/iclr/Physics-aware, probabilistic model order reduction with guaranteed stability
 create mode 100644 data/2021/iclr/Plan-Based Relaxed Reward Shaping for Goal-Directed Tasks
 create mode 100644 data/2021/iclr/Planning from Pixels using Inverse Dynamics Models
 create mode 100644 data/2021/iclr/PlasticineLab: A Soft-Body Manipulation Benchmark with Differentiable Physics
 create mode 100644 data/2021/iclr/PolarNet: Learning to Optimize Polar Keypoints for Keypoint Based Object Detection
 create mode 100644 data/2021/iclr/Policy-Driven Attack: Learning to Query for Hard-label Black-box Adversarial Examples
 create mode 100644 data/2021/iclr/Practical Massively Parallel Monte-Carlo Tree Search Applied to Molecular Design
 create mode 100644 data/2021/iclr/Practical Real Time Recurrent Learning with a Sparse Approximation
 create mode 100644 data/2021/iclr/Pre-training Text-to-Text Transformers for Concept-centric Common Sense
 create mode 100644 data/2021/iclr/Predicting Classification Accuracy When Adding New Unobserved Classes
 create mode 100644 data/2021/iclr/Predicting Inductive Biases of Pre-Trained Models
 create mode 100644 data/2021/iclr/Predicting Infectiousness for Proactive Contact Tracing
 create mode 100644 data/2021/iclr/Prediction and generalisation over directed actions by grid cells
 create mode 100644 data/2021/iclr/Primal Wasserstein Imitation Learning
 create mode 100644 data/2021/iclr/Private Image Reconstruction from System Side Channels Using Generative Models
 create mode 100644 data/2021/iclr/Private Post-GAN Boosting
 create mode 100644 data/2021/iclr/Probabilistic Numeric Convolutional Neural Networks
 create mode 100644 data/2021/iclr/Probing BERT in Hyperbolic Spaces
 create mode 100644 data/2021/iclr/Progressive Skeletonization: Trimming more fat from a network at initialization
 create mode 100644 data/2021/iclr/Projected Latent Markov Chain Monte Carlo: Conditional Sampling of Normalizing Flows
 create mode 100644 data/2021/iclr/Property Controllable Variational Autoencoder via Invertible Mutual Dependence
 create mode 100644 data/2021/iclr/Protecting DNNs from Theft using an Ensemble of Diverse Models
 create mode 100644 data/2021/iclr/Prototypical Contrastive Learning of Unsupervised Representations
 create mode 100644 data/2021/iclr/Prototypical Representation Learning for Relation Extraction
 create mode 100644 data/2021/iclr/Provable Rich Observation Reinforcement Learning with Combinatorial Latent States
 create mode 100644 data/2021/iclr/Provably robust classification of adversarial examples with detection
 create mode 100644 "data/2021/iclr/Proximal Gradient Descent-Ascent: Variable Convergence under K\305\201 Geometry"
 create mode 100644 data/2021/iclr/Pruning Neural Networks at Initialization: Why Are We Missing the Mark?
 create mode 100644 data/2021/iclr/PseudoSeg: Designing Pseudo Labels for Semantic Segmentation
 create mode 100644 data/2021/iclr/QPLEX: Duplex Dueling Multi-Agent Q-Learning
 create mode 100644 data/2021/iclr/Quantifying Differences in Reward Functions
 create mode 100644 data/2021/iclr/R-GAP: Recursive Gradient Attack on Privacy
 create mode 100644 data/2021/iclr/RMSprop converges with proper hyper-parameter
 create mode 100644 data/2021/iclr/RNNLogic: Learning Logic Rules for Reasoning on Knowledge Graphs
 create mode 100644 data/2021/iclr/RODE: Learning Roles to Decompose Multi-Agent Tasks
 create mode 100644 data/2021/iclr/Random Feature Attention
 create mode 100644 data/2021/iclr/Randomized Automatic Differentiation
 create mode 100644 data/2021/iclr/Randomized Ensembled Double Q-Learning: Learning Fast Without a Model
 create mode 100644 data/2021/iclr/Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments
 create mode 100644 data/2021/iclr/Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient Estimator
 create mode 100644 data/2021/iclr/Rapid Neural Architecture Search by Learning to Generate Graphs from Datasets
 create mode 100644 data/2021/iclr/Rapid Task-Solving in Novel Environments
 create mode 100644 data/2021/iclr/Recurrent Independent Mechanisms
 create mode 100644 data/2021/iclr/Reducing the Computational Cost of Deep Generative Models with Binary Neural Networks
 create mode 100644 data/2021/iclr/Refining Deep Generative Models via Discriminator Gradient Flow
 create mode 100644 data/2021/iclr/Regularization Matters in Policy Optimization - An Empirical Study on Continuous Control
 create mode 100644 data/2021/iclr/Regularized Inverse Reinforcement Learning
 create mode 100644 data/2021/iclr/Reinforcement Learning with Random Delays
 create mode 100644 data/2021/iclr/Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models
 create mode 100644 data/2021/iclr/Remembering for the Right Reasons: Explanations Reduce Catastrophic Forgetting
 create mode 100644 data/2021/iclr/Removing Undesirable Feature Contributions Using Out-of-Distribution Data
 create mode 100644 data/2021/iclr/Representation Balancing Offline Model-based Reinforcement Learning
 create mode 100644 data/2021/iclr/Representation Learning for Sequence Data with Deep Autoencoding Predictive Components
 create mode 100644 data/2021/iclr/Representation Learning via Invariant Causal Mechanisms
 create mode 100644 data/2021/iclr/Representation learning for improved interpretability and classification accuracy of clinical factors from EEG
 create mode 100644 data/2021/iclr/Representing Partial Programs with Blended Abstract Semantics
 create mode 100644 data/2021/iclr/Repurposing Pretrained Models for Robust Out-of-domain Few-Shot Learning
 create mode 100644 data/2021/iclr/ResNet After All: Neural ODEs and Their Numerical Solution
 create mode 100644 data/2021/iclr/Reset-Free Lifelong Learning with Skill-Space Planning
 create mode 100644 data/2021/iclr/Rethinking Architecture Selection in Differentiable NAS
 create mode 100644 data/2021/iclr/Rethinking Attention with Performers
 create mode 100644 data/2021/iclr/Rethinking Embedding Coupling in Pre-trained Language Models
 create mode 100644 data/2021/iclr/Rethinking Positional Encoding in Language Pre-training
 create mode 100644 data/2021/iclr/Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective
 create mode 100644 data/2021/iclr/Rethinking the Role of Gradient-based Attribution Methods for Model Interpretability
 create mode 100644 data/2021/iclr/Retrieval-Augmented Generation for Code Summarization via Hybrid GNN
 create mode 100644 data/2021/iclr/Return-Based Contrastive Representation Learning for Reinforcement Learning
 create mode 100644 data/2021/iclr/Revisiting Dynamic Convolution via Matrix Decomposition
 create mode 100644 data/2021/iclr/Revisiting Few-sample BERT Fine-tuning
 create mode 100644 data/2021/iclr/Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction
 create mode 100644 data/2021/iclr/Revisiting Locally Supervised Learning: an Alternative to End-to-end Training
 create mode 100644 data/2021/iclr/Reweighting Augmented Samples by Minimizing the Maximal Expected Loss
 create mode 100644 data/2021/iclr/Ringing ReLUs: Harmonic Distortion Analysis of Nonlinear Feedforward Networks
 create mode 100644 data/2021/iclr/Risk-Averse Offline Reinforcement Learning
 create mode 100644 data/2021/iclr/Robust Learning of Fixed-Structure Bayesian Networks in Nearly-Linear Time
 create mode 100644 data/2021/iclr/Robust Overfitting may be mitigated by properly learned smoothening
 create mode 100644 data/2021/iclr/Robust Pruning at Initialization
 create mode 100644 data/2021/iclr/Robust Reinforcement Learning on State Observations with Learned Optimal Adversary
 create mode 100644 data/2021/iclr/Robust and Generalizable Visual Representation Learning via Random Convolutions
 create mode 100644 data/2021/iclr/Robust early-learning: Hindering the memorization of noisy labels
 create mode 100644 data/2021/iclr/SAFENet: A Secure, Accurate and Fast Neural Network Inference
 create mode 100644 data/2021/iclr/SALD: Sign Agnostic Learning with Derivatives
 create mode 100644 data/2021/iclr/SCoRe: Pre-Training for Context Representation in Conversational Semantic Parsing
 create mode 100644 data/2021/iclr/SEDONA: Search for Decoupled Neural Networks toward Greedy Block-wise Learning
 create mode 100644 data/2021/iclr/SEED: Self-supervised Distillation For Visual Representation
 create mode 100644 data/2021/iclr/SMiRL: Surprise Minimizing Reinforcement Learning in Unstable Environments
 create mode 100644 data/2021/iclr/SOLAR: Sparse Orthogonal Learned and Random Embeddings
 create mode 100644 data/2021/iclr/SSD: A Unified Framework for Self-Supervised Outlier Detection
 create mode 100644 data/2021/iclr/Saliency is a Possible Red Herring When Diagnosing Poor Generalization
 create mode 100644 data/2021/iclr/SaliencyMix: A Saliency Guided Data Augmentation Strategy for Better Regularization
 create mode 100644 data/2021/iclr/Sample-Efficient Automated Deep Reinforcement Learning
 create mode 100644 data/2021/iclr/Scalable Bayesian Inverse Reinforcement Learning
 create mode 100644 data/2021/iclr/Scalable Learning and MAP Inference for Nonsymmetric Determinantal Point Processes
 create mode 100644 data/2021/iclr/Scalable Transfer Learning with Expert Models
 create mode 100644 data/2021/iclr/Scaling Symbolic Methods using Gradients for Neural Model Explanation
 create mode 100644 data/2021/iclr/Scaling the Convex Barrier with Active Sets
 create mode 100644 data/2021/iclr/Score-Based Generative Modeling through Stochastic Differential Equations
 create mode 100644 data/2021/iclr/Selective Classification Can Magnify Disparities Across Groups
 create mode 100644 data/2021/iclr/Selectivity considered harmful: evaluating the causal impact of class selectivity in DNNs
 create mode 100644 data/2021/iclr/Self-Supervised Learning of Compressed Video Representations
 create mode 100644 data/2021/iclr/Self-Supervised Policy Adaptation during Deployment
 create mode 100644 data/2021/iclr/Self-supervised Adversarial Robustness for the Low-label, High-data Regime
 create mode 100644 data/2021/iclr/Self-supervised Learning from a Multi-view Perspective
 create mode 100644 data/2021/iclr/Self-supervised Representation Learning with Relative Predictive Coding
 create mode 100644 data/2021/iclr/Self-supervised Visual Reinforcement Learning with Object-centric Representations
 create mode 100644 data/2021/iclr/Self-training For Few-shot Transfer Across Extreme Task Differences
 create mode 100644 data/2021/iclr/Semantic Re-tuning with Contrastive Tension
 create mode 100644 data/2021/iclr/Semi-supervised Keypoint Localization
 create mode 100644 data/2021/iclr/SenSeI: Sensitive Set Invariance for Enforcing Individual Fairness
 create mode 100644 data/2021/iclr/Separation and Concentration in Deep Networks
 create mode 100644 data/2021/iclr/Seq2Tens: An Efficient Representation of Sequences by Low-Rank Tensor Projections
 create mode 100644 data/2021/iclr/Sequential Density Ratio Estimation for Simultaneous Optimization of Speed and Accuracy
 create mode 100644 data/2021/iclr/Set Prediction without Imposing Structure as Conditional Density Estimation
 create mode 100644 data/2021/iclr/Shape or Texture: Understanding Discriminative Features in CNNs
 create mode 100644 data/2021/iclr/Shape-Texture Debiased Neural Network Training
 create mode 100644 data/2021/iclr/Shapley Explanation Networks
 create mode 100644 data/2021/iclr/Shapley explainability on the data manifold
 create mode 100644 data/2021/iclr/Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation
 create mode 100644 data/2021/iclr/Sharper Generalization Bounds for Learning with Gradient-dominated Objective Functions
 create mode 100644 data/2021/iclr/Sharpness-aware Minimization for Efficiently Improving Generalization
 create mode 100644 data/2021/iclr/Signatory: differentiable computations of the signature and logsignature transforms, on both CPU and GPU
 create mode 100644 data/2021/iclr/Simple Augmentation Goes a Long Way: ADRL for DNN Quantization
 create mode 100644 data/2021/iclr/Simple Spectral Graph Convolution
 create mode 100644 data/2021/iclr/Single-Photon Image Classification
 create mode 100644 data/2021/iclr/Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy
 create mode 100644 data/2021/iclr/SkipW: Resource Adaptable RNN with Strict Upper Computational Limit
 create mode 100644 data/2021/iclr/Sliced Kernelized Stein Discrepancy
 create mode 100644 data/2021/iclr/Solving Compositional Reinforcement Learning Problems via Task Reduction
 create mode 100644 data/2021/iclr/Sparse Quantized Spectral Clustering
 create mode 100644 data/2021/iclr/Sparse encoding for more-interpretable feature-selecting representations in probabilistic matrix factorization
 create mode 100644 data/2021/iclr/Spatial Dependency Networks: Neural Layers for Improved Generative Image Modeling
 create mode 100644 data/2021/iclr/Spatially Structured Recurrent Modules
 create mode 100644 data/2021/iclr/Spatio-Temporal Graph Scattering Transform
 create mode 100644 data/2021/iclr/Stabilized Medical Image Attacks
 create mode 100644 data/2021/iclr/Statistical inference for individual fairness
 create mode 100644 data/2021/iclr/Stochastic Security: Adversarial Defense Using Long-Run Dynamics of Energy-Based Models
 create mode 100644 data/2021/iclr/Structured Prediction as Translation between Augmented Natural Languages
 create mode 100644 data/2021/iclr/Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning
 create mode 100644 data/2021/iclr/Support-set bottlenecks for video-text representation learning
 create mode 100644 data/2021/iclr/Symmetry-Aware Actor-Critic for 3D Molecular Design
 create mode 100644 data/2021/iclr/Systematic generalisation with group invariant predictions
 create mode 100644 data/2021/iclr/Taking Notes on the Fly Helps Language Pre-Training
 create mode 100644 data/2021/iclr/Taming GANs with Lookahead-Minmax
 create mode 100644 data/2021/iclr/Targeted Attack against Deep Neural Networks via Flipping Limited Weight Bits
 create mode 100644 data/2021/iclr/Task-Agnostic Morphology Evolution
 create mode 100644 data/2021/iclr/Teaching Temporal Logics to Neural Networks
 create mode 100644 data/2021/iclr/Teaching with Commentaries
 create mode 100644 "data/2021/iclr/Temporally-Extended \316\265-Greedy Exploration"
 create mode 100644 data/2021/iclr/Tent: Fully Test-Time Adaptation by Entropy Minimization
 create mode 100644 data/2021/iclr/Text Generation by Learning from Demonstrations
 create mode 100644 data/2021/iclr/The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers
 create mode 100644 data/2021/iclr/The Importance of Pessimism in Fixed-Dataset Policy Optimization
 create mode 100644 data/2021/iclr/The Intrinsic Dimension of Images and Its Impact on Learning
 create mode 100644 data/2021/iclr/The Recurrent Neural Tangent Kernel
 create mode 100644 data/2021/iclr/The Risks of Invariant Risk Minimization
 create mode 100644 data/2021/iclr/The Role of Momentum Parameters in the Optimal Convergence of Adaptive Polyak's Heavy-ball Methods
 create mode 100644 data/2021/iclr/The Traveling Observer Model: Multi-task Learning Through Spatial Variable Embeddings
 create mode 100644 data/2021/iclr/The Unreasonable Effectiveness of Patches in Deep Convolutional Kernels Methods
 create mode 100644 data/2021/iclr/The geometry of integration in text classification RNNs
 create mode 100644 data/2021/iclr/The inductive bias of ReLU networks on orthogonally separable data
 create mode 100644 data/2021/iclr/The role of Disentanglement in Generalisation
 create mode 100644 data/2021/iclr/Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data
 create mode 100644 data/2021/iclr/Theoretical bounds on estimation error for meta-learning
 create mode 100644 data/2021/iclr/Tilted Empirical Risk Minimization
 create mode 100644 data/2021/iclr/Tomographic Auto-Encoder: Unsupervised Bayesian Recovery of Corrupted Data
 create mode 100644 data/2021/iclr/Topology-Aware Segmentation Using Discrete Morse Theory
 create mode 100644 data/2021/iclr/Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis
 create mode 100644 data/2021/iclr/Towards Impartial Multi-task Learning
 create mode 100644 data/2021/iclr/Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding
 create mode 100644 data/2021/iclr/Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning
 create mode 100644 data/2021/iclr/Towards Robust Neural Networks via Close-loop Control
 create mode 100644 data/2021/iclr/Towards Robustness Against Natural Language Word Substitutions
 create mode 100644 data/2021/iclr/Tradeoffs in Data Augmentation: An Empirical Study
 create mode 100644 data/2021/iclr/Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs
 create mode 100644 data/2021/iclr/Training GANs with Stronger Augmentations via Contrastive Discriminator
 create mode 100644 data/2021/iclr/Training independent subnetworks for robust prediction
 create mode 100644 data/2021/iclr/Training with Quantization Noise for Extreme Model Compression
 create mode 100644 data/2021/iclr/Trajectory Prediction using Equivariant Continuous Convolution
 create mode 100644 data/2021/iclr/Transformer protein language models are unsupervised structure learners
 create mode 100644 data/2021/iclr/Transient Non-stationarity and Generalisation in Deep Reinforcement Learning
 create mode 100644 data/2021/iclr/TropEx: An Algorithm for Extracting Linear Terms in Deep Neural Networks
 create mode 100644 data/2021/iclr/Trusted Multi-View Classification
 create mode 100644 data/2021/iclr/UMEC: Unified model and embedding compression for efficient recommendation systems
 create mode 100644 data/2021/iclr/UPDeT: Universal Multi-agent RL via Policy Decoupling with Transformers
 create mode 100644 data/2021/iclr/Unbiased Teacher for Semi-Supervised Object Detection
 create mode 100644 data/2021/iclr/Uncertainty Estimation and Calibration with Finite-State Probabilistic RNNs
 create mode 100644 data/2021/iclr/Uncertainty Estimation in Autoregressive Structured Prediction
 create mode 100644 data/2021/iclr/Uncertainty Sets for Image Classifiers using Conformal Prediction
 create mode 100644 data/2021/iclr/Uncertainty in Gradient Boosting via Ensembles
 create mode 100644 data/2021/iclr/Uncertainty-aware Active Learning for Optimal Bayesian Classifier
 create mode 100644 data/2021/iclr/Understanding Over-parameterization in Generative Adversarial Networks
 create mode 100644 data/2021/iclr/Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning
 create mode 100644 data/2021/iclr/Understanding and Improving Lexical Choice in Non-Autoregressive Translation
 create mode 100644 data/2021/iclr/Understanding the effects of data parallelism and sparsity on neural network training
 create mode 100644 data/2021/iclr/Understanding the failure modes of out-of-distribution generalization
 create mode 100644 data/2021/iclr/Understanding the role of importance weighting for deep learning
 create mode 100644 data/2021/iclr/Undistillable: Making A Nasty Teacher That CANNOT teach students
 create mode 100644 data/2021/iclr/Universal Weakly Supervised Segmentation by Pixel-to-Segment Contrastive Learning
 create mode 100644 data/2021/iclr/Universal approximation power of deep residual neural networks via nonlinear control theory
 create mode 100644 data/2021/iclr/Unlearnable Examples: Making Personal Data Unexploitable
 create mode 100644 data/2021/iclr/Unsupervised Audiovisual Synthesis via Exemplar Autoencoders
 create mode 100644 data/2021/iclr/Unsupervised Discovery of 3D Physical Objects from Video
 create mode 100644 data/2021/iclr/Unsupervised Meta-Learning through Latent-Space Interpolation in Generative Models
 create mode 100644 data/2021/iclr/Unsupervised Object Keypoint Learning using Local Spatial Predictability
 create mode 100644 data/2021/iclr/Unsupervised Representation Learning for Time Series with Temporal Neighborhood Coding
 create mode 100644 data/2021/iclr/Usable Information and Evolution of Optimal Representations During Training
 create mode 100644 data/2021/iclr/Using latent space regression to analyze and leverage compositionality in GANs
 create mode 100644 data/2021/iclr/VA-RED2: Video Adaptive Redundancy Reduction
 create mode 100644 data/2021/iclr/VAEBM: A Symbiosis between Variational Autoencoders and Energy-based Models
 create mode 100644 data/2021/iclr/VCNet and Functional Targeted Regularization For Learning Causal Effects of Continuous Treatments
 create mode 100644 data/2021/iclr/VTNet: Visual Transformer Network for Object Goal Navigation
 create mode 100644 data/2021/iclr/Variational Information Bottleneck for Effective Low-Resource Fine-Tuning
 create mode 100644 data/2021/iclr/Variational Intrinsic Control Revisited
 create mode 100644 data/2021/iclr/Variational State-Space Models for Localisation and Dense 3D Mapping in 6 DoF
 create mode 100644 data/2021/iclr/Vector-output ReLU Neural Network Problems are Copositive Programs: Convex Analysis of Two Layer Networks and Polynomial-time Algorithms
 create mode 100644 data/2021/iclr/Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images
 create mode 100644 data/2021/iclr/Viewmaker Networks: Learning Views for Unsupervised Representation Learning
 create mode 100644 data/2021/iclr/Vulnerability-Aware Poisoning Mechanism for Online RL with Unknown Dynamics
 create mode 100644 data/2021/iclr/WaNet - Imperceptible Warping-based Backdoor Attack
 create mode 100644 data/2021/iclr/Wandering within a world: Online contextualized few-shot learning
 create mode 100644 data/2021/iclr/Wasserstein Embedding for Graph Learning
 create mode 100644 data/2021/iclr/Wasserstein-2 Generative Networks
 create mode 100644 data/2021/iclr/Watch-And-Help: A Challenge for Social Perception and Human-AI Collaboration
 create mode 100644 data/2021/iclr/WaveGrad: Estimating Gradients for Waveform Generation
 create mode 100644 data/2021/iclr/What Can You Learn From Your Muscles? Learning Visual Representation from Human Interactions
 create mode 100644 data/2021/iclr/What Makes Instance Discrimination Good for Transfer Learning?
 create mode 100644 data/2021/iclr/What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study
 create mode 100644 data/2021/iclr/What Should Not Be Contrastive in Contrastive Learning
 create mode 100644 data/2021/iclr/What are the Statistical Limits of Offline RL with Linear Function Approximation?
 create mode 100644 data/2021/iclr/What they do when in doubt: a study of inductive biases in seq2seq learners
 create mode 100644 data/2021/iclr/When Do Curricula Work?
 create mode 100644 data/2021/iclr/When Optimizing f-Divergence is Robust with Label Noise
 create mode 100644 data/2021/iclr/When does preconditioning help or hurt generalization?
 create mode 100644 data/2021/iclr/Why Are Convolutional Nets More Sample-Efficient than Fully-Connected Nets?
 create mode 100644 data/2021/iclr/Why resampling outperforms reweighting for correcting sampling bias with stochastic gradients
 create mode 100644 data/2021/iclr/Winning the L2RPN Challenge: Power Grid Management via Semi-Markov Afterstate Actor-Critic
 create mode 100644 data/2021/iclr/Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching
 create mode 100644 data/2021/iclr/WrapNet: Neural Net Inference with Ultra-Low-Precision Arithmetic
 create mode 100644 data/2021/iclr/X2T: Training an X-to-Text Typing Interface with Online Learning from User Feedback
 create mode 100644 data/2021/iclr/You Only Need Adversarial Supervision for Semantic Image Synthesis
 create mode 100644 data/2021/iclr/Zero-Cost Proxies for Lightweight NAS
 create mode 100644 data/2021/iclr/Zero-shot Synthesis with Group-Supervised Learning
 create mode 100644 data/2021/iclr/gradSim: Differentiable simulation for system identification and visuomotor control
 create mode 100644 data/2021/iclr/i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning
 create mode 100644 data/2021/iclr/not-MIWAE: Deep Generative Modelling with Missing not at Random Data
 create mode 100644 data/2022/iclr/8-bit Optimizers via Block-wise Quantization
 create mode 100644 data/2022/iclr/A Biologically Interpretable Graph Convolutional Network to Link Genetic Risk Pathways and Imaging Phenotypes of Disease
 create mode 100644 data/2022/iclr/A Class of Short-term Recurrence Anderson Mixing Methods and Their Applications
 create mode 100644 data/2022/iclr/A Comparison of Hamming Errors of Representative Variable Selection Methods
 create mode 100644 data/2022/iclr/A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion
 create mode 100644 data/2022/iclr/A Deep Variational Approach to Clustering Survival Data
 create mode 100644 data/2022/iclr/A Fine-Grained Analysis on Distribution Shift
 create mode 100644 data/2022/iclr/A Fine-Tuning Approach to Belief State Modeling
 create mode 100644 data/2022/iclr/A First-Occupancy Representation for Reinforcement Learning
 create mode 100644 data/2022/iclr/A General Analysis of Example-Selection for Stochastic Gradient Descent
 create mode 100644 data/2022/iclr/A Generalized Weighted Optimization Method for Computational Learning and Inversion
 create mode 100644 data/2022/iclr/A Johnson-Lindenstrauss Framework for Randomly Initialized CNNs
 create mode 100644 data/2022/iclr/A Loss Curvature Perspective on Training Instabilities of Deep Learning Models
 create mode 100644 data/2022/iclr/A Neural Tangent Kernel Perspective of Infinite Tree Ensembles
 create mode 100644 "data/2022/iclr/A New Perspective on \"How Graph Neural Networks Go Beyond Weisfeiler-Lehman?\""
 create mode 100644 data/2022/iclr/A Non-Parametric Regression Viewpoint : Generalization of Overparametrized Deep RELU Network Under Noisy Observations
 create mode 100644 data/2022/iclr/A Program to Build E(N)-Equivariant Steerable CNNs
 create mode 100644 data/2022/iclr/A Reduction-Based Framework for Conservative Bandits and Reinforcement Learning
 create mode 100644 data/2022/iclr/A Relational Intervention Approach for Unsupervised Dynamics Generalization in Model-Based Reinforcement Learning
 create mode 100644 data/2022/iclr/A Statistical Framework for Efficient Out of Distribution Detection in Deep Neural Networks
 create mode 100644 data/2022/iclr/A Tale of Two Flows: Cooperative Learning of Langevin Flow and Normalizing Flow Toward Energy-Based Model
 create mode 100644 data/2022/iclr/A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features
 create mode 100644 data/2022/iclr/A Theory of Tournament Representations
 create mode 100644 data/2022/iclr/A Unified Contrastive Energy-based Model for Understanding the Generative Ability of Adversarial Training
 create mode 100644 data/2022/iclr/A Unified Wasserstein Distributional Robustness Framework for Adversarial Training
 create mode 100644 data/2022/iclr/A Zest of LIME: Towards Architecture-Independent Model Distances
 create mode 100644 data/2022/iclr/A fast and accurate splitting method for optimal transport: analysis and implementation
 create mode 100644 data/2022/iclr/A generalization of the randomized singular value decomposition
 create mode 100644 data/2022/iclr/A global convergence theory for deep ReLU implicit networks via over-parameterization
 create mode 100644 data/2022/iclr/ADAVI: Automatic Dual Amortized Variational Inference Applied To Pyramidal Bayesian Models
 create mode 100644 data/2022/iclr/AEVA: Black-box Backdoor Detection Using Adversarial Extreme Value Analysis
 create mode 100644 data/2022/iclr/ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity
 create mode 100644 data/2022/iclr/AS-MLP: An Axial Shifted MLP Architecture for Vision
 create mode 100644 data/2022/iclr/Ab-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave Functions
 create mode 100644 data/2022/iclr/Accelerated Policy Learning with Parallel Differentiable Simulation
 create mode 100644 data/2022/iclr/Acceleration of Federated Learning with Alleviated Forgetting in Local Training
 create mode 100644 data/2022/iclr/Active Hierarchical Exploration with Stable Subgoal Representation Learning
 create mode 100644 data/2022/iclr/Actor-Critic Policy Optimization in a Large-Scale Imperfect-Information Game
 create mode 100644 data/2022/iclr/Actor-critic is implicitly biased towards high entropy optimal policies
 create mode 100644 data/2022/iclr/Ada-NETS: Face Clustering via Adaptive Neighbour Discovery in the Structure Space
 create mode 100644 data/2022/iclr/AdaAug: Learning Class- and Instance-adaptive Data Augmentation Policies
 create mode 100644 data/2022/iclr/AdaMatch: A Unified Approach to Semi-Supervised Learning and Domain Adaptation
 create mode 100644 data/2022/iclr/AdaRL: What, Where, and How to Adapt in Transfer Reinforcement Learning
 create mode 100644 data/2022/iclr/Adaptive Wavelet Transformer Network for 3D Shape Representation Learning
 create mode 100644 data/2022/iclr/Adversarial Retriever-Ranker for Dense Text Retrieval
 create mode 100644 data/2022/iclr/Adversarial Robustness Through the Lens of Causality
 create mode 100644 data/2022/iclr/Adversarial Support Alignment
 create mode 100644 data/2022/iclr/Adversarial Unlearning of Backdoors via Implicit Hypergradient
 create mode 100644 data/2022/iclr/Adversarially Robust Conformal Prediction
 create mode 100644 data/2022/iclr/Almost Tight L0-norm Certified Robustness of Top-k Predictions against Adversarial Perturbations
 create mode 100644 data/2022/iclr/AlphaZero-based Proof Cost Network to Aid Game Solving
 create mode 100644 data/2022/iclr/Amortized Implicit Differentiation for Stochastic Bilevel Optimization
 create mode 100644 data/2022/iclr/Amortized Tree Generation for Bottom-up Synthesis Planning and Synthesizable Molecular Design
 create mode 100644 data/2022/iclr/An Agnostic Approach to Federated Learning with Class Imbalance
 create mode 100644 data/2022/iclr/An Autoregressive Flow Model for 3D Molecular Geometry Generation from Scratch
 create mode 100644 data/2022/iclr/An Experimental Design Perspective on Model-Based Reinforcement Learning
 create mode 100644 data/2022/iclr/An Explanation of In-context Learning as Implicit Bayesian Inference
 create mode 100644 data/2022/iclr/An Information Fusion Approach to Learning with Instance-Dependent Label Noise
 create mode 100644 data/2022/iclr/An Operator Theoretic View On Pruning Deep Neural Networks
 create mode 100644 data/2022/iclr/An Unconstrained Layer-Peeled Perspective on Neural Collapse
 create mode 100644 data/2022/iclr/Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models
 create mode 100644 data/2022/iclr/Analyzing and Improving the Optimization Landscape of Noise-Contrastive Estimation
 create mode 100644 data/2022/iclr/Ancestral protein sequence reconstruction using a tree-structured Ornstein-Uhlenbeck variational autoencoder
 create mode 100644 data/2022/iclr/Anisotropic Random Feature Regression in High Dimensions
 create mode 100644 data/2022/iclr/Anomaly Detection for Tabular Data with Internal Contrastive Learning
 create mode 100644 data/2022/iclr/Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy
 create mode 100644 data/2022/iclr/Anti-Concentrated Confidence Bonuses For Scalable Exploration
 create mode 100644 data/2022/iclr/Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice
 create mode 100644 data/2022/iclr/Anytime Dense Prediction with Confidence Adaptivity
 create mode 100644 data/2022/iclr/Approximation and Learning with Deep Convolutional Models: a Kernel Perspective
 create mode 100644 data/2022/iclr/Assessing Generalization of SGD via Disagreement
 create mode 100644 data/2022/iclr/Associated Learning: an Alternative to End-to-End Backpropagation that Works on CNN, RNN, and Transformer
 create mode 100644 data/2022/iclr/Asymmetry Learning for Counterfactually-invariant Classification in OOD Tasks
 create mode 100644 data/2022/iclr/Attacking deep networks with surrogate-based adversarial black-box methods is easy
 create mode 100644 data/2022/iclr/Attention-based Interpretability with Concept Transformers
 create mode 100644 data/2022/iclr/Audio Lottery: Speech Recognition Made Ultra-Lightweight, Noise-Robust, and Transferable
 create mode 100644 data/2022/iclr/Augmented Sliced Wasserstein Distances
 create mode 100644 data/2022/iclr/Auto-Transfer: Learning to Route Transferable Representations
 create mode 100644 data/2022/iclr/Auto-scaling Vision Transformers without Training
 create mode 100644 data/2022/iclr/Automated Self-Supervised Learning for Graphs
 create mode 100644 data/2022/iclr/Automatic Loss Function Search for Predict-Then-Optimize Problems with Strong Ranking Property
 create mode 100644 data/2022/iclr/Autonomous Learning of Object-Centric Abstractions for High-Level Planning
 create mode 100644 data/2022/iclr/Autonomous Reinforcement Learning: Formalism and Benchmarking
 create mode 100644 data/2022/iclr/Autoregressive Diffusion Models
 create mode 100644 data/2022/iclr/Autoregressive Quantile Flows for Predictive Uncertainty Estimation
 create mode 100644 data/2022/iclr/Axiomatic Explanations for Visual Search, Retrieval, and Similarity Learning
 create mode 100644 data/2022/iclr/BAM: Bayes with Adaptive Memory
 create mode 100644 data/2022/iclr/BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis
 create mode 100644 data/2022/iclr/BEiT: BERT Pre-Training of Image Transformers
 create mode 100644 data/2022/iclr/Back2Future: Leveraging Backfill Dynamics for Improving Real-time Predictions in Future
 create mode 100644 data/2022/iclr/Backdoor Defense via Decoupling the Training Process
 create mode 100644 data/2022/iclr/BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models
 create mode 100644 data/2022/iclr/Bag of Instances Aggregation Boosts Self-supervised Distillation
 create mode 100644 data/2022/iclr/Bandit Learning with Joint Effect of Incentivized Sampling, Delayed Sampling Feedback, and Self-Reinforcing User Preferences
 create mode 100644 data/2022/iclr/Bayesian Framework for Gradient Leakage
 create mode 100644 data/2022/iclr/Bayesian Modeling and Uncertainty Quantification for Learning to Optimize: What, Why, and How
 create mode 100644 data/2022/iclr/Bayesian Neural Network Priors Revisited
 create mode 100644 data/2022/iclr/Benchmarking the Spectrum of Agent Capabilities
 create mode 100644 data/2022/iclr/Better Supervisory Signals by Observing Learning Paths
 create mode 100644 data/2022/iclr/Beyond ImageNet Attack: Towards Crafting Adversarial Examples for Black-box Domains
 create mode 100644 data/2022/iclr/Bi-linear Value Networks for Multi-goal Reinforcement Learning
 create mode 100644 data/2022/iclr/BiBERT: Accurate Fully Binarized BERT
 create mode 100644 data/2022/iclr/Blaschke Product Neural Networks (BPNN): A Physics-Infused Neural Network for Phase Retrieval of Meromorphic Functions
 create mode 100644 data/2022/iclr/Boosted Curriculum Reinforcement Learning
 create mode 100644 data/2022/iclr/Boosting Randomized Smoothing with Variance Reduced Classifiers
 create mode 100644 data/2022/iclr/Boosting the Certified Robustness of L-infinity Distance Nets
 create mode 100644 data/2022/iclr/Bootstrapped Meta-Learning
 create mode 100644 data/2022/iclr/Bootstrapping Semantic Segmentation with Regional Contrast
 create mode 100644 data/2022/iclr/Bregman Gradient Policy Optimization
 create mode 100644 data/2022/iclr/Bridging Recommendation and Marketing via Recurrent Intensity Modeling
 create mode 100644 data/2022/iclr/Bridging the Gap: Providing Post-Hoc Symbolic Explanations for Sequential Decision-Making Problems with Inscrutable Representations
 create mode 100644 data/2022/iclr/Bundle Networks: Fiber Bundles, Local Trivializations, and a Generative Approach to Exploring Many-to-one Maps
 create mode 100644 data/2022/iclr/Byzantine-Robust Learning on Heterogeneous Datasets via Bucketing
 create mode 100644 data/2022/iclr/C-Planning: An Automatic Curriculum for Learning Goal-Reaching Tasks
 create mode 100644 data/2022/iclr/CADDA: Class-wise Automatic Differentiable Data Augmentation for EEG Signals
 create mode 100644 data/2022/iclr/CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation
 create mode 100644 data/2022/iclr/CKConv: Continuous Kernel Convolution For Sequential Data
 create mode 100644 data/2022/iclr/CLEVA-Compass: A Continual Learning Evaluation Assessment Compass to Promote Research Transparency and Comparability
 create mode 100644 data/2022/iclr/COPA: Certifying Robust Policies for Offline Reinforcement Learning against Poisoning Attacks
 create mode 100644 data/2022/iclr/COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation
 create mode 100644 data/2022/iclr/CROP: Certifying Robust Policies for Reinforcement Learning through Functional Smoothing
 create mode 100644 data/2022/iclr/Can an Image Classifier Suffice For Action Recognition?
 create mode 100644 data/2022/iclr/Capacity of Group-invariant Linear Readouts from Equivariant Representations: How Many Objects can be Linearly Classified Under All Possible Views?
 create mode 100644 data/2022/iclr/Capturing Structural Locality in Non-parametric Language Models
 create mode 100644 data/2022/iclr/Case-based reasoning for better generalization in textual reinforcement learning
 create mode 100644 data/2022/iclr/Causal Contextual Bandits with Targeted Interventions
 create mode 100644 data/2022/iclr/Certified Robustness for Deep Equilibrium Models via Interval Bound Propagation
 create mode 100644 data/2022/iclr/Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via Augmentation Overlap
 create mode 100644 data/2022/iclr/Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
 create mode 100644 data/2022/iclr/Chemical-Reaction-Aware Molecule Representation Learning
 create mode 100644 data/2022/iclr/Chunked Autoregressive GAN for Conditional Waveform Synthesis
 create mode 100644 data/2022/iclr/Churn Reduction via Distillation
 create mode 100644 data/2022/iclr/Clean Images are Hard to Reblur: Exploiting the Ill-Posed Inverse Task for Dynamic Scene Deblurring
 create mode 100644 data/2022/iclr/ClimateGAN: Raising Climate Change Awareness by Generating Images of Floods
 create mode 100644 data/2022/iclr/Closed-form Sample Probing for Learning Generative Models in Zero-shot Learning
 create mode 100644 data/2022/iclr/CoBERL: Contrastive BERT for Reinforcement Learning
 create mode 100644 data/2022/iclr/CoMPS: Continual Meta Policy Search
 create mode 100644 data/2022/iclr/CoST: Contrastive Learning of Disentangled Seasonal-Trend Representations for Time Series Forecasting
 create mode 100644 data/2022/iclr/CodeTrek: Flexible Modeling of Code using an Extensible Relational Representation
 create mode 100644 data/2022/iclr/Coherence-based Label Propagation over Time Series for Accelerated Active Learning
 create mode 100644 data/2022/iclr/Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods
 create mode 100644 data/2022/iclr/Collapse by Conditioning: Training Class-conditional GANs with Limited Data
 create mode 100644 data/2022/iclr/ComPhy: Compositional Physical Reasoning of Objects and Events from Videos
 create mode 100644 data/2022/iclr/Communication-Efficient Actor-Critic Methods for Homogeneous Markov Games
 create mode 100644 data/2022/iclr/Comparing Distributions by Measuring Differences that Affect Decision Making
 create mode 100644 data/2022/iclr/Complete Verification via Multi-Neuron Relaxation Guided Branch-and-Bound
 create mode 100644 data/2022/iclr/Compositional Attention: Disentangling Search and Retrieval
 create mode 100644 data/2022/iclr/Compositional Training for End-to-End Deep AUC Maximization
 create mode 100644 data/2022/iclr/ConFeSS: A Framework for Single Source Cross-Domain Few-Shot Learning
 create mode 100644 data/2022/iclr/Concurrent Adversarial Learning for Large-Batch Training
 create mode 100644 data/2022/iclr/Conditional Contrastive Learning with Kernel
 create mode 100644 data/2022/iclr/Conditional Image Generation by Conditioning Variational Auto-Encoders
 create mode 100644 data/2022/iclr/Conditional Object-Centric Learning from Video
 create mode 100644 data/2022/iclr/Conditioning Sequence-to-sequence Networks with Learned Activations
 create mode 100644 data/2022/iclr/Connectome-constrained Latent Variable Model of Whole-Brain Neural Activity
 create mode 100644 data/2022/iclr/Consistent Counterfactuals for Deep Models
 create mode 100644 data/2022/iclr/Constrained Physical-Statistics Models for Dynamical System Identification and Prediction
 create mode 100644 data/2022/iclr/Constrained Policy Optimization via Bayesian World Models
 create mode 100644 data/2022/iclr/Constraining Linear-chain CRFs to Regular Languages
 create mode 100644 data/2022/iclr/Constructing Orthogonal Convolutions in an Explicit Manner
 create mode 100644 data/2022/iclr/Constructing a Good Behavior Basis for Transfer using Generalized Policy Updates
 create mode 100644 data/2022/iclr/Contact Points Discovery for Soft-Body Manipulations with Differentiable Physics
 create mode 100644 data/2022/iclr/Context-Aware Sparse Deep Coordination Graphs
 create mode 100644 data/2022/iclr/Contextualized Scene Imagination for Generative Commonsense Reasoning
 create mode 100644 data/2022/iclr/Continual Learning with Filter Atom Swapping
 create mode 100644 data/2022/iclr/Continual Learning with Recursive Gradient Optimization
 create mode 100644 data/2022/iclr/Continual Normalization: Rethinking Batch Normalization for Online Continual Learning
 create mode 100644 data/2022/iclr/Continuous-Time Meta-Learning with Forward Mode Differentiation
 create mode 100644 data/2022/iclr/Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization
 create mode 100644 data/2022/iclr/Contrastive Clustering to Mine Pseudo Parallel Data for Unsupervised Translation
 create mode 100644 data/2022/iclr/Contrastive Fine-grained Class Clustering via Generative Adversarial Networks
 create mode 100644 data/2022/iclr/Controlling Directions Orthogonal to a Classifier
 create mode 100644 data/2022/iclr/Controlling the Complexity and Lipschitz Constant improves Polynomial Nets
 create mode 100644 data/2022/iclr/Convergent Graph Solvers
 create mode 100644 data/2022/iclr/Convergent and Efficient Deep Q Learning Algorithm
 create mode 100644 data/2022/iclr/CoordX: Accelerating Implicit Neural Representation with a Split MLP Architecture
 create mode 100644 data/2022/iclr/Coordination Among Neural Modules Through a Shared Global Workspace
 create mode 100644 data/2022/iclr/Counterfactual Plans under Distributional Ambiguity
 create mode 100644 data/2022/iclr/Creating Training Sets via Weak Indirect Supervision
 create mode 100644 data/2022/iclr/Critical Points in Quantum Generative Models
 create mode 100644 data/2022/iclr/Cross-Domain Imitation Learning via Optimal Transport
 create mode 100644 data/2022/iclr/Cross-Lingual Transfer with Class-Weighted Language-Invariant Representations
 create mode 100644 data/2022/iclr/Cross-Trajectory Representation Learning for Zero-Shot Generalization in RL
 create mode 100644 data/2022/iclr/CrossBeam: Learning to Search in Bottom-Up Program Synthesis
 create mode 100644 data/2022/iclr/CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention
 create mode 100644 data/2022/iclr/CrossMatch: Cross-Classifier Consistency Regularization for Open-Set Single Domain Generalization
 create mode 100644 data/2022/iclr/CrowdPlay: Crowdsourcing Human Demonstrations for Offline Learning
 create mode 100644 data/2022/iclr/Crystal Diffusion Variational Autoencoder for Periodic Material Generation
 create mode 100644 data/2022/iclr/Curriculum learning as a tool to uncover learning principles in the brain
 create mode 100644 data/2022/iclr/Curvature-Guided Dynamic Scale Networks for Multi-View Stereo
 create mode 100644 data/2022/iclr/CycleMLP: A MLP-like Architecture for Dense Prediction
 create mode 100644 data/2022/iclr/D-CODE: Discovering Closed-form ODEs from Observed Trajectories
 create mode 100644 data/2022/iclr/DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR
 create mode 100644 data/2022/iclr/DARA: Dynamics-Aware Reward Augmentation in Offline Reinforcement Learning
 create mode 100644 data/2022/iclr/DEGREE: Decomposition Based Explanation for Graph Neural Networks
 create mode 100644 data/2022/iclr/DEPTS: Deep Expansion Learning for Periodic Time Series Forecasting
 create mode 100644 data/2022/iclr/DISSECT: Disentangled Simultaneous Explanations via Concept Traversals
 create mode 100644 data/2022/iclr/DIVA: Dataset Derivative of a Learning Task
 create mode 100644 data/2022/iclr/DKM: Differentiable k-Means Clustering Layer for Neural Network Compression
 create mode 100644 data/2022/iclr/DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization
 create mode 100644 data/2022/iclr/Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation
 create mode 100644 data/2022/iclr/Data Poisoning Won't Save You From Facial Recognition
 create mode 100644 data/2022/iclr/Data-Driven Offline Optimization for Architecting Hardware Accelerators
 create mode 100644 data/2022/iclr/Data-Efficient Graph Grammar Learning for Molecular Generation
 create mode 100644 data/2022/iclr/DeSKO: Stability-Assured Robust Control with a Deep Stochastic Koopman Operator
 create mode 100644 data/2022/iclr/Dealing with Non-Stationarity in MARL via Trust-Region Decomposition
 create mode 100644 data/2022/iclr/Decentralized Learning for Overparameterized Problems: A Multi-Agent Kernel Approximation Approach
 create mode 100644 data/2022/iclr/Declarative nets that are equilibrium models
 create mode 100644 data/2022/iclr/Deconstructing the Inductive Biases of Hamiltonian Neural Networks
 create mode 100644 data/2022/iclr/Decoupled Adaptation for Cross-Domain Object Detection
 create mode 100644 data/2022/iclr/Deep Attentive Variational Inference
 create mode 100644 data/2022/iclr/Deep AutoAugment
 create mode 100644 data/2022/iclr/Deep Ensembling with No Overhead for either Training or Testing: The All-Round Blessings of Dynamic Sparsity
 create mode 100644 data/2022/iclr/Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers
 create mode 100644 data/2022/iclr/Deep Point Cloud Reconstruction
 create mode 100644 data/2022/iclr/Deep ReLU Networks Preserve Expected Length
 create mode 100644 data/2022/iclr/Defending Against Image Corruptions Through Adversarial Augmentations
 create mode 100644 data/2022/iclr/Delaunay Component Analysis for Evaluation of Data Representations
 create mode 100644 data/2022/iclr/DemoDICE: Offline Imitation Learning with Supplementary Imperfect Demonstrations
 create mode 100644 data/2022/iclr/Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization
 create mode 100644 data/2022/iclr/Demystifying Limited Adversarial Transferability in Automatic Speech Recognition Systems
 create mode 100644 data/2022/iclr/Denoising Likelihood Score Matching for Conditional Score-based Data Generation
 create mode 100644 data/2022/iclr/DictFormer: Tiny Transformer with Shared Dictionary
 create mode 100644 data/2022/iclr/DiffSkill: Skill Abstraction from Differentiable Physics for Deformable Object Manipulations with Tools
 create mode 100644 data/2022/iclr/Differentiable DAG Sampling
 create mode 100644 data/2022/iclr/Differentiable Expectation-Maximization for Set Representation Learning
 create mode 100644 data/2022/iclr/Differentiable Gradient Sampling for Learning Implicit 3D Scene Reconstructions from a Single Image
 create mode 100644 data/2022/iclr/Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners
 create mode 100644 data/2022/iclr/Differentiable Scaffolding Tree for Molecule Optimization
 create mode 100644 data/2022/iclr/Differentially Private Fine-tuning of Language Models
 create mode 100644 data/2022/iclr/Differentially Private Fractional Frequency Moments Estimation with Polylogarithmic Space
 create mode 100644 data/2022/iclr/Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme
 create mode 100644 data/2022/iclr/Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching
 create mode 100644 data/2022/iclr/Discovering Invariant Rationales for Graph Neural Networks
 create mode 100644 data/2022/iclr/Discovering Latent Concepts Learned in BERT
 create mode 100644 data/2022/iclr/Discovering Nonlinear PDEs from Scarce Data with Physics-encoded Learning
 create mode 100644 data/2022/iclr/Discovering and Explaining the Representation Bottleneck of DNNS
 create mode 100644 data/2022/iclr/Discrepancy-Based Active Learning for Domain Adaptation
 create mode 100644 data/2022/iclr/Discrete Representations Strengthen Vision Transformer Robustness
 create mode 100644 data/2022/iclr/Discriminative Similarity for Data Clustering
 create mode 100644 data/2022/iclr/Disentanglement Analysis with Partial Information Decomposition
 create mode 100644 data/2022/iclr/Distilling GANs with Style-Mixed Triplets for X2I Translation with Limited Data
 create mode 100644 data/2022/iclr/Distribution Compression in Near-Linear Time
 create mode 100644 data/2022/iclr/Distributional Reinforcement Learning with Monotonic Splines
 create mode 100644 data/2022/iclr/Distributionally Robust Fair Principal Components via Geodesic Descents
 create mode 100644 data/2022/iclr/Distributionally Robust Models with Parametric Likelihood Ratios
 create mode 100644 data/2022/iclr/Diurnal or Nocturnal? Federated Learning of Multi-branch Networks from Periodically Shifting Distributions
 create mode 100644 data/2022/iclr/Dive Deeper Into Integral Pose Regression
 create mode 100644 data/2022/iclr/Divergence-aware Federated Self-Supervised Learning
 create mode 100644 data/2022/iclr/Diverse Client Selection for Federated Learning via Submodular Maximization
 create mode 100644 data/2022/iclr/Divisive Feature Normalization Improves Image Recognition Performance in AlexNet
 create mode 100644 data/2022/iclr/Do Not Escape From the Manifold: Discovering the Local Coordinates on the Latent Space of GANs
 create mode 100644 data/2022/iclr/Do Users Benefit From Interpretable Vision? A User Study, Baseline, And Dataset
 create mode 100644 data/2022/iclr/Do We Need Anisotropic Graph Neural Networks?
 create mode 100644 data/2022/iclr/Do deep networks transfer invariances across classes?
 create mode 100644 data/2022/iclr/Does your graph need a confidence boost? Convergent boosted smoothing on graphs with tabular node features
 create mode 100644 data/2022/iclr/Domain Adversarial Training: A Game Perspective
 create mode 100644 data/2022/iclr/Domino: Discovering Systematic Errors with Cross-Modal Embeddings
 create mode 100644 data/2022/iclr/Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information
 create mode 100644 data/2022/iclr/DriPP: Driven Point Processes to Model Stimuli Induced Patterns in M EEG Signals
 create mode 100644 data/2022/iclr/Dropout Q-Functions for Doubly Efficient Reinforcement Learning
 create mode 100644 data/2022/iclr/Dual Lottery Ticket Hypothesis
 create mode 100644 data/2022/iclr/Dynamic Token Normalization improves Vision Transformers
 create mode 100644 data/2022/iclr/Dynamics-Aware Comparison of Learned Reward Functions
 create mode 100644 data/2022/iclr/EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits
 create mode 100644 data/2022/iclr/EViT: Expediting Vision Transformers via Token Reorganizations
 create mode 100644 data/2022/iclr/EXACT: Scalable Graph Neural Networks Training via Extreme Activation Compression
 create mode 100644 data/2022/iclr/Effect of scale on catastrophic forgetting in neural networks
 create mode 100644 data/2022/iclr/Effective Model Sparsification by Scheduled Grow-and-Prune Methods
 create mode 100644 data/2022/iclr/Efficient Active Search for Combinatorial Optimization Problems
 create mode 100644 data/2022/iclr/Efficient Computation of Deep Nonlinear Infinite-Width Neural Networks that Learn Features
 create mode 100644 data/2022/iclr/Efficient Learning of Safe Driving Policy via Human-AI Copilot Optimization
 create mode 100644 data/2022/iclr/Efficient Neural Causal Discovery without Acyclicity Constraints
 create mode 100644 data/2022/iclr/Efficient Self-supervised Vision Transformers for Representation Learning
 create mode 100644 data/2022/iclr/Efficient Sharpness-aware Minimization for Improved Training of Neural Networks
 create mode 100644 data/2022/iclr/Efficient Split-Mix Federated Learning for On-Demand and In-Situ Customization
 create mode 100644 data/2022/iclr/Efficient Token Mixing for Transformers via Adaptive Fourier Neural Operators
 create mode 100644 data/2022/iclr/Efficient and Differentiable Conformal Prediction with General Function Classes
 create mode 100644 data/2022/iclr/Efficiently Modeling Long Sequences with Structured State Spaces
 create mode 100644 data/2022/iclr/EigenGame Unloaded: When playing games is better than optimizing
 create mode 100644 data/2022/iclr/Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian Spectrums
 create mode 100644 data/2022/iclr/Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation
 create mode 100644 data/2022/iclr/Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise
 create mode 100644 data/2022/iclr/Embedded-model flows: Combining the inductive biases of model-free deep learning and explicit probabilistic modeling
 create mode 100644 data/2022/iclr/Emergent Communication at Scale
 create mode 100644 data/2022/iclr/Enabling Arbitrary Translation Objectives with Adaptive Tree Search
 create mode 100644 data/2022/iclr/Encoding Weights of Irregular Sparsity for Fixed-to-Fixed Model Compression
 create mode 100644 data/2022/iclr/End-to-End Learning of Probabilistic Hierarchies on Graphs
 create mode 100644 data/2022/iclr/Energy-Based Learning for Cooperative Games, with Applications to Valuation Problems in Machine Learning
 create mode 100644 data/2022/iclr/Energy-Inspired Molecular Conformation Optimization
 create mode 100644 data/2022/iclr/Enhancing Cross-lingual Transfer by Manifold Mixup
 create mode 100644 data/2022/iclr/EntQA: Entity Linking as Question Answering
 create mode 100644 data/2022/iclr/Entroformer: A Transformer-based Entropy Model for Learned Image Compression
 create mode 100644 data/2022/iclr/Environment Predictive Coding for Visual Navigation
 create mode 100644 data/2022/iclr/Equivariant Graph Mechanics Networks with Constraints
 create mode 100644 data/2022/iclr/Equivariant Self-Supervised Learning: Encouraging Equivariance in Representations
 create mode 100644 data/2022/iclr/Equivariant Subgraph Aggregation Networks
 create mode 100644 data/2022/iclr/Equivariant Transformers for Neural Network based Molecular Potentials
 create mode 100644 data/2022/iclr/Equivariant and Stable Positional Encoding for More Powerful Graph Neural Networks
 create mode 100644 data/2022/iclr/Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems
 create mode 100644 data/2022/iclr/Evading Adversarial Example Detection Defenses with Orthogonal Projected Gradient Descent
 create mode 100644 data/2022/iclr/Evaluating Disentanglement of Structured Representations
 create mode 100644 data/2022/iclr/Evaluating Distributional Distortion in Neural Language Modeling
 create mode 100644 data/2022/iclr/Evaluating Model-Based Planning and Planner Amortization for Continuous Control
 create mode 100644 data/2022/iclr/Evaluation Metrics for Graph Generative Models: Problems, Pitfalls, and Practical Solutions
 create mode 100644 data/2022/iclr/Evidential Turing Processes
 create mode 100644 data/2022/iclr/Evolutionary Diversity Optimization with Clustering-based Selection for Reinforcement Learning
 create mode 100644 data/2022/iclr/ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning
 create mode 100644 data/2022/iclr/Explainable GNN-Based Models over Knowledge Graphs
 create mode 100644 data/2022/iclr/Explaining Point Processes by Learning Interpretable Temporal Logic Rules
 create mode 100644 data/2022/iclr/Explanations of Black-Box Models based on Directional Feature Interactions
 create mode 100644 data/2022/iclr/Exploiting Class Activation Value for Partial-Label Learning
 create mode 100644 data/2022/iclr/Exploring Memorization in Adversarial Training
 create mode 100644 data/2022/iclr/Exploring extreme parameter compression for pre-trained language models
 create mode 100644 data/2022/iclr/Exploring the Limits of Large Scale Pre-training
 create mode 100644 data/2022/iclr/Exposing the Implicit Energy Networks behind Masked Language Models via Metropolis--Hastings
 create mode 100644 data/2022/iclr/Expressiveness and Approximation Properties of Graph Neural Networks
 create mode 100644 data/2022/iclr/Expressivity of Emergent Languages is a Trade-off between Contextual Complexity and Unpredictability
 create mode 100644 data/2022/iclr/Extending the WILDS Benchmark for Unsupervised Adaptation
 create mode 100644 data/2022/iclr/F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization
 create mode 100644 data/2022/iclr/FALCON: Fast Visual Concept Learning by Integrating Images, Linguistic descriptions, and Conceptual Relations
 create mode 100644 data/2022/iclr/FILIP: Fine-grained Interactive Language-Image Pre-Training
 create mode 100644 data/2022/iclr/FILM: Following Instructions in Language with Modular Methods
 create mode 100644 data/2022/iclr/FP-DETR: Detection Transformer Advanced by Fully Pre-training
 create mode 100644 data/2022/iclr/Fair Normalizing Flows
 create mode 100644 data/2022/iclr/FairCal: Fairness Calibration for Face Verification
 create mode 100644 data/2022/iclr/Fairness Guarantees under Demographic Shift
 create mode 100644 data/2022/iclr/Fairness in Representation for Multilingual NLP: Insights from Controlled Experiments on Conditional Language Modeling
 create mode 100644 data/2022/iclr/Fast AdvProp
 create mode 100644 data/2022/iclr/Fast Differentiable Matrix Square Root
 create mode 100644 data/2022/iclr/Fast Generic Interaction Detection for Model Interpretability and Compression
 create mode 100644 data/2022/iclr/Fast Model Editing at Scale
 create mode 100644 data/2022/iclr/Fast Regression for Structured Inputs
 create mode 100644 data/2022/iclr/Fast topological clustering with Wasserstein distance
 create mode 100644 data/2022/iclr/FastSHAP: Real-Time Shapley Value Estimation
 create mode 100644 data/2022/iclr/Feature Kernel Distillation
 create mode 100644 data/2022/iclr/FedBABU: Toward Enhanced Representation for Federated Image Classification
 create mode 100644 data/2022/iclr/FedChain: Chained Algorithms for Near-optimal Communication Cost in Federated Learning
 create mode 100644 data/2022/iclr/FedPara: Low-rank Hadamard Product for Communication-Efficient Federated Learning
 create mode 100644 data/2022/iclr/Federated Learning from Only Unlabeled Data with Class-conditional-sharing Clients
 create mode 100644 data/2022/iclr/Few-Shot Backdoor Attacks on Visual Object Tracking
 create mode 100644 data/2022/iclr/Few-shot Learning via Dirichlet Tessellation Ensemble
 create mode 100644 data/2022/iclr/Filling the G_ap_s: Multivariate Time Series Imputation by Graph Neural Networks
 create mode 100644 data/2022/iclr/Filtered-CoPhy: Unsupervised Learning of Counterfactual Physics in Pixel Space
 create mode 100644 data/2022/iclr/Finding Biological Plausibility for Adversarially Robust Features via Metameric Tasks
 create mode 100644 data/2022/iclr/Finding an Unsupervised Image Segmenter in each of your Deep Generative Models
 create mode 100644 data/2022/iclr/Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution
 create mode 100644 data/2022/iclr/Fine-grained Differentiable Physics: A Yarn-level Model for Fabrics
 create mode 100644 data/2022/iclr/Finetuned Language Models are Zero-Shot Learners
 create mode 100644 data/2022/iclr/Finite-Time Convergence and Sample Complexity of Multi-Agent Actor-Critic Reinforcement Learning with Average Reward
 create mode 100644 data/2022/iclr/Fixed Neural Network Steganography: Train the images, not the network
 create mode 100644 data/2022/iclr/FlexConv: Continuous Kernel Convolutions With Differentiable Kernel Sizes
 create mode 100644 data/2022/iclr/Focus on the Common Good: Group Distributional Robustness Follows
 create mode 100644 data/2022/iclr/Fooling Explanations in Text Classifiers
 create mode 100644 data/2022/iclr/Fortuitous Forgetting in Connectionist Networks
 create mode 100644 data/2022/iclr/Frame Averaging for Invariant and Equivariant Network Design
 create mode 100644 data/2022/iclr/Frequency-aware SGD for Efficient Embedding Learning with Provable Benefits
 create mode 100644 data/2022/iclr/From Intervention to Domain Transportation: A Novel Perspective to Optimize Recommendation
 create mode 100644 data/2022/iclr/From Stars to Subgraphs: Uplifting Any GNN with Local Structure Awareness
 create mode 100644 data/2022/iclr/GATSBI: Generative Adversarial Training for Simulation-Based Inference
 create mode 100644 data/2022/iclr/GDA-AM: On the Effectiveness of Solving Min-Imax Optimization via Anderson Mixing
 create mode 100644 data/2022/iclr/GLASS: GNN with Labeling Tricks for Subgraph Representation Learning
 create mode 100644 data/2022/iclr/GNN is a Counter? Revisiting GNN for Question Answering
 create mode 100644 data/2022/iclr/GNN-LM: Language Modeling based on Global Contexts via GNN
 create mode 100644 data/2022/iclr/GPT-Critic: Offline Reinforcement Learning for End-to-End Task-Oriented Dialogue Systems
 create mode 100644 data/2022/iclr/GRAND++: Graph Neural Diffusion with A Source Term
 create mode 100644 data/2022/iclr/Gaussian Mixture Convolution Networks
 create mode 100644 data/2022/iclr/GeneDisco: A Benchmark for Experimental Design in Drug Discovery
 create mode 100644 data/2022/iclr/Generalisation in Lifelong Reinforcement Learning through Logical Composition
 create mode 100644 data/2022/iclr/Generalization Through the Lens of Leave-One-Out Error
 create mode 100644 data/2022/iclr/Generalization of Neural Combinatorial Solvers Through the Lens of Adversarial Robustness
 create mode 100644 data/2022/iclr/Generalized Decision Transformer for Offline Hindsight Information Matching
 create mode 100644 data/2022/iclr/Generalized Demographic Parity for Group Fairness
 create mode 100644 data/2022/iclr/Generalized Kernel Thinning
 create mode 100644 data/2022/iclr/Generalized Natural Gradient Flows in Hidden Convex-Concave Games and GANs
 create mode 100644 data/2022/iclr/Generalized rectifier wavelet covariance models for texture synthesis
 create mode 100644 data/2022/iclr/Generalizing Few-Shot NAS with Gradient Matching
 create mode 100644 data/2022/iclr/Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks
 create mode 100644 data/2022/iclr/Generative Modeling with Optimal Transport Maps
 create mode 100644 data/2022/iclr/Generative Models as a Data Source for Multiview Representation Learning
 create mode 100644 data/2022/iclr/Generative Planning for Temporally Coordinated Exploration in Reinforcement Learning
 create mode 100644 data/2022/iclr/Generative Principal Component Analysis
 create mode 100644 data/2022/iclr/Generative Pseudo-Inverse Memory
 create mode 100644 data/2022/iclr/GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation
 create mode 100644 data/2022/iclr/Geometric Transformers for Protein Interface Contact Prediction
 create mode 100644 data/2022/iclr/Geometric and Physical Quantities improve E(3) Equivariant Message Passing
 create mode 100644 data/2022/iclr/Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields
 create mode 100644 data/2022/iclr/GiraffeDet: A Heavy-Neck Paradigm for Object Detection
 create mode 100644 data/2022/iclr/Givens Coordinate Descent Methods for Rotation Matrix Learning in Trainable Embedding Indexes
 create mode 100644 data/2022/iclr/Global Convergence of Multi-Agent Policy Gradient in Markov Potential Games
 create mode 100644 data/2022/iclr/Goal-Directed Planning via Hindsight Experience Replay
 create mode 100644 data/2022/iclr/GradMax: Growing Neural Networks using Gradient Information
 create mode 100644 data/2022/iclr/GradSign: Model Performance Inference with Theoretical Insights
 create mode 100644 data/2022/iclr/Gradient Importance Learning for Incomplete Observations
 create mode 100644 data/2022/iclr/Gradient Information Matters in Policy Optimization by Back-propagating through Model
 create mode 100644 data/2022/iclr/Gradient Matching for Domain Generalization
 create mode 100644 data/2022/iclr/Gradient Step Denoiser for convergent Plug-and-Play
 create mode 100644 data/2022/iclr/Granger causal inference on DAGs identifies genomic loci regulating transcription
 create mode 100644 data/2022/iclr/Graph Auto-Encoder via Neighborhood Wasserstein Reconstruction
 create mode 100644 data/2022/iclr/Graph Condensation for Graph Neural Networks
 create mode 100644 data/2022/iclr/Graph Neural Network Guided Local Search for the Traveling Salesperson Problem
 create mode 100644 data/2022/iclr/Graph Neural Networks with Learnable Structural and Positional Representations
 create mode 100644 data/2022/iclr/Graph-Augmented Normalizing Flows for Anomaly Detection of Multiple Time Series
 create mode 100644 data/2022/iclr/Graph-Guided Network for Irregularly Sampled Multivariate Time Series
 create mode 100644 data/2022/iclr/Graph-Relational Domain Adaptation
 create mode 100644 data/2022/iclr/Graph-based Nearest Neighbor Search in Hyperbolic Spaces
 create mode 100644 data/2022/iclr/Graph-less Neural Networks: Teaching Old MLPs New Tricks Via Distillation
 create mode 100644 data/2022/iclr/GraphENS: Neighbor-Aware Ego Network Synthesis for Class-Imbalanced Node Classification
 create mode 100644 data/2022/iclr/Graphon based Clustering and Testing of Networks: Algorithms and Theory
 create mode 100644 data/2022/iclr/GreaseLM: Graph REASoning Enhanced Language Models
 create mode 100644 data/2022/iclr/Group equivariant neural posterior estimation
 create mode 100644 data/2022/iclr/Group-based Interleaved Pipeline Parallelism for Large-scale DNN Training
 create mode 100644 data/2022/iclr/HTLM: Hyper-Text Pre-Training and Prompting of Language Models
 create mode 100644 data/2022/iclr/Half-Inverse Gradients for Physical Deep Learning
 create mode 100644 data/2022/iclr/Handling Distribution Shifts on Graphs: An Invariance Perspective
 create mode 100644 data/2022/iclr/Heteroscedastic Temporal Variational Autoencoder For Irregularly Sampled Time Series
 create mode 100644 data/2022/iclr/Hidden Convexity of Wasserstein GANs: Interpretable Generative Models with Closed-Form Solutions
 create mode 100644 data/2022/iclr/Hidden Parameter Recurrent State Space Models For Changing Dynamics Scenarios
 create mode 100644 data/2022/iclr/Hierarchical Few-Shot Imitation with Skill Transition Models
 create mode 100644 data/2022/iclr/Hierarchical Variational Memory for Few-shot Learning Across Domains
 create mode 100644 data/2022/iclr/High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize
 create mode 100644 data/2022/iclr/High Probability Generalization Bounds with Fast Rates for Minimax Problems
 create mode 100644 data/2022/iclr/Hindsight Foresight Relabeling for Meta-Reinforcement Learning
 create mode 100644 data/2022/iclr/Hindsight is 20 20: Leveraging Past Traversals to Aid 3D Perception
 create mode 100644 data/2022/iclr/Hindsight: Posterior-guided training of retrievers for improved open-ended generation
 create mode 100644 data/2022/iclr/Hot-Refresh Model Upgrades with Regression-Free Compatible Training in Image Retrieval
 create mode 100644 data/2022/iclr/How Attentive are Graph Attention Networks?
 create mode 100644 data/2022/iclr/How Did the Model Change? Efficiently Assessing Machine Learning API Shifts
 create mode 100644 data/2022/iclr/How Do Vision Transformers Work?
 create mode 100644 data/2022/iclr/How Does SimSiam Avoid Collapse Without Negative Samples? A Unified Understanding with Self-supervised Contrastive Learning
 create mode 100644 data/2022/iclr/How Low Can We Go: Trading Memory for Error in Low-Precision Training
 create mode 100644 data/2022/iclr/How Much Can CLIP Benefit Vision-and-Language Tasks?
 create mode 100644 data/2022/iclr/How Well Does Self-Supervised Pre-Training Perform with Streaming Data?
 create mode 100644 data/2022/iclr/How many degrees of freedom do we need to train deep networks: a loss landscape perspective
 create mode 100644 data/2022/iclr/How to Inject Backdoors with Better Consistency: Logit Anchoring on Clean Data
 create mode 100644 data/2022/iclr/How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective
 create mode 100644 data/2022/iclr/How to Train Your MAML to Excel in Few-Shot Classification
 create mode 100644 data/2022/iclr/How to deal with missing data in supervised deep learning?
 create mode 100644 data/2022/iclr/How unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis
 create mode 100644 data/2022/iclr/Huber Additive Models for Non-stationary Time Series Analysis
 create mode 100644 data/2022/iclr/HyAR: Addressing Discrete-Continuous Action Reinforcement Learning via Hybrid Action Representation
 create mode 100644 data/2022/iclr/Hybrid Local SGD for Federated Learning with Heterogeneous Communications
 create mode 100644 data/2022/iclr/Hybrid Memoised Wake-Sleep: Approximate Inference at the Discrete-Continuous Interface
 create mode 100644 data/2022/iclr/Hybrid Random Features
 create mode 100644 data/2022/iclr/HyperDQN: A Randomized Exploration Method for Deep Reinforcement Learning
 create mode 100644 data/2022/iclr/Hyperparameter Tuning with Renyi Differential Privacy
 create mode 100644 data/2022/iclr/IFR-Explore: Learning Inter-object Functional Relationships in 3D Indoor Scenes
 create mode 100644 data/2022/iclr/IGLU: Efficient GCN Training via Lazy Updates
 create mode 100644 data/2022/iclr/Igeood: An Information Geometry Approach to Out-of-Distribution Detection
 create mode 100644 data/2022/iclr/Illiterate DALL-E Learns to Compose
 create mode 100644 data/2022/iclr/Image BERT Pre-training with Online Tokenizer
 create mode 100644 data/2022/iclr/Imbedding Deep Neural Networks
 create mode 100644 data/2022/iclr/Imitation Learning by Reinforcement Learning
 create mode 100644 data/2022/iclr/Imitation Learning from Observations under Transition Model Disparity
 create mode 100644 data/2022/iclr/Implicit Bias of Adversarial Training for Deep Neural Networks
 create mode 100644 data/2022/iclr/Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks
 create mode 100644 data/2022/iclr/Implicit Bias of Projected Subgradient Method Gives Provable Robust Recovery of Subspaces of Unknown Codimension
 create mode 100644 data/2022/iclr/Improved deterministic l2 robustness on CIFAR-10 and CIFAR-100
 create mode 100644 data/2022/iclr/Improving Federated Learning Face Recognition via Privacy-Agnostic Clusters
 create mode 100644 data/2022/iclr/Improving Mutual Information Estimation with Annealed and Energy-Based Bounds
 create mode 100644 data/2022/iclr/Improving Non-Autoregressive Translation Models Without Distillation
 create mode 100644 data/2022/iclr/Improving the Accuracy of Learning Example Weights for Imbalance Classification
 create mode 100644 data/2022/iclr/In a Nutshell, the Human Asked for This: Latent Goals for Following Temporal Specifications
 create mode 100644 data/2022/iclr/Increasing the Cost of Model Extraction with Calibrated Proof of Work
 create mode 100644 data/2022/iclr/Incremental False Negative Detection for Contrastive Learning
 create mode 100644 data/2022/iclr/Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking
 create mode 100644 data/2022/iclr/Inductive Relation Prediction Using Analogy Subgraph Embeddings
 create mode 100644 data/2022/iclr/InfinityGAN: Towards Infinite-Pixel Image Synthesis
 create mode 100644 data/2022/iclr/Information Bottleneck: Exact Analysis of (Quantized) Neural Networks
 create mode 100644 data/2022/iclr/Information Gain Propagation: a New Way to Graph Active Learning with Soft Labels
 create mode 100644 data/2022/iclr/Information Prioritization through Empowerment in Visual Model-based RL
 create mode 100644 data/2022/iclr/Information-theoretic Online Memory Selection for Continual Learning
 create mode 100644 data/2022/iclr/IntSGD: Adaptive Floatless Compression of Stochastic Gradients
 create mode 100644 data/2022/iclr/Interacting Contour Stochastic Gradient Langevin Dynamics
 create mode 100644 data/2022/iclr/Interpretable Unsupervised Diversity Denoising and Artefact Removal
 create mode 100644 data/2022/iclr/Invariant Causal Representation Learning for Out-of-Distribution Generalization
 create mode 100644 data/2022/iclr/Inverse Online Learning: Understanding Non-Stationary and Reactionary Policies
 create mode 100644 data/2022/iclr/Is Fairness Only Metric Deep? Evaluating and Addressing Subgroup Gaps in Deep Metric Learning
 create mode 100644 data/2022/iclr/Is High Variance Unavoidable in RL? A Case Study in Continuous Control
 create mode 100644 data/2022/iclr/Is Homophily a Necessity for Graph Neural Networks?
 create mode 100644 data/2022/iclr/Is Importance Weighting Incompatible with Interpolating Classifiers?
 create mode 100644 data/2022/iclr/It Takes Four to Tango: Multiagent Self Play for Automatic Curriculum Generation
 create mode 100644 data/2022/iclr/It Takes Two to Tango: Mixup for Deep Metric Learning
 create mode 100644 data/2022/iclr/Iterated Reasoning with Mutual Information in Cooperative and Byzantine Decentralized Teaming
 create mode 100644 data/2022/iclr/Iterative Refinement Graph Neural Network for Antibody Sequence-Structure Co-design
 create mode 100644 data/2022/iclr/Joint Shapley values: a measure of joint feature importance
 create mode 100644 data/2022/iclr/KL Guided Domain Adaptation
 create mode 100644 data/2022/iclr/Know Thyself: Transferable Visual Control Policies Through Robot-Awareness
 create mode 100644 data/2022/iclr/Know Your Action Set: Learning Action Relations for Reinforcement Learning
 create mode 100644 data/2022/iclr/Knowledge Infused Decoding
 create mode 100644 data/2022/iclr/Knowledge Removal in Sampling-based Bayesian Inference
 create mode 100644 data/2022/iclr/L0-Sparse Canonical Correlation Analysis
 create mode 100644 data/2022/iclr/LFPT5: A Unified Framework for Lifelong Few-shot Language Learning Based on Prompt Tuning of T5
 create mode 100644 data/2022/iclr/LIGS: Learnable Intrinsic-Reward Generation Selection for Multi-Agent Learning
 create mode 100644 data/2022/iclr/LORD: Lower-Dimensional Embedding of Log-Signature in Neural Rough Differential Equations
 create mode 100644 data/2022/iclr/Label Encoding for Regression Networks
 create mode 100644 data/2022/iclr/Label Leakage and Protection in Two-party Split Learning
 create mode 100644 data/2022/iclr/Label-Efficient Semantic Segmentation with Diffusion Models
 create mode 100644 data/2022/iclr/Language model compression with weighted low-rank factorization
 create mode 100644 data/2022/iclr/Language modeling via stochastic processes
 create mode 100644 data/2022/iclr/Language-biased image classification: evaluation based on semantic representations
 create mode 100644 data/2022/iclr/Language-driven Semantic Segmentation
 create mode 100644 data/2022/iclr/Large Language Models Can Be Strong Differentially Private Learners
 create mode 100644 data/2022/iclr/Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect
 create mode 100644 data/2022/iclr/Large-Scale Representation Learning on Graphs via Bootstrapping
 create mode 100644 data/2022/iclr/Latent Image Animator: Learning to Animate Images via Latent Space Navigation
 create mode 100644 data/2022/iclr/Latent Variable Sequential Set Transformers for Joint Multi-Agent Motion Prediction
 create mode 100644 data/2022/iclr/Learn Locally, Correct Globally: A Distributed Algorithm for Training Graph Neural Networks
 create mode 100644 data/2022/iclr/Learnability Lock: Authorized Learnability Control Through Adversarial Invertible Transformations
 create mode 100644 data/2022/iclr/Learnability of convolutional neural networks for infinite dimensional input via mixed and anisotropic smoothness
 create mode 100644 data/2022/iclr/Learned Simulators for Turbulence
 create mode 100644 data/2022/iclr/Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations
 create mode 100644 data/2022/iclr/Learning Altruistic Behaviours in Reinforcement Learning without External Rewards
 create mode 100644 data/2022/iclr/Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
 create mode 100644 data/2022/iclr/Learning Causal Models from Conditional Moment Restrictions by Importance Weighting
 create mode 100644 data/2022/iclr/Learning Continuous Environment Fields via Implicit Functions
 create mode 100644 data/2022/iclr/Learning Curves for Gaussian Process Regression with Power-Law Priors and Targets
 create mode 100644 data/2022/iclr/Learning Curves for SGD on Structured Features
 create mode 100644 data/2022/iclr/Learning Discrete Structured Variational Auto-Encoder using Natural Evolution Strategies
 create mode 100644 data/2022/iclr/Learning Disentangled Representation by Exploiting Pretrained Generative Models: A Contrastive Learning View
 create mode 100644 data/2022/iclr/Learning Distributionally Robust Models at Scale via Composite Optimization
 create mode 100644 data/2022/iclr/Learning Efficient Image Super-Resolution Networks via Structure-Regularized Pruning
 create mode 100644 data/2022/iclr/Learning Efficient Online 3D Bin Packing on Packing Configuration Trees
 create mode 100644 data/2022/iclr/Learning Fast Samplers for Diffusion Models by Differentiating Through Sample Quality
 create mode 100644 data/2022/iclr/Learning Fast, Learning Slow: A General Continual Learning Method based on Complementary Learning System
 create mode 100644 data/2022/iclr/Learning Features with Parameter-Free Layers
 create mode 100644 data/2022/iclr/Learning Generalizable Representations for Reinforcement Learning via Adaptive Meta-learner of Behavioral Similarities
 create mode 100644 data/2022/iclr/Learning Graphon Mean Field Games and Approximate Nash Equilibria
 create mode 100644 data/2022/iclr/Learning Guarantees for Graph Convolutional Networks on the Stochastic Block Model
 create mode 100644 data/2022/iclr/Learning Hierarchical Structures with Differentiable Nondeterministic Stacks
 create mode 100644 data/2022/iclr/Learning Long-Term Reward Redistribution via Randomized Return Decomposition
 create mode 100644 data/2022/iclr/Learning Multimodal VAEs through Mutual Supervision
 create mode 100644 data/2022/iclr/Learning Neural Contextual Bandits through Perturbed Rewards
 create mode 100644 data/2022/iclr/Learning Object-Oriented Dynamics for Planning from Text
 create mode 100644 data/2022/iclr/Learning Optimal Conformal Classifiers
 create mode 100644 data/2022/iclr/Learning Prototype-oriented Set Representations for Meta-Learning
 create mode 100644 data/2022/iclr/Learning Pruning-Friendly Networks via Frank-Wolfe: One-Shot, Any-Sparsity, And No Retraining
 create mode 100644 data/2022/iclr/Learning Representation from Neural Fisher Kernel with Low-rank Approximation
 create mode 100644 data/2022/iclr/Learning Scenario Representation for Solving Two-stage Stochastic Integer Programs
 create mode 100644 data/2022/iclr/Learning State Representations via Retracing in Reinforcement Learning
 create mode 100644 data/2022/iclr/Learning Strides in Convolutional Neural Networks
 create mode 100644 data/2022/iclr/Learning Super-Features for Image Retrieval
 create mode 100644 data/2022/iclr/Learning Synthetic Environments and Reward Networks for Reinforcement Learning
 create mode 100644 data/2022/iclr/Learning Temporally Causal Latent Processes from General Temporal Data
 create mode 100644 data/2022/iclr/Learning Towards The Largest Margins
 create mode 100644 data/2022/iclr/Learning Transferable Reward for Query Object Localization with Policy Adaptation
 create mode 100644 data/2022/iclr/Learning Value Functions from Undirected State-only Experience
 create mode 100644 data/2022/iclr/Learning Versatile Neural Architectures by Propagating Network Codes
 create mode 100644 data/2022/iclr/Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers
 create mode 100644 data/2022/iclr/Learning Weakly-supervised Contrastive Representations
 create mode 100644 data/2022/iclr/Learning a subspace of policies for online adaptation in Reinforcement Learning
 create mode 100644 data/2022/iclr/Learning by Directional Gradient Descent
 create mode 100644 data/2022/iclr/Learning curves for continual learning in neural networks: Self-knowledge transfer and forgetting
 create mode 100644 data/2022/iclr/Learning meta-features for AutoML
 create mode 100644 data/2022/iclr/Learning more skills through optimistic exploration
 create mode 100644 data/2022/iclr/Learning the Dynamics of Physical Systems from Sparse Observations with Finite Element Networks
 create mode 100644 data/2022/iclr/Learning to Annotate Part Segmentation with Gradient Matching
 create mode 100644 data/2022/iclr/Learning to Complete Code with Sketches
 create mode 100644 data/2022/iclr/Learning to Dequantise with Truncated Flows
 create mode 100644 data/2022/iclr/Learning to Downsample for Segmentation of Ultra-High Resolution Images
 create mode 100644 data/2022/iclr/Learning to Extend Molecular Scaffolds with Structural Motifs
 create mode 100644 data/2022/iclr/Learning to Generalize across Domains on Single Test Samples
 create mode 100644 data/2022/iclr/Learning to Guide and to be Guided in the Architect-Builder Problem
 create mode 100644 data/2022/iclr/Learning to Map for Active Semantic Goal Navigation
 create mode 100644 data/2022/iclr/Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic Forecasting
 create mode 100644 data/2022/iclr/Learning to Schedule Learning rate with Graph Neural Networks
 create mode 100644 data/2022/iclr/Learning transferable motor skills with hierarchical latent mixture policies
 create mode 100644 data/2022/iclr/Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations
 create mode 100644 data/2022/iclr/Learning-Augmented $k$-means Clustering
 create mode 100644 data/2022/iclr/Leveraging Automated Unit Tests for Unsupervised Code Translation
 create mode 100644 data/2022/iclr/Leveraging unlabeled data to predict out-of-distribution performance
 create mode 100644 "data/2022/iclr/Likelihood Training of Schr\303\266dinger Bridge using Forward-Backward SDEs Theory"
 create mode 100644 data/2022/iclr/Linking Emergent and Natural Languages via Corpus Transfer
 create mode 100644 data/2022/iclr/Lipschitz-constrained Unsupervised Skill Discovery
 create mode 100644 data/2022/iclr/LoRA: Low-Rank Adaptation of Large Language Models
 create mode 100644 data/2022/iclr/Local Feature Swapping for Generalization in Reinforcement Learning
 create mode 100644 data/2022/iclr/Long Expressive Memory for Sequence Modeling
 create mode 100644 data/2022/iclr/Looking Back on Learned Experiences For Class task Incremental Learning
 create mode 100644 data/2022/iclr/Lossless Compression with Probabilistic Circuits
 create mode 100644 data/2022/iclr/Lossy Compression with Distribution Shift as Entropy Constrained Optimal Transport
 create mode 100644 data/2022/iclr/Low-Budget Active Learning via Wasserstein Distance: An Integer Programming Approach
 create mode 100644 data/2022/iclr/MAML is a Noisy Contrastive Learner in Classification
 create mode 100644 data/2022/iclr/MCMC Should Mix: Learning Energy-Based Model with Neural Transport Latent Space MCMC
 create mode 100644 data/2022/iclr/MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling
 create mode 100644 data/2022/iclr/MT3: Multi-Task Multitrack Music Transcription
 create mode 100644 data/2022/iclr/MaGNET: Uniform Sampling from Deep Generative Network Manifolds Without Retraining
 create mode 100644 data/2022/iclr/Machine Learning For Elliptic PDEs: Fast Rate Generalization Bound, Neural Scaling Law and Minimax Optimality
 create mode 100644 data/2022/iclr/Map Induction: Compositional spatial submap learning for efficient exploration in novel environments
 create mode 100644 data/2022/iclr/Mapping Language Models to Grounded Conceptual Spaces
 create mode 100644 data/2022/iclr/Mapping conditional distributions for domain adaptation under generalized target shift
 create mode 100644 data/2022/iclr/Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning
 create mode 100644 data/2022/iclr/Maximizing Ensemble Diversity in Deep Reinforcement Learning
 create mode 100644 data/2022/iclr/Maximum Entropy RL (Provably) Solves Some Robust RL Problems
 create mode 100644 data/2022/iclr/Maximum n-times Coverage for Vaccine Design
 create mode 100644 data/2022/iclr/Measuring CLEVRness: Black-box Testing of Visual Reasoning Models
 create mode 100644 data/2022/iclr/Measuring the Interpretability of Unsupervised Representations via Quantized Reversed Probing
 create mode 100644 data/2022/iclr/Memorizing Transformers
 create mode 100644 data/2022/iclr/Memory Augmented Optimizers for Deep Learning
 create mode 100644 data/2022/iclr/Memory Replay with Data Compression for Continual Learning
 create mode 100644 data/2022/iclr/Mention Memory: incorporating textual knowledge into Transformers through entity mention attention
 create mode 100644 data/2022/iclr/Message Passing Neural PDE Solvers
 create mode 100644 data/2022/iclr/Meta Discovery: Learning to Discover Novel Classes given Very Limited Data
 create mode 100644 data/2022/iclr/Meta Learning Low Rank Covariance Factors for Energy Based Deterministic Uncertainty
 create mode 100644 data/2022/iclr/Meta-Imitation Learning by Watching Video Demonstrations
 create mode 100644 data/2022/iclr/Meta-Learning with Fewer Tasks through Task Interpolation
 create mode 100644 data/2022/iclr/MetaMorph: Learning Universal Controllers with Transformers
 create mode 100644 data/2022/iclr/MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts
 create mode 100644 data/2022/iclr/Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks
 create mode 100644 data/2022/iclr/Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond
 create mode 100644 data/2022/iclr/Minimax Optimality (Probably) Doesn't Imply Distribution Learning for GANs
 create mode 100644 data/2022/iclr/Minimax Optimization with Smooth Algorithmic Adversaries
 create mode 100644 data/2022/iclr/Mirror Descent Policy Optimization
 create mode 100644 data/2022/iclr/Missingness Bias in Model Debugging
 create mode 100644 data/2022/iclr/MoReL: Multi-omics Relational Learning
 create mode 100644 data/2022/iclr/MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
 create mode 100644 data/2022/iclr/Model Agnostic Interpretability for Multiple Instance Learning
 create mode 100644 data/2022/iclr/Model Zoo: A Growing Brain That Learns Continually
 create mode 100644 data/2022/iclr/Model-Based Offline Meta-Reinforcement Learning with Regularization
 create mode 100644 data/2022/iclr/Model-augmented Prioritized Experience Replay
 create mode 100644 data/2022/iclr/Modeling Label Space Interactions in Multi-label Classification using Box Embeddings
 create mode 100644 data/2022/iclr/Modular Lifelong Reinforcement Learning via Neural Composition
 create mode 100644 data/2022/iclr/MonoDistill: Learning Spatial Features for Monocular 3D Object Detection
 create mode 100644 data/2022/iclr/Monotonic Differentiable Sorting Networks
 create mode 100644 data/2022/iclr/Multi-Agent MDP Homomorphic Networks
 create mode 100644 data/2022/iclr/Multi-Critic Actor Learning: Teaching RL Policies to Act with Style
 create mode 100644 data/2022/iclr/Multi-Mode Deep Matrix and Tensor Factorization
 create mode 100644 data/2022/iclr/Multi-Stage Episodic Control for Strategic Exploration in Text Games
 create mode 100644 data/2022/iclr/Multi-Task Processes
 create mode 100644 data/2022/iclr/Multi-objective Optimization by Learning Space Partition
 create mode 100644 data/2022/iclr/Multimeasurement Generative Models
 create mode 100644 data/2022/iclr/Multiset-Equivariant Set Prediction with Approximate Implicit Differentiation
 create mode 100644 data/2022/iclr/Multitask Prompted Training Enables Zero-Shot Task Generalization
 create mode 100644 data/2022/iclr/NAS-Bench-Suite: NAS Evaluation is (Now) Surprisingly Easy
 create mode 100644 data/2022/iclr/NASI: Label- and Data-agnostic Neural Architecture Search at Initialization
 create mode 100644 data/2022/iclr/NASPY: Automated Extraction of Automated Machine Learning Models
 create mode 100644 data/2022/iclr/NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training
 create mode 100644 data/2022/iclr/NODE-GAM: Neural Generalized Additive Model for Interpretable Deep Learning
 create mode 100644 data/2022/iclr/Natural Language Descriptions of Deep Visual Features
 create mode 100644 data/2022/iclr/Natural Posterior Network: Deep Bayesian Predictive Uncertainty for Exponential Family Distributions
 create mode 100644 data/2022/iclr/Near-Optimal Reward-Free Exploration for Linear Mixture MDPs with Plug-in Solver
 create mode 100644 data/2022/iclr/Near-optimal Offline Reinforcement Learning with Linear Representation: Leveraging Variance Information with Pessimism
 create mode 100644 data/2022/iclr/Network Augmentation for Tiny Deep Learning
 create mode 100644 data/2022/iclr/Network Insensitivity to Parameter Noise via Parameter Attack During Training
 create mode 100644 data/2022/iclr/NeuPL: Neural Population Learning
 create mode 100644 data/2022/iclr/Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path
 create mode 100644 data/2022/iclr/Neural Contextual Bandits with Deep Representation and Shallow Exploration
 create mode 100644 data/2022/iclr/Neural Deep Equilibrium Solvers
 create mode 100644 data/2022/iclr/Neural Link Prediction with Walk Pooling
 create mode 100644 data/2022/iclr/Neural Markov Controlled SDE: Stochastic Optimization for Continuous-Time Data
 create mode 100644 data/2022/iclr/Neural Methods for Logical Reasoning over Knowledge Graphs
 create mode 100644 data/2022/iclr/Neural Models for Output-Space Invariance in Combinatorial Problems
 create mode 100644 data/2022/iclr/Neural Network Approximation based on Hausdorff distance of Tropical Zonotopes
 create mode 100644 data/2022/iclr/Neural Networks as Kernel Learners: The Silent Alignment Effect
 create mode 100644 data/2022/iclr/Neural Parameter Allocation Search
 create mode 100644 data/2022/iclr/Neural Processes with Stochastic Attention: Paying more attention to the context dataset
 create mode 100644 data/2022/iclr/Neural Program Synthesis with Query
 create mode 100644 data/2022/iclr/Neural Relational Inference with Node-Specific Information
 create mode 100644 data/2022/iclr/Neural Solvers for Fast and Accurate Numerical Optimal Control
 create mode 100644 data/2022/iclr/Neural Spectral Marked Point Processes
 create mode 100644 data/2022/iclr/Neural Stochastic Dual Dynamic Programming
 create mode 100644 data/2022/iclr/Neural Structured Prediction for Inductive Node Classification
 create mode 100644 data/2022/iclr/Neural Variational Dropout Processes
 create mode 100644 data/2022/iclr/Neural graphical modelling in continuous-time: consistency guarantees and algorithms
 create mode 100644 data/2022/iclr/New Insights on Reducing Abrupt Representation Change in Online Continual Learning
 create mode 100644 data/2022/iclr/No One Representation to Rule Them All: Overlapping Features of Training Methods
 create mode 100644 data/2022/iclr/No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models
 create mode 100644 data/2022/iclr/Node Feature Extraction by Self-Supervised Multi-scale Neighborhood Prediction
 create mode 100644 data/2022/iclr/NodePiece: Compositional and Parameter-Efficient Representations of Large Knowledge Graphs
 create mode 100644 data/2022/iclr/Noisy Feature Mixup
 create mode 100644 data/2022/iclr/Non-Linear Operator Approximations for Initial Value Problems
 create mode 100644 data/2022/iclr/Non-Parallel Text Style Transfer with Self-Parallel Supervision
 create mode 100644 data/2022/iclr/Non-Transferable Learning: A New Approach for Model Ownership Verification and Applicability Authorization
 create mode 100644 data/2022/iclr/Nonlinear ICA Using Volume-Preserving Transformations
 create mode 100644 data/2022/iclr/Normalization of Language Embeddings for Cross-Lingual Alignment
 create mode 100644 data/2022/iclr/Object Dynamics Distillation for Scene Decomposition and Representation
 create mode 100644 data/2022/iclr/Object Pursuit: Building a Space of Objects via Discriminative Weight Generation
 create mode 100644 data/2022/iclr/Objects in Semantic Topology
 create mode 100644 data/2022/iclr/Offline Neural Contextual Bandits: Pessimism, Optimization and Generalization
 create mode 100644 data/2022/iclr/Offline Reinforcement Learning with Implicit Q-Learning
 create mode 100644 data/2022/iclr/Offline Reinforcement Learning with Value-based Episodic Memory
 create mode 100644 data/2022/iclr/Omni-Dimensional Dynamic Convolution
 create mode 100644 data/2022/iclr/Omni-Scale CNNs: a simple and effective kernel size configuration for time series classification
 create mode 100644 data/2022/iclr/On Bridging Generic and Personalized Federated Learning for Image Classification
 create mode 100644 data/2022/iclr/On Covariate Shift of Latent Confounders in Imitation and Reinforcement Learning
 create mode 100644 data/2022/iclr/On Distributed Adaptive Optimization with Gradient Compression
 create mode 100644 data/2022/iclr/On Evaluation Metrics for Graph Generative Models
 create mode 100644 data/2022/iclr/On Improving Adversarial Transferability of Vision Transformers
 create mode 100644 data/2022/iclr/On Incorporating Inductive Biases into VAEs
 create mode 100644 data/2022/iclr/On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning
 create mode 100644 data/2022/iclr/On Non-Random Missing Labels in Semi-Supervised Learning
 create mode 100644 data/2022/iclr/On Predicting Generalization using GANs
 create mode 100644 data/2022/iclr/On Redundancy and Diversity in Cell-based Neural Architecture Search
 create mode 100644 data/2022/iclr/On Robust Prefix-Tuning for Text Classification
 create mode 100644 data/2022/iclr/On feature learning in neural networks with global convergence guarantees
 create mode 100644 data/2022/iclr/On the Certified Robustness for Ensemble Models and Beyond
 create mode 100644 data/2022/iclr/On the Connection between Local Attention and Dynamic Depth-wise Convolution
 create mode 100644 data/2022/iclr/On the Convergence of Certified Robust Training with Interval Bound Propagation
 create mode 100644 data/2022/iclr/On the Convergence of mSGD and AdaGrad for Stochastic Optimization
 create mode 100644 data/2022/iclr/On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning
 create mode 100644 data/2022/iclr/On the Existence of Universal Lottery Tickets
 create mode 100644 data/2022/iclr/On the Generalization of Models Trained with SGD: Information-Theoretic Bounds and Implications
 create mode 100644 data/2022/iclr/On the Importance of Difficulty Calibration in Membership Inference Attacks
 create mode 100644 data/2022/iclr/On the Importance of Firth Bias Reduction in Few-Shot Classification
 create mode 100644 data/2022/iclr/On the Learning and Learnability of Quasimetrics
 create mode 100644 data/2022/iclr/On the Limitations of Multimodal VAEs
 create mode 100644 data/2022/iclr/On the Optimal Memorization Power of ReLU Neural Networks
 create mode 100644 data/2022/iclr/On the Pitfalls of Analyzing Individual Neurons in Language Models
 create mode 100644 data/2022/iclr/On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks
 create mode 100644 data/2022/iclr/On the Role of Neural Collapse in Transfer Learning
 create mode 100644 data/2022/iclr/On the Uncomputability of Partition Functions in Energy-Based Sequence Models
 create mode 100644 data/2022/iclr/On the approximation properties of recurrent encoder-decoder architectures
 create mode 100644 data/2022/iclr/On the benefits of maximum likelihood estimation for Regression and Forecasting
 create mode 100644 data/2022/iclr/On the relation between statistical learning and perceptual distances
 create mode 100644 data/2022/iclr/On the role of population heterogeneity in emergent communication
 create mode 100644 data/2022/iclr/On-Policy Model Errors in Reinforcement Learning
 create mode 100644 data/2022/iclr/One After Another: Learning Incremental Skills for a Changing World
 create mode 100644 data/2022/iclr/Online Ad Hoc Teamwork under Partial Observability
 create mode 100644 data/2022/iclr/Online Adversarial Attacks
 create mode 100644 data/2022/iclr/Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference
 create mode 100644 data/2022/iclr/Online Coreset Selection for Rehearsal-based Continual Learning
 create mode 100644 data/2022/iclr/Online Facility Location with Predictions
 create mode 100644 data/2022/iclr/Online Hyperparameter Meta-Learning with Hypergradient Distillation
 create mode 100644 data/2022/iclr/Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs
 create mode 100644 data/2022/iclr/OntoProtein: Protein Pretraining With Gene Ontology Embedding
 create mode 100644 data/2022/iclr/Open-Set Recognition: A Good Closed-Set Classifier is All You Need
 create mode 100644 data/2022/iclr/Open-World Semi-Supervised Learning
 create mode 100644 data/2022/iclr/Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
 create mode 100644 data/2022/iclr/Optimal ANN-SNN Conversion for High-accuracy and Ultra-low-latency Spiking Neural Networks
 create mode 100644 data/2022/iclr/Optimal Representations for Covariate Shift
 create mode 100644 data/2022/iclr/Optimal Transport for Causal Discovery
 create mode 100644 data/2022/iclr/Optimal Transport for Long-Tailed Recognition with Learnable Cost Matrix
 create mode 100644 data/2022/iclr/Optimization and Adaptive Generalization of Three layer Neural Networks
 create mode 100644 data/2022/iclr/Optimization inspired Multi-Branch Equilibrium Models
 create mode 100644 data/2022/iclr/Optimizer Amalgamation
 create mode 100644 data/2022/iclr/Optimizing Neural Networks with Gradient Lexicase Selection
 create mode 100644 data/2022/iclr/Orchestrated Value Mapping for Reinforcement Learning
 create mode 100644 data/2022/iclr/Out-of-distribution Generalization in the Presence of Nuisance-Induced Spurious Correlations
 create mode 100644 data/2022/iclr/Overcoming The Spectral Bias of Neural Value Approximation
 create mode 100644 data/2022/iclr/P-Adapters: Robustly Extracting Factual Information from Language Models with Diverse Prompts
 create mode 100644 data/2022/iclr/PAC Prediction Sets Under Covariate Shift
 create mode 100644 data/2022/iclr/PAC-Bayes Information Bottleneck
 create mode 100644 data/2022/iclr/PEARL: Data Synthesis via Private Embeddings and Adversarial Reconstruction Learning
 create mode 100644 data/2022/iclr/PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method
 create mode 100644 data/2022/iclr/PF-GNN: Differentiable particle filtering based approximation of universal graph representations
 create mode 100644 data/2022/iclr/PI3NN: Out-of-distribution-aware Prediction Intervals from Three Neural Networks
 create mode 100644 data/2022/iclr/POETREE: Interpretable Policy Learning with Adaptive Decision Trees
 create mode 100644 data/2022/iclr/PSA-GAN: Progressive Self Attention GANs for Synthetic Time Series
 create mode 100644 data/2022/iclr/Parallel Training of GRU Networks with a Multi-Grid Solver for Long Sequences
 create mode 100644 data/2022/iclr/Pareto Policy Adaptation
 create mode 100644 data/2022/iclr/Pareto Policy Pool for Model-based Offline Reinforcement Learning
 create mode 100644 data/2022/iclr/Pareto Set Learning for Neural Multi-Objective Combinatorial Optimization
 create mode 100644 data/2022/iclr/Partial Wasserstein Adversarial Network for Non-rigid Point Set Registration
 create mode 100644 data/2022/iclr/Particle Stochastic Dual Coordinate Ascent: Exponential convergent algorithm for mean field neural network optimization
 create mode 100644 data/2022/iclr/Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations?
 create mode 100644 data/2022/iclr/Path Auxiliary Proposal for MCMC in Discrete Space
 create mode 100644 data/2022/iclr/Path Integral Sampler: A Stochastic Control Approach For Sampling
 create mode 100644 data/2022/iclr/Peek-a-Boo: What (More) is Disguised in a Randomly Weighted Neural Network, and How to Find It Efficiently
 create mode 100644 data/2022/iclr/Perceiver IO: A General Architecture for Structured Inputs & Outputs
 create mode 100644 data/2022/iclr/Permutation Compressors for Provably Faster Distributed Nonconvex Optimization
 create mode 100644 data/2022/iclr/Permutation-Based SGD: Is Random Optimal?
 create mode 100644 data/2022/iclr/Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning
 create mode 100644 data/2022/iclr/Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage
 create mode 100644 data/2022/iclr/Phase Collapse in Neural Networks
 create mode 100644 data/2022/iclr/Phenomenology of Double Descent in Finite-Width Neural Networks
 create mode 100644 data/2022/iclr/PiCO: Contrastive Label Disambiguation for Partial Label Learning
 create mode 100644 data/2022/iclr/PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication
 create mode 100644 data/2022/iclr/Pix2seq: A Language Modeling Framework for Object Detection
 create mode 100644 data/2022/iclr/Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models
 create mode 100644 data/2022/iclr/Planning in Stochastic Environments with a Learned Model
 create mode 100644 data/2022/iclr/Plant 'n' Seek: Can You Find the Winning Ticket?
 create mode 100644 data/2022/iclr/PoNet: Pooling Network for Efficient Token Mixing in Long Sequences
 create mode 100644 data/2022/iclr/Poisoning and Backdooring Contrastive Learning
 create mode 100644 data/2022/iclr/Policy Gradients Incorporating the Future
 create mode 100644 data/2022/iclr/Policy Smoothing for Provably Robust Reinforcement Learning
 create mode 100644 data/2022/iclr/Policy improvement by planning with Gumbel
 create mode 100644 data/2022/iclr/PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions
 create mode 100644 data/2022/iclr/Possibility Before Utility: Learning And Using Hierarchical Affordances
 create mode 100644 data/2022/iclr/Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation
 create mode 100644 data/2022/iclr/Post-Training Detection of Backdoor Attacks for Two-Class and Multi-Attack Scenarios
 create mode 100644 data/2022/iclr/Practical Conditional Neural Process Via Tractable Dependent Predictions
 create mode 100644 data/2022/iclr/Practical Integration via Separable Bijective Networks
 create mode 100644 data/2022/iclr/Pre-training Molecular Graph Representation with 3D Geometry
 create mode 100644 data/2022/iclr/Predicting Physics in Mesh-reduced Space with Temporal Attention
 create mode 100644 data/2022/iclr/Pretrained Language Model in Continual Learning: A Comparative Study
 create mode 100644 data/2022/iclr/Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators
 create mode 100644 data/2022/iclr/PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior
 create mode 100644 data/2022/iclr/Privacy Implications of Shuffling
 create mode 100644 data/2022/iclr/Probabilistic Implicit Scene Completion
 create mode 100644 data/2022/iclr/Procedural generalization by planning with self-supervised world models
 create mode 100644 data/2022/iclr/Programmatic Reinforcement Learning without Oracles
 create mode 100644 data/2022/iclr/Progressive Distillation for Fast Sampling of Diffusion Models
 create mode 100644 data/2022/iclr/Promoting Saliency From Depth: Deep Unsupervised RGB-D Saliency Detection
 create mode 100644 data/2022/iclr/Proof Artifact Co-Training for Theorem Proving with Language Models
 create mode 100644 data/2022/iclr/Properties from mechanisms: an equivariance perspective on identifiable representation learning
 create mode 100644 data/2022/iclr/Prospect Pruning: Finding Trainable Weights at Initialization using Meta-Gradients
 create mode 100644 data/2022/iclr/ProtoRes: Proto-Residual Network for Pose Authoring via Learned Inverse Kinematics
 create mode 100644 data/2022/iclr/Prototype memory and attention mechanisms for few shot image generation
 create mode 100644 data/2022/iclr/Prototypical Contrastive Predictive Coding
 create mode 100644 data/2022/iclr/Provable Adaptation across Multiway Domains via Representation Learning
 create mode 100644 data/2022/iclr/Provable Learning-based Algorithm For Sparse Recovery
 create mode 100644 data/2022/iclr/Provably Filtering Exogenous Distractors using Multistep Inverse Dynamics
 create mode 100644 data/2022/iclr/Provably Robust Adversarial Examples
 create mode 100644 data/2022/iclr/Provably convergent quasistatic dynamics for mean-field two-player zero-sum games
 create mode 100644 data/2022/iclr/Proving the Lottery Ticket Hypothesis for Convolutional Neural Networks
 create mode 100644 data/2022/iclr/Pseudo Numerical Methods for Diffusion Models on Manifolds
 create mode 100644 data/2022/iclr/Pseudo-Labeled Auto-Curriculum Learning for Semi-Supervised Keypoint Localization
 create mode 100644 data/2022/iclr/Pyraformer: Low-Complexity Pyramidal Attention for Long-Range Time Series Modeling and Forecasting
 create mode 100644 data/2022/iclr/QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization
 create mode 100644 data/2022/iclr/Quadtree Attention for Vision Transformers
 create mode 100644 data/2022/iclr/Quantitative Performance Assessment of CNN Units via Topological Entropy Calculation
 create mode 100644 data/2022/iclr/Query Efficient Decision Based Sparse Attacks Against Black-Box Deep Learning Models
 create mode 100644 data/2022/iclr/Query Embedding on Hyper-Relational Knowledge Graphs
 create mode 100644 data/2022/iclr/R4D: Utilizing Reference Objects for Long-Range Distance Estimation
 create mode 100644 data/2022/iclr/R5: Rule Discovery with Reinforced and Recurrent Relational Reasoning
 create mode 100644 data/2022/iclr/RISP: Rendering-Invariant State Predictor with Differentiable Simulation and Rendering for Cross-Domain Parameter Estimation
 create mode 100644 data/2022/iclr/Random matrices in service of ML footprint: ternary random features with no performance loss
 create mode 100644 data/2022/iclr/Real-Time Neural Voice Camouflage
 create mode 100644 data/2022/iclr/Recursive Disentanglement Network
 create mode 100644 data/2022/iclr/Recycling Model Updates in Federated Learning: Are Gradient Subspaces Low-Rank?
 create mode 100644 data/2022/iclr/Reducing Excessive Margin to Achieve a Better Accuracy vs. Robustness Trade-off
 create mode 100644 data/2022/iclr/RegionViT: Regional-to-Local Attention for Vision Transformers
 create mode 100644 data/2022/iclr/Regularized Autoencoders for Isometric Representation Learning
 create mode 100644 data/2022/iclr/Reinforcement Learning in Presence of Discrete Markovian Context Evolution
 create mode 100644 data/2022/iclr/Reinforcement Learning under a Multi-agent Predictive State Representation Model: Method and Theory
 create mode 100644 data/2022/iclr/Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration
 create mode 100644 data/2022/iclr/RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning
 create mode 100644 data/2022/iclr/Relating transformers to models and neural representations of the hippocampal formation
 create mode 100644 data/2022/iclr/Relational Learning with Variational Bayes
 create mode 100644 data/2022/iclr/Relational Multi-Task Learning: Modeling Relations between Data and Tasks
 create mode 100644 data/2022/iclr/Relational Surrogate Loss Learning
 create mode 100644 data/2022/iclr/RelaxLoss: Defending Membership Inference Attacks without Losing Utility
 create mode 100644 data/2022/iclr/Reliable Adversarial Distillation with Unreliable Teachers
 create mode 100644 data/2022/iclr/Representation Learning for Online and Offline RL in Low-rank MDPs
 create mode 100644 data/2022/iclr/Representation-Agnostic Shape Fields
 create mode 100644 data/2022/iclr/Representational Continuity for Unsupervised Continual Learning
 create mode 100644 data/2022/iclr/Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings
 create mode 100644 data/2022/iclr/Resolving Training Biases via Influence-based Data Relabeling
 create mode 100644 data/2022/iclr/Resonance in Weight Space: Covariate Shift Can Drive Divergence of SGD with Momentum
 create mode 100644 data/2022/iclr/Responsible Disclosure of Generative Models Using Scalable Fingerprinting
 create mode 100644 data/2022/iclr/Rethinking Adversarial Transferability from a Data Distribution Perspective
 create mode 100644 data/2022/iclr/Rethinking Class-Prior Estimation for Positive-Unlabeled Learning
 create mode 100644 data/2022/iclr/Rethinking Goal-Conditioned Supervised Learning and Its Connection to Offline RL
 create mode 100644 data/2022/iclr/Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework
 create mode 100644 data/2022/iclr/Rethinking Supervised Pre-Training for Better Downstream Transferring
 create mode 100644 data/2022/iclr/Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph
 create mode 100644 data/2022/iclr/Reverse Engineering of Imperceptible Adversarial Image Perturbations
 create mode 100644 data/2022/iclr/Reversible Instance Normalization for Accurate Time-Series Forecasting against Distribution Shift
 create mode 100644 data/2022/iclr/Revisit Kernel Pruning with Lottery Regulated Grouped Convolutions
 create mode 100644 data/2022/iclr/Revisiting Design Choices in Offline Model Based Reinforcement Learning
 create mode 100644 data/2022/iclr/Revisiting Over-smoothing in BERT from the Perspective of Graph
 create mode 100644 data/2022/iclr/Revisiting flow generative models for Out-of-distribution detection
 create mode 100644 data/2022/iclr/Reward Uncertainty for Exploration in Preference-based Reinforcement Learning
 create mode 100644 data/2022/iclr/Robbing the Fed: Directly Obtaining Private Data in Federated Learning with Modified Models
 create mode 100644 data/2022/iclr/Robust Learning Meets Generative Models: Can Proxy Distributions Improve Adversarial Robustness?
 create mode 100644 data/2022/iclr/Robust Unlearnable Examples: Protecting Data Privacy Against Adversarial Learning
 create mode 100644 data/2022/iclr/Robust and Scalable SDE Learning: A Functional Perspective
 create mode 100644 data/2022/iclr/RotoGrad: Gradient Homogenization in Multitask Learning
 create mode 100644 data/2022/iclr/RvS: What is Essential for Offline RL via Supervised Learning?
 create mode 100644 data/2022/iclr/SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
 create mode 100644 data/2022/iclr/SGD Can Converge to Local Maxima
 create mode 100644 data/2022/iclr/SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models
 create mode 100644 data/2022/iclr/SOSP: Efficiently Capturing Global Correlations by Second-Order Structured Pruning
 create mode 100644 data/2022/iclr/SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training
 create mode 100644 data/2022/iclr/SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation
 create mode 100644 data/2022/iclr/SUMNAS: Supernet with Unbiased Meta-Features for Neural Architecture Search
 create mode 100644 data/2022/iclr/SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning
 create mode 100644 data/2022/iclr/Safe Neurosymbolic Learning with Differentiable Symbolic Execution
 create mode 100644 data/2022/iclr/Salient ImageNet: How to discover spurious features in Deep Learning?
 create mode 100644 data/2022/iclr/Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation
 create mode 100644 data/2022/iclr/Sample Efficient Stochastic Policy Extragradient Algorithm for Zero-Sum Markov Game
 create mode 100644 data/2022/iclr/Sample Selection with Uncertainty of Losses for Learning with Noisy Labels
 create mode 100644 data/2022/iclr/Sample and Computation Redistribution for Efficient Face Detection
 create mode 100644 data/2022/iclr/Sampling with Mirrored Stein Operators
 create mode 100644 data/2022/iclr/Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiation
 create mode 100644 data/2022/iclr/Scalable Sampling for Nonsymmetric Determinantal Point Processes
 create mode 100644 data/2022/iclr/Scale Efficiently: Insights from Pretraining and Finetuning Transformers
 create mode 100644 data/2022/iclr/Scale Mixtures of Neural Network Gaussian Processes
 create mode 100644 data/2022/iclr/Scaling Laws for Neural Machine Translation
 create mode 100644 data/2022/iclr/Scarf: Self-Supervised Contrastive Learning using Random Feature Corruption
 create mode 100644 data/2022/iclr/Scattering Networks on the Sphere for Scalable and Rotationally Equivariant Spherical CNNs
 create mode 100644 data/2022/iclr/Scene Transformer: A unified architecture for predicting future trajectories of multiple agents
 create mode 100644 data/2022/iclr/Score-Based Generative Modeling with Critically-Damped Langevin Diffusion
 create mode 100644 data/2022/iclr/Selective Ensembles for Consistent Predictions
 create mode 100644 data/2022/iclr/Self-Joint Supervised Learning
 create mode 100644 data/2022/iclr/Self-Supervised Graph Neural Networks for Improved Electroencephalographic Seizure Analysis
 create mode 100644 data/2022/iclr/Self-Supervised Inference in State-Space Models
 create mode 100644 data/2022/iclr/Self-Supervision Enhanced Feature Selection with Correlated Gates
 create mode 100644 data/2022/iclr/Self-ensemble Adversarial Training for Improved Robustness
 create mode 100644 data/2022/iclr/Self-supervised Learning is More Robust to Dataset Imbalance
 create mode 100644 data/2022/iclr/Semi-relaxed Gromov-Wasserstein divergence and applications on graphs
 create mode 100644 data/2022/iclr/Sequence Approximation using Feedforward Spiking Neural Network for Spatiotemporal Learning: Theory and Optimization Methods
 create mode 100644 data/2022/iclr/Sequential Reptile: Inter-Task Gradient Alignment for Multilingual Learning
 create mode 100644 data/2022/iclr/Shallow and Deep Networks are Near-Optimal Approximators of Korobov Functions
 create mode 100644 data/2022/iclr/Should I Run Offline Reinforcement Learning or Behavioral Cloning?
 create mode 100644 data/2022/iclr/Should We Be Pre-training? An Argument for End-task Aware Training as an Alternative
 create mode 100644 data/2022/iclr/Shuffle Private Stochastic Convex Optimization
 create mode 100644 data/2022/iclr/Signing the Supermask: Keep, Hide, Invert
 create mode 100644 data/2022/iclr/SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
 create mode 100644 data/2022/iclr/Simple GNN Regularisation for 3D Molecular Property Prediction and Beyond
 create mode 100644 data/2022/iclr/SketchODE: Learning neural sketch representation in continuous time
 create mode 100644 data/2022/iclr/Skill-based Meta-Reinforcement Learning
 create mode 100644 data/2022/iclr/Solving Inverse Problems in Medical Imaging with Score-Based Generative Models
 create mode 100644 data/2022/iclr/Sound Adversarial Audio-Visual Navigation
 create mode 100644 data/2022/iclr/Sound and Complete Neural Network Repair with Minimality and Locality Guarantees
 create mode 100644 data/2022/iclr/Source-Free Adaptation to Measurement Shift via Bottom-Up Feature Restoration
 create mode 100644 data/2022/iclr/Space-Time Graph Neural Networks
 create mode 100644 data/2022/iclr/Spanning Tree-based Graph Generation for Molecules
 create mode 100644 data/2022/iclr/Sparse Attention with Learning to Hash
 create mode 100644 data/2022/iclr/Sparse Communication via Mixed Distributions
 create mode 100644 data/2022/iclr/Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity
 create mode 100644 data/2022/iclr/Sparsity Winning Twice: Better Robust Generalization from More Efficient Training
 create mode 100644 data/2022/iclr/Spatial Graph Attention and Curiosity-driven Policy for Antiviral Drug Discovery
 create mode 100644 data/2022/iclr/SphereFace2: Binary Classification is All You Need for Deep Face Recognition
 create mode 100644 data/2022/iclr/Spherical Message Passing for 3D Molecular Graphs
 create mode 100644 data/2022/iclr/Spike-inspired rank coding for fast and accurate recurrent neural networks
 create mode 100644 data/2022/iclr/Spread Spurious Attribute: Improving Worst-group Accuracy with Spurious Attribute Estimation
 create mode 100644 data/2022/iclr/Sqrt(d) Dimension Dependence of Langevin Monte Carlo
 create mode 100644 data/2022/iclr/Stability Regularization for Discrete Representation Learning
 create mode 100644 data/2022/iclr/Steerable Partial Differential Operators for Equivariant Neural Networks
 create mode 100644 data/2022/iclr/Stein Latent Optimization for Generative Adversarial Networks
 create mode 100644 data/2022/iclr/Step-unrolled Denoising Autoencoders for Text Generation
 create mode 100644 data/2022/iclr/Stiffness-aware neural network for learning Hamiltonian systems
 create mode 100644 data/2022/iclr/Stochastic Training is Not Necessary for Generalization
 create mode 100644 data/2022/iclr/Strength of Minibatch Noise in SGD
 create mode 100644 data/2022/iclr/Structure-Aware Transformer Policy for Inhomogeneous Multi-Task Reinforcement Learning
 create mode 100644 data/2022/iclr/StyleAlign: Analysis and Applications of Aligned StyleGAN Models
 create mode 100644 data/2022/iclr/StyleNeRF: A Style-based 3D Aware Generator for High-resolution Image Synthesis
 create mode 100644 data/2022/iclr/Subspace Regularizers for Few-Shot Class Incremental Learning
 create mode 100644 data/2022/iclr/Superclass-Conditional Gaussian Mixture Model For Learning Fine-Grained Embeddings
 create mode 100644 data/2022/iclr/Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm
 create mode 100644 data/2022/iclr/Surreal-GAN: Semi-Supervised Representation Learning via GAN for uncovering heterogeneous disease-related imaging patterns
 create mode 100644 data/2022/iclr/Surrogate Gap Minimization Improves Sharpness-Aware Training
 create mode 100644 data/2022/iclr/Surrogate NAS Benchmarks: Going Beyond the Limited Search Spaces of Tabular NAS Benchmarks
 create mode 100644 data/2022/iclr/Switch to Generalize: Domain-Switch Learning for Cross-Domain Few-Shot Classification
 create mode 100644 data/2022/iclr/Symbolic Learning to Optimize: Towards Interpretability and Scalability
 create mode 100644 data/2022/iclr/Synchromesh: Reliable Code Generation from Pre-trained Language Models
 create mode 100644 data/2022/iclr/T-WaveNet: A Tree-Structured Wavelet Neural Network for Time Series Signal Analysis
 create mode 100644 data/2022/iclr/TAMP-S2GCNets: Coupling Time-Aware Multipersistence Knowledge Representation with Spatio-Supra Graph Convolutional Networks for Time-Series Forecasting
 create mode 100644 data/2022/iclr/TAPEX: Table Pre-training via Learning a Neural SQL Executor
 create mode 100644 data/2022/iclr/TAda! Temporally-Adaptive Convolutions for Video Understanding
 create mode 100644 data/2022/iclr/THOMAS: Trajectory Heatmap Output with learned Multi-Agent Sampling
 create mode 100644 data/2022/iclr/TPU-GAN: Learning temporal coherence from dynamic point cloud sequences
 create mode 100644 data/2022/iclr/TRAIL: Near-Optimal Imitation Learning with Suboptimal Data
 create mode 100644 data/2022/iclr/TRGP: Trust Region Gradient Projection for Continual Learning
 create mode 100644 data/2022/iclr/Tackling the Generative Learning Trilemma with Denoising Diffusion GANs
 create mode 100644 data/2022/iclr/Taming Sparsely Activated Transformer with Stochastic Experts
 create mode 100644 data/2022/iclr/Target-Side Input Augmentation for Sequence to Sequence Generation
 create mode 100644 data/2022/iclr/Task Affinity with Maximum Bipartite Matching in Few-Shot Learning
 create mode 100644 data/2022/iclr/Task Relatedness-Based Generalization Bounds for Meta Learning
 create mode 100644 data/2022/iclr/Task-Induced Representation Learning
 create mode 100644 data/2022/iclr/Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification
 create mode 100644 data/2022/iclr/Temporal Efficient Training of Spiking Neural Network via Gradient Re-weighting
 create mode 100644 data/2022/iclr/The Boltzmann Policy Distribution: Accounting for Systematic Suboptimality in Human Models
 create mode 100644 data/2022/iclr/The Close Relationship Between Contrastive Learning and Meta-Learning
 create mode 100644 data/2022/iclr/The Convex Geometry of Backpropagation: Neural Network Gradient Flows Converge to Extreme Points of the Dual Convex Program
 create mode 100644 data/2022/iclr/The Effects of Invertibility on the Representational Complexity of Encoders in Variational Autoencoders
 create mode 100644 data/2022/iclr/The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
 create mode 100644 data/2022/iclr/The Efficiency Misnomer
 create mode 100644 data/2022/iclr/The Evolution of Uncertainty of Learning in Games
 create mode 100644 data/2022/iclr/The Geometry of Memoryless Stochastic Policy Optimization in Infinite-Horizon POMDPs
 create mode 100644 data/2022/iclr/The Hidden Convex Optimization Landscape of Regularized Two-Layer ReLU Networks: an Exact Characterization of Optimal Solutions
 create mode 100644 data/2022/iclr/The Inductive Bias of In-Context Learning: Rethinking Pretraining Example Design
 create mode 100644 data/2022/iclr/The Information Geometry of Unsupervised Reinforcement Learning
 create mode 100644 data/2022/iclr/The MultiBERTs: BERT Reproductions for Robustness Analysis
 create mode 100644 data/2022/iclr/The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization
 create mode 100644 data/2022/iclr/The Rich Get Richer: Disparate Impact of Semi-Supervised Learning
 create mode 100644 data/2022/iclr/The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks
 create mode 100644 data/2022/iclr/The Role of Pretrained Representations for the OOD Generalization of RL Agents
 create mode 100644 data/2022/iclr/The Spectral Bias of Polynomial Neural Networks
 create mode 100644 data/2022/iclr/The Three Stages of Learning Dynamics in High-dimensional Kernel Methods
 create mode 100644 data/2022/iclr/The Uncanny Similarity of Recurrence and Depth
 create mode 100644 data/2022/iclr/The Unreasonable Effectiveness of Random Pruning: Return of the Most Naive Baseline for Sparse Training
 create mode 100644 data/2022/iclr/Tighter Sparse Approximation Bounds for ReLU Neural Networks
 create mode 100644 data/2022/iclr/ToM2C: Target-oriented Multi-agent Communication and Cooperation with Theory of Mind
 create mode 100644 data/2022/iclr/Top-N: Equivariant Set and Graph Generation without Exchangeability
 create mode 100644 data/2022/iclr/Top-label calibration and multiclass-to-binary reductions
 create mode 100644 data/2022/iclr/Topological Experience Replay
 create mode 100644 data/2022/iclr/Topological Graph Neural Networks
 create mode 100644 data/2022/iclr/Topologically Regularized Data Embeddings
 create mode 100644 data/2022/iclr/Toward Efficient Low-Precision Training: Data Format Optimization and Hysteresis Quantization
 create mode 100644 data/2022/iclr/Toward Faithful Case-based Reasoning through Learning Prototypes in a Nearest Neighbor-friendly Space
 create mode 100644 data/2022/iclr/Towards Better Understanding and Better Generalization of Low-shot Classification in Histology Images with Contrastive Learning
 create mode 100644 data/2022/iclr/Towards Building A Group-based Unsupervised Representation Disentanglement Framework
 create mode 100644 data/2022/iclr/Towards Continual Knowledge Learning of Language Models
 create mode 100644 data/2022/iclr/Towards Deepening Graph Neural Networks: A GNTK-based Optimization Perspective
 create mode 100644 data/2022/iclr/Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality
 create mode 100644 data/2022/iclr/Towards Empirical Sandwich Bounds on the Rate-Distortion Function
 create mode 100644 data/2022/iclr/Towards Evaluating the Robustness of Neural Networks Learned by Transduction
 create mode 100644 data/2022/iclr/Towards General Function Approximation in Zero-Sum Markov Games
 create mode 100644 data/2022/iclr/Towards Model Agnostic Federated Learning Using Knowledge Distillation
 create mode 100644 data/2022/iclr/Towards Training Billion Parameter Graph Neural Networks for Atomic Simulations
 create mode 100644 data/2022/iclr/Towards Understanding Generalization via Decomposing Excess Risk Dynamics
 create mode 100644 data/2022/iclr/Towards Understanding the Data Dependency of Mixup-style Training
 create mode 100644 data/2022/iclr/Towards Understanding the Robustness Against Evasion Attack on Categorical Data
 create mode 100644 data/2022/iclr/Towards a Unified View of Parameter-Efficient Transfer Learning
 create mode 100644 data/2022/iclr/Tracking the risk of a deployed model and detecting harmful distribution shifts
 create mode 100644 data/2022/iclr/Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
 create mode 100644 data/2022/iclr/Training Data Generating Networks: Shape Reconstruction via Bi-level Optimization
 create mode 100644 data/2022/iclr/Training Structured Neural Networks Through Manifold Identification and Variance Reduction
 create mode 100644 data/2022/iclr/Training Transition Policies via Distribution Matching for Complex Tasks
 create mode 100644 data/2022/iclr/Training invariances and the low-rank phenomenon: beyond linear networks
 create mode 100644 data/2022/iclr/Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations
 create mode 100644 data/2022/iclr/Transfer RL across Observation Feature Spaces via Model-Based Regularization
 create mode 100644 data/2022/iclr/Transferable Adversarial Attack based on Integrated Gradients
 create mode 100644 data/2022/iclr/Transform2Act: Learning a Transform-and-Control Policy for Efficient Agent Design
 create mode 100644 data/2022/iclr/Transformer Embeddings of Irregularly Spaced Events and Their Participants
 create mode 100644 data/2022/iclr/Transformer-based Transform Coding
 create mode 100644 data/2022/iclr/Transformers Can Do Bayesian Inference
 create mode 100644 data/2022/iclr/Transition to Linearity of Wide Neural Networks is an Emerging Property of Assembling Weak Models
 create mode 100644 data/2022/iclr/Triangle and Four Cycle Counting with Predictions in Graph Streams
 create mode 100644 data/2022/iclr/Trigger Hunting with a Topological Prior for Trojan Detection
 create mode 100644 data/2022/iclr/Trivial or Impossible --- dichotomous data difficulty masks model differences (on ImageNet and beyond)
 create mode 100644 data/2022/iclr/Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning
 create mode 100644 data/2022/iclr/Tuformer: Data-driven Design of Transformers for Improved Generalization or Efficiency
 create mode 100644 data/2022/iclr/Uncertainty Modeling for Out-of-Distribution Generalization
 create mode 100644 data/2022/iclr/Understanding Dimensional Collapse in Contrastive Self-supervised Learning
 create mode 100644 data/2022/iclr/Understanding Domain Randomization for Sim-to-real Transfer
 create mode 100644 data/2022/iclr/Understanding Intrinsic Robustness Using Label Uncertainty
 create mode 100644 data/2022/iclr/Understanding Latent Correlation-Based Multiview Learning and Self-Supervision: An Identifiability Perspective
 create mode 100644 data/2022/iclr/Understanding and Improving Graph Injection Attack by Promoting Unnoticeability
 create mode 100644 data/2022/iclr/Understanding and Leveraging Overparameterization in Recursive Value Estimation
 create mode 100644 data/2022/iclr/Understanding and Preventing Capacity Loss in Reinforcement Learning
 create mode 100644 data/2022/iclr/Understanding approximate and unrolled dictionary learning for pattern recovery
 create mode 100644 data/2022/iclr/Understanding over-squashing and bottlenecks on graphs via curvature
 create mode 100644 data/2022/iclr/Understanding the Role of Self Attention for Efficient Speech Recognition
 create mode 100644 data/2022/iclr/Understanding the Variance Collapse of SVGD in High Dimensions
 create mode 100644 data/2022/iclr/UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning
 create mode 100644 data/2022/iclr/Unified Visual Transformer Compression
 create mode 100644 data/2022/iclr/Unifying Likelihood-free Inference with Black-box Optimization and Beyond
 create mode 100644 data/2022/iclr/Universal Approximation Under Constraints is Possible with Transformers
 create mode 100644 data/2022/iclr/Universalizing Weak Supervision
 create mode 100644 data/2022/iclr/Unraveling Model-Agnostic Meta-Learning via The Adaptation Learning Rate
 create mode 100644 data/2022/iclr/Unrolling PALM for Sparse Semi-Blind Source Separation
 create mode 100644 data/2022/iclr/Unsupervised Discovery of Object Radiance Fields
 create mode 100644 data/2022/iclr/Unsupervised Disentanglement with Tensor Product Representations on the Torus
 create mode 100644 data/2022/iclr/Unsupervised Learning of Full-Waveform Inversion: Connecting CNN and Partial Differential Equation in a Loop
 create mode 100644 data/2022/iclr/Unsupervised Semantic Segmentation by Distilling Feature Correspondences
 create mode 100644 data/2022/iclr/Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling
 create mode 100644 data/2022/iclr/Using Graph Representation Learning with Schema Encoders to Measure the Severity of Depressive Symptoms
 create mode 100644 data/2022/iclr/VAE Approximation Error: ELBO and Exponential Families
 create mode 100644 data/2022/iclr/VAT-Mart: Learning Visual Action Trajectory Proposals for Manipulating 3D ARTiculated Objects
 create mode 100644 data/2022/iclr/VC dimension of partially quantized neural networks in the overparametrized regime
 create mode 100644 data/2022/iclr/VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
 create mode 100644 data/2022/iclr/VOS: Learning What You Don't Know by Virtual Outlier Synthesis
 create mode 100644 data/2022/iclr/Value Function Spaces: Skill-Centric State Abstractions for Long-Horizon Reasoning
 create mode 100644 data/2022/iclr/Value Gradient weighted Model-Based Reinforcement Learning
 create mode 100644 data/2022/iclr/Variational Inference for Discriminative Learning with Generative Modeling of Feature Incompletion
 create mode 100644 data/2022/iclr/Variational Neural Cellular Automata
 create mode 100644 data/2022/iclr/Variational Predictive Routing with Nested Subjective Timescales
 create mode 100644 data/2022/iclr/Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias
 create mode 100644 data/2022/iclr/Variational methods for simulation-based inference
 create mode 100644 data/2022/iclr/Variational oracle guiding for reinforcement learning
 create mode 100644 data/2022/iclr/Vector-quantized Image Modeling with Improved VQGAN
 create mode 100644 data/2022/iclr/ViDT: An Efficient and Effective Fully Transformer-based Object Detector
 create mode 100644 data/2022/iclr/ViTGAN: Training GANs with Vision Transformers
 create mode 100644 data/2022/iclr/Vision-Based Manipulators Need to Also See from Their Hands
 create mode 100644 data/2022/iclr/Visual Correspondence Hallucination
 create mode 100644 data/2022/iclr/Visual Representation Learning Does Not Generalize Strongly Within the Same Domain
 create mode 100644 data/2022/iclr/Visual Representation Learning over Latent Domains
 create mode 100644 data/2022/iclr/Visual hyperacuity with moving sensor and recurrent neural computations
 create mode 100644 data/2022/iclr/Vitruvion: A Generative Model of Parametric CAD Sketches
 create mode 100644 data/2022/iclr/W-CTC: a Connectionist Temporal Classification Loss with Wild Cards
 create mode 100644 data/2022/iclr/WeakM3D: Towards Weakly Supervised Monocular 3D Object Detection
 create mode 100644 data/2022/iclr/What Do We Mean by Generalization in Federated Learning?
 create mode 100644 data/2022/iclr/What Happens after SGD Reaches Zero Loss? --A Mathematical Framework
 create mode 100644 data/2022/iclr/What Makes Better Augmentation Strategies? Augment Difficult but Not too Different
 create mode 100644 data/2022/iclr/What's Wrong with Deep Learning in Tree Search for Combinatorial Optimization
 create mode 100644 data/2022/iclr/When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?
 create mode 100644 data/2022/iclr/When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
 create mode 100644 data/2022/iclr/When should agents explore?
 create mode 100644 data/2022/iclr/When, Why, and Which Pretrained GANs Are Useful?
 create mode 100644 data/2022/iclr/Which Shortcut Cues Will DNNs Choose? A Study from the Parameter-Space Perspective
 create mode 100644 data/2022/iclr/Who Is Your Right Mixup Partner in Positive and Unlabeled Learning
 create mode 100644 data/2022/iclr/Who Is the Strongest Enemy? Towards Optimal and Efficient Evasion Attacks in Deep RL
 create mode 100644 data/2022/iclr/Why Propagate Alone? Parallel Use of Labels and Features on Graphs
 create mode 100644 data/2022/iclr/Wiring Up Vision: Minimizing Supervised Synaptic Updates Needed to Produce a Primate Ventral Stream
 create mode 100644 data/2022/iclr/Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models
 create mode 100644 data/2022/iclr/Wish you were here: Hindsight Goal Selection for long-horizon dexterous manipulation
 create mode 100644 data/2022/iclr/X-model: Improving Data Efficiency in Deep Learning with A Minimax Model
 create mode 100644 data/2022/iclr/You Mostly Walk Alone: Analyzing Feature Attribution in Trajectory Prediction
 create mode 100644 data/2022/iclr/You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks
 create mode 100644 data/2022/iclr/Zero Pixel Directional Boundary by Vector Transform
 create mode 100644 data/2022/iclr/Zero-CL: Instance and Feature decorrelation for negative-free symmetric contrastive learning
 create mode 100644 data/2022/iclr/Zero-Shot Self-Supervised Learning for MRI Reconstruction
 create mode 100644 data/2022/iclr/ZeroFL: Efficient On-Device Training for Federated Learning with Local Sparsity
 create mode 100644 data/2022/iclr/cosFormer: Rethinking Softmax In Attention
 create mode 100644 data/2022/iclr/iFlood: A Stable and Effective Regularizer
 create mode 100644 data/2022/iclr/iLQR-VAE : control-based learning of input-driven dynamics with applications to neural data
 create mode 100644 data/2022/iclr/miniF2F: a cross-system benchmark for formal Olympiad-level mathematics
 create mode 100644 data/2022/iclr/switch-GLAT: Multilingual Parallel Machine Translation Via Code-Switch Decoder
 create mode 100644 data/2023/iclr/A Multi-Grained Self-Interpretable Symbolic-Neural Model For Single Multi-Labeled Text Classification
 create mode 100644 data/2023/iclr/A Unified Framework for Soft Threshold Pruning
 create mode 100644 data/2023/iclr/Achieve the Minimum Width of Neural Networks for Universal Approximation
 create mode 100644 data/2023/iclr/BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object Detection
 create mode 100644 data/2023/iclr/Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining
 create mode 100644 data/2023/iclr/Continuous-Discrete Convolution for Geometry-Sequence Modeling in Proteins
 create mode 100644 data/2023/iclr/DAG Matters! GFlowNets Enhanced Explainer for Graph Neural Networks
 create mode 100644 data/2023/iclr/Delving into Semantic Scale Imbalance
 create mode 100644 data/2023/iclr/Diagnosing and Rectifying Vision Models using Language
 create mode 100644 data/2023/iclr/Diversify and Disambiguate: Out-of-Distribution Robustness via Disagreement
 create mode 100644 data/2023/iclr/DropIT: Dropping Intermediate Tensors for Memory-Efficient DNN Training
 create mode 100644 data/2023/iclr/DualAfford: Learning Collaborative Visual Affordance for Dual-gripper Manipulation
 create mode 100644 data/2023/iclr/Guiding Safe Exploration with Weakest Preconditions
 create mode 100644 data/2023/iclr/H2RBox: Horizontal Box Annotation is All You Need for Oriented Object Detection
 create mode 100644 data/2023/iclr/Harnessing Out-Of-Distribution Examples via Augmenting Content and Style
 create mode 100644 data/2023/iclr/IDEAL: Query-Efficient Data-Free Learning from Black-Box Models
 create mode 100644 data/2023/iclr/Learning Domain-Agnostic Representation for Disease Diagnosis
 create mode 100644 data/2023/iclr/Logical Entity Representation in Knowledge-Graphs for Differentiable Rule Learning
 create mode 100644 data/2023/iclr/Molecular Geometry Pretraining with SE(3)-Invariant Denoising Distance Matching
 create mode 100644 data/2023/iclr/On amortizing convex conjugates for optimal transport
 create mode 100644 data/2023/iclr/Patch-Level Contrasting without Patch Correspondence for Accurate and Dense Contrastive Representation Learning
 create mode 100644 data/2023/iclr/Pushing the Limits of Fewshot Anomaly Detection in Industry Vision: Graphcore
 create mode 100644 data/2023/iclr/Representation Learning for Low-rank General-sum Markov Games
 create mode 100644 data/2023/iclr/SIMPLE: Specialized Model-Sample Matching for Domain Generalization
 create mode 100644 data/2023/iclr/Suppressing the Heterogeneity: A Strong Feature Extractor for Few-shot Segmentation
 create mode 100644 data/2023/iclr/Surgical Fine-Tuning Improves Adaptation to Distribution Shifts
 create mode 100644 data/2023/iclr/TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding
 create mode 100644 data/2023/iclr/The Augmented Image Prior: Distilling 1000 Classes by Extrapolating from a Single Image
 create mode 100644 data/2023/iclr/Trainability Preserving Neural Pruning

diff --git a/data/2020/iclr/A Constructive Prediction of the Generalization Error Across Scales b/data/2020/iclr/A Constructive Prediction of the Generalization Error Across Scales
new file mode 100644
index 0000000000..80d725bd5c
--- /dev/null
+++ b/data/2020/iclr/A Constructive Prediction of the Generalization Error Across Scales	
@@ -0,0 +1 @@
+The dependency of the generalization error of neural networks on model and dataset size is of critical importance both in practice and for understanding the theory of neural networks. Nevertheless, the functional form of this dependency remains elusive. In this work, we present a functional form which approximates well the generalization error in practice. Capitalizing on the successful concept of model scaling (e.g., width, depth), we are able to simultaneously construct such a form and specify the exact models which can attain it across model/data scales. Our construction follows insights obtained from observations conducted over a range of model/data scales, in various model types and datasets, in vision and language tasks. We show that the form both fits the observations well across scales, and provides accurate predictions from small- to large-scale models and data.
\ No newline at end of file
diff --git a/data/2020/iclr/A Fair Comparison of Graph Neural Networks for Graph Classification b/data/2020/iclr/A Fair Comparison of Graph Neural Networks for Graph Classification
new file mode 100644
index 0000000000..9e2ebcee28
--- /dev/null
+++ b/data/2020/iclr/A Fair Comparison of Graph Neural Networks for Graph Classification	
@@ -0,0 +1 @@
+Experimental reproducibility and replicability are critical topics in machine learning. Authors have often raised concerns about their lack in scientific publications to improve the quality of the field. Recently, the graph representation learning field has attracted the attention of a wide research community, which resulted in a large stream of works. As such, several Graph Neural Network models have been developed to effectively tackle graph classification. However, experimental procedures often lack rigorousness and are hardly reproducible. Motivated by this, we provide an overview of common practices that should be avoided to fairly compare with the state of the art. To counter this troubling trend, we ran more than 47000 experiments in a controlled and uniform framework to re-evaluate five popular models across nine common benchmarks. Moreover, by comparing GNNs with structure-agnostic baselines we provide convincing evidence that, on some datasets, structural information has not been exploited yet. We believe that this work can contribute to the development of the graph learning field, by providing a much needed grounding for rigorous evaluations of graph classification models.
\ No newline at end of file
diff --git a/data/2020/iclr/A Learning-based Iterative Method for Solving Vehicle Routing Problems b/data/2020/iclr/A Learning-based Iterative Method for Solving Vehicle Routing Problems
new file mode 100644
index 0000000000..916201a647
--- /dev/null
+++ b/data/2020/iclr/A Learning-based Iterative Method for Solving Vehicle Routing Problems	
@@ -0,0 +1 @@
+This paper is concerned with solving combinatorial optimization problems, in particular, the capacitated vehicle routing problems (CVRP). Classical Operations Research (OR) algorithms such as LKH3 (Helsgaun, 2017) are extremely inefficient (e.g., 13 hours on CVRP of only size 100) and difficult to scale to larger-size problems. Machine learning based approaches have recently shown to be promising, partly because of their efficiency (once trained, they can perform solving within minutes or even seconds). However, there is still a considerable gap between the quality of a machine learned solution and what OR methods can offer (e.g., on CVRP-100, the best result of learned solutions is between 16.10-16.80, significantly worse than LKH3's 15.65). In this paper, we present the first learning based approach for CVRP that is efficient in solving speed and at the same time outperforms OR methods. Starting with a random initial solution, our algorithm learns to iteratively refines the solution with an improvement operator, selected by a reinforcement learning based controller. The improvement operator is selected from a pool of powerful operators that are customized for routing problems. By combining the strengths of the two worlds, our approach achieves the new state-of-the-art results on CVRP, e.g., an average cost of 15.57 on CVRP-100.
\ No newline at end of file
diff --git a/data/2020/iclr/A Target-Agnostic Attack on Deep Models: Exploiting Security Vulnerabilities of Transfer Learning b/data/2020/iclr/A Target-Agnostic Attack on Deep Models: Exploiting Security Vulnerabilities of Transfer Learning
new file mode 100644
index 0000000000..2790a87f00
--- /dev/null
+++ b/data/2020/iclr/A Target-Agnostic Attack on Deep Models: Exploiting Security Vulnerabilities of Transfer Learning	
@@ -0,0 +1 @@
+Due to insufficient training data and the high computational cost to train a deep neural network from scratch, transfer learning has been extensively used in many deep-neural-network-based applications. A commonly used transfer learning approach involves taking a part of a pre-trained model, adding a few layers at the end, and re-training the new layers with a small dataset. This approach, while efficient and widely used, imposes a security vulnerability because the pre-trained model used in transfer learning is usually publicly available, including to potential attackers. In this paper, we show that without any additional knowledge other than the pre-trained model, an attacker can launch an effective and efficient brute force attack that can craft instances of input to trigger each target class with high confidence. We assume that the attacker has no access to any target-specific information, including samples from target classes, re-trained model, and probabilities assigned by Softmax to each class, and thus making the attack target-agnostic. These assumptions render all previous attack models inapplicable, to the best of our knowledge. To evaluate the proposed attack, we perform a set of experiments on face recognition and speech recognition tasks and show the effectiveness of the attack. Our work reveals a fundamental security weakness of the Softmax layer when used in transfer learning settings.
\ No newline at end of file
diff --git a/data/2020/iclr/A Theoretical Analysis of the Number of Shots in Few-Shot Learning b/data/2020/iclr/A Theoretical Analysis of the Number of Shots in Few-Shot Learning
new file mode 100644
index 0000000000..70348df253
--- /dev/null
+++ b/data/2020/iclr/A Theoretical Analysis of the Number of Shots in Few-Shot Learning	
@@ -0,0 +1 @@
+Few-shot classification is the task of predicting the category of an example from a set of few labeled examples. The number of labeled examples per category is called the number of shots (or shot number). Recent works tackle this task through meta-learning, where a meta-learner extracts information from observed tasks during meta-training to quickly adapt to new tasks during meta-testing. In this formulation, the number of shots exploited during meta-training has an impact on the recognition performance at meta-test time. Generally, the shot number used in meta-training should match the one used in meta-testing to obtain the best performance. We introduce a theoretical analysis of the impact of the shot number on Prototypical Networks, a state-of-the-art few-shot classification method. From our analysis, we propose a simple method that is robust to the choice of shot number used during meta-training, which is a crucial hyperparameter. The performance of our model trained for an arbitrary meta-training shot number shows great performance for different values of meta-testing shot numbers. We experimentally demonstrate our approach on different few-shot classification benchmarks.
\ No newline at end of file
diff --git a/data/2020/iclr/A critical analysis of self-supervision, or what we can learn from a single image b/data/2020/iclr/A critical analysis of self-supervision, or what we can learn from a single image
new file mode 100644
index 0000000000..7f44c17d0e
--- /dev/null
+++ b/data/2020/iclr/A critical analysis of self-supervision, or what we can learn from a single image	
@@ -0,0 +1 @@
+We look critically at popular self-supervision techniques for learning deep convolutional neural networks without manual labels. We show that three different and representative methods, BiGAN, RotNet and DeepCluster, can learn the first few layers of a convolutional network from a single image as well as using millions of images and manual labels, provided that strong data augmentation is used. However, for deeper layers the gap with manual supervision cannot be closed even if millions of unlabelled images are used for training. We conclude that: (1) the weights of the early layers of deep networks contain limited information about the statistics of natural images, that (2) such low-level statistics can be learned through self-supervision just as well as through strong supervision, and that (3) the low-level statistics can be captured via synthetic transformations instead of using a large image dataset.
\ No newline at end of file
diff --git a/data/2020/iclr/AMRL: Aggregated Memory For Reinforcement Learning b/data/2020/iclr/AMRL: Aggregated Memory For Reinforcement Learning
new file mode 100644
index 0000000000..659572235a
--- /dev/null
+++ b/data/2020/iclr/AMRL: Aggregated Memory For Reinforcement Learning	
@@ -0,0 +1 @@
+In many partially observable scenarios, Reinforcement Learning (RL) agents must rely on long-term memory in order to learn an optimal policy. We demonstrate that using techniques from NLP and supervised learning fails at RL tasks due to stochasticity from the environment and from exploration. Utilizing our insights on the limitations of traditional memory methods in RL, we propose AMRL, a class of models that can learn better policies with greater sample efficiency and are resilient to noisy inputs. Specifically, our models use a standard memory module to summarize short-term context, and then aggregate all prior states from the standard model without respect to order. We show that this provides advantages both in terms of gradient decay and signal-to-noise ratio over time. Evaluating in Minecraft and maze environments that test long-term memory, we find that our model improves average return by 19% over a baseline that has the same number of parameters and by 9% over a stronger baseline that has far more parameters.
\ No newline at end of file
diff --git a/data/2020/iclr/Accelerating SGD with momentum for over-parameterized learning b/data/2020/iclr/Accelerating SGD with momentum for over-parameterized learning
new file mode 100644
index 0000000000..a6c8adbb0a
--- /dev/null
+++ b/data/2020/iclr/Accelerating SGD with momentum for over-parameterized learning	
@@ -0,0 +1,4 @@
+Nesterov SGD is widely used for training modern neural networks and other machine learning models. Yet, its advantages over SGD have not been theoretically clarified. Indeed, as we show in our paper, both theoretically and empirically, Nesterov SGD with any parameter selection does not in general provide acceleration over ordinary SGD. Furthermore, Nesterov SGD may diverge for step sizes that ensure convergence of ordinary SGD. This is in contrast to the classical results in the deterministic scenario, where the same step size ensures accelerated convergence of the Nesterov's method over optimal gradient descent. 
+To address the non-acceleration issue, we introduce a compensation term to Nesterov SGD. The resulting algorithm, which we call MaSS, converges for same step sizes as SGD. We prove that MaSS obtains an accelerated convergence rates over SGD for any mini-batch size in the linear setting. For full batch, the convergence rate of MaSS matches the well-known accelerated rate of the Nesterov's method. 
+We also analyze the practically important question of the dependence of the convergence rate and optimal hyper-parameters on the mini-batch size, demonstrating three distinct regimes: linear scaling, diminishing returns and saturation. 
+Experimental evaluation of MaSS for several standard architectures of deep networks, including ResNet and convolutional networks, shows improved performance over SGD, Nesterov SGD and Adam.
\ No newline at end of file
diff --git a/data/2020/iclr/Action Semantics Network: Considering the Effects of Actions in Multiagent Systems b/data/2020/iclr/Action Semantics Network: Considering the Effects of Actions in Multiagent Systems
new file mode 100644
index 0000000000..ce017c4309
--- /dev/null
+++ b/data/2020/iclr/Action Semantics Network: Considering the Effects of Actions in Multiagent Systems	
@@ -0,0 +1 @@
+In multiagent systems (MASs), each agent makes individual decisions but all of them contribute globally to the system evolution. Learning in MASs is difficult since each agent's selection of actions must take place in the presence of other co-learning agents. Moreover, the environmental stochasticity and uncertainties increase exponentially with the increase in the number of agents. Previous works borrow various multiagent coordination mechanisms into deep learning architecture to facilitate multiagent coordination. However, none of them explicitly consider action semantics between agents that different actions have different influences on other agents. In this paper, we propose a novel network architecture, named Action Semantics Network (ASN), that explicitly represents such action semantics between agents. ASN characterizes different actions' influence on other agents using neural networks based on the action semantics between them. ASN can be easily combined with existing deep reinforcement learning (DRL) algorithms to boost their performance. Experimental results on StarCraft II micromanagement and Neural MMO show ASN significantly improves the performance of state-of-the-art DRL approaches compared with several network architectures.
\ No newline at end of file
diff --git a/data/2020/iclr/Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games b/data/2020/iclr/Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games
new file mode 100644
index 0000000000..a2a8b4ae62
--- /dev/null
+++ b/data/2020/iclr/Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games	
@@ -0,0 +1 @@
+We study discrete-time mean-field Markov games with infinite numbers of agents where each agent aims to minimize its ergodic cost. We consider the setting where the agents have identical linear state transitions and quadratic cost functions, while the aggregated effect of the agents is captured by the population mean of their states, namely, the mean-field state. For such a game, based on the Nash certainty equivalence principle, we provide sufficient conditions for the existence and uniqueness of its Nash equilibrium. Moreover, to find the Nash equilibrium, we propose a mean-field actor-critic algorithm with linear function approximation, which does not require knowing the model of dynamics. Specifically, at each iteration of our algorithm, we use the single-agent actor-critic algorithm to approximately obtain the optimal policy of the each agent given the current mean-field state, and then update the mean-field state. In particular, we prove that our algorithm converges to the Nash equilibrium at a linear rate. To the best of our knowledge, this is the first success of applying model-free reinforcement learning with function approximation to discrete-time mean-field Markov games with provable non-asymptotic global convergence guarantees.
\ No newline at end of file
diff --git a/data/2020/iclr/Adaptive Structural Fingerprints for Graph Attention Networks b/data/2020/iclr/Adaptive Structural Fingerprints for Graph Attention Networks
new file mode 100644
index 0000000000..9c6c3c3eac
--- /dev/null
+++ b/data/2020/iclr/Adaptive Structural Fingerprints for Graph Attention Networks	
@@ -0,0 +1 @@
+Many real-world data sets are represented as graphs, such as citation links, social media, and biological interaction. The volatile graph structure makes it non-trivial to employ convolutional neural networks (CNN's) for graph data processing. Recently, graph attention network (GAT) has proven a promising attempt by combining graph neural networks with attention mechanism, so as to achieve massage passing in graphs with arbitrary structures. However, the attention in GAT is computed mainly based on the similarity between the node content, while the structures of the graph remains largely unemployed (except in masking the attention out of one-hop neighbors). In this paper, we propose an `````````````````````````````"ADaptive Structural Fingerprint" (ADSF) model to fully exploit both topological details of the graph and content features of the nodes. The key idea is to contextualize each node with a weighted, learnable receptive field encoding rich and diverse local graph structures. By doing this, structural interactions between the nodes can be inferred accurately, thus improving subsequent attention layer as well as the convergence of learning. Furthermore, our model provides a useful platform for different subspaces of node features and various scales of graph structures to ``cross-talk'' with each other through the learning of multi-head attention, being particularly useful in handling complex real-world data. Encouraging performance is observed on a number of benchmark data sets in node classification.
\ No newline at end of file
diff --git a/data/2020/iclr/Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks b/data/2020/iclr/Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks
new file mode 100644
index 0000000000..79d34cc752
--- /dev/null
+++ b/data/2020/iclr/Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks	
@@ -0,0 +1 @@
+We propose Additive Powers-of-Two~(APoT) quantization, an efficient non-uniform quantization scheme for the bell-shaped and long-tailed distribution of weights and activations in neural networks. By constraining all quantization levels as the sum of Powers-of-Two terms, APoT quantization enjoys high computational efficiency and a good match with the distribution of weights. A simple reparameterization of the clipping function is applied to generate a better-defined gradient for learning the clipping threshold. Moreover, weight normalization is presented to refine the distribution of weights to make the training more stable and consistent. Experimental results show that our proposed method outperforms state-of-the-art methods, and is even competitive with the full-precision models, demonstrating the effectiveness of our proposed APoT quantization. For example, our 4-bit quantized ResNet-50 on ImageNet achieves 76.6% top-1 accuracy without bells and whistles; meanwhile, our model reduces 22% computational cost compared with the uniformly quantized counterpart.
\ No newline at end of file
diff --git a/data/2020/iclr/Adjustable Real-time Style Transfer b/data/2020/iclr/Adjustable Real-time Style Transfer
new file mode 100644
index 0000000000..94f7149ab6
--- /dev/null
+++ b/data/2020/iclr/Adjustable Real-time Style Transfer	
@@ -0,0 +1 @@
+Artistic style transfer is the problem of synthesizing an image with content similar to a given image and style similar to another. Although recent feed-forward neural networks can generate stylized images in real-time, these models produce a single stylization given a pair of style/content images, and the user doesn't have control over the synthesized output. Moreover, the style transfer depends on the hyper-parameters of the model with varying "optimum" for different input images. Therefore, if the stylized output is not appealing to the user, she/he has to try multiple models or retrain one with different hyper-parameters to get a favorite stylization. In this paper, we address these issues by proposing a novel method which allows adjustment of crucial hyper-parameters, after the training and in real-time, through a set of manually adjustable parameters. These parameters enable the user to modify the synthesized outputs from the same pair of style/content images, in search of a favorite stylized image. Our quantitative and qualitative experiments indicate how adjusting these parameters is comparable to retraining the model with different hyper-parameters. We also demonstrate how these parameters can be randomized to generate results which are diverse but still very similar in style and content.
\ No newline at end of file
diff --git a/data/2020/iclr/Adversarial Policies: Attacking Deep Reinforcement Learning b/data/2020/iclr/Adversarial Policies: Attacking Deep Reinforcement Learning
new file mode 100644
index 0000000000..7c119759c1
--- /dev/null
+++ b/data/2020/iclr/Adversarial Policies: Attacking Deep Reinforcement Learning	
@@ -0,0 +1 @@
+Deep reinforcement learning (RL) policies are known to be vulnerable to adversarial perturbations to their observations, similar to adversarial examples for classifiers. However, an attacker is not usually able to directly modify another agent's observations. This might lead one to wonder: is it possible to attack an RL agent simply by choosing an adversarial policy acting in a multi-agent environment so as to create natural observations that are adversarial? We demonstrate the existence of adversarial policies in zero-sum games between simulated humanoid robots with proprioceptive observations, against state-of-the-art victims trained via self-play to be robust to opponents. The adversarial policies reliably win against the victims but generate seemingly random and uncoordinated behavior. We find that these policies are more successful in high-dimensional environments, and induce substantially different activations in the victim policy network than when the victim plays against a normal opponent. Videos are available at this https URL.
\ No newline at end of file
diff --git a/data/2020/iclr/Adversarially Robust Representations with Smooth Encoders b/data/2020/iclr/Adversarially Robust Representations with Smooth Encoders
new file mode 100644
index 0000000000..7ab9fba6b5
--- /dev/null
+++ b/data/2020/iclr/Adversarially Robust Representations with Smooth Encoders	
@@ -0,0 +1 @@
+This paper studies the undesired phenomena of over-sensitivity of representations learned by deep networks to semantically-irrelevant changes in data. We identify a cause for this shortcoming in the classical Variational Auto-encoder (VAE) objective, the evidence lower bound (ELBO). We show that the ELBO fails to control the behaviour of the encoder out of the support of the empirical data distribution and this behaviour of the VAE can lead to extreme errors in the learned representation. This is a key hurdle in the effective use of representations for data-efficient learning and transfer. To address this problem, we propose to augment the data with specifications that enforce insensitivity of the representation with respect to families of transformations. To incorporate these specifications, we propose a regularization method that is based on a selection mechanism that creates a fictive data point by explicitly perturbing an observed true data point. For certain choices of parameters, our formulation naturally leads to the minimization of the entropy regularized Wasserstein distance between representations. We illustrate our approach on standard datasets and experimentally show that significant improvements in the downstream adversarial accuracy can be achieved by learning robust representations completely in an unsupervised manner, without a reference to a particular downstream task and without a costly supervised adversarial training procedure.
\ No newline at end of file
diff --git a/data/2020/iclr/Adversarially robust transfer learning b/data/2020/iclr/Adversarially robust transfer learning
new file mode 100644
index 0000000000..c91c0a96bb
--- /dev/null
+++ b/data/2020/iclr/Adversarially robust transfer learning	
@@ -0,0 +1 @@
+Transfer learning, in which a network is trained on one task and re-purposed on another, is often used to produce neural network classifiers when data is scarce or full-scale training is too costly. When the goal is to produce a model that is not only accurate but also adversarially robust, data scarcity and computational limitations become even more cumbersome. We consider robust transfer learning, in which we transfer not only performance but also robustness from a source model to a target domain. We start by observing that robust networks contain robust feature extractors. By training classifiers on top of these feature extractors, we produce new models that inherit the robustness of their parent networks. We then consider the case of fine tuning a network by re-training end-to-end in the target domain. When using lifelong learning strategies, this process preserves the robustness of the source network while achieving high accuracy. By using such strategies, it is possible to produce accurate and robust models with little data, and without the cost of adversarial training. Additionally, we can improve the generalization of adversarially trained models, while maintaining their robustness.
\ No newline at end of file
diff --git a/data/2020/iclr/Ae-OT: a New Generative Model based on Extended Semi-discrete Optimal transport b/data/2020/iclr/Ae-OT: a New Generative Model based on Extended Semi-discrete Optimal transport
new file mode 100644
index 0000000000..754be15559
--- /dev/null
+++ b/data/2020/iclr/Ae-OT: a New Generative Model based on Extended Semi-discrete Optimal transport	
@@ -0,0 +1 @@
+Generative adversarial networks (GANs) have attracted huge attention due to its capability to generate visual realistic images. However, most of the existing models suffer from the mode collapse or mode mixture problems. In this work, we give a theoretic explanation of the both problems by Figalli’s regularity theory of optimal transportation maps. Basically, the generator compute the transportation maps between the white noise distributions and the data distributions, which are in general discontinuous. However, DNNs can only represent continuous maps. This intrinsic conflict induces mode collapse and mode mixture. In order to tackle the both problems, we explicitly separate the manifold embedding and the optimal transportation; the first part is carried out using an autoencoder to map the images onto the latent space; the second part is accomplished using a GPU-based convex optimization to find the discontinuous transportation maps. Composing the extended OT map and the decoder, we can finally generate new images from the white noise. This AE-OT model avoids representing discontinuous maps by DNNs, therefore effectively prevents mode collapse and mode mixture.
\ No newline at end of file
diff --git a/data/2020/iclr/An Inductive Bias for Distances: Neural Nets that Respect the Triangle Inequality b/data/2020/iclr/An Inductive Bias for Distances: Neural Nets that Respect the Triangle Inequality
new file mode 100644
index 0000000000..b45b915641
--- /dev/null
+++ b/data/2020/iclr/An Inductive Bias for Distances: Neural Nets that Respect the Triangle Inequality	
@@ -0,0 +1 @@
+Distances are pervasive in machine learning. They serve as similarity measures, loss functions, and learning targets; it is said that a good distance measure solves a task. When defining distances, the triangle inequality has proven to be a useful constraint, both theoretically--to prove convergence and optimality guarantees--and empirically--as an inductive bias. Deep metric learning architectures that respect the triangle inequality rely, almost exclusively, on Euclidean distance in the latent space. Though effective, this fails to model two broad classes of subadditive distances, common in graphs and reinforcement learning: asymmetric metrics, and metrics that cannot be embedded into Euclidean space. To address these problems, we introduce novel architectures that are guaranteed to satisfy the triangle inequality. We prove our architectures universally approximate norm-induced metrics on $\mathbb{R}^n$, and present a similar result for modified Input Convex Neural Networks. We show that our architectures outperform existing metric approaches when modeling graph distances and have a better inductive bias than non-metric approaches when training data is limited in the multi-goal reinforcement learning setting.
\ No newline at end of file
diff --git a/data/2020/iclr/Analysis of Video Feature Learning in Two-Stream CNNs on the Example of Zebrafish Swim Bout Classification b/data/2020/iclr/Analysis of Video Feature Learning in Two-Stream CNNs on the Example of Zebrafish Swim Bout Classification
new file mode 100644
index 0000000000..7e1cac583f
--- /dev/null
+++ b/data/2020/iclr/Analysis of Video Feature Learning in Two-Stream CNNs on the Example of Zebrafish Swim Bout Classification	
@@ -0,0 +1 @@
+Semmelhack et al. (2014) have achieved high classification accuracy in distinguishing swim bouts of zebrafish using a Support Vector Machine (SVM). Convolutional Neural Networks (CNNs) have reached superior performance in various image recognition tasks over SVMs, but these powerful networks remain a black box. Reaching better transparency helps to build trust in their classifications and makes learned features interpretable to experts. Using a recently developed technique called Deep Taylor Decomposition, we generated heatmaps to highlight input regions of high relevance for predictions. We find that our CNN makes predictions by analyzing the steadiness of the tail's trunk, which markedly differs from the manually extracted features used by Semmelhack et al. (2014). We further uncovered that the network paid attention to experimental artifacts. Removing these artifacts ensured the validity of predictions. After correction, our best CNN beats the SVM by 6.12%, achieving a classification accuracy of 96.32%. Our work thus demonstrates the utility of AI explainability for CNNs.
\ No newline at end of file
diff --git a/data/2020/iclr/Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction b/data/2020/iclr/Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction
new file mode 100644
index 0000000000..ac5ee5c36d
--- /dev/null
+++ b/data/2020/iclr/Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction	
@@ -0,0 +1 @@
+With the recent success and popularity of pre-trained language models (LMs) in natural language processing, there has been a rise in efforts to understand their inner workings. In line with such interest, we propose a novel method that assists us in investigating the extent to which pre-trained LMs capture the syntactic notion of constituency. Our method provides an effective way of extracting constituency trees from the pre-trained LMs without training. In addition, we report intriguing findings in the induced trees, including the fact that pre-trained LMs outperform other approaches in correctly demarcating adverb phrases in sentences.
\ No newline at end of file
diff --git a/data/2020/iclr/Are Transformers universal approximators of sequence-to-sequence functions? b/data/2020/iclr/Are Transformers universal approximators of sequence-to-sequence functions?
new file mode 100644
index 0000000000..c723898710
--- /dev/null
+++ b/data/2020/iclr/Are Transformers universal approximators of sequence-to-sequence functions?	
@@ -0,0 +1 @@
+Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that fixed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in the universal approximation property of Transformers. Based on this insight from our analysis, we consider other architectures that can compute contextual mappings and empirically evaluate them.
\ No newline at end of file
diff --git a/data/2020/iclr/AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures b/data/2020/iclr/AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures
new file mode 100644
index 0000000000..199d3c56eb
--- /dev/null
+++ b/data/2020/iclr/AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures	
@@ -0,0 +1 @@
+Learning to represent videos is a very challenging task both algorithmically and computationally. Standard video CNN architectures have been designed by directly extending architectures devised for image understanding to include the time dimension, using modules such as 3D convolutions, or by using two-stream design to capture both appearance and motion in videos. We interpret a video CNN as a collection of multi-stream convolutional blocks connected to each other, and propose the approach of automatically finding neural architectures with better connectivity and spatio-temporal interactions for video understanding. This is done by evolving a population of overly-connected architectures guided by connection weight learning. Architectures combining representations that abstract different input types (i.e., RGB and optical flow) at multiple temporal resolutions are searched for, allowing different types or sources of information to interact with each other. Our method, referred to as AssembleNet, outperforms prior approaches on public video datasets, in some cases by a great margin. We obtain 58.6% mAP on Charades and 34.27% accuracy on Moments-in-Time.
\ No newline at end of file
diff --git a/data/2020/iclr/Augmenting Genetic Algorithms with Deep Neural Networks for Exploring the Chemical Space b/data/2020/iclr/Augmenting Genetic Algorithms with Deep Neural Networks for Exploring the Chemical Space
new file mode 100644
index 0000000000..e8659d9bb0
--- /dev/null
+++ b/data/2020/iclr/Augmenting Genetic Algorithms with Deep Neural Networks for Exploring the Chemical Space	
@@ -0,0 +1 @@
+Challenges in natural sciences can often be phrased as optimization problems. Machine learning techniques have recently been applied to solve such problems. One example in chemistry is the design of tailor-made organic materials and molecules, which requires efficient methods to explore the chemical space. We present a genetic algorithm (GA) that is enhanced with a neural network (DNN) based discriminator model to improve the diversity of generated molecules and at the same time steer the GA. We show that our algorithm outperforms other generative models in optimization tasks. We furthermore present a way to increase interpretability of genetic algorithms, which helped us to derive design principles.
\ No newline at end of file
diff --git a/data/2020/iclr/AutoQ: Automated Kernel-Wise Neural Network Quantization b/data/2020/iclr/AutoQ: Automated Kernel-Wise Neural Network Quantization
new file mode 100644
index 0000000000..a56e8fc93e
--- /dev/null
+++ b/data/2020/iclr/AutoQ: Automated Kernel-Wise Neural Network Quantization	
@@ -0,0 +1 @@
+Network quantization is one of the most hardware friendly techniques to enable the deployment of convolutional neural networks (CNNs) on low-power mobile devices. Recent network quantization techniques quantize each weight kernel in a convolutional layer independently for higher inference accuracy, since the weight kernels in a layer exhibit different variances and hence have different amounts of redundancy. The quantization bitwidth or bit number (QBN) directly decides the inference accuracy, latency, energy and hardware overhead. To effectively reduce the redundancy and accelerate CNN inferences, various weight kernels should be quantized with different QBNs. However, prior works use only one QBN to quantize each convolutional layer or the entire CNN, because the design space of searching a QBN for each weight kernel is too large. The hand-crafted heuristic of the kernel-wise QBN search is so sophisticated that domain experts can obtain only sub-optimal results. It is difficult for even deep reinforcement learning (DRL) Deep Deterministic Policy Gradient (DDPG)-based agents to find a kernel-wise QBN configuration that can achieve reasonable inference accuracy. In this paper, we propose a hierarchical-DRL-based kernel-wise network quantization technique, AutoQ, to automatically search a QBN for each weight kernel, and choose another QBN for each activation layer. Compared to the models quantized by the state-of-the-art DRL-based schemes, on average, the same models quantized by AutoQ reduce the inference latency by 54.06\%, and decrease the inference energy consumption by 50.69\%, while achieving the same inference accuracy.
\ No newline at end of file
diff --git a/data/2020/iclr/Automated Relational Meta-learning b/data/2020/iclr/Automated Relational Meta-learning
new file mode 100644
index 0000000000..c0cda39050
--- /dev/null
+++ b/data/2020/iclr/Automated Relational Meta-learning	
@@ -0,0 +1 @@
+In order to efficiently learn with small amount of data on new tasks, meta-learning transfers knowledge learned from previous tasks to the new ones. However, a critical challenge in meta-learning is the task heterogeneity which cannot be well handled by traditional globally shared meta-learning methods. In addition, current task-specific meta-learning methods may either suffer from hand-crafted structure design or lack the capability to capture complex relations between tasks. In this paper, motivated by the way of knowledge organization in knowledge bases, we propose an automated relational meta-learning (ARML) framework that automatically extracts the cross-task relations and constructs the meta-knowledge graph. When a new task arrives, it can quickly find the most relevant structure and tailor the learned structure knowledge to the meta-learner. As a result, the proposed framework not only addresses the challenge of task heterogeneity by a learned meta-knowledge graph, but also increases the model interpretability. We conduct extensive experiments on 2D toy regression and few-shot image classification and the results demonstrate the superiority of ARML over state-of-the-art baselines.
\ No newline at end of file
diff --git a/data/2020/iclr/Automated curriculum generation through setter-solver interactions b/data/2020/iclr/Automated curriculum generation through setter-solver interactions
new file mode 100644
index 0000000000..b3771d5996
--- /dev/null
+++ b/data/2020/iclr/Automated curriculum generation through setter-solver interactions	
@@ -0,0 +1 @@
+Reinforcement learning algorithms use correlations between policies and rewards to improve agent performance. But in dynamic or sparsely rewarding environments these correlations are often too small, or rewarding events are too infrequent to make learning feasible. Human education instead relies on curricula –the breakdown of tasks into simpler, static challenges with dense rewards– to build up to complex behaviors. While curricula are also useful for artificial agents, hand-crafting them is time consuming. This has lead researchers to explore automatic curriculum generation. Here we explore automatic curriculum generation in rich,dynamic environments. Using a setter-solver paradigm we show the importance of considering goal validity, goal feasibility, and goal coverage to construct useful curricula. We demonstrate the success of our approach in rich but sparsely rewarding 2D and 3D environments, where an agent is tasked to achieve a single goal selected from a set of possible goals that varies between episodes, and identify challenges for future work. Finally, we demonstrate the value of a novel technique that guides agents towards a desired goal distribution. Altogether, these results represent a substantial step towards applying automatic task curricula to learn complex, otherwise unlearnable goals, and to our knowledge are the first to demonstrate automated curriculum generation for goal-conditioned agents in environments where the possible goals vary between episodes.
\ No newline at end of file
diff --git a/data/2020/iclr/Automatically Discovering and Learning New Visual Categories with Ranking Statistics b/data/2020/iclr/Automatically Discovering and Learning New Visual Categories with Ranking Statistics
new file mode 100644
index 0000000000..76e3bffdab
--- /dev/null
+++ b/data/2020/iclr/Automatically Discovering and Learning New Visual Categories with Ranking Statistics	
@@ -0,0 +1 @@
+We tackle the problem of discovering novel classes in an image collection given labelled examples of other classes. This setting is similar to semi-supervised learning, but significantly harder because there are no labelled examples for the new classes. The challenge, then, is to leverage the information contained in the labelled images in order to learn a general-purpose clustering model and use the latter to identify the new classes in the unlabelled data. In this work we address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labeled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use rank statistics to transfer the model's knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data. We evaluate our approach on standard classification benchmarks and outperform current methods for novel category discovery by a significant margin.
\ No newline at end of file
diff --git a/data/2020/iclr/Black-Box Adversarial Attack with Transferable Model-based Embedding b/data/2020/iclr/Black-Box Adversarial Attack with Transferable Model-based Embedding
new file mode 100644
index 0000000000..ff3364ce40
--- /dev/null
+++ b/data/2020/iclr/Black-Box Adversarial Attack with Transferable Model-based Embedding	
@@ -0,0 +1 @@
+We present a new method for black-box adversarial attack. Unlike previous methods that combined transfer-based and scored-based methods by using the gradient or initialization of a surrogate white-box model, this new method tries to learn a low-dimensional embedding using a pretrained model, and then performs efficient search within the embedding space to attack an unknown target network. The method produces adversarial perturbations with high level semantic patterns that are easily transferable. We show that this approach can greatly improve the query efficiency of black-box adversarial attack across different target network architectures. We evaluate our approach on MNIST, ImageNet and Google Cloud Vision API, resulting in a significant reduction on the number of queries. We also attack adversarially defended networks on CIFAR10 and ImageNet, where our method not only reduces the number of queries, but also improves the attack success rate.
\ No newline at end of file
diff --git a/data/2020/iclr/Bounds on Over-Parameterization for Guaranteed Existence of Descent Paths in Shallow ReLU Networks b/data/2020/iclr/Bounds on Over-Parameterization for Guaranteed Existence of Descent Paths in Shallow ReLU Networks
new file mode 100644
index 0000000000..8030cf08da
--- /dev/null
+++ b/data/2020/iclr/Bounds on Over-Parameterization for Guaranteed Existence of Descent Paths in Shallow ReLU Networks	
@@ -0,0 +1 @@
+We study the landscape of squared loss in neural networks with one-hidden layer and ReLU activation functions. Let $m$ and $d$ be the widths of hidden and input layers, respectively. We show that there exit poor local minima with positive curvature for some training sets of size $n\geq m+2d-2$. By positive curvature of a local minimum, we mean that within a small neighborhood the loss function is strictly increasing in all directions. Consequently, for such training sets, there are initialization of weights from which there is no descent path to global optima. It is known that for $n\le m$, there always exist descent paths to global optima from all initial weights. In this perspective, our results provide a somewhat sharp characterization of the over-parameterization required for "existence of descent paths" in the loss landscape.
\ No newline at end of file
diff --git a/data/2020/iclr/Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness b/data/2020/iclr/Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness
new file mode 100644
index 0000000000..6e2114eea9
--- /dev/null
+++ b/data/2020/iclr/Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness	
@@ -0,0 +1 @@
+Mode connectivity provides novel geometric insights on analyzing loss landscapes and enables building high-accuracy pathways between well-trained neural networks. In this work, we propose to employ mode connectivity in loss landscapes to study the adversarial robustness of deep neural networks, and provide novel methods for improving this robustness. Our experiments cover various types of adversarial attacks applied to different network architectures and datasets. When network models are tampered with backdoor or error-injection attacks, our results demonstrate that the path connection learned using limited amount of bonafide data can effectively mitigate adversarial effects while maintaining the original accuracy on clean data. Therefore, mode connectivity provides users with the power to repair backdoored or error-injected models. We also use mode connectivity to investigate the loss landscapes of regular and robust models against evasion attacks. Experiments show that there exists a barrier in adversarial robustness loss on the path connecting regular and adversarially-trained models. A high correlation is observed between the adversarial robustness loss and the largest eigenvalue of the input Hessian matrix, for which theoretical justifications are provided. Our results suggest that mode connectivity offers a holistic tool and practical means for evaluating and improving adversarial robustness.
\ No newline at end of file
diff --git a/data/2020/iclr/Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints b/data/2020/iclr/Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints
new file mode 100644
index 0000000000..83c2cbd400
--- /dev/null
+++ b/data/2020/iclr/Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints	
@@ -0,0 +1 @@
+In most practical settings and theoretical analyses, one assumes that a model can be trained until convergence. However, the growing complexity of machine learning datasets and models may violate such assumptions. Indeed, current approaches for hyper-parameter tuning and neural architecture search tend to be limited by practical resource constraints. Therefore, we introduce a formal setting for studying training under the non-asymptotic, resource-constrained regime, i.e., budgeted training. We analyze the following problem: "given a dataset, algorithm, and fixed resource budget, what is the best achievable performance?" We focus on the number of optimization iterations as the representative resource. Under such a setting, we show that it is critical to adjust the learning rate schedule according to the given budget. Among budget-aware learning schedules, we find simple linear decay to be both robust and high-performing. We support our claim through extensive experiments with state-of-the-art models on ImageNet (image classification), Kinetics (video classification), MS COCO (object detection and instance segmentation), and Cityscapes (semantic segmentation). We also analyze our results and find that the key to a good schedule is budgeted convergence, a phenomenon whereby the gradient vanishes at the end of each allowed budget. We also revisit existing approaches for fast convergence and show that budget-aware learning schedules readily outperform such approaches under (the practical but under-explored) budgeted training setting.
\ No newline at end of file
diff --git a/data/2020/iclr/CAQL: Continuous Action Q-Learning b/data/2020/iclr/CAQL: Continuous Action Q-Learning
new file mode 100644
index 0000000000..5c2f99b644
--- /dev/null
+++ b/data/2020/iclr/CAQL: Continuous Action Q-Learning	
@@ -0,0 +1 @@
+Value-based reinforcement learning (RL) methods like Q-learning have shown success in a variety of domains. One challenge in applying Q-learning to continuous-action RL problems, however, is the continuous action maximization (max-Q) required for optimal Bellman backup. In this work, we develop CAQL, a (class of) algorithm(s) for continuous-action Q-learning that can use several plug-and-play optimizers for the max-Q problem. Leveraging recent optimization results for deep neural networks, we show that max-Q can be solved optimally using mixed-integer programming (MIP). When the Q-function representation has sufficient power, MIP-based optimization gives rise to better policies and is more robust than approximate methods (e.g., gradient ascent, cross-entropy search). We further develop several techniques to accelerate inference in CAQL, which despite their approximate nature, perform well. We compare CAQL with state-of-the-art RL algorithms on benchmark continuous-control problems that have different degrees of action constraints and show that CAQL outperforms policy-based methods in heavily constrained environments, often dramatically.
\ No newline at end of file
diff --git a/data/2020/iclr/CLN2INV: Learning Loop Invariants with Continuous Logic Networks b/data/2020/iclr/CLN2INV: Learning Loop Invariants with Continuous Logic Networks
new file mode 100644
index 0000000000..29c17c123d
--- /dev/null
+++ b/data/2020/iclr/CLN2INV: Learning Loop Invariants with Continuous Logic Networks	
@@ -0,0 +1 @@
+Program verification offers a framework for ensuring program correctness and therefore systematically eliminating different classes of bugs. Inferring loop invariants is one of the main challenges behind automated verification of real-world programs which often contain many loops. In this paper, we present Continuous Logic Network (CLN), a novel neural architecture for automatically learning loop invariants directly from program execution traces. Unlike existing neural networks, CLNs can learn precise and explicit representations of formulas in Satisfiability Modulo Theories (SMT) for loop invariants from program execution traces. We develop a new sound and complete semantic mapping for assigning SMT formulas to continuous truth values that allows CLNs to be trained efficiently. We use CLNs to implement a new inference system for loop invariants, CLN2INV, that significantly outperforms existing approaches on the popular Code2Inv dataset. CLN2INV is the first tool to solve all 124 theoretically solvable problems in the Code2Inv dataset. Moreover, CLN2INV takes only 1.1 seconds on average for each problem, which is 40 times faster than existing approaches. We further demonstrate that CLN2INV can even learn 12 significantly more complex loop invariants than the ones required for the Code2Inv dataset.
\ No newline at end of file
diff --git a/data/2020/iclr/CM3: Cooperative Multi-goal Multi-stage Multi-agent Reinforcement Learning b/data/2020/iclr/CM3: Cooperative Multi-goal Multi-stage Multi-agent Reinforcement Learning
new file mode 100644
index 0000000000..ee7f0523da
--- /dev/null
+++ b/data/2020/iclr/CM3: Cooperative Multi-goal Multi-stage Multi-agent Reinforcement Learning	
@@ -0,0 +1 @@
+A variety of cooperative multi-agent control problems require agents to achieve individual goals while contributing to collective success. This multi-goal multi-agent setting poses difficulties for recent algorithms, which primarily target settings with a single global reward, due to two new challenges: efficient exploration for learning both individual goal attainment and cooperation for others' success, and credit-assignment for interactions between actions and goals of different agents. To address both challenges, we restructure the problem into a novel two-stage curriculum, in which single-agent goal attainment is learned prior to learning multi-agent cooperation, and we derive a new multi-goal multi-agent policy gradient with a credit function for localized credit assignment. We use a function augmentation scheme to bridge value and policy functions across the curriculum. The complete architecture, called CM3, learns significantly faster than direct adaptations of existing algorithms on three challenging multi-goal multi-agent problems: cooperative navigation in difficult formations, negotiating multi-vehicle lane changes in the SUMO traffic simulator, and strategic cooperation in a Checkers environment.
\ No newline at end of file
diff --git a/data/2020/iclr/Can gradient clipping mitigate label noise? b/data/2020/iclr/Can gradient clipping mitigate label noise?
new file mode 100644
index 0000000000..446c5cd7da
--- /dev/null
+++ b/data/2020/iclr/Can gradient clipping mitigate label noise?	
@@ -0,0 +1 @@
+Gradient clipping is a widely-used technique in the training of deep networks, and is generally motivated from an optimisation lens: informally, it controls the dynamics of iterates, thus enhancing the rate of convergence to a local minimum. This intuition has been made precise in a line of recent works, which show that suitable clipping can yield significantly faster convergence than vanilla gradient descent. In this paper, we propose a new lens for studying gradient clipping, namely, robustness: informally, one expects clipping to provide robustness to noise, since one does not overly trust any single sample. Surprisingly, we prove that for the common problem of label noise in classification, standard gradient clipping does not in general provide robustness. On the other hand, we show that a simple variant of gradient clipping is provably robust, and corresponds to suitably modifying the underlying loss function. This yields a simple, noise-robust alternative to the standard cross-entropy loss which performs well empirically.
\ No newline at end of file
diff --git a/data/2020/iclr/Certified Robustness for Top-k Predictions against Adversarial Perturbations via Randomized Smoothing b/data/2020/iclr/Certified Robustness for Top-k Predictions against Adversarial Perturbations via Randomized Smoothing
new file mode 100644
index 0000000000..1da8831b27
--- /dev/null
+++ b/data/2020/iclr/Certified Robustness for Top-k Predictions against Adversarial Perturbations via Randomized Smoothing	
@@ -0,0 +1 @@
+It is well-known that classifiers are vulnerable to adversarial perturbations. To defend against adversarial perturbations, various certified robustness results have been derived. However, existing certified robustnesses are limited to top-1 predictions. In many real-world applications, top-$k$ predictions are more relevant. In this work, we aim to derive certified robustness for top-$k$ predictions. In particular, our certified robustness is based on randomized smoothing, which turns any classifier to a new classifier via adding noise to an input example. We adopt randomized smoothing because it is scalable to large-scale neural networks and applicable to any classifier. We derive a tight robustness in $\ell_2$ norm for top-$k$ predictions when using randomized smoothing with Gaussian noise. We find that generalizing the certified robustness from top-1 to top-$k$ predictions faces significant technical challenges. We also empirically evaluate our method on CIFAR10 and ImageNet. For example, our method can obtain an ImageNet classifier with a certified top-5 accuracy of 62.8\% when the $\ell_2$-norms of the adversarial perturbations are less than 0.5 (=127/255). Our code is publicly available at: \url{https://github.com/jjy1994/Certify_Topk}.
\ No newline at end of file
diff --git a/data/2020/iclr/Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation b/data/2020/iclr/Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation
new file mode 100644
index 0000000000..b790bc99e9
--- /dev/null
+++ b/data/2020/iclr/Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation	
@@ -0,0 +1 @@
+Achieving faster execution with shorter compilation time can foster further diversity and innovation in neural networks. However, the current paradigm of executing neural networks either relies on hand-optimized libraries, traditional compilation heuristics, or very recently genetic algorithms and other stochastic methods. These methods suffer from frequent costly hardware measurements rendering them not only too time consuming but also suboptimal. As such, we devise a solution that can learn to quickly adapt to a previously unseen design space for code optimization, both accelerating the search and improving the output performance. This solution dubbed CHAMELEON leverages reinforcement learning whose solution takes fewer steps to converge, and develops an adaptive sampling algorithm that not only focuses on the costly samples (real hardware measurements) on representative points but also uses a domain knowledge inspired logic to improve the samples itself. Experimentation with real hardware shows that CHAMELEON provides 4.45×speed up in optimization time over AutoTVM, while also improving inference time of the modern deep networks by 5.6%.
\ No newline at end of file
diff --git a/data/2020/iclr/Compositional languages emerge in a neural iterated learning model b/data/2020/iclr/Compositional languages emerge in a neural iterated learning model
new file mode 100644
index 0000000000..10326940de
--- /dev/null
+++ b/data/2020/iclr/Compositional languages emerge in a neural iterated learning model	
@@ -0,0 +1 @@
+The principle of compositionality, which enables natural language to represent complex concepts via a structured combination of simpler ones, allows us to convey an open-ended set of messages using a limited vocabulary. If compositionality is indeed a natural property of language, we may expect it to appear in communication protocols that are created by neural agents via grounded language learning. Inspired by the iterated learning framework, which simulates the process of language evolution, we propose an effective neural iterated learning algorithm that, when applied to interacting neural agents, facilitates the emergence of a more structured type of language. Indeed, these languages provide specific advantages to neural agents during training, which translates as a larger posterior probability, which is then incrementally amplified via the iterated learning procedure. Our experiments confirm our analysis, and also demonstrate that the emerged languages largely improve the generalization of the neural agent communication.
\ No newline at end of file
diff --git a/data/2020/iclr/Computation Reallocation for Object Detection b/data/2020/iclr/Computation Reallocation for Object Detection
new file mode 100644
index 0000000000..5ed1d5181b
--- /dev/null
+++ b/data/2020/iclr/Computation Reallocation for Object Detection	
@@ -0,0 +1 @@
+The allocation of computation resources in the backbone is a crucial issue in object detection. However, classification allocation pattern is usually adopted directly to object detector, which is proved to be sub-optimal. In order to reallocate the engaged computation resources in a more efficient way, we present CR-NAS (Computation Reallocation Neural Architecture Search) that can learn computation reallocation strategies across different feature resolution and spatial position diectly on the target detection dataset. A two-level reallocation space is proposed for both stage and spatial reallocation. A novel hierarchical search procedure is adopted to cope with the complex search space. We apply CR-NAS to multiple backbones and achieve consistent improvements. Our CR-ResNet50 and CR-MobileNetV2 outperforms the baseline by 1.9% and 1.7% COCO AP respectively without any additional computation budget. The models discovered by CR-NAS can be equiped to other powerful detection neck/head and be easily transferred to other dataset, e.g. PASCAL VOC, and other vision tasks, e.g. instance segmentation. Our CR-NAS can be used as a plugin to improve the performance of various networks, which is demanding.
\ No newline at end of file
diff --git a/data/2020/iclr/Continual Learning with Adaptive Weights (CLAW) b/data/2020/iclr/Continual Learning with Adaptive Weights (CLAW)
new file mode 100644
index 0000000000..f9da99e189
--- /dev/null
+++ b/data/2020/iclr/Continual Learning with Adaptive Weights (CLAW)	
@@ -0,0 +1 @@
+Approaches to continual learning aim to successfully learn a set of related tasks that arrive in an online manner. Recently, several frameworks have been developed which enable deep learning to be deployed in this learning scenario. A key modelling decision is to what extent the architecture should be shared across tasks. On the one hand, separately modelling each task avoids catastrophic forgetting but it does not support transfer learning and leads to large models. On the other hand, rigidly specifying a shared component and a task-specific part enables task transfer and limits the model size, but it is vulnerable to catastrophic forgetting and restricts the form of task-transfer that can occur. Ideally, the network should adaptively identify which parts of the network to share in a data driven way. Here we introduce such an approach called Continual Learning with Adaptive Weights (CLAW), which is based on probabilistic modelling and variational inference. Experiments show that CLAW achieves state-of-the-art performance on six benchmarks in terms of overall continual learning performance, as measured by classification accuracy, and in terms of addressing catastrophic forgetting.
\ No newline at end of file
diff --git a/data/2020/iclr/Continual Learning with Bayesian Neural Networks for Non-Stationary Data b/data/2020/iclr/Continual Learning with Bayesian Neural Networks for Non-Stationary Data
new file mode 100644
index 0000000000..c4033ad794
--- /dev/null
+++ b/data/2020/iclr/Continual Learning with Bayesian Neural Networks for Non-Stationary Data	
@@ -0,0 +1 @@
+This work addresses continual learning for non-stationary data, using Bayesian neural networks and memory-based online variational Bayes. We represent the posterior approximation of the network weights by a diagonal Gaussian distribution and a complementary memory of raw data. This raw data corresponds to likelihood terms that cannot be well approximated by the Gaussian. We introduce a novel method for sequentially updating both components of the posterior approximation. Furthermore, we propose Bayesian forgetting and a Gaussian diffusion process for adapting to non-stationary data. The experimental results show that our update method improves on existing approaches for streaming data. Additionally, the adaptation methods lead to better predictive performance for non-stationary data.
\ No newline at end of file
diff --git a/data/2020/iclr/Counterfactuals uncover the modular structure of deep generative models b/data/2020/iclr/Counterfactuals uncover the modular structure of deep generative models
new file mode 100644
index 0000000000..4dda48b4f6
--- /dev/null
+++ b/data/2020/iclr/Counterfactuals uncover the modular structure of deep generative models	
@@ -0,0 +1 @@
+Deep generative models can emulate the perceptual properties of complex image datasets, providing a latent representation of the data. However, manipulating such representation to perform meaningful and controllable transformations in the data space remains challenging without some form of supervision. While previous work has focused on exploiting statistical independence to disentangle latent factors, we argue that such requirement is too restrictive and propose instead a non-statistical framework that relies on counterfactual manipulations to uncover a modular structure of the network composed of disentangled groups of internal variables. Experiments with a variety of generative models trained on complex image datasets show the obtained modules can be used to design targeted interventions. This opens the way to applications such as computationally efficient style transfer and the automated assessment of robustness to contextual changes in pattern recognition systems.
\ No newline at end of file
diff --git a/data/2020/iclr/Curvature Graph Network b/data/2020/iclr/Curvature Graph Network
new file mode 100644
index 0000000000..ba9decd6d5
--- /dev/null
+++ b/data/2020/iclr/Curvature Graph Network	
@@ -0,0 +1 @@
+Graph-structured data is prevalent in many domains. Despite the widely celebrated success of deep neural networks, their power in graph-structured data is yet to be fully explored. We propose a novel network architecture that incorporates advanced graph structural features. In particular, we leverage discrete graph curvature, which measures how the neighborhoods of a pair of nodes are structurally related. The curvature of an edge (x, y) defines the distance taken to travel from neighbors of x to neighbors of y, compared with the length of edge (x, y). It is a much more descriptive feature compared to previously used features that only focus on node specific attributes or limited topological information such as degree. Our curvature graph convolution network outperforms state-of-the-art on various synthetic and real-world graphs, especially the larger and denser ones.
\ No newline at end of file
diff --git a/data/2020/iclr/DBA: Distributed Backdoor Attacks against Federated Learning b/data/2020/iclr/DBA: Distributed Backdoor Attacks against Federated Learning
new file mode 100644
index 0000000000..44ffc1af61
--- /dev/null
+++ b/data/2020/iclr/DBA: Distributed Backdoor Attacks against Federated Learning	
@@ -0,0 +1 @@
+Backdoor attacks aim to manipulate a subset of training data by injecting adversarial triggers such that machine learning models trained on the tampered dataset will make arbitrarily (targeted) incorrect prediction on the testset with the same trigger embedded. While federated learning (FL) is capable of aggregating information provided by different parties for training a better model, its distributed learning methodology and inherently heterogeneous data distribution across parties may bring new vulnerabilities. In addition to recent centralized backdoor attacks on FL where each party embeds the same global trigger during training, we propose the distributed backdoor attack (DBA) --- a novel threat assessment framework developed by fully exploiting the distributed nature of FL. DBA decomposes a global trigger pattern into separate local patterns and embed them into the training set of different adversarial parties respectively. Compared to standard centralized backdoors, we show that DBA is substantially more persistent and stealthy against FL on diverse datasets such as finance and image data. We conduct extensive experiments to show that the attack success rate of DBA is significantly higher than centralized backdoors under different settings. Moreover, we find that distributed attacks are indeed more insidious, as DBA can evade two state-of-the-art robust FL algorithms against centralized backdoors. We also provide explanations for the effectiveness of DBA via feature visual interpretation and feature importance ranking. To further explore the properties of DBA, we test the attack performance by varying different trigger factors, including local trigger variations (size, gap, and location), scaling factor in FL, data distribution, and poison ratio and interval. Our proposed DBA and thorough evaluation results shed lights on characterizing the robustness of FL.
\ No newline at end of file
diff --git a/data/2020/iclr/DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames b/data/2020/iclr/DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames
new file mode 100644
index 0000000000..4a3e94da18
--- /dev/null
+++ b/data/2020/iclr/DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames	
@@ -0,0 +1,3 @@
+We present Decentralized Distributed Proximal Policy Optimization (DD-PPO), a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever "stale"), making it conceptually simple and easy to implement. In our experiments on training virtual robots to navigate in Habitat-Sim, DD-PPO exhibits near-linear scaling -- achieving a speedup of 107x on 128 GPUs over a serial implementation. We leverage this scaling to train an agent for 2.5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs. 
+
+This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially "solves" the task -- near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS+Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks -- the analog of "ImageNet pre-training + task-specific fine-tuning" for embodied AI. Our model outperforms ImageNet pre-trained CNNs on these transfer tasks and can serve as a universal resource (all models and code are publicly available).
\ No newline at end of file
diff --git a/data/2020/iclr/Data-Independent Neural Pruning via Coresets b/data/2020/iclr/Data-Independent Neural Pruning via Coresets
new file mode 100644
index 0000000000..150f74b0af
--- /dev/null
+++ b/data/2020/iclr/Data-Independent Neural Pruning via Coresets	
@@ -0,0 +1 @@
+Previous work showed empirically that large neural networks can be significantly reduced in size while preserving their accuracy. Model compression became a central research topic, as it is crucial for deployment of neural networks on devices with limited computational and memory resources. The majority of the compression methods are based on heuristics and offer no worst-case guarantees on the trade-off between the compression rate and the approximation error for an arbitrarily new sample. We propose the first efficient, data-independent neural pruning algorithm with a provable trade-off between its compression rate and the approximation error for any future test sample. Our method is based on the coreset framework, which finds a small weighted subset of points that provably approximates the original inputs. Specifically, we approximate the output of a layer of neurons by a coreset of neurons in the previous layer and discard the rest. We apply this framework in a layer-by-layer fashion from the top to the bottom. Unlike previous works, our coreset is data independent, meaning that it provably guarantees the accuracy of the function for any input $x\in \mathbb{R}^d$, including an adversarial one. We demonstrate the effectiveness of our method on popular network architectures. In particular, our coresets yield 90\% compression of the LeNet-300-100 architecture on MNIST while improving the accuracy.
\ No newline at end of file
diff --git a/data/2020/iclr/DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling b/data/2020/iclr/DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling
new file mode 100644
index 0000000000..c00eccb683
--- /dev/null
+++ b/data/2020/iclr/DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling	
@@ -0,0 +1 @@
+For sequence models with large vocabularies, a majority of network parameters lie in the input and output layers. In this work, we describe a new method, DeFINE, for learning deep token representations efficiently. Our architecture uses a hierarchical structure with novel skip-connections which allows for the use of low dimensional input and output layers, reducing total parameters and training time while delivering similar or better performance versus existing methods. DeFINE can be incorporated easily in new or existing sequence models. Compared to state-of-the-art methods including adaptive input representations, this technique results in a 6% to 20% drop in perplexity. On WikiText-103, DeFINE reduces the total parameters of Transformer-XL by half with minimal impact on performance. On the Penn Treebank, DeFINE improves AWD-LSTM by 4 points with a 17% reduction in parameters, achieving comparable performance to state-of-the-art methods with fewer parameters. For machine translation, DeFINE improves the efficiency of the Transformer model by about 1.4 times while delivering similar performance.
\ No newline at end of file
diff --git "a/data/2020/iclr/Deep 3D Pan via local adaptive \"t-shaped\" convolutions with global and local adaptive dilations" "b/data/2020/iclr/Deep 3D Pan via local adaptive \"t-shaped\" convolutions with global and local adaptive dilations"
new file mode 100644
index 0000000000..560f317b68
--- /dev/null
+++ "b/data/2020/iclr/Deep 3D Pan via local adaptive \"t-shaped\" convolutions with global and local adaptive dilations"	
@@ -0,0 +1 @@
+Recent advances in deep learning have shown promising results in many low-level vision tasks. However, solving the single-image-based view synthesis is still an open problem. In particular, the generation of new images at parallel camera views given a single input image is of great interest, as it enables 3D visualization of the 2D input scenery. We propose a novel network architecture to perform stereoscopic view synthesis at arbitrary camera positions along the X-axis, or Deep 3D Pan, with "t-shaped" adaptive kernels equipped with globally and locally adaptive dilations. Our proposed network architecture, the monster-net, is devised with a novel t-shaped adaptive kernel with globally and locally adaptive dilation, which can efficiently incorporate global camera shift into and handle local 3D geometries of the target image's pixels for the synthesis of naturally looking 3D panned views when a 2-D input image is given. Extensive experiments were performed on the KITTI, CityScapes and our VXXLXX_STEREO indoors dataset to prove the efficacy of our method. Our monster-net significantly outperforms the state-of-the-art method, SOTA, by a large margin in all metrics of RMSE, PSNR, and SSIM. Our proposed monster-net is capable of reconstructing more reliable image structures in synthesized images with coherent geometry. Moreover, the disparity information that can be extracted from the "t-shaped" kernel is much more reliable than that of the SOTA for the unsupervised monocular depth estimation task, confirming the effectiveness of our method.
\ No newline at end of file
diff --git a/data/2020/iclr/Deep Imitative Models for Flexible Inference, Planning, and Control b/data/2020/iclr/Deep Imitative Models for Flexible Inference, Planning, and Control
new file mode 100644
index 0000000000..cd00d903d4
--- /dev/null
+++ b/data/2020/iclr/Deep Imitative Models for Flexible Inference, Planning, and Control	
@@ -0,0 +1 @@
+Imitation Learning (IL) is an appealing approach to learn desirable autonomous behavior. However, directing IL to achieve arbitrary goals is difficult. In contrast, planning-based algorithms use dynamics models and reward functions to achieve goals. Yet, reward functions that evoke desirable behavior are often difficult to specify. In this paper, we propose Imitative Models to combine the benefits of IL and goal-directed planning. Imitative Models are probabilistic predictive models of desirable behavior able to plan interpretable expert-like trajectories to achieve specified goals. We derive families of flexible goal objectives, including constrained goal regions, unconstrained goal sets, and energy-based goals. We show that our method can use these objectives to successfully direct behavior. Our method substantially outperforms six IL approaches and a planning-based approach in a dynamic simulated autonomous driving task, and is efficiently learned from expert demonstrations without online data collection. We also show our approach is robust to poorly specified goals, such as goals on the wrong side of the road.
\ No newline at end of file
diff --git a/data/2020/iclr/Deep Learning of Determinantal Point Processes via Proper Spectral Sub-gradient b/data/2020/iclr/Deep Learning of Determinantal Point Processes via Proper Spectral Sub-gradient
new file mode 100644
index 0000000000..52adaf5b3c
--- /dev/null
+++ b/data/2020/iclr/Deep Learning of Determinantal Point Processes via Proper Spectral Sub-gradient	
@@ -0,0 +1 @@
+Determinantal point processes (DPPs) is an effective tool to deliver diversity on multiple machine learning and computer vision tasks. Under deep learning framework, DPP is typically optimized via approximation, which is not straightforward and has some conflict with diversity requirement. We note, however, there has been no deep learning paradigms to optimize DPP directly since it involves matrix inversion which may result in highly computational instability. This fact greatly hinders the wide use of DPP on some specific objectives where DPP serves as a term to measure the feature diversity. In this paper, we devise a simple but effective algorithm to address this issue to optimize DPP term directly expressed with L-ensemble in spectral domain over gram matrix, which is more flexible than learning on parametric kernels. By further taking into account some geometric constraints, our algorithm seeks to generate valid sub-gradients of DPP term in case when the DPP gram matrix is not invertible (no gradients exist in this case). In this sense, our algorithm can be easily incorporated with multiple deep learning tasks. Experiments show the effectiveness of our algorithm, indicating promising performance for practical learning problems.
\ No newline at end of file
diff --git a/data/2020/iclr/Deep Network Classification by Scattering and Homotopy Dictionary Learning b/data/2020/iclr/Deep Network Classification by Scattering and Homotopy Dictionary Learning
new file mode 100644
index 0000000000..cfd03de9ef
--- /dev/null
+++ b/data/2020/iclr/Deep Network Classification by Scattering and Homotopy Dictionary Learning	
@@ -0,0 +1 @@
+We introduce a sparse scattering deep convolutional neural network, which provides a simple model to analyze properties of deep representation learning for classification. Learning a single dictionary matrix with a classifier yields a higher classification accuracy than AlexNet over the ImageNet 2012 dataset. The network first applies a scattering transform that linearizes variabilities due to geometric transformations such as translations and small deformations. A sparse $\ell^1$ dictionary coding reduces intra-class variability while preserving class separation through projections over unions of linear spaces. It is implemented in a deep convolutional network with a homotopy algorithm having an exponential convergence. A convergence proof is given in a general framework that includes ALISTA. Classification results are analyzed on ImageNet.
\ No newline at end of file
diff --git a/data/2020/iclr/Deep Semi-Supervised Anomaly Detection b/data/2020/iclr/Deep Semi-Supervised Anomaly Detection
new file mode 100644
index 0000000000..e117396cdf
--- /dev/null
+++ b/data/2020/iclr/Deep Semi-Supervised Anomaly Detection	
@@ -0,0 +1 @@
+Deep approaches to anomaly detection have recently shown promising results over shallow methods on large and complex datasets. Typically anomaly detection is treated as an unsupervised learning problem. In practice however, one may have---in addition to a large set of unlabeled samples---access to a small pool of labeled samples, e.g. a subset verified by some domain expert as being normal or anomalous. Semi-supervised approaches to anomaly detection aim to utilize such labeled samples, but most proposed methods are limited to merely including labeled normal samples. Only a few methods take advantage of labeled anomalies, with existing deep approaches being domain-specific. In this work we present Deep SAD, an end-to-end deep methodology for general semi-supervised anomaly detection. We further introduce an information-theoretic framework for deep anomaly detection based on the idea that the entropy of the latent distribution for normal data should be lower than the entropy of the anomalous distribution, which can serve as a theoretical interpretation for our method. In extensive experiments on MNIST, Fashion-MNIST, and CIFAR-10, along with other anomaly detection benchmark datasets, we demonstrate that our method is on par or outperforms shallow, hybrid, and deep competitors, yielding appreciable performance improvements even when provided with only little labeled data.
\ No newline at end of file
diff --git a/data/2020/iclr/DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures b/data/2020/iclr/DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures
new file mode 100644
index 0000000000..2d1019493b
--- /dev/null
+++ b/data/2020/iclr/DeepHoyer: Learning Sparser Neural Network with Differentiable Scale-Invariant Sparsity Measures	
@@ -0,0 +1 @@
+In seeking for sparse and efficient neural network models, many previous works investigated on enforcing L1 or L0 regularizers to encourage weight sparsity during training. The L0 regularizer measures the parameter sparsity directly and is invariant to the scaling of parameter values, but it cannot provide useful gradients, and therefore requires complex optimization techniques. The L1 regularizer is almost everywhere differentiable and can be easily optimized with gradient descent. Yet it is not scale-invariant, causing the same shrinking rate to all parameters, which is inefficient in increasing sparsity. Inspired by the Hoyer measure (the ratio between L1 and L2 norms) used in traditional compressed sensing problems, we present DeepHoyer, a set of sparsity-inducing regularizers that are both differentiable almost everywhere and scale-invariant. Our experiments show that enforcing DeepHoyer regularizers can produce even sparser neural network models than previous works, under the same accuracy level. We also show that DeepHoyer can be applied to both element-wise and structural pruning.
\ No newline at end of file
diff --git a/data/2020/iclr/DeepV2D: Video to Depth with Differentiable Structure from Motion b/data/2020/iclr/DeepV2D: Video to Depth with Differentiable Structure from Motion
new file mode 100644
index 0000000000..9d447aba77
--- /dev/null
+++ b/data/2020/iclr/DeepV2D: Video to Depth with Differentiable Structure from Motion	
@@ -0,0 +1 @@
+We propose DeepV2D, an end-to-end deep learning architecture for predicting depth from video. DeepV2D combines the representation ability of neural networks with the geometric principles governing image formation. We compose a collection of classical geometric algorithms, which are converted into trainable modules and combined into an end-to-end differentiable architecture. DeepV2D interleaves two stages: motion estimation and depth estimation. During inference, motion and depth estimation are alternated and converge to accurate depth. Code is available this https URL.
\ No newline at end of file
diff --git a/data/2020/iclr/Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation b/data/2020/iclr/Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation
new file mode 100644
index 0000000000..4a4951e5fb
--- /dev/null
+++ b/data/2020/iclr/Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation	
@@ -0,0 +1,2 @@
+Convolutional networks are not aware of an object's geometric variations, which leads to inefficient utilization of model and data capacity. To overcome this issue, recent works on deformation modeling seek to spatially reconfigure the data towards a common arrangement such that semantic recognition suffers less from deformation. This is typically done by augmenting static operators with learned free-form sampling grids in the image space, dynamically tuned to the data and task for adapting the receptive field. Yet adapting the receptive field does not quite reach the actual goal -- what really matters to the network is the "effective" receptive field (ERF), which reflects how much each pixel contributes. It is thus natural to design other approaches to adapt the ERF directly during runtime. 
+In this work, we instantiate one possible solution as Deformable Kernels (DKs), a family of novel and generic convolutional operators for handling object deformations by directly adapting the ERF while leaving the receptive field untouched. At the heart of our method is the ability to resample the original kernel space towards recovering the deformation of objects. This approach is justified with theoretical insights that the ERF is strictly determined by data sampling locations and kernel values. We implement DKs as generic drop-in replacements of rigid kernels and conduct a series of empirical studies whose results conform with our theories. Over several tasks and standard base models, our approach compares favorably against prior works that adapt during runtime. In addition, further experiments suggest a working mechanism orthogonal and complementary to previous works.
\ No newline at end of file
diff --git a/data/2020/iclr/Depth-Adaptive Transformer b/data/2020/iclr/Depth-Adaptive Transformer
new file mode 100644
index 0000000000..d574342554
--- /dev/null
+++ b/data/2020/iclr/Depth-Adaptive Transformer	
@@ -0,0 +1 @@
+State of the art sequence-to-sequence models for large scale tasks perform a fixed number of computations for each input sequence regardless of whether it is easy or hard to process. In this paper, we train Transformer models which can make output predictions at different stages of the network and we investigate different ways to predict how much computation is required for a particular sequence. Unlike dynamic computation in Universal Transformers, which applies the same set of layers iteratively, we apply different layers at every step to adjust both the amount of computation as well as the model capacity. On IWSLT German-English translation our approach matches the accuracy of a well tuned baseline Transformer while using less than a quarter of the decoder layers.
\ No newline at end of file
diff --git a/data/2020/iclr/Detecting Extrapolation with Local Ensembles b/data/2020/iclr/Detecting Extrapolation with Local Ensembles
new file mode 100644
index 0000000000..8a680d7c7b
--- /dev/null
+++ b/data/2020/iclr/Detecting Extrapolation with Local Ensembles	
@@ -0,0 +1 @@
+We present local ensembles, a method for detecting extrapolation at test time in a pre-trained model. We focus on underdetermination as a key component of extrapolation: we aim to detect when many possible predictions are consistent with the training data and model class. Our method uses local second-order information to approximate the variance of predictions across an ensemble of models from the same class. We compute this approximation by estimating the norm of the component of a test point's gradient that aligns with the low-curvature directions of the Hessian, and provide a tractable method for estimating this quantity. Experimentally, we show that our method is capable of detecting when a pre-trained model is extrapolating on test data, with applications to out-of-distribution detection, detecting spurious correlates, and active learning.
\ No newline at end of file
diff --git a/data/2020/iclr/Detecting and Diagnosing Adversarial Images with Class-Conditional Capsule Reconstructions b/data/2020/iclr/Detecting and Diagnosing Adversarial Images with Class-Conditional Capsule Reconstructions
new file mode 100644
index 0000000000..24543dcaa3
--- /dev/null
+++ b/data/2020/iclr/Detecting and Diagnosing Adversarial Images with Class-Conditional Capsule Reconstructions	
@@ -0,0 +1 @@
+Adversarial examples raise questions about whether neural network models are sensitive to the same visual features as humans. In this paper, we first detect adversarial examples or otherwise corrupted images based on a class-conditional reconstruction of the input. To specifically attack our detection mechanism, we propose the Reconstructive Attack which seeks both to cause a misclassification and a low reconstruction error. This reconstructive attack produces undetected adversarial examples but with much smaller success rate. Among all these attacks, we find that CapsNets always perform better than convolutional networks. Then, we diagnose the adversarial examples for CapsNets and find that the success of the reconstructive attack is highly related to the visual similarity between the source and target class. Additionally, the resulting perturbations can cause the input image to appear visually more like the target class and hence become non-adversarial. This suggests that CapsNets use features that are more aligned with human perception and have the potential to address the central issue raised by adversarial examples.
\ No newline at end of file
diff --git a/data/2020/iclr/Difference-Seeking Generative Adversarial Network-Unseen Sample Generation b/data/2020/iclr/Difference-Seeking Generative Adversarial Network-Unseen Sample Generation
new file mode 100644
index 0000000000..db5df43176
--- /dev/null
+++ b/data/2020/iclr/Difference-Seeking Generative Adversarial Network-Unseen Sample Generation	
@@ -0,0 +1 @@
+Unseen data, which are not samples from the distribution of training data and are difficult to collect, have exhibited importance in numerous applications, ({\em e.g.,} novelty detection, semi-supervised learning, and adversarial training). In this paper, we introduce a general framework called \textbf{d}ifference-\textbf{s}eeking \textbf{g}enerative \textbf{a}dversarial \textbf{n}etwork (DSGAN), to generate various types of unseen data. Its novelty is the consideration of the probability density of the unseen data distribution as the difference between two distributions $p_{\bar{d}}$ and $p_{d}$ whose samples are relatively easy to collect. The DSGAN can learn the target distribution, $p_{t}$, (or the unseen data distribution) from only the samples from the two distributions, $p_{d}$ and $p_{\bar{d}}$. In our scenario, $p_d$ is the distribution of the seen data, and $p_{\bar{d}}$ can be obtained from $p_{d}$ via simple operations, so that we only need the samples of $p_{d}$ during the training. Two key applications, semi-supervised learning and novelty detection, are taken as case studies to illustrate that the DSGAN enables the production of various unseen data. We also provide theoretical analyses about the convergence of the DSGAN.
\ No newline at end of file
diff --git a/data/2020/iclr/Differentially Private Meta-Learning b/data/2020/iclr/Differentially Private Meta-Learning
new file mode 100644
index 0000000000..12d5520f46
--- /dev/null
+++ b/data/2020/iclr/Differentially Private Meta-Learning	
@@ -0,0 +1 @@
+Parameter-transfer is a well-known and versatile approach for meta-learning, with applications including few-shot learning, federated learning, and reinforcement learning. However, parameter-transfer algorithms often require sharing models that have been trained on the samples from specific tasks, thus leaving the task-owners susceptible to breaches of privacy. We conduct the first formal study of privacy in this setting and formalize the notion of task-global differential privacy as a practical relaxation of more commonly studied threat models. We then propose a new differentially private algorithm for gradient-based parameter transfer that not only satisfies this privacy requirement but also retains provable transfer learning guarantees in convex settings. Empirically, we apply our analysis to the problems of federated learning with personalization and few-shot classification, showing that allowing the relaxation to task-global privacy from the more commonly studied notion of local privacy leads to dramatically increased performance in recurrent neural language modeling and image classification.
\ No newline at end of file
diff --git a/data/2020/iclr/Disentangling Factors of Variations Using Few Labels b/data/2020/iclr/Disentangling Factors of Variations Using Few Labels
new file mode 100644
index 0000000000..38cc8254c4
--- /dev/null
+++ b/data/2020/iclr/Disentangling Factors of Variations Using Few Labels	
@@ -0,0 +1 @@
+Learning disentangled representations is considered a cornerstone problem in representation learning. Recently, Locatello et al. (2019) demonstrated that unsupervised disentanglement learning without inductive biases is theoretically impossible and that existing inductive biases and unsupervised methods do not allow to consistently learn disentangled representations. However, in many practical settings, one might have access to a limited amount of supervision, for example through manual labeling of (some) factors of variation in a few training examples. In this paper, we investigate the impact of such supervision on state-of-the-art disentanglement methods and perform a large scale study, training over 52000 models under well-defined and reproducible experimental conditions. We observe that a small number of labeled examples (0.01--0.5% of the data set), with potentially imprecise and incomplete labels, is sufficient to perform model selection on state-of-the-art unsupervised models. Further, we investigate the benefit of incorporating supervision into the training process. Overall, we empirically validate that with little and imprecise supervision it is possible to reliably learn disentangled representations.
\ No newline at end of file
diff --git a/data/2020/iclr/Distance-Based Learning from Errors for Confidence Calibration b/data/2020/iclr/Distance-Based Learning from Errors for Confidence Calibration
new file mode 100644
index 0000000000..94d8ee3d27
--- /dev/null
+++ b/data/2020/iclr/Distance-Based Learning from Errors for Confidence Calibration	
@@ -0,0 +1 @@
+Deep neural networks (DNNs) are poorly-calibrated when trained in conventional ways. To improve confidence calibration of DNNs, we propose a novel training method, distance-based learning from errors (DBLE). DBLE bases its confidence estimation on distances in the representation space. We first adapt prototypical learning for training of a classification model for DBLE. It yields a representation space where a test sample's distance to its ground-truth class center can calibrate the model's performance. At inference, however, these distances are not available due to the lack of ground-truth label. To circumvent this by approximately inferring the distance for every test sample, we propose to train a confidence model jointly with the classification model, by merely learning from mis-classified training samples, which we show to be highly-beneficial for effective learning. On multiple data sets and DNN architectures, we demonstrate that DBLE outperforms alternative single-modal confidence calibration approaches. DBLE also achieves comparable performance with computationally-expensive ensemble approaches with lower computational cost and lower number of parameters.
\ No newline at end of file
diff --git a/data/2020/iclr/Diverse Trajectory Forecasting with Determinantal Point Processes b/data/2020/iclr/Diverse Trajectory Forecasting with Determinantal Point Processes
new file mode 100644
index 0000000000..dd30743f9e
--- /dev/null
+++ b/data/2020/iclr/Diverse Trajectory Forecasting with Determinantal Point Processes	
@@ -0,0 +1 @@
+The ability to forecast a set of likely yet diverse possible future behaviors of an agent (e.g., future trajectories of a pedestrian) is essential for safety-critical perception systems (e.g., autonomous vehicles). In particular, a set of possible future behaviors generated by the system must be diverse to account for all possible outcomes in order to take necessary safety precautions. It is not sufficient to maintain a set of the most likely future outcomes because the set may only contain perturbations of a single outcome. While generative models such as variational autoencoders (VAEs) have been shown to be a powerful tool for learning a distribution over future trajectories, randomly drawn samples from the learned implicit likelihood model may not be diverse -- the likelihood model is derived from the training data distribution and the samples will concentrate around the major mode that has most data. In this work, we propose to learn a diversity sampling function (DSF) that generates a diverse and likely set of future trajectories. The DSF maps forecasting context features to a set of latent codes which can be decoded by a generative model (e.g., VAE) into a set of diverse trajectory samples. Concretely, the process of identifying the diverse set of samples is posed as a parameter estimation of the DSF. To learn the parameters of the DSF, the diversity of the trajectory samples is evaluated by a diversity loss based on a determinantal point process (DPP). Gradient descent is performed over the DSF parameters, which in turn move the latent codes of the sample set to find an optimal diverse and likely set of trajectories. Our method is a novel application of DPPs to optimize a set of items (trajectories) in continuous space. We demonstrate the diversity of the trajectories produced by our approach on both low-dimensional 2D trajectory data and high-dimensional human motion data.
\ No newline at end of file
diff --git a/data/2020/iclr/DivideMix: Learning with Noisy Labels as Semi-supervised Learning b/data/2020/iclr/DivideMix: Learning with Noisy Labels as Semi-supervised Learning
new file mode 100644
index 0000000000..eebbffc0f5
--- /dev/null
+++ b/data/2020/iclr/DivideMix: Learning with Noisy Labels as Semi-supervised Learning	
@@ -0,0 +1 @@
+Deep neural networks are known to be annotation-hungry. Numerous efforts have been devoted to reducing the annotation cost when learning with deep networks. Two prominent directions include learning with noisy labels and semi-supervised learning by exploiting unlabeled data. In this work, we propose DivideMix, a novel framework for learning with noisy labels by leveraging semi-supervised learning techniques. In particular, DivideMix models the per-sample loss distribution with a mixture model to dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples, and trains the model on both the labeled and unlabeled data in a semi-supervised manner. To avoid confirmation bias, we simultaneously train two diverged networks where each network uses the dataset division from the other network. During the semi-supervised training phase, we improve the MixMatch strategy by performing label co-refinement and label co-guessing on labeled and unlabeled samples, respectively. Experiments on multiple benchmark datasets demonstrate substantial improvements over state-of-the-art methods. Code is available at this https URL .
\ No newline at end of file
diff --git a/data/2020/iclr/Dynamic Time Lag Regression: Predicting What & When b/data/2020/iclr/Dynamic Time Lag Regression: Predicting What & When
new file mode 100644
index 0000000000..3b3bd91ce2
--- /dev/null
+++ b/data/2020/iclr/Dynamic Time Lag Regression: Predicting What & When	
@@ -0,0 +1 @@
+This paper tackles a new regression problem, called Dynamic Time-Lag Regression (DTLR), where a cause signal drives an effect signal with an unknown time delay. The motivating application, pertaining to space weather modelling, aims to predict the near-Earth solar wind speed based on estimates of the Sun's coronal magnetic field. DTLR differs from mainstream regression and from sequence-to-sequence learning in two respects: firstly, no ground truth (e.g., pairs of associated sub-sequences) is available; secondly, the cause signal contains much information irrelevant to the effect signal (the solar magnetic field governs the solar wind propagation in the heliosphere, of which the Earth's magnetosphere is but a minuscule region). A Bayesian approach is presented to tackle the specifics of the DTLR problem, with theoretical justifications based on linear stability analysis. A proof of concept on synthetic problems is presented. Finally, the empirical results on the solar wind modelling task improve on the state of the art in solar wind forecasting.
\ No newline at end of file
diff --git a/data/2020/iclr/Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery b/data/2020/iclr/Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery
new file mode 100644
index 0000000000..f6f0678021
--- /dev/null
+++ b/data/2020/iclr/Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery	
@@ -0,0 +1 @@
+Reinforcement learning requires manual specification of a reward function to learn a task. While in principle this reward function only needs to specify the task goal, in practice reinforcement learning can be very time-consuming or even infeasible unless the reward function is shaped so as to provide a smooth gradient towards a successful outcome. This shaping is difficult to specify by hand, particularly when the task is learned from raw observations, such as images. In this paper, we study how we can automatically learn dynamical distances: a measure of the expected number of time steps to reach a given goal state from any other state. These dynamical distances can be used to provide well-shaped reward functions for reaching new goals, making it possible to learn complex tasks efficiently. We show that dynamical distances can be used in a semi-supervised regime, where unsupervised interaction with the environment is used to learn the dynamical distances, while a small amount of preference supervision is used to determine the task goal, without any manually engineered reward function or goal examples. We evaluate our method both on a real-world robot and in simulation. We show that our method can learn to turn a valve with a real-world 9-DoF hand, using raw image observations and just ten preference labels, without any other supervision. Videos of the learned skills can be found on the project website: this https URL.
\ No newline at end of file
diff --git a/data/2020/iclr/Dynamically Pruned Message Passing Networks for Large-scale Knowledge Graph Reasoning b/data/2020/iclr/Dynamically Pruned Message Passing Networks for Large-scale Knowledge Graph Reasoning
new file mode 100644
index 0000000000..7f29b7de46
--- /dev/null
+++ b/data/2020/iclr/Dynamically Pruned Message Passing Networks for Large-scale Knowledge Graph Reasoning	
@@ -0,0 +1 @@
+We propose Dynamically Pruned Message Passing Networks (DPMPN) for large-scale knowledge graph reasoning. In contrast to existing models, embedding-based or path-based, we learn an input-dependent subgraph to explicitly model reasoning process. Subgraphs are dynamically constructed and expanded by applying graphical attention mechanism conditioned on input queries. In this way, we not only construct graph-structured explanations but also enable message passing designed in Graph Neural Networks (GNNs) to scale with graph sizes. We take the inspiration from the consciousness prior proposed by and develop a two-GNN framework to simultaneously encode input-agnostic full graph representation and learn input-dependent local one coordinated by an attention module. Experiments demonstrate the reasoning capability of our model that is to provide clear graphical explanations as well as deliver accurate predictions, outperforming most state-of-the-art methods in knowledge base completion tasks.
\ No newline at end of file
diff --git a/data/2020/iclr/ES-MAML: Simple Hessian-Free Meta Learning b/data/2020/iclr/ES-MAML: Simple Hessian-Free Meta Learning
new file mode 100644
index 0000000000..490ff0efde
--- /dev/null
+++ b/data/2020/iclr/ES-MAML: Simple Hessian-Free Meta Learning	
@@ -0,0 +1 @@
+We introduce ES-MAML, a new framework for solving the model agnostic meta learning (MAML) problem based on Evolution Strategies (ES). Existing algorithms for MAML are based on policy gradients, and incur significant difficulties when attempting to estimate second derivatives using backpropagation on stochastic policies. We show how ES can be applied to MAML to obtain an algorithm which avoids the problem of estimating second derivatives, and is also conceptually simple and easy to implement. Moreover, ES-MAML can handle new types of nonsmooth adaptation operators, and other techniques for improving performance and estimation of ES methods become applicable. We show empirically that ES-MAML is competitive with existing methods and often yields better adaptation with fewer queries.
\ No newline at end of file
diff --git a/data/2020/iclr/Editable Neural Networks b/data/2020/iclr/Editable Neural Networks
new file mode 100644
index 0000000000..eb837c84db
--- /dev/null
+++ b/data/2020/iclr/Editable Neural Networks	
@@ -0,0 +1 @@
+These days deep neural networks are ubiquitously used in a wide range of tasks, from image classification and machine translation to face identification and self-driving cars. In many applications, a single model error can lead to devastating financial, reputational and even life-threatening consequences. Therefore, it is crucially important to correct model mistakes quickly as they appear. In this work, we investigate the problem of neural network editing - how one can efficiently patch a mistake of the model on a particular sample, without influencing the model behavior on other samples. Namely, we propose Editable Training, a model-agnostic training technique that encourages fast editing of the trained model. We empirically demonstrate the effectiveness of this method on large-scale image classification and machine translation tasks.
\ No newline at end of file
diff --git a/data/2020/iclr/Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform b/data/2020/iclr/Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform
new file mode 100644
index 0000000000..660a6d983b
--- /dev/null
+++ b/data/2020/iclr/Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform	
@@ -0,0 +1 @@
+Strictly enforcing orthonormality constraints on parameter matrices has been shown advantageous in deep learning. This amounts to Riemannian optimization on the Stiefel manifold, which, however, is computationally expensive. To address this challenge, we present two main contributions: (1) A new efficient retraction map based on an iterative Cayley transform for optimization updates, and (2) An implicit vector transport mechanism based on the combination of a projection of the momentum and the Cayley transform on the Stiefel manifold. We specify two new optimization algorithms: Cayley SGD with momentum, and Cayley ADAM on the Stiefel manifold. Convergence of Cayley SGD is theoretically analyzed. Our experiments for CNN training demonstrate that both algorithms: (a) Use less running time per iteration relative to existing approaches that enforce orthonormality of CNN parameters; and (b) Achieve faster convergence rates than the baseline SGD and ADAM algorithms without compromising the performance of the CNN. Cayley SGD and Cayley ADAM are also shown to reduce the training time for optimizing the unitary transition matrices in RNNs.
\ No newline at end of file
diff --git a/data/2020/iclr/Efficient and Information-Preserving Future Frame Prediction and Beyond b/data/2020/iclr/Efficient and Information-Preserving Future Frame Prediction and Beyond
new file mode 100644
index 0000000000..4e8a57e7ab
--- /dev/null
+++ b/data/2020/iclr/Efficient and Information-Preserving Future Frame Prediction and Beyond	
@@ -0,0 +1 @@
+Applying resolution-preserving blocks is a common practice to maximize information preservation in video prediction, yet their high memory consumption greatly limits their application scenarios. We propose CrevNet, a Conditionally Reversible Network that uses reversible architectures to build a bijective two-way autoencoder and its complementary recurrent predictor. Our model enjoys the theoretically guaranteed property of no information loss during the feature extraction, much lower memory consumption and computational efficiency. The lightweight nature of our model enables us to incorporate 3D convolutions without concern of memory bottleneck, enhancing the model's ability to capture both short-term and long-term temporal dependencies. Our proposed approach achieves state-of-the-art results on Moving MNIST, Traffic4cast and KITTI datasets. We further demonstrate the transferability of our self-supervised learning method by exploiting its learnt features for object detection on KITTI. Our competitive results indicate the potential of using CrevNet as a generative pre-training strategy to guide downstream tasks.
\ No newline at end of file
diff --git a/data/2020/iclr/Enhancing Transformation-Based Defenses Against Adversarial Attacks with a Distribution Classifier b/data/2020/iclr/Enhancing Transformation-Based Defenses Against Adversarial Attacks with a Distribution Classifier
new file mode 100644
index 0000000000..31ef9c7efe
--- /dev/null
+++ b/data/2020/iclr/Enhancing Transformation-Based Defenses Against Adversarial Attacks with a Distribution Classifier	
@@ -0,0 +1 @@
+Adversarial attacks on convolutional neural networks (CNN) have gained significant attention and there have been active research efforts on defense mechanisms. Stochastic input transformation methods have been proposed, where the idea is to recover the image from adversarial attack by random transformation, and to take the majority vote as consensus among the random samples. However, the transformation improves the accuracy on adversarial images at the expense of the accuracy on clean images. While it is intuitive that the accuracy on clean images would deteriorate, the exact mechanism in which how this occurs is unclear. In this paper, we study the distribution of softmax induced by stochastic transformations. We observe that with random transformations on the clean images, although the mass of the softmax distribution could shift to the wrong class, the resulting distribution of softmax could be used to correct the prediction. Furthermore, on the adversarial counterparts, with the image transformation, the resulting shapes of the distribution of softmax are similar to the distributions from the clean images. With these observations, we propose a method to improve existing transformation-based defenses. We train a separate lightweight distribution classifier to recognize distinct features in the distributions of softmax outputs of transformed images. Our empirical studies show that our distribution classifier, by training on distributions obtained from clean images only, outperforms majority voting for both clean and adversarial images. Our method is generic and can be integrated with existing transformation-based defenses.
\ No newline at end of file
diff --git a/data/2020/iclr/Ensemble Distribution Distillation b/data/2020/iclr/Ensemble Distribution Distillation
new file mode 100644
index 0000000000..c90275b749
--- /dev/null
+++ b/data/2020/iclr/Ensemble Distribution Distillation	
@@ -0,0 +1 @@
+Ensembles of models often yield improvements in system performance. These ensemble approaches have also been empirically shown to yield robust measures of uncertainty, and are capable of distinguishing between different \emph{forms} of uncertainty. However, ensembles come at a computational and memory cost which may be prohibitive for many applications. There has been significant work done on the distillation of an ensemble into a single model. Such approaches decrease computational cost and allow a single model to achieve an accuracy comparable to that of an ensemble. However, information about the \emph{diversity} of the ensemble, which can yield estimates of different forms of uncertainty, is lost. This work considers the novel task of \emph{Ensemble Distribution Distillation} (EnD$^2$) --- distilling the distribution of the predictions from an ensemble, rather than just the average prediction, into a single model. EnD$^2$ enables a single model to retain both the improved classification performance of ensemble distillation as well as information about the diversity of the ensemble, which is useful for uncertainty estimation. A solution for EnD$^2$ based on Prior Networks, a class of models which allow a single neural network to explicitly model a distribution over output distributions, is proposed in this work. The properties of EnD$^2$ are investigated on both an artificial dataset, and on the CIFAR-10, CIFAR-100 and TinyImageNet datasets, where it is shown that EnD$^2$ can approach the classification performance of an ensemble, and outperforms both standard DNNs and Ensemble Distillation on the tasks of misclassification and out-of-distribution input detection.
\ No newline at end of file
diff --git a/data/2020/iclr/Escaping Saddle Points Faster with Stochastic Momentum b/data/2020/iclr/Escaping Saddle Points Faster with Stochastic Momentum
new file mode 100644
index 0000000000..4e6c1cb3ed
--- /dev/null
+++ b/data/2020/iclr/Escaping Saddle Points Faster with Stochastic Momentum	
@@ -0,0 +1 @@
+Stochastic gradient descent (SGD) with stochastic momentum is popular in nonconvex stochastic optimization and particularly for the training of deep neural networks. In standard SGD, parameters are updated by improving along the path of the gradient at the current iterate on a batch of examples, where the addition of a ``momentum'' term biases the update in the direction of the previous change in parameters. In non-stochastic convex optimization one can show that a momentum adjustment provably reduces convergence time in many settings, yet such results have been elusive in the stochastic and non-convex settings. At the same time, a widely-observed empirical phenomenon is that in training deep networks stochastic momentum appears to significantly improve convergence time, variants of it have flourished in the development of other popular update methods, e.g. ADAM, AMSGrad, etc. Yet theoretical justification for the use of stochastic momentum has remained a significant open question. In this paper we propose an answer: stochastic momentum improves deep network training because it modifies SGD to escape saddle points faster and, consequently, to more quickly find a second order stationary point. Our theoretical results also shed light on the related question of how to choose the ideal momentum parameter--our analysis suggests that $\beta \in [0,1)$ should be large (close to 1), which comports with empirical findings. We also provide experimental findings that further validate these conclusions.
\ No newline at end of file
diff --git a/data/2020/iclr/Evaluating The Search Phase of Neural Architecture Search b/data/2020/iclr/Evaluating The Search Phase of Neural Architecture Search
new file mode 100644
index 0000000000..5903855ed4
--- /dev/null
+++ b/data/2020/iclr/Evaluating The Search Phase of Neural Architecture Search	
@@ -0,0 +1 @@
+Neural Architecture Search (NAS) aims to facilitate the design of deep networks for new tasks. Existing techniques rely on two stages: searching over the architecture space and validating the best architecture. NAS algorithms are currently compared solely based on their results on the downstream task. While intuitive, this fails to explicitly evaluate the effectiveness of their search strategies. In this paper, we propose to evaluate the NAS search phase. To this end, we compare the quality of the solutions obtained by NAS search policies with that of random architecture selection. We find that: (i) On average, the state-of-the-art NAS algorithms perform similarly to the random policy; (ii) the widely-used weight sharing strategy degrades the ranking of the NAS candidates to the point of not reflecting their true performance, thus reducing the effectiveness of the search process. We believe that our evaluation framework will be key to designing NAS strategies that consistently discover architectures superior to random ones.
\ No newline at end of file
diff --git a/data/2020/iclr/Exploration in Reinforcement Learning with Deep Covering Options b/data/2020/iclr/Exploration in Reinforcement Learning with Deep Covering Options
new file mode 100644
index 0000000000..0f030c75ea
--- /dev/null
+++ b/data/2020/iclr/Exploration in Reinforcement Learning with Deep Covering Options	
@@ -0,0 +1 @@
+While many option discovery methods have been proposed to accelerate exploration in reinforcement learning, they are often heuristic. Recently, covering options was proposed to discover a set of options that provably reduce the upper bound of the environment's cover time, a measure of the difficulty of exploration. Covering options are computed using the eigenvectors of the graph Laplacian, but they are constrained to tabular tasks and are not applicable to tasks with large or continuous state-spaces. We introduce deep covering options, an online method that extends covering options to large state spaces, automatically discovering task-agnostic options that encourage exploration. We evaluate our method in several challenging sparse-reward domains and we show that our approach identifies less explored regions of the state-space and successfully generates options to visit these regions, substantially improving both the exploration and the total accumulated reward.
\ No newline at end of file
diff --git a/data/2020/iclr/Exploring Model-based Planning with Policy Networks b/data/2020/iclr/Exploring Model-based Planning with Policy Networks
new file mode 100644
index 0000000000..31026dc402
--- /dev/null
+++ b/data/2020/iclr/Exploring Model-based Planning with Policy Networks	
@@ -0,0 +1 @@
+Model-based reinforcement learning (MBRL) with model-predictive control or online planning has shown great potential for locomotion control tasks in terms of both sample efficiency and asymptotic performance. Despite their initial successes, the existing planning methods search from candidate sequences randomly generated in the action space, which is inefficient in complex high-dimensional environments. In this paper, we propose a novel MBRL algorithm, model-based policy planning (POPLIN), that combines policy networks with online planning. More specifically, we formulate action planning at each time-step as an optimization problem using neural networks. We experiment with both optimization w.r.t. the action sequences initialized from the policy network, and also online optimization directly w.r.t. the parameters of the policy network. We show that POPLIN obtains state-of-the-art performance in the MuJoCo benchmarking environments, being about 3x more sample efficient than the state-of-the-art algorithms, such as PETS, TD3 and SAC. To explain the effectiveness of our algorithm, we show that the optimization surface in parameter space is smoother than in action space. Further more, we found the distilled policy network can be effectively applied without the expansive model predictive control during test time for some environments such as Cheetah. Code is released in this https URL.
\ No newline at end of file
diff --git a/data/2020/iclr/FSPool: Learning Set Representations with Featurewise Sort Pooling b/data/2020/iclr/FSPool: Learning Set Representations with Featurewise Sort Pooling
new file mode 100644
index 0000000000..34228a29ed
--- /dev/null
+++ b/data/2020/iclr/FSPool: Learning Set Representations with Featurewise Sort Pooling	
@@ -0,0 +1 @@
+Traditional set prediction models can struggle with simple datasets due to an issue we call the responsibility problem. We introduce a pooling method for sets of feature vectors based on sorting features across elements of the set. This can be used to construct a permutation-equivariant auto-encoder that avoids this responsibility problem. On a toy dataset of polygons and a set version of MNIST, we show that such an auto-encoder produces considerably better reconstructions and representations. Replacing the pooling function in existing set encoders with FSPool improves accuracy and convergence speed on a variety of datasets.
\ No newline at end of file
diff --git a/data/2020/iclr/Fast is better than free: Revisiting adversarial training b/data/2020/iclr/Fast is better than free: Revisiting adversarial training
new file mode 100644
index 0000000000..fef813e939
--- /dev/null
+++ b/data/2020/iclr/Fast is better than free: Revisiting adversarial training	
@@ -0,0 +1 @@
+Adversarial training, a method for learning robust deep networks, is typically assumed to be more expensive than traditional training due to the necessity of constructing adversarial examples via a first-order method like projected gradient decent (PGD). In this paper, we make the surprising discovery that it is possible to train empirically robust models using a much weaker and cheaper adversary, an approach that was previously believed to be ineffective, rendering the method no more costly than standard training in practice. Specifically, we show that adversarial training with the fast gradient sign method (FGSM), when combined with random initialization, is as effective as PGD-based training but has significantly lower cost. Furthermore we show that FGSM adversarial training can be further accelerated by using standard techniques for efficient training of deep networks, allowing us to learn a robust CIFAR10 classifier with 45% robust accuracy to PGD attacks with $\epsilon=8/255$ in 6 minutes, and a robust ImageNet classifier with 43% robust accuracy at $\epsilon=2/255$ in 12 hours, in comparison to past work based on "free" adversarial training which took 10 and 50 hours to reach the same respective thresholds. Finally, we identify a failure mode referred to as "catastrophic overfitting" which may have caused previous attempts to use FGSM adversarial training to fail. All code for reproducing the experiments in this paper as well as pretrained model weights are at this https URL.
\ No newline at end of file
diff --git a/data/2020/iclr/FasterSeg: Searching for Faster Real-time Semantic Segmentation b/data/2020/iclr/FasterSeg: Searching for Faster Real-time Semantic Segmentation
new file mode 100644
index 0000000000..c5f1756a92
--- /dev/null
+++ b/data/2020/iclr/FasterSeg: Searching for Faster Real-time Semantic Segmentation	
@@ -0,0 +1 @@
+We present FasterSeg, an automatically designed semantic segmentation network with not only state-of-the-art performance but also faster speed than current methods. Utilizing neural architecture search (NAS), FasterSeg is discovered from a novel and broader search space integrating multi-resolution branches, that has been recently found to be vital in manually designed segmentation models. To better calibrate the balance between the goals of high accuracy and low latency, we propose a decoupled and fine-grained latency regularization, that effectively overcomes our observed phenomenons that the searched networks are prone to "collapsing" to low-latency yet poor-accuracy models. Moreover, we seamlessly extend FasterSeg to a new collaborative search (co-searching) framework, simultaneously searching for a teacher and a student network in the same single run. The teacher-student distillation further boosts the student model’s accuracy. Experiments on popular segmentation benchmarks demonstrate the competency of FasterSeg. For example, FasterSeg can run over 30% faster than the closest manually designed competitor on Cityscapes, while maintaining comparable accuracy.
\ No newline at end of file
diff --git a/data/2020/iclr/Feature Interaction Interpretability: A Case for Explaining Ad-Recommendation Systems via Neural Interaction Detection b/data/2020/iclr/Feature Interaction Interpretability: A Case for Explaining Ad-Recommendation Systems via Neural Interaction Detection
new file mode 100644
index 0000000000..5bdf24130a
--- /dev/null
+++ b/data/2020/iclr/Feature Interaction Interpretability: A Case for Explaining Ad-Recommendation Systems via Neural Interaction Detection	
@@ -0,0 +1 @@
+Recommendation is a prevalent application of machine learning that affects many users; therefore, it is important for recommender models to be accurate and interpretable. In this work, we propose a method to both interpret and augment the predictions of black-box recommender systems. In particular, we propose to interpret feature interactions from a source recommender model and explicitly encode these interactions in a target recommender model, where both source and target models are black-boxes. By not assuming the structure of the recommender system, our approach can be used in general settings. In our experiments, we focus on a prominent use of machine learning recommendation: ad-click prediction. We found that our interaction interpretations are both informative and predictive, e.g., significantly outperforming existing recommender models. What's more, the same approach to interpret interactions can provide new insights into domains even beyond recommendation, such as text and image classification.
\ No newline at end of file
diff --git a/data/2020/iclr/Federated Adversarial Domain Adaptation b/data/2020/iclr/Federated Adversarial Domain Adaptation
new file mode 100644
index 0000000000..90185826f5
--- /dev/null
+++ b/data/2020/iclr/Federated Adversarial Domain Adaptation	
@@ -0,0 +1 @@
+Federated learning improves data privacy and efficiency in machine learning performed over networks of distributed devices, such as mobile phones, IoT and wearable devices, etc. Yet models trained with federated learning can still fail to generalize to new devices due to the problem of domain shift. Domain shift occurs when the labeled data collected by source nodes statistically differs from the target node's unlabeled data. In this work, we present a principled approach to the problem of federated domain adaptation, which aims to align the representations learned among the different nodes with the data distribution of the target node. Our approach extends adversarial adaptation techniques to the constraints of the federated setting. In addition, we devise a dynamic attention mechanism and leverage feature disentanglement to enhance knowledge transfer. Empirically, we perform extensive experiments on several image and text classification tasks and show promising results under unsupervised federated domain adaptation setting.
\ No newline at end of file
diff --git a/data/2020/iclr/Few-Shot Learning on graphs via super-Classes based on Graph spectral Measures b/data/2020/iclr/Few-Shot Learning on graphs via super-Classes based on Graph spectral Measures
new file mode 100644
index 0000000000..24f5604b55
--- /dev/null
+++ b/data/2020/iclr/Few-Shot Learning on graphs via super-Classes based on Graph spectral Measures	
@@ -0,0 +1 @@
+We propose to study the problem of few-shot graph classification in graph neural networks (GNNs) to recognize unseen classes, given limited labeled graph examples. Despite several interesting GNN variants being proposed recently for node and graph classification tasks, when faced with scarce labeled examples in the few-shot setting, these GNNs exhibit significant loss in classification performance. Here, we present an approach where a probability measure is assigned to each graph based on the spectrum of the graph’s normalized Laplacian. This enables us to accordingly cluster the graph base-labels associated with each graph into super-classes, where the L^p Wasserstein distance serves as our underlying distance metric. Subsequently, a super-graph constructed based on the super-classes is then fed to our proposed GNN framework which exploits the latent inter-class relationships made explicit by the super-graph to achieve better class label separation among the graphs. We conduct exhaustive empirical evaluations of our proposed method and show that it outperforms both the adaptation of state-of-the-art graph classification methods to few-shot scenario and our naive baseline GNNs. Additionally, we also extend and study the behavior of our method to semi-supervised and active learning scenarios.
\ No newline at end of file
diff --git a/data/2020/iclr/Few-shot Text Classification with Distributional Signatures b/data/2020/iclr/Few-shot Text Classification with Distributional Signatures
new file mode 100644
index 0000000000..453b6bc12f
--- /dev/null
+++ b/data/2020/iclr/Few-shot Text Classification with Distributional Signatures	
@@ -0,0 +1 @@
+In this paper, we explore meta-learning for few-shot text classification. Meta-learning has shown strong performance in computer vision, where low-level patterns are transferable across learning tasks. However, directly applying this approach to text is challenging--lexical features highly informative for one task may be insignificant for another. Thus, rather than learning solely from words, our model also leverages their distributional signatures, which encode pertinent word occurrence patterns. Our model is trained within a meta-learning framework to map these signatures into attention scores, which are then used to weight the lexical representations of words. We demonstrate that our model consistently outperforms prototypical networks learned on lexical knowledge (Snell et al., 2017) in both few-shot text classification and relation classification by a significant margin across six benchmark datasets (20.0% on average in 1-shot classification).
\ No newline at end of file
diff --git a/data/2020/iclr/Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents b/data/2020/iclr/Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents
new file mode 100644
index 0000000000..671e68ea5d
--- /dev/null
+++ b/data/2020/iclr/Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents	
@@ -0,0 +1 @@
+As deep reinforcement learning driven by visual perception becomes more widely used there is a growing need to better understand and probe the learned agents. Understanding the decision making process and its relationship to visual inputs can be very valuable to identify problems in learned behavior. However, this topic has been relatively under-explored in the research community. In this work we present a method for synthesizing visual inputs of interest for a trained agent. Such inputs or states could be situations in which specific actions are necessary. Further, critical states in which a very high or a very low reward can be achieved are often interesting to understand the situational awareness of the system as they can correspond to risky states. To this end, we learn a generative model over the state space of the environment and use its latent space to optimize a target function for the state of interest. In our experiments we show that this method can generate insights for a variety of environments and reinforcement learning methods. We explore results in the standard Atari benchmark games as well as in an autonomous driving simulator. Based on the efficiency with which we have been able to identify behavioural weaknesses with this technique, we believe this general approach could serve as an important tool for AI safety applications.
\ No newline at end of file
diff --git a/data/2020/iclr/Fooling Detection Alone is Not Enough: Adversarial Attack against Multiple Object Tracking b/data/2020/iclr/Fooling Detection Alone is Not Enough: Adversarial Attack against Multiple Object Tracking
new file mode 100644
index 0000000000..4803d2a116
--- /dev/null
+++ b/data/2020/iclr/Fooling Detection Alone is Not Enough: Adversarial Attack against Multiple Object Tracking	
@@ -0,0 +1 @@
+Recent work in adversarial machine learning started to focus on the visual perception in autonomous driving and studied Adversarial Examples (AEs) for object detection models. However, in such visual perception pipeline the detected objects must also be tracked, in a process called Multiple Object Tracking (MOT), to build the moving trajectories of surrounding obstacles. Since MOT is designed to be robust against errors in object detection, it poses a general challenge to existing attack techniques that blindly target objection detection: we find that a success rate of over 98% is needed for them to actually affect the tracking results, a requirement that no existing attack technique can satisfy. In this paper, we are the first to study adversarial machine learning attacks against the complete visual perception pipeline in autonomous driving, and discover a novel attack technique, tracker hijacking, that can effectively fool MOT using AEs on object detection. Using our technique, successful AEs on as few as one single frame can move an existing object in to or out of the headway of an autonomous vehicle to cause potential safety hazards. We perform evaluation using the Berkeley Deep Drive dataset and find that on average when 3 frames are attacked, our attack can have a nearly 100% success rate while attacks that blindly target object detection only have up to 25%.
\ No newline at end of file
diff --git a/data/2020/iclr/Four Things Everyone Should Know to Improve Batch Normalization b/data/2020/iclr/Four Things Everyone Should Know to Improve Batch Normalization
new file mode 100644
index 0000000000..d5a90aab4c
--- /dev/null
+++ b/data/2020/iclr/Four Things Everyone Should Know to Improve Batch Normalization	
@@ -0,0 +1 @@
+A key component of most neural network architectures is the use of normalization layers, such as Batch Normalization. Despite its common use and large utility in optimizing deep architectures, it has been challenging both to generically improve upon Batch Normalization and to understand the circumstances that lend themselves to other enhancements. In this paper, we identify four improvements to the generic form of Batch Normalization and the circumstances under which they work, yielding performance gains across all batch sizes while requiring no additional computation during training. These contributions include proposing a method for reasoning about the current example in inference normalization statistics, fixing a training vs. inference discrepancy; recognizing and validating the powerful regularization effect of Ghost Batch Normalization for small and medium batch sizes; examining the effect of weight decay regularization on the scaling and shifting parameters gamma and beta; and identifying a new normalization algorithm for very small batch sizes by combining the strengths of Batch and Group Normalization. We validate our results empirically on six datasets: CIFAR-100, SVHN, Caltech-256, Oxford Flowers-102, CUB-2011, and ImageNet.
\ No newline at end of file
diff --git a/data/2020/iclr/From Variational to Deterministic Autoencoders b/data/2020/iclr/From Variational to Deterministic Autoencoders
new file mode 100644
index 0000000000..cd70f27809
--- /dev/null
+++ b/data/2020/iclr/From Variational to Deterministic Autoencoders	
@@ -0,0 +1 @@
+Variational Autoencoders (VAEs) provide a theoretically-backed and popular framework for deep generative models. However, learning a VAE from data poses still unanswered theoretical questions and considerable practical challenges. In this work, we propose an alternative framework for generative modeling that is simpler, easier to train, and deterministic, yet has many of the advantages of VAEs. We observe that sampling a stochastic encoder in a Gaussian VAE can be interpreted as simply injecting noise into the input of a deterministic decoder. We investigate how substituting this kind of stochasticity, with other explicit and implicit regularization schemes, can lead to an equally smooth and meaningful latent space without forcing it to conform to an arbitrarily chosen prior. To retrieve a generative mechanism to sample new data, we introduce an ex-post density estimation step that can be readily applied also to existing VAEs, improving their sample quality. We show, in a rigorous empirical study, that the proposed regularized deterministic autoencoders are able to generate samples that are comparable to, or better than, those of VAEs and more powerful alternatives when applied to images as well as to structured data such as molecules. \footnote{An implementation is available at: \url{this https URL}}
\ No newline at end of file
diff --git a/data/2020/iclr/Functional vs. parametric equivalence of ReLU networks b/data/2020/iclr/Functional vs. parametric equivalence of ReLU networks
new file mode 100644
index 0000000000..3acf078a4a
--- /dev/null
+++ b/data/2020/iclr/Functional vs. parametric equivalence of ReLU networks	
@@ -0,0 +1 @@
+We address the following question: How redundant is the parameterisation of ReLU networks? Specifically, we consider transformations of the weight space which leave the function implemented by the network intact. Two such transformations are known for feed-forward architectures: permutation of neurons within a layer, and positive scaling of all incoming weights of a neuron coupled with inverse scaling of its outgoing weights. In this work, we show for architectures with non-increasing widths that permutation and scaling are in fact the only function-preserving weight transformations. For any eligible architecture we give an explicit construction of a neural network such that any other network that implements the same function can be obtained from the original one by the application of permutations and rescaling. The proof relies on a geometric understanding of boundaries between linear regions of ReLU networks, and we hope the developed mathematical tools are of independent interest.
\ No newline at end of file
diff --git a/data/2020/iclr/GAT: Generative Adversarial Training for Adversarial Example Detection and Robust Classification b/data/2020/iclr/GAT: Generative Adversarial Training for Adversarial Example Detection and Robust Classification
new file mode 100644
index 0000000000..76f4811e40
--- /dev/null
+++ b/data/2020/iclr/GAT: Generative Adversarial Training for Adversarial Example Detection and Robust Classification	
@@ -0,0 +1 @@
+The vulnerabilities of deep neural networks against adversarial examples have become a significant concern for deploying these models in sensitive domains. Devising a definitive defense against such attacks is proven to be challenging, and the methods relying on detecting adversarial samples are only valid when the attacker is oblivious to the detection mechanism. In this paper, we consider the adversarial detection problem under the robust optimization framework. We partition the input space into subspaces and train adversarial robust subspace detectors using asymmetrical adversarial training (AAT). The integration of the classifier and detectors presents a detection mechanism that provides a performance guarantee to the adversary it considered. We demonstrate that AAT promotes the learning of class-conditional distributions, which further gives rise to generative detection/classification approaches that are both robust and more interpretable. We provide comprehensive evaluations of the above methods, and demonstrate their competitive performances and compelling properties on adversarial detection and robust classification problems.
\ No newline at end of file
diff --git a/data/2020/iclr/GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations b/data/2020/iclr/GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations
new file mode 100644
index 0000000000..bd4b3095ef
--- /dev/null
+++ b/data/2020/iclr/GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations	
@@ -0,0 +1 @@
+Generative latent-variable models are emerging as promising tools in robotics and reinforcement learning. Yet, even though tasks in these domains typically involve distinct objects, most state-of-the-art generative models do not explicitly capture the compositional nature of visual scenes. Two recent exceptions, MONet and IODINE, decompose scenes into objects in an unsupervised fashion. Their underlying generative processes, however, do not account for component interactions. Hence, neither of them allows for principled sampling of novel scenes. Here we present GENESIS, the first object-centric generative model of 3D visual scenes capable of both decomposing and generating scenes by capturing relationships between scene components. GENESIS parameterises a spatial GMM over images which is decoded from a set of object-centric latent variables that are either inferred sequentially in an amortised fashion or sampled from an autoregressive prior. We train GENESIS on several publicly available datasets and evaluate its performance on scene generation, decomposition, and semi-supervised learning.
\ No newline at end of file
diff --git a/data/2020/iclr/GLAD: Learning Sparse Graph Recovery b/data/2020/iclr/GLAD: Learning Sparse Graph Recovery
new file mode 100644
index 0000000000..bf043cf95c
--- /dev/null
+++ b/data/2020/iclr/GLAD: Learning Sparse Graph Recovery	
@@ -0,0 +1 @@
+Recovering sparse conditional independence graphs from data is a fundamental problem in machine learning with wide applications. A popular formulation of the problem is an $\ell_1$ regularized maximum likelihood estimation. Many convex optimization algorithms have been designed to solve this formulation to recover the graph structure. Recently, there is a surge of interest to learn algorithms directly based on data, and in this case, learn to map empirical covariance to the sparse precision matrix. However, it is a challenging task in this case, since the symmetric positive definiteness (SPD) and sparsity of the matrix are not easy to enforce in learned algorithms, and a direct mapping from data to precision matrix may contain many parameters. We propose a deep learning architecture, GLAD, which uses an Alternating Minimization (AM) algorithm as our model inductive bias, and learns the model parameters via supervised learning. We show that GLAD learns a very compact and effective model for recovering sparse graphs from data.
\ No newline at end of file
diff --git a/data/2020/iclr/Gap-Aware Mitigation of Gradient Staleness b/data/2020/iclr/Gap-Aware Mitigation of Gradient Staleness
new file mode 100644
index 0000000000..9c38a3a9fa
--- /dev/null
+++ b/data/2020/iclr/Gap-Aware Mitigation of Gradient Staleness	
@@ -0,0 +1 @@
+Cloud computing is becoming increasingly popular as a platform for distributed training of deep neural networks. Synchronous stochastic gradient descent (SSGD) suffers from substantial slowdowns due to stragglers if the environment is non-dedicated, as is common in cloud computing. Asynchronous SGD (ASGD) methods are immune to these slowdowns but are scarcely used due to gradient staleness, which encumbers the convergence process. Recent techniques have had limited success mitigating the gradient staleness when scaling up to many workers (computing nodes). In this paper we define the Gap as a measure of gradient staleness and propose Gap-Aware (GA), a novel asynchronous-distributed method that penalizes stale gradients linearly to the Gap and performs well even when scaling to large numbers of workers. Our evaluation on the CIFAR, ImageNet, and WikiText-103 datasets shows that GA outperforms the currently acceptable gradient penalization method, in final test accuracy. We also provide convergence rate proof for GA. Despite prior beliefs, we show that if GA is applied, momentum becomes beneficial in asynchronous environments, even when the number of workers scales up.
\ No newline at end of file
diff --git a/data/2020/iclr/Generalization bounds for deep convolutional neural networks b/data/2020/iclr/Generalization bounds for deep convolutional neural networks
new file mode 100644
index 0000000000..bbb8c8c24f
--- /dev/null
+++ b/data/2020/iclr/Generalization bounds for deep convolutional neural networks	
@@ -0,0 +1 @@
+We prove bounds on the generalization error of convolutional networks. The bounds are in terms of the training loss, the number of parameters, the Lipschitz constant of the loss and the distance from the weights to the initial weights. They are independent of the number of pixels in the input, and the height and width of hidden feature maps. We present experiments using CIFAR-10 with varying hyperparameters of a deep convolutional network, comparing our bounds with practical generalization gaps.
\ No newline at end of file
diff --git a/data/2020/iclr/Generative Ratio Matching Networks b/data/2020/iclr/Generative Ratio Matching Networks
new file mode 100644
index 0000000000..a7f9797ebb
--- /dev/null
+++ b/data/2020/iclr/Generative Ratio Matching Networks	
@@ -0,0 +1 @@
+Deep generative models can learn to generate realistic-looking images, but many of the most effective methods are adversarial and involve a saddlepoint optimization, which require careful balancing of training between a generator network and a critic network. Maximum mean discrepancy networks (MMD-nets) avoid this issue by using kernel as a fixed adversary, but unfortunately they have not on their own been able to match the generative quality of adversarial training. In this work, we take their insight of using kernels as fixed adversaries further and present a novel method for training deep generative models that does not involve saddlepoint optimization. We call our method generative ratio matching or GRAM for short. In GRAM, the generator and the critic networks do not play a zero-sum game against each other, instead they do so against a fixed kernel. Thus GRAM networks are not only stable to train like MMD-nets but they also match and beat the generative quality of adversarially trained generative networks.
\ No newline at end of file
diff --git a/data/2020/iclr/Geometric Insights into the Convergence of Nonlinear TD Learning b/data/2020/iclr/Geometric Insights into the Convergence of Nonlinear TD Learning
new file mode 100644
index 0000000000..5f520acfa5
--- /dev/null
+++ b/data/2020/iclr/Geometric Insights into the Convergence of Nonlinear TD Learning	
@@ -0,0 +1 @@
+While there are convergence guarantees for temporal difference (TD) learning when using linear function approximators, the situation for nonlinear models is far less understood, and divergent examples are known. Here we take a first step towards extending theoretical convergence guarantees to TD learning with nonlinear function approximation. More precisely, we consider the expected learning dynamics of the TD(0) algorithm for value estimation. As the step-size converges to zero, these dynamics are defined by a nonlinear ODE which depends on the geometry of the space of function approximators, the structure of the underlying Markov chain, and their interaction. We find a set of function approximators that includes ReLU networks and has geometry amenable to TD learning regardless of environment, so that the solution performs about as well as linear TD in the worst case. Then, we show how environments that are more reversible induce dynamics that are better for TD learning and prove global convergence to the true value function for well-conditioned function approximators. Finally, we generalize a divergent counterexample to a family of divergent problems to demonstrate how the interaction between approximator and environment can go wrong and to motivate the assumptions needed to prove convergence.
\ No newline at end of file
diff --git a/data/2020/iclr/Global Relational Models of Source Code b/data/2020/iclr/Global Relational Models of Source Code
new file mode 100644
index 0000000000..2c01dd4d76
--- /dev/null
+++ b/data/2020/iclr/Global Relational Models of Source Code	
@@ -0,0 +1 @@
+Models of code can learn distributed representations of a program's syntax and semantics to predict many non-trivial properties of a program. Recent state-of-the-art models leverage highly structured representations of programs, such as trees, graphs and paths therein (e.g. data-flow relations), which are precise and abundantly available for code. This provides a strong inductive bias towards semantically meaningful relations, yielding more generalizable representations than classical sequence-based models. Unfortunately, these models primarily rely on graph-based message passing to represent relations in code, which makes them de facto local due to the high cost of message-passing steps, quite in contrast to modern, global sequence-based models, such as the Transformer. In this work, we bridge this divide between global and structured models by introducing two new hybrid model families that are both global and incorporate structural bias: Graph Sandwiches, which wrap traditional (gated) graph message-passing layers in sequential message-passing layers; and Graph Relational Embedding Attention Transformers (GREAT for short), which bias traditional Transformers with relational information from graph edge types. By studying a popular, non-trivial program repair task, variable-misuse identification, we explore the relative merits of traditional and hybrid model families for code representation. Starting with a graph-based model that already improves upon the prior state-of-the-art for this task by 20%, we show that our proposed hybrid models improve an additional 10-15%, while training both faster and using fewer parameters.
\ No newline at end of file
diff --git a/data/2020/iclr/Graph inference learning for semi-supervised classification b/data/2020/iclr/Graph inference learning for semi-supervised classification
new file mode 100644
index 0000000000..7197604584
--- /dev/null
+++ b/data/2020/iclr/Graph inference learning for semi-supervised classification	
@@ -0,0 +1 @@
+In this work, we address the semi-supervised classification of graph data, where the categories of those unlabeled nodes are inferred from labeled nodes as well as graph structures. Recent works often solve this problem with the advanced graph convolution in a conventional supervised manner, but the performance could be heavily affected when labeled data is scarce. Here we propose a Graph Inference Learning (GIL) framework to boost the performance of node classification by learning the inference of node labels on graph topology. To bridge the connection of two nodes, we formally define a structure relation by encapsulating node attributes, between-node paths and local topological structures together, which can make inference conveniently deduced from one node to another node. For learning the inference process, we further introduce meta-optimization on structure relations from training nodes to validation nodes, such that the learnt graph inference capability can be better self-adapted into test nodes. Comprehensive evaluations on four benchmark datasets (including Cora, Citeseer, Pubmed and NELL) demonstrate the superiority of our GIL when compared with other state-of-the-art methods in the semi-supervised node classification task.
\ No newline at end of file
diff --git a/data/2020/iclr/Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation b/data/2020/iclr/Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation
new file mode 100644
index 0000000000..d53106e592
--- /dev/null
+++ b/data/2020/iclr/Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation	
@@ -0,0 +1 @@
+Video prediction models combined with planning algorithms have shown promise in enabling robots to learn to perform many vision-based tasks through only self-supervision, reaching novel goals in cluttered scenes with unseen objects. However, due to the compounding uncertainty in long horizon video prediction and poor scalability of sampling-based planning optimizers, one significant limitation of these approaches is the ability to plan over long horizons to reach distant goals. To that end, we propose a framework for subgoal generation and planning, hierarchical visual foresight (HVF), which generates subgoal images conditioned on a goal image, and uses them for planning. The subgoal images are directly optimized to decompose the task into easy to plan segments, and as a result, we observe that the method naturally identifies semantically meaningful states as subgoals. Across three out of four simulated vision-based manipulation tasks, we find that our method achieves nearly a 200% performance improvement over planning without subgoals and model-free RL approaches. Further, our experiments illustrate that our approach extends to real, cluttered visual scenes. Project page: this https URL
\ No newline at end of file
diff --git a/data/2020/iclr/I Am Going MAD: Maximum Discrepancy Competition for Comparing Classifiers Adaptively b/data/2020/iclr/I Am Going MAD: Maximum Discrepancy Competition for Comparing Classifiers Adaptively
new file mode 100644
index 0000000000..9c74e09e17
--- /dev/null
+++ b/data/2020/iclr/I Am Going MAD: Maximum Discrepancy Competition for Comparing Classifiers Adaptively	
@@ -0,0 +1 @@
+The learning of hierarchical representations for image classification has experienced an impressive series of successes due in part to the availability of large-scale labeled data for training. On the other hand, the trained classifiers have traditionally been evaluated on small and fixed sets of test images, which are deemed to be extremely sparsely distributed in the space of all natural images. It is thus questionable whether recent performance improvements on the excessively re-used test sets generalize to real-world natural images with much richer content variations. Inspired by efficient stimulus selection for testing perceptual models in psychophysical and physiological studies, we present an alternative framework for comparing image classifiers, which we name the MAximum Discrepancy (MAD) competition. Rather than comparing image classifiers using fixed test images, we adaptively sample a small test set from an arbitrarily large corpus of unlabeled images so as to maximize the discrepancies between the classifiers, measured by the distance over WordNet hierarchy. Human labeling on the resulting model-dependent image sets reveals the relative performance of the competing classifiers, and provides useful insights on potential ways to improve them. We report the MAD competition results of eleven ImageNet classifiers while noting that the framework is readily extensible and cost-effective to add future classifiers into the competition. Codes can be found at this https URL.
\ No newline at end of file
diff --git a/data/2020/iclr/Identifying through Flows for Recovering Latent Representations b/data/2020/iclr/Identifying through Flows for Recovering Latent Representations
new file mode 100644
index 0000000000..fc10f5ccf8
--- /dev/null
+++ b/data/2020/iclr/Identifying through Flows for Recovering Latent Representations	
@@ -0,0 +1 @@
+Identifiability, or recovery of the true latent representations from which the observed data originates, is de facto a fundamental goal of representation learning. Yet, most deep generative models do not address the question of identifiability, and thus fail to deliver on the promise of the recovery of the true latent sources that generate the observations. Recent work proposed identifiable generative modelling using variational autoencoders (iVAE) with a theory of identifiability. Due to the intractablity of KL divergence between variational approximate posterior and the true posterior, however, iVAE has to maximize the evidence lower bound (ELBO) of the marginal likelihood, leading to suboptimal solutions in both theory and practice. In contrast, we propose an identifiable framework for estimating latent representations using a flow-based model (iFlow). Our approach directly maximizes the marginal likelihood, allowing for theoretical guarantees on identifiability, thereby dispensing with variational approximations. We derive its optimization objective in analytical form, making it possible to train iFlow in an end-to-end manner. Simulations on synthetic data validate the correctness and effectiveness of our proposed method and demonstrate its practical advantages over other existing methods.
\ No newline at end of file
diff --git a/data/2020/iclr/Identity Crisis: Memorization and Generalization Under Extreme Overparameterization b/data/2020/iclr/Identity Crisis: Memorization and Generalization Under Extreme Overparameterization
new file mode 100644
index 0000000000..cf8c328fdd
--- /dev/null
+++ b/data/2020/iclr/Identity Crisis: Memorization and Generalization Under Extreme Overparameterization	
@@ -0,0 +1 @@
+We study the interplay between memorization and generalization of overparameterized networks in the extreme case of a single training example and an identity-mapping task. We examine fully-connected and convolutional networks (FCN and CNN), both linear and nonlinear, initialized randomly and then trained to minimize the reconstruction error. The trained networks stereotypically take one of two forms: the constant function (memorization) and the identity function (generalization). We formally characterize generalization in single-layer FCNs and CNNs. We show empirically that different architectures exhibit strikingly different inductive biases. For example, CNNs of up to 10 layers are able to generalize from a single example, whereas FCNs cannot learn the identity function reliably from 60k examples. Deeper CNNs often fail, but nonetheless do astonishing work to memorize the training output: because CNN biases are location invariant, the model must progressively grow an output pattern from the image boundaries via the coordination of many layers. Our work helps to quantify and visualize the sensitivity of inductive biases to architectural choices such as depth, kernel width, and number of channels.
\ No newline at end of file
diff --git a/data/2020/iclr/Image-guided Neural Object Rendering b/data/2020/iclr/Image-guided Neural Object Rendering
new file mode 100644
index 0000000000..d02e915185
--- /dev/null
+++ b/data/2020/iclr/Image-guided Neural Object Rendering	
@@ -0,0 +1 @@
+We present a novel method for photo-realistic re-rendering of reconstructed objects. The digital reproduction of object appearances is of paramount importance nowadays. Augmented and virtual reality relies on such 3D content. It enables virtual showrooms, virtual tours & sightseeing, the digital inspection of historical artifacts and many other applications. Classical approaches use methods to reconstruct the geometry of an object and textures to capture the appearance properties. Instead, we propose a learned image-guided rendering technique that combines the benefits of image-based rendering and GAN-based image synthesis. A core component of our work is the handling of view-dependent effects. Specifically, we directly train an object-specific deep neural network to synthesize the view-dependent appearance of an object. As input data we are using an RGB video of the object. This video is used to reconstruct a proxy geometry of the object via multi-view stereo. Based on this 3D proxy, the appearance of a captured view can be warped into a new target view. This warping assumes diffuse surfaces, in case of view-dependent effects, such as specular highlights, it leads to artifacts. To this end, we propose EffectsNet, a deep neural network that predicts view-dependent effects. Based on these estimations, we are able to convert observed images to diffuse images. These diffuse images can be projected into other views. In the target view, our pipeline reinserts the new view-dependent effects. To composite multiple reprojected images to a final output, we learn a composition network that outputs photo-realistic results. Using this image-guided approach, the network does not have to allocate capacity on ``''remembering'' object appearance, instead it learns how to combine the appearance of captured images. We demonstrate the effectiveness of our approach both qualitatively and quantitatively on synthetic as well as on real data.
\ No newline at end of file
diff --git a/data/2020/iclr/Imitation Learning via Off-Policy Distribution Matching b/data/2020/iclr/Imitation Learning via Off-Policy Distribution Matching
new file mode 100644
index 0000000000..29a1b0932a
--- /dev/null
+++ b/data/2020/iclr/Imitation Learning via Off-Policy Distribution Matching	
@@ -0,0 +1 @@
+When performing imitation learning from expert demonstrations, distribution matching is a popular approach, in which one alternates between estimating distribution ratios and then using these ratios as rewards in a standard reinforcement learning (RL) algorithm. Traditionally, estimation of the distribution ratio requires on-policy data, which has caused previous work to either be exorbitantly data- inefficient or alter the original objective in a manner that can drastically change its optimum. In this work, we show how the original distribution ratio estimation objective may be transformed in a principled manner to yield a completely off-policy objective. In addition to the data-efficiency that this provides, we are able to show that this objective also renders the use of a separate RL optimization unnecessary. Rather, an imitation policy may be learned directly from this objective without the use of explicit rewards. We call the resulting algorithm ValueDICE and evaluate it on a suite of popular imitation learning benchmarks, finding that it can achieve state-of-the-art sample efficiency and performance.
\ No newline at end of file
diff --git a/data/2020/iclr/Implicit Bias of Gradient Descent based Adversarial Training on Separable Data b/data/2020/iclr/Implicit Bias of Gradient Descent based Adversarial Training on Separable Data
new file mode 100644
index 0000000000..eef3755eb8
--- /dev/null
+++ b/data/2020/iclr/Implicit Bias of Gradient Descent based Adversarial Training on Separable Data	
@@ -0,0 +1 @@
+Adversarial training is a principled approach for training robust neural networks. Despite of tremendous successes in practice, its theoretical properties still remain largely unexplored. In this paper, we provide new theoretical insights of gradient descent based adversarial training by studying its computational properties, specifically on its implicit bias. We take the binary classification task on linearly separable data as an illustrative example, where the loss asymptotically attains its infimum as the parameter diverges to infinity along certain directions. Specifically, we show that for any fixed iteration $T$, when the adversarial perturbation during training has proper bounded L2 norm, the classifier learned by gradient descent based adversarial training converges in direction to the maximum L2 norm margin classifier at the rate of $O(1/\sqrt{T})$, significantly faster than the rate $O(1/\log T}$ of training with clean data. In addition, when the adversarial perturbation during training has bounded Lq norm, the resulting classifier converges in direction to a maximum mixed-norm margin classifier, which has a natural interpretation of robustness, as being the maximum L2 norm margin classifier under worst-case bounded Lq norm perturbation to the data. Our findings provide theoretical backups for adversarial training that it indeed promotes robustness against adversarial perturbation.
\ No newline at end of file
diff --git a/data/2020/iclr/Improved Sample Complexities for Deep Neural Networks and Robust Classification via an All-Layer Margin b/data/2020/iclr/Improved Sample Complexities for Deep Neural Networks and Robust Classification via an All-Layer Margin
new file mode 100644
index 0000000000..bffc81572a
--- /dev/null
+++ b/data/2020/iclr/Improved Sample Complexities for Deep Neural Networks and Robust Classification via an All-Layer Margin	
@@ -0,0 +1 @@
+For linear classifiers, the relationship between (normalized) output margin and generalization is captured in a clear and simple bound -- a large output margin implies good generalization. Unfortunately, for deep models, this relationship is less clear: existing analyses of the output margin give complicated bounds which sometimes depend exponentially on depth. In this work, we propose to instead analyze a new notion of margin, which we call the "all-layer margin." Our analysis reveals that the all-layer margin has a clear and direct relationship with generalization for deep models. This enables the following concrete applications of the all-layer margin: 1) by analyzing the all-layer margin, we obtain tighter generalization bounds for neural nets which depend on Jacobian and hidden layer norms and remove the exponential dependency on depth 2) our neural net results easily translate to the adversarially robust setting, giving the first direct analysis of robust test error for deep networks, and 3) we present a theoretically inspired training algorithm for increasing the all-layer margin and demonstrate that our algorithm improves test performance over strong baselines in practice.
\ No newline at end of file
diff --git a/data/2020/iclr/Improving Adversarial Robustness Requires Revisiting Misclassified Examples b/data/2020/iclr/Improving Adversarial Robustness Requires Revisiting Misclassified Examples
new file mode 100644
index 0000000000..3ad7c4abf3
--- /dev/null
+++ b/data/2020/iclr/Improving Adversarial Robustness Requires Revisiting Misclassified Examples	
@@ -0,0 +1 @@
+Deep neural networks (DNNs) are vulnerable to adversarial examples crafted by imperceptible perturbations. A range of defense techniques have been proposed to improve DNN robustness to adversarial examples, among which adversarial training has been demonstrated to be the most effective. Adversarial training is often formulated as a min-max optimization problem, with the inner maximization for generating adversarial examples. However, there exists a simple, yet easily overlooked fact that adversarial examples are only defined on correctly classified (natural) examples, but inevitably, some (natural) examples will be misclassified during training. In this paper, we investigate the distinctive influence of misclassified and correctly classified examples on the final robustness of adversarial training. Specifically, we find that misclassified examples indeed have a significant impact on the final robustness. More surprisingly, we find that different maximization techniques on misclassified examples may have a negligible influence on the final robustness, while different minimization techniques are crucial. Motivated by the above discovery, we propose a new defense algorithm called {\em Misclassification Aware adveRsarial Training} (MART), which explicitly differentiates the misclassified and correctly classified examples during the training. We also propose a semi-supervised extension of MART, which can leverage the unlabeled data to further improve the robustness. Experimental results show that MART and its variant could significantly improve the state-of-the-art adversarial robustness.
\ No newline at end of file
diff --git a/data/2020/iclr/In Search for a SAT-friendly Binarized Neural Network Architecture b/data/2020/iclr/In Search for a SAT-friendly Binarized Neural Network Architecture
new file mode 100644
index 0000000000..84511c8cdf
--- /dev/null
+++ b/data/2020/iclr/In Search for a SAT-friendly Binarized Neural Network Architecture	
@@ -0,0 +1 @@
+Analyzing the behavior of neural networks is one of the most pressing challenges in deep learning. Binarized Neural Networks are an important class of networks that allow equivalent representation in Boolean logic and can be analyzed formally with logic-based reasoning tools like SAT solvers. Such tools can be used to answer existential and probabilistic queries about the network, perform explanation generation, etc. However, the main bottleneck for all methods is their ability to reason about large BNNs efficiently. In this work, we analyze architectural design choices of BNNs and discuss how they affect the performance of logic-based reasoners. We propose changes to the BNN architecture and the training procedure to get a simpler network for SAT solvers without sacrificing accuracy on the primary task. Our experimental results demonstrate that our approach scales to larger deep neural networks compared to existing work for existential and probabilistic queries, leading to significant speed ups on all tested datasets.
\ No newline at end of file
diff --git a/data/2020/iclr/Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models b/data/2020/iclr/Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models
new file mode 100644
index 0000000000..9cfab8f46a
--- /dev/null
+++ b/data/2020/iclr/Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models	
@@ -0,0 +1 @@
+Likelihood-based generative models are a promising resource to detect out-of-distribution (OOD) inputs which could compromise the robustness or reliability of a machine learning system. However, likelihoods derived from such models have been shown to be problematic for detecting certain types of inputs that significantly differ from training data. In this paper, we pose that this problem is due to the excessive influence that input complexity has in generative models' likelihoods. We report a set of experiments supporting this hypothesis, and use an estimate of input complexity to derive an efficient and parameter-free OOD score, which can be seen as a likelihood-ratio, akin to Bayesian model comparison. We find such score to perform comparably to, or even better than, existing OOD detection approaches under a wide range of data sets, models, model sizes, and complexity estimates.
\ No newline at end of file
diff --git a/data/2020/iclr/Interpretable Complex-Valued Neural Networks for Privacy Protection b/data/2020/iclr/Interpretable Complex-Valued Neural Networks for Privacy Protection
new file mode 100644
index 0000000000..08065f31ac
--- /dev/null
+++ b/data/2020/iclr/Interpretable Complex-Valued Neural Networks for Privacy Protection	
@@ -0,0 +1 @@
+Previous studies have found that an adversary attacker can often infer unintended input information from intermediate-layer features. We study the possibility of preventing such adversarial inference, yet without too much accuracy degradation. We propose a generic method to revise the neural network to boost the challenge of inferring input attributes from features, while maintaining highly accurate outputs. In particular, the method transforms real-valued features into complex-valued ones, in which the input is hidden in a randomized phase of the transformed features. The knowledge of the phase acts like a key, with which any party can easily recover the output from the processing result, but without which the party can neither recover the output nor distinguish the original input. Preliminary experiments on various datasets and network structures have shown that our method significantly diminishes the adversary's ability in inferring about the input while largely preserves the resulting accuracy.
\ No newline at end of file
diff --git a/data/2020/iclr/Intrinsic Motivation for Encouraging Synergistic Behavior b/data/2020/iclr/Intrinsic Motivation for Encouraging Synergistic Behavior
new file mode 100644
index 0000000000..c8872d0a1c
--- /dev/null
+++ b/data/2020/iclr/Intrinsic Motivation for Encouraging Synergistic Behavior	
@@ -0,0 +1 @@
+We study the role of intrinsic motivation as an exploration bias for reinforcement learning in sparse-reward synergistic tasks, which are tasks where multiple agents must work together to achieve a goal they could not individually. Our key idea is that a good guiding principle for intrinsic motivation in synergistic tasks is to take actions which affect the world in ways that would not be achieved if the agents were acting on their own. Thus, we propose to incentivize agents to take (joint) actions whose effects cannot be predicted via a composition of the predicted effect for each individual agent. We study two instantiations of this idea, one based on the true states encountered, and another based on a dynamics model trained concurrently with the policy. While the former is simpler, the latter has the benefit of being analytically differentiable with respect to the action taken. We validate our approach in robotic bimanual manipulation tasks with sparse rewards; we find that our approach yields more efficient learning than both 1) training with only the sparse reward and 2) using the typical surprise-based formulation of intrinsic motivation, which does not bias toward synergistic behavior. Videos are available on the project webpage: https://sites.google.com/view/iclr2020-synergistic.
\ No newline at end of file
diff --git a/data/2020/iclr/Knowledge Consistency between Neural Networks and Beyond b/data/2020/iclr/Knowledge Consistency between Neural Networks and Beyond
new file mode 100644
index 0000000000..92589caa69
--- /dev/null
+++ b/data/2020/iclr/Knowledge Consistency between Neural Networks and Beyond	
@@ -0,0 +1 @@
+This paper aims to analyze knowledge consistency between pre-trained deep neural networks. We propose a generic definition for knowledge consistency between neural networks at different fuzziness levels. A task-agnostic method is designed to disentangle feature components, which represent the consistent knowledge, from raw intermediate-layer features of each neural network. As a generic tool, our method can be broadly used for different applications. In preliminary experiments, we have used knowledge consistency as a tool to diagnose knowledge representations of neural networks. Knowledge consistency provides new insights to explain the success of existing deep-learning techniques, such as knowledge distillation and network compression. More crucially, knowledge consistency can also be used to refine pre-trained networks and boost performance.
\ No newline at end of file
diff --git a/data/2020/iclr/LAMOL: LAnguage MOdeling for Lifelong Language Learning b/data/2020/iclr/LAMOL: LAnguage MOdeling for Lifelong Language Learning
new file mode 100644
index 0000000000..69b29587e9
--- /dev/null
+++ b/data/2020/iclr/LAMOL: LAnguage MOdeling for Lifelong Language Learning	
@@ -0,0 +1 @@
+Most research on lifelong learning applies to images or games, but not language. We present LAMOL, a simple yet effective method for lifelong language learning (LLL) based on language modeling. LAMOL replays pseudo-samples of previous tasks while requiring no extra memory or model capacity. Specifically, LAMOL is a language model that simultaneously learns to solve the tasks and generate training samples. When the model is trained for a new task, it generates pseudo-samples of previous tasks for training alongside data for the new task. The results show that LAMOL prevents catastrophic forgetting without any sign of intransigence and can perform five very different language tasks sequentially with only one model. Overall, LAMOL outperforms previous methods by a considerable margin and is only 2-3% worse than multitasking, which is usually considered the LLL upper bound. The source code is available at this https URL.
\ No newline at end of file
diff --git a/data/2020/iclr/Language GANs Falling Short b/data/2020/iclr/Language GANs Falling Short
new file mode 100644
index 0000000000..ca2f936236
--- /dev/null
+++ b/data/2020/iclr/Language GANs Falling Short	
@@ -0,0 +1 @@
+Generating high-quality text with sufficient diversity is essential for a wide range of Natural Language Generation (NLG) tasks. Maximum-Likelihood (MLE) models trained with teacher forcing have consistently been reported as weak baselines, where poor performance is attributed to exposure bias (Bengio et al., 2015; Ranzato et al., 2015); at inference time, the model is fed its own prediction instead of a ground-truth token, which can lead to accumulating errors and poor samples. This line of reasoning has led to an outbreak of adversarial based approaches for NLG, on the account that GANs do not suffer from exposure bias. In this work, we make several surprising observations which contradict common beliefs. First, we revisit the canonical evaluation framework for NLG, and point out fundamental flaws with quality-only evaluation: we show that one can outperform such metrics using a simple, well-known temperature parameter to artificially reduce the entropy of the model's conditional distributions. Second, we leverage the control over the quality / diversity trade-off given by this parameter to evaluate models over the whole quality-diversity spectrum and find MLE models constantly outperform the proposed GAN variants over the whole quality-diversity space. Our results have several implications: 1) The impact of exposure bias on sample quality is less severe than previously thought, 2) temperature tuning provides a better quality / diversity trade-off than adversarial training while being easier to train, easier to cross-validate, and less computationally expensive. Code to reproduce the experiments is available at github.com/pclucas14/GansFallingShort
\ No newline at end of file
diff --git a/data/2020/iclr/Large Batch Optimization for Deep Learning: Training BERT in 76 minutes b/data/2020/iclr/Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
new file mode 100644
index 0000000000..48120d86d2
--- /dev/null
+++ b/data/2020/iclr/Large Batch Optimization for Deep Learning: Training BERT in 76 minutes	
@@ -0,0 +1 @@
+Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in this line of research is LARS, which by employing layerwise adaptive learning rates trains ResNet on ImageNet in a few minutes. However, LARS performs poorly for attention models like BERT, indicating that its performance gains are not consistent across tasks. In this paper, we first study a principled layerwise adaptation strategy to accelerate training of deep neural networks using large mini-batches. Using this strategy, we develop a new layerwise adaptive large batch optimization technique called LAMB; we then provide convergence analysis of LAMB as well as LARS, showing convergence to a stationary point in general nonconvex settings. Our empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and ResNet-50 training with very little hyperparameter tuning. In particular, for BERT training, our optimizer enables use of very large batch sizes of 32868 without any degradation of performance. By increasing the batch size to the memory limit of a TPUv3 Pod, BERT training time can be reduced from 3 days to just 76 minutes (Table 1). The LAMB implementation is available at this https URL
\ No newline at end of file
diff --git a/data/2020/iclr/Lazy-CFR: fast and near-optimal regret minimization for extensive games with imperfect information b/data/2020/iclr/Lazy-CFR: fast and near-optimal regret minimization for extensive games with imperfect information
new file mode 100644
index 0000000000..3d8acfe902
--- /dev/null
+++ b/data/2020/iclr/Lazy-CFR: fast and near-optimal regret minimization for extensive games with imperfect information	
@@ -0,0 +1 @@
+Counterfactual regret minimization (CFR) is the most popular algorithm on solving two-player zero-sum extensive games with imperfect information and achieves state-of-the-art performance in practice. However, the performance of CFR is not fully understood, since empirical results on the regret are much better than the upper bound proved in \cite{zinkevich2008regret}. Another issue is that CFR has to traverse the whole game tree in each round, which is time-consuming in large scale games. In this paper, we present a novel technique, lazy update, which can avoid traversing the whole game tree in CFR, as well as a novel analysis on the regret of CFR with lazy update. Our analysis can also be applied to the vanilla CFR, resulting in a much tighter regret bound than that in \cite{zinkevich2008regret}. Inspired by lazy update, we further present a novel CFR variant, named Lazy-CFR. Compared to traversing $O(|\mathcal{I}|)$ information sets in vanilla CFR, Lazy-CFR needs only to traverse $O(\sqrt{|\mathcal{I}|})$ information sets per round while keeping the regret bound almost the same, where $\mathcal{I}$ is the class of all information sets. As a result, Lazy-CFR shows better convergence result compared with vanilla CFR. Experimental results consistently show that Lazy-CFR outperforms the vanilla CFR significantly.
\ No newline at end of file
diff --git a/data/2020/iclr/Learned Step Size quantization b/data/2020/iclr/Learned Step Size quantization
new file mode 100644
index 0000000000..7f358d6c5d
--- /dev/null
+++ b/data/2020/iclr/Learned Step Size quantization	
@@ -0,0 +1 @@
+Deep networks run with low precision operations at inference time offer power and space advantages over high precision alternatives, but need to overcome the challenge of maintaining high accuracy as precision decreases. Here, we present a method for training such networks, Learned Step Size Quantization, that achieves the highest accuracy to date on the ImageNet dataset when using models, from a variety of architectures, with weights and activations quantized to 2-, 3- or 4-bits of precision, and that can train 3-bit models that reach full precision baseline accuracy. Our approach builds upon existing methods for learning weights in quantized networks by improving how the quantizer itself is configured. Specifically, we introduce a novel means to estimate and scale the task loss gradient at each weight and activation layer's quantizer step size, such that it can be learned in conjunction with other network parameters. This approach works using different levels of precision as needed for a given system and requires only a simple modification of existing training code.
\ No newline at end of file
diff --git a/data/2020/iclr/Learning Disentangled Representations for CounterFactual Regression b/data/2020/iclr/Learning Disentangled Representations for CounterFactual Regression
new file mode 100644
index 0000000000..aa65043ad7
--- /dev/null
+++ b/data/2020/iclr/Learning Disentangled Representations for CounterFactual Regression	
@@ -0,0 +1 @@
+We consider the challenge of estimating treatment effects from observational data; and point out that, in general, only some factors based on the observed covariates X contribute to selection of the treatment T, and only some to determining the outcomes Y. We model this by considering three underlying sources of {X, T, Y} and show that explicitly modeling these sources offers great insight to guide designing models that better handle selection bias. This paper is an attempt to conceptualize this line of thought and provide a path to explore it further. In this work, we propose an algorithm to (1) identify disentangled representations of the above-mentioned underlying factors from any given observational dataset D and (2) leverage this knowledge to reduce, as well as account for, the negative impact of selection bias on estimating the treatment effects from D. Our empirical results show that the proposed method (i) achieves state-of-the-art performance in both individual and population based evaluation measures and (ii) is highly robust under various data generating scenarios.
\ No newline at end of file
diff --git a/data/2020/iclr/Learning Efficient Parameter Server Synchronization Policies for Distributed SGD b/data/2020/iclr/Learning Efficient Parameter Server Synchronization Policies for Distributed SGD
new file mode 100644
index 0000000000..dd690814c9
--- /dev/null
+++ b/data/2020/iclr/Learning Efficient Parameter Server Synchronization Policies for Distributed SGD	
@@ -0,0 +1 @@
+We apply a reinforcement learning (RL) based approach to learning optimal synchronization policies used for Parameter Server-based distributed training of machine learning models with Stochastic Gradient Descent (SGD). Utilizing a formal synchronization policy description in the PS-setting, we are able to derive a suitable and compact description of states and actions, allowing us to efficiently use the standard off-the-shelf deep Q-learning algorithm. As a result, we are able to learn synchronization policies which generalize to different cluster environments, different training datasets and small model variations and (most importantly) lead to considerable decreases in training time when compared to standard policies such as bulk synchronous parallel (BSP), asynchronous parallel (ASP), or stale synchronous parallel (SSP). To support our claims we present extensive numerical results obtained from experiments performed in simulated cluster environments. In our experiments training time is reduced by 44 on average and learned policies generalize to multiple unseen circumstances.
\ No newline at end of file
diff --git a/data/2020/iclr/Learning Execution through Neural Code fusion b/data/2020/iclr/Learning Execution through Neural Code fusion
new file mode 100644
index 0000000000..3fca47656e
--- /dev/null
+++ b/data/2020/iclr/Learning Execution through Neural Code fusion	
@@ -0,0 +1 @@
+As the performance of computer systems stagnates due to the end of Moore's Law, there is a need for new models that can understand and optimize the execution of general purpose code. While there is a growing body of work on using Graph Neural Networks (GNNs) to learn representations of source code, these representations do not understand how code dynamically executes. In this work, we propose a new approach to use GNNs to learn fused representations of general source code and its execution. Our approach defines a multi-task GNN over low-level representations of source code and program state (i.e., assembly code and dynamic memory states), converting complex source code constructs and complex data structures into a simpler, more uniform format. We show that this leads to improved performance over similar methods that do not use execution and it opens the door to applying GNN models to new tasks that would not be feasible from static code alone. As an illustration of this, we apply the new model to challenging dynamic tasks (branch prediction and prefetching) from the SPEC CPU benchmark suite, outperforming the state-of-the-art by 26% and 45% respectively. Moreover, we use the learned fused graph embeddings to demonstrate transfer learning with high performance on an indirectly related task (algorithm classification).
\ No newline at end of file
diff --git a/data/2020/iclr/Learning Expensive Coordination: An Event-Based Deep RL Approach b/data/2020/iclr/Learning Expensive Coordination: An Event-Based Deep RL Approach
new file mode 100644
index 0000000000..89b109b964
--- /dev/null
+++ b/data/2020/iclr/Learning Expensive Coordination: An Event-Based Deep RL Approach	
@@ -0,0 +1 @@
+Existing works in deep Multi-Agent Reinforcement Learning (MARL) mainly focus on coordinating cooperative agents to complete certain tasks jointly. However, in many cases of the real world, agents are self-interested such as employees in a company and clubs in a league. Therefore, the leader, i.e., the manager of the company or the league, needs to provide bonuses to followers for efficient coordination, which we call expensive coordination. The main difficulties of expensive coordination are that i) the leader has to consider the long-term effect and predict the followers' behaviors when assigning bonuses and ii) the complex interactions between followers make the training process hard to converge, especially when the leader's policy changes with time. In this work, we address this problem through an event-based deep RL approach. Our main contributions are threefold. (1) We model the leader's decision-making process as a semi-Markov Decision Process and propose a novel multi-agent event-based policy gradient to learn the leader's long-term policy. (2) We exploit the leader-follower consistency scheme to design a follower-aware module and a follower-specific attention module to predict the followers' behaviors and make accurate response to their behaviors. (3) We propose an action abstraction-based policy gradient algorithm to reduce the followers' decision space and thus accelerate the training process of followers. Experiments in resource collections, navigation, and the predator-prey game reveal that our approach outperforms the state-of-the-art methods dramatically.
\ No newline at end of file
diff --git a/data/2020/iclr/Learning Heuristics for Quantified Boolean Formulas through Reinforcement Learning b/data/2020/iclr/Learning Heuristics for Quantified Boolean Formulas through Reinforcement Learning
new file mode 100644
index 0000000000..22a681cabd
--- /dev/null
+++ b/data/2020/iclr/Learning Heuristics for Quantified Boolean Formulas through Reinforcement Learning	
@@ -0,0 +1 @@
+We demonstrate how to learn efficient heuristics for automated reasoning algorithms for quantified Boolean formulas through deep reinforcement learning. We focus on a backtracking search algorithm, which can already solve formulas of impressive size - up to hundreds of thousands of variables. The main challenge is to find a representation of these formulas that lends itself to making predictions in a scalable way. For a family of challenging problems, we learned a heuristic that solves significantly more formulas compared to the existing handwritten heuristics.
\ No newline at end of file
diff --git a/data/2020/iclr/Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling b/data/2020/iclr/Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling
new file mode 100644
index 0000000000..f432012659
--- /dev/null
+++ b/data/2020/iclr/Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling	
@@ -0,0 +1 @@
+Imitation learning, followed by reinforcement learning algorithms, is a promising paradigm to solve complex control tasks sample-efficiently. However, learning from demonstrations often suffers from the covariate shift problem, which results in cascading errors of the learned policy. We introduce a notion of conservatively-extrapolated value functions, which provably lead to policies with self-correction. We design an algorithm Value Iteration with Negative Sampling (VINS) that practically learns such value functions with conservative extrapolation. We show that VINS can correct mistakes of the behavioral cloning policy on simulated robotics benchmark tasks. We also propose the algorithm of using VINS to initialize a reinforcement learning algorithm, which is shown to outperform significantly prior works in sample efficiency.
\ No newline at end of file
diff --git a/data/2020/iclr/Learning Space Partitions for Nearest Neighbor Search b/data/2020/iclr/Learning Space Partitions for Nearest Neighbor Search
new file mode 100644
index 0000000000..d0d24d5983
--- /dev/null
+++ b/data/2020/iclr/Learning Space Partitions for Nearest Neighbor Search	
@@ -0,0 +1 @@
+Space partitions of $\mathbb{R}^d$ underlie a vast and important class of fast nearest neighbor search (NNS) algorithms. Inspired by recent theoretical work on NNS for general metric spaces (Andoni et al. 2018b,c), we develop a new framework for building space partitions reducing the problem to balanced graph partitioning followed by supervised classification. We instantiate this general approach with the KaHIP graph partitioner (Sanders and Schulz 2013) and neural networks, respectively, to obtain a new partitioning procedure called Neural Locality-Sensitive Hashing (Neural LSH). On several standard benchmarks for NNS (Aumuller et al. 2017), our experiments show that the partitions obtained by Neural LSH consistently outperform partitions found by quantization-based and tree-based methods as well as classic, data-oblivious LSH.
\ No newline at end of file
diff --git a/data/2020/iclr/Learning deep graph matching with channel-independent embedding and Hungarian attention b/data/2020/iclr/Learning deep graph matching with channel-independent embedding and Hungarian attention
new file mode 100644
index 0000000000..88d5b2299f
--- /dev/null
+++ b/data/2020/iclr/Learning deep graph matching with channel-independent embedding and Hungarian attention	
@@ -0,0 +1 @@
+Graph matching aims to establishing node-wise correspondence between two graphs, which is a classic combinatorial problem and in general NP-complete. Until very recently, deep graph matching methods start to resort to deep networks to achieve unprecedented matching accuracy. Along this direction, this paper makes two complementary contributions which can also be reused as plugin in existing works: i) a novel node and edge embedding strategy which stimulates the multi-head strategy in attention models and allows the information in each channel to be merged independently. In contrast, only node embedding is accounted in previous works; ii) a general masking mechanism over the loss function is devised to improve the smoothness of objective learning for graph matching. Using Hungarian algorithm, it dynamically constructs a structured and sparsely connected layer, taking into account the most contributing matching pairs as hard attention. Our approach performs competitively, and can also improve state-of-the-art methods as plugin, regarding with matching accuracy on three public benchmarks.
\ No newline at end of file
diff --git a/data/2020/iclr/Learning the Arrow of Time for Problems in Reinforcement Learning b/data/2020/iclr/Learning the Arrow of Time for Problems in Reinforcement Learning
new file mode 100644
index 0000000000..04531f7be3
--- /dev/null
+++ b/data/2020/iclr/Learning the Arrow of Time for Problems in Reinforcement Learning	
@@ -0,0 +1 @@
+We humans have an innate understanding of the asymmetric progression of time, which we use to efficiently and safely perceive and manipulate our environment. Drawing inspiration from that, we approach the problem of learning an arrow of time in a Markov (Decision) Process. We illustrate how a learned arrow of time can capture salient information about the environment, which in turn can be used to measure reachability, detect side-effects and to obtain an intrinsic reward signal. Finally, we propose a simple yet effective algorithm to parameterize the problem at hand and learn an arrow of time with a function approximator (here, a deep neural network). Our empirical results span a selection of discrete and continuous environments, and demonstrate for a class of stochastic processes that the learned arrow of time agrees reasonably well with a well known notion of an arrow of time due to Jordan, Kinderlehrer and Otto (1998).
\ No newline at end of file
diff --git a/data/2020/iclr/Learning to Learn by Zeroth-Order Oracle b/data/2020/iclr/Learning to Learn by Zeroth-Order Oracle
new file mode 100644
index 0000000000..0194144b9a
--- /dev/null
+++ b/data/2020/iclr/Learning to Learn by Zeroth-Order Oracle	
@@ -0,0 +1 @@
+In the learning to learn (L2L) framework, we cast the design of optimization algorithms as a machine learning problem and use deep neural networks to learn the update rules. In this paper, we extend the L2L framework to zeroth-order (ZO) optimization setting, where no explicit gradient information is available. Our learned optimizer, modeled as recurrent neural network (RNN), first approximates gradient by ZO gradient estimator and then produces parameter update utilizing the knowledge of previous iterations. To reduce high variance effect due to ZO gradient estimator, we further introduce another RNN to learn the Gaussian sampling rule and dynamically guide the query direction sampling. Our learned optimizer outperforms hand-designed algorithms in terms of convergence rate and final solution on both synthetic and practical ZO optimization tasks (in particular, the black-box adversarial attack task, which is one of the most widely used tasks of ZO optimization). We finally conduct extensive analytical experiments to demonstrate the effectiveness of our proposed optimizer.
\ No newline at end of file
diff --git a/data/2020/iclr/Learning to Link b/data/2020/iclr/Learning to Link
new file mode 100644
index 0000000000..ae4e3b8319
--- /dev/null
+++ b/data/2020/iclr/Learning to Link	
@@ -0,0 +1,2 @@
+This paper describes how to automatically cross-reference documents with Wikipedia: the largest knowledge base ever known. It explains how machine learning can be used to identify significant terms within unstructured text, and enrich it with links to the appropriate Wikipedia articles. The resulting link detector and disambiguator performs very well, with recall and precision of almost 75%. This performance is constant whether the system is evaluated on Wikipedia articles or "real world" documents.
+ This work has implications far beyond enriching documents with explanatory links. It can provide structured knowledge about any unstructured fragment of text. Any task that is currently addressed with bags of words - indexing, clustering, retrieval, and summarization to name a few - could use the techniques described here to draw on a vast network of concepts and semantics.
\ No newline at end of file
diff --git a/data/2020/iclr/Learning to Represent Programs with Property Signatures b/data/2020/iclr/Learning to Represent Programs with Property Signatures
new file mode 100644
index 0000000000..4615671cc6
--- /dev/null
+++ b/data/2020/iclr/Learning to Represent Programs with Property Signatures	
@@ -0,0 +1 @@
+We introduce the notion of property signatures, a representation for programs and program specifications meant for consumption by machine learning algorithms. Given a function with input type $\tau_{in}$ and output type $\tau_{out}$, a property is a function of type: $(\tau_{in}, \tau_{out}) \rightarrow \texttt{Bool}$ that (informally) describes some simple property of the function under consideration. For instance, if $\tau_{in}$ and $\tau_{out}$ are both lists of the same type, one property might ask `is the input list the same length as the output list?'. If we have a list of such properties, we can evaluate them all for our function to get a list of outputs that we will call the property signature. Crucially, we can `guess' the property signature for a function given only a set of input/output pairs meant to specify that function. We discuss several potential applications of property signatures and show experimentally that they can be used to improve over a baseline synthesizer so that it emits twice as many programs in less than one-tenth of the time.
\ No newline at end of file
diff --git a/data/2020/iclr/Learning to solve the credit assignment problem b/data/2020/iclr/Learning to solve the credit assignment problem
new file mode 100644
index 0000000000..efb21088a2
--- /dev/null
+++ b/data/2020/iclr/Learning to solve the credit assignment problem	
@@ -0,0 +1 @@
+Backpropagation is driving today's artificial neural networks (ANNs). However, despite extensive research, it remains unclear if the brain implements this algorithm. Among neuroscientists, reinforcement learning (RL) algorithms are often seen as a realistic alternative: neurons can randomly introduce change, and use unspecific feedback signals to observe their effect on the cost and thus approximate their gradient. However, the convergence rate of such learning scales poorly with the number of involved neurons. Here we propose a hybrid learning approach. Each neuron uses an RL-type strategy to learn how to approximate the gradients that backpropagation would provide. We provide proof that our approach converges to the true gradient for certain classes of networks. In both feedforward and convolutional networks, we empirically show that our approach learns to approximate the gradient, and can match or the performance of exact gradient-based learning. Learning feedback weights provides a biologically plausible mechanism of achieving good performance, without the need for precise, pre-specified learning rules.
\ No newline at end of file
diff --git a/data/2020/iclr/Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware b/data/2020/iclr/Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware
new file mode 100644
index 0000000000..6d66052917
--- /dev/null
+++ b/data/2020/iclr/Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware	
@@ -0,0 +1 @@
+With the proliferation of specialized neural network processors that operate on low-precision integers, the performance of Deep Neural Network inference becomes increasingly dependent on the result of quantization. Despite plenty of prior work on the quantization of weights or activations for neural networks, there is still a wide gap between the software quantizers and the low-precision accelerator implementation, which degrades either the efficiency of networks or that of the hardware for the lack of software and hardware coordination at design-phase. In this paper, we propose a learned linear symmetric quantizer for integer neural network processors, which not only quantizes neural parameters and activations to low-bit integer but also accelerates hardware inference by using batch normalization fusion and low-precision accumulators (e.g., 16-bit) and multipliers (e.g., 4-bit). We use a unified way to quantize weights and activations, and the results outperform many previous approaches for various networks such as AlexNet, ResNet, and lightweight models like MobileNet while keeping friendly to the accelerator architecture. Additional, we also apply the method to object detection models and witness high performance and accuracy in YOLO-v2. Finally, we deploy the quantized models on our specialized integer-arithmetic-only DNN accelerator to show the effectiveness of the proposed quantizer. We show that even with linear symmetric quantization, the results can be better than asymmetric or non-linear methods in 4-bit networks. In evaluation, the proposed quantizer induces less than 0.4\% accuracy drop in ResNet18, ResNet34, and AlexNet when quantizing the whole network as required by the integer processors.
\ No newline at end of file
diff --git a/data/2020/iclr/Locality and Compositionality in Zero-Shot Learning b/data/2020/iclr/Locality and Compositionality in Zero-Shot Learning
new file mode 100644
index 0000000000..530006008d
--- /dev/null
+++ b/data/2020/iclr/Locality and Compositionality in Zero-Shot Learning	
@@ -0,0 +1 @@
+In this work we study locality and compositionality in the context of learning representations for Zero Shot Learning (ZSL). In order to well-isolate the importance of these properties in learned representations, we impose the additional constraint that, differently from most recent work in ZSL, no pre-training on different datasets (e.g. ImageNet) is performed. The results of our experiments show how locality, in terms of small parts of the input, and compositionality, i.e. how well can the learned representations be expressed as a function of a smaller vocabulary, are both deeply related to generalization and motivate the focus on more local-aware models in future research directions for representation learning.
\ No newline at end of file
diff --git a/data/2020/iclr/Logic and the 2-Simplicial Transformer b/data/2020/iclr/Logic and the 2-Simplicial Transformer
new file mode 100644
index 0000000000..e31e7d3207
--- /dev/null
+++ b/data/2020/iclr/Logic and the 2-Simplicial Transformer	
@@ -0,0 +1 @@
+We introduce the $2$-simplicial Transformer, an extension of the Transformer which includes a form of higher-dimensional attention generalising the dot-product attention, and uses this attention to update entity representations with tensor products of value vectors. We show that this architecture is a useful inductive bias for logical reasoning in the context of deep reinforcement learning.
\ No newline at end of file
diff --git a/data/2020/iclr/Low-Resource Knowledge-Grounded Dialogue Generation b/data/2020/iclr/Low-Resource Knowledge-Grounded Dialogue Generation
new file mode 100644
index 0000000000..f398243abd
--- /dev/null
+++ b/data/2020/iclr/Low-Resource Knowledge-Grounded Dialogue Generation	
@@ -0,0 +1 @@
+Responding with knowledge has been recognized as an important capability for an intelligent conversational agent. Yet knowledge-grounded dialogues, as training data for learning such a response generation model, are difficult to obtain. Motivated by the challenge in practice, we consider knowledge-grounded dialogue generation under a natural assumption that only limited training examples are available. In such a low-resource setting, we devise a disentangled response decoder in order to isolate parameters that depend on knowledge-grounded dialogues from the entire generation model. By this means, the major part of the model can be learned from a large number of ungrounded dialogues and unstructured documents, while the remaining small parameters can be well fitted using the limited training examples. Evaluation results on two benchmarks indicate that with only $1/8$ training data, our model can achieve the state-of-the-art performance and generalize well on out-of-domain knowledge.
\ No newline at end of file
diff --git a/data/2020/iclr/MACER: Attack-free and Scalable Robust Training via Maximizing Certified Radius b/data/2020/iclr/MACER: Attack-free and Scalable Robust Training via Maximizing Certified Radius
new file mode 100644
index 0000000000..d172a0ec08
--- /dev/null
+++ b/data/2020/iclr/MACER: Attack-free and Scalable Robust Training via Maximizing Certified Radius	
@@ -0,0 +1 @@
+Adversarial training is one of the most popular ways to learn robust models but is usually attack-dependent and time costly. In this paper, we propose the MACER algorithm, which learns robust models without using adversarial training but performs better than all existing provable l2-defenses. Recent work shows that randomized smoothing can be used to provide certified l2 radius to smoothed classifiers, and our algorithm trains provably robust smoothed classifiers via MAximizing the CErtified Radius (MACER). The attack-free characteristic makes MACER faster to train and easier to optimize. In our experiments, we show that our method can be applied to modern deep neural networks on a wide range of datasets, including Cifar-10, ImageNet, MNIST, and SVHN. For all tasks, MACER spends less training time than state-of-the-art adversarial training algorithms, and the learned models achieve larger average certified radius.
\ No newline at end of file
diff --git a/data/2020/iclr/Maxmin Q-learning: Controlling the Estimation Bias of Q-learning b/data/2020/iclr/Maxmin Q-learning: Controlling the Estimation Bias of Q-learning
new file mode 100644
index 0000000000..62aeca0314
--- /dev/null
+++ b/data/2020/iclr/Maxmin Q-learning: Controlling the Estimation Bias of Q-learning	
@@ -0,0 +1 @@
+Q-learning suffers from overestimation bias, because it approximates the maximum action value using the maximum estimated action value. Algorithms have been proposed to reduce overestimation bias, but we lack an understanding of how bias interacts with performance, and the extent to which existing algorithms mitigate bias. In this paper, we 1) highlight that the effect of overestimation bias on learning efficiency is environment-dependent; 2) propose a generalization of Q-learning, called \emph{Maxmin Q-learning}, which provides a parameter to flexibly control bias; 3) show theoretically that there exists a parameter choice for Maxmin Q-learning that leads to unbiased estimation with a lower approximation variance than Q-learning; and 4) prove the convergence of our algorithm in the tabular case, as well as convergence of several previous Q-learning variants, using a novel Generalized Q-learning framework. We empirically verify that our algorithm better controls estimation bias in toy environments, and that it achieves superior performance on several benchmark problems.
\ No newline at end of file
diff --git a/data/2020/iclr/Measuring Compositional Generalization: A Comprehensive Method on Realistic Data b/data/2020/iclr/Measuring Compositional Generalization: A Comprehensive Method on Realistic Data
new file mode 100644
index 0000000000..8105099b3b
--- /dev/null
+++ b/data/2020/iclr/Measuring Compositional Generalization: A Comprehensive Method on Realistic Data	
@@ -0,0 +1 @@
+State-of-the-art machine learning methods exhibit limited compositional generalization. At the same time, there is a lack of realistic benchmarks that comprehensively measure this ability, which makes it challenging to find and evaluate improvements. We introduce a novel method to systematically construct such benchmarks by maximizing compound divergence while guaranteeing a small atom divergence between train and test sets, and we quantitatively compare this method to other approaches for creating compositional generalization benchmarks. We present a large and realistic natural language question answering dataset that is constructed according to this method, and we use it to analyze the compositional generalization ability of three machine learning architectures. We find that they fail to generalize compositionally and that there is a surprisingly strong negative correlation between compound divergence and accuracy. We also demonstrate how our method can be used to create new compositionality benchmarks on top of the existing SCAN dataset, which confirms these findings.
\ No newline at end of file
diff --git a/data/2020/iclr/Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples b/data/2020/iclr/Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples
new file mode 100644
index 0000000000..3516d76e17
--- /dev/null
+++ b/data/2020/iclr/Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples	
@@ -0,0 +1 @@
+Few-shot classification refers to learning a classifier for new classes given only a few examples. While a plethora of models have emerged to tackle it, we find the procedure and datasets that are used to assess their progress lacking. To address this limitation, we propose Meta-Dataset: a new benchmark for training and evaluating models that is large-scale, consists of diverse datasets, and presents more realistic tasks. We experiment with popular baselines and meta-learners on Meta-Dataset, along with a competitive method that we propose. We analyze performance as a function of various characteristics of test tasks and examine the models' ability to leverage diverse training sources for improving their generalization. We also propose a new set of baselines for quantifying the benefit of meta-learning in Meta-Dataset. Our extensive experimentation has uncovered important research challenges and we hope to inspire work in these directions.
\ No newline at end of file
diff --git a/data/2020/iclr/MetaPix: Few-Shot Video Retargeting b/data/2020/iclr/MetaPix: Few-Shot Video Retargeting
new file mode 100644
index 0000000000..9ed6534f9f
--- /dev/null
+++ b/data/2020/iclr/MetaPix: Few-Shot Video Retargeting	
@@ -0,0 +1 @@
+We address the task of unsupervised retargeting of human actions from one video to another. We consider the challenging setting where only a few frames of the target is available. The core of our approach is a conditional generative model that can transcode input skeletal poses (automatically extracted with an off-the-shelf pose estimator) to output target frames. However, it is challenging to build a universal transcoder because humans can appear wildly different due to clothing and background scene geometry. Instead, we learn to adapt - or personalize - a universal generator to the particular human and background in the target. To do so, we make use of meta-learning to discover effective strategies for on-the-fly personalization. One significant benefit of meta-learning is that the personalized transcoder naturally enforces temporal coherence across its generated frames; all frames contain consistent clothing and background geometry of the target. We experiment on in-the-wild internet videos and images and show our approach improves over widely-used baselines for the task.
\ No newline at end of file
diff --git a/data/2020/iclr/Minimizing FLOPs to Learn Efficient Sparse Representations b/data/2020/iclr/Minimizing FLOPs to Learn Efficient Sparse Representations
new file mode 100644
index 0000000000..d4e04b519b
--- /dev/null
+++ b/data/2020/iclr/Minimizing FLOPs to Learn Efficient Sparse Representations	
@@ -0,0 +1 @@
+Deep representation learning has become one of the most widely adopted approaches for visual search, recommendation, and identification. Retrieval of such representations from a large database is however computationally challenging. Approximate methods based on learning compact representations, have been widely explored for this problem, such as locality sensitive hashing, product quantization, and PCA. In this work, in contrast to learning compact representations, we propose to learn high dimensional and sparse representations that have similar representational capacity as dense embeddings while being more efficient due to sparse matrix multiplication operations which can be much faster than dense multiplication. Following the key insight that the number of operations decreases quadratically with the sparsity of embeddings provided the non-zero entries are distributed uniformly across dimensions, we propose a novel approach to learn such distributed sparse embeddings via the use of a carefully constructed regularization function that directly minimizes a continuous relaxation of the number of floating-point operations (FLOPs) incurred during retrieval. Our experiments show that our approach is competitive to the other baselines and yields a similar or better speed-vs-accuracy tradeoff on practical datasets.
\ No newline at end of file
diff --git a/data/2020/iclr/Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models b/data/2020/iclr/Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models
new file mode 100644
index 0000000000..eeb64d4f1f
--- /dev/null
+++ b/data/2020/iclr/Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models	
@@ -0,0 +1 @@
+In natural language processing, it has been observed recently that generalization could be greatly improved by finetuning a large-scale language model pretrained on a large unlabeled corpus. Despite its recent success and wide adoption, finetuning a large pretrained language model on a downstream task is prone to degenerate performance when there are only a small number of training instances available. In this paper, we introduce a new regularization technique, to which we refer as "mixout", motivated by dropout. Mixout stochastically mixes the parameters of two models. We show that our mixout technique regularizes learning to minimize the deviation from one of the two models and that the strength of regularization adapts along the optimization trajectory. We empirically evaluate the proposed mixout and its variants on finetuning a pretrained language model on downstream tasks. More specifically, we demonstrate that the stability of finetuning and the average accuracy greatly increase when we use the proposed approach to regularize finetuning of BERT on downstream tasks in GLUE.
\ No newline at end of file
diff --git a/data/2020/iclr/Mixup Inference: Better Exploiting Mixup to Defend Adversarial Attacks b/data/2020/iclr/Mixup Inference: Better Exploiting Mixup to Defend Adversarial Attacks
new file mode 100644
index 0000000000..b5b8ef263f
--- /dev/null
+++ b/data/2020/iclr/Mixup Inference: Better Exploiting Mixup to Defend Adversarial Attacks	
@@ -0,0 +1 @@
+It has been widely recognized that adversarial examples can be easily crafted to fool deep networks, which mainly root from the locally non-linear behavior nearby input examples. Applying mixup in training provides an effective mechanism to improve generalization performance and model robustness against adversarial perturbations, which introduces the globally linear behavior in-between training examples. However, in previous work, the mixup-trained models only passively defend adversarial attacks in inference by directly classifying the inputs, where the induced global linearity is not well exploited. Namely, since the locality of the adversarial perturbations, it would be more efficient to actively break the locality via the globality of the model predictions. Inspired by simple geometric intuition, we develop an inference principle, named mixup inference (MI), for mixup-trained models. MI mixups the input with other random clean samples, which can shrink and transfer the equivalent perturbation if the input is adversarial. Our experiments on CIFAR-10 and CIFAR-100 demonstrate that MI can further improve the adversarial robustness for the models trained by mixup and its variants.
\ No newline at end of file
diff --git a/data/2020/iclr/Multi-agent Reinforcement Learning for Networked System Control b/data/2020/iclr/Multi-agent Reinforcement Learning for Networked System Control
new file mode 100644
index 0000000000..3d6d56bce7
--- /dev/null
+++ b/data/2020/iclr/Multi-agent Reinforcement Learning for Networked System Control	
@@ -0,0 +1 @@
+This paper considers multi-agent reinforcement learning (MARL) in networked system control. Specifically, each agent learns a decentralized control policy based on local observations and messages from connected neighbors. We formulate such a networked MARL (NMARL) problem as a spatiotemporal Markov decision process and introduce a spatial discount factor to stabilize the training of each local agent. Further, we propose a new differentiable communication protocol, called NeurComm, to reduce information loss and non-stationarity in NMARL. Based on experiments in realistic NMARL scenarios of adaptive traffic signal control and cooperative adaptive cruise control, an appropriate spatial discount factor effectively enhances the learning curves of non-communicative MARL algorithms, while NeurComm outperforms existing communication protocols in both learning efficiency and control performance.
\ No newline at end of file
diff --git a/data/2020/iclr/Multiplicative Interactions and Where to Find Them b/data/2020/iclr/Multiplicative Interactions and Where to Find Them
new file mode 100644
index 0000000000..d3c1abbf53
--- /dev/null
+++ b/data/2020/iclr/Multiplicative Interactions and Where to Find Them	
@@ -0,0 +1 @@
+We explore the role of multiplicative interaction as a unifying framework to describe a range of classical and modern neural network architectural motifs, such as gating, attention layers, hypernetworks, and dynamic convolutions amongst others. Multiplicative interaction layers as primitive operations have a long-established presence in the literature, though this often not emphasized and thus under-appreciated. We begin by showing that such layers strictly enrich the representable function classes of neural networks. We conjecture that multiplicative interactions offer a particularly powerful inductive bias when fusing multiple streams of information or when conditional computation is required. We therefore argue that they should be considered in many situation where multiple compute or information paths need to be combined, in place of the simple and oft-used concatenation operation. Finally, we back up our claims and demonstrate the potential of multiplicative interactions by applying them in large-scale complex RL and sequence modelling tasks, where their use allows us to deliver state-of-the-art results, and thereby provides new evidence in support of multiplicative interactions playing a more prominent role when designing new neural network architectures.
\ No newline at end of file
diff --git a/data/2020/iclr/Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification b/data/2020/iclr/Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification
new file mode 100644
index 0000000000..4bb242a9bb
--- /dev/null
+++ b/data/2020/iclr/Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification	
@@ -0,0 +1 @@
+Person re-identification (re-ID) aims at identifying the same persons' images across different cameras. However, domain diversities between different datasets pose an evident challenge for adapting the re-ID model trained on one dataset to another one. State-of-the-art unsupervised domain adaptation methods for person re-ID transferred the learned knowledge from the source domain by optimizing with pseudo labels created by clustering algorithms on the target domain. Although they achieved state-of-the-art performances, the inevitable label noise caused by the clustering procedure was ignored. Such noisy pseudo labels substantially hinders the model's capability on further improving feature representations on the target domain. In order to mitigate the effects of noisy pseudo labels, we propose to softly refine the pseudo labels in the target domain by proposing an unsupervised framework, Mutual Mean-Teaching (MMT), to learn better features from the target domain via off-line refined hard pseudo labels and on-line refined soft pseudo labels in an alternative training manner. In addition, the common practice is to adopt both the classification loss and the triplet loss jointly for achieving optimal performances in person re-ID models. However, conventional triplet loss cannot work with softly refined labels. To solve this problem, a novel soft softmax-triplet loss is proposed to support learning with soft pseudo triplet labels for achieving the optimal domain adaptation performance. The proposed MMT framework achieves considerable improvements of 14.4%, 18.2%, 13.1% and 16.4% mAP on Market-to-Duke, Duke-to-Market, Market-to-MSMT and Duke-to-MSMT unsupervised domain adaptation tasks.
\ No newline at end of file
diff --git a/data/2020/iclr/N-BEATS: Neural basis expansion analysis for interpretable time series forecasting b/data/2020/iclr/N-BEATS: Neural basis expansion analysis for interpretable time series forecasting
new file mode 100644
index 0000000000..5fa20c9c2f
--- /dev/null
+++ b/data/2020/iclr/N-BEATS: Neural basis expansion analysis for interpretable time series forecasting	
@@ -0,0 +1 @@
+We focus on solving the univariate times series point forecasting problem using deep learning. We propose a deep neural architecture based on backward and forward residual links and a very deep stack of fully-connected layers. The architecture has a number of desirable properties, being interpretable, applicable without modification to a wide array of target domains, and fast to train. We test the proposed architecture on several well-known datasets, including M3, M4 and TOURISM competition datasets containing time series from diverse domains. We demonstrate state-of-the-art performance for two configurations of N-BEATS for all the datasets, improving forecast accuracy by 11% over a statistical benchmark and by 3% over last year's winner of the M4 competition, a domain-adjusted hand-crafted hybrid between neural network and statistical time series models. The first configuration of our model does not employ any time-series-specific components and its performance on heterogeneous datasets strongly suggests that, contrarily to received wisdom, deep learning primitives such as residual blocks are by themselves sufficient to solve a wide range of forecasting problems. Finally, we demonstrate how the proposed architecture can be augmented to provide outputs that are interpretable without considerable loss in accuracy.
\ No newline at end of file
diff --git a/data/2020/iclr/NAS evaluation is frustratingly hard b/data/2020/iclr/NAS evaluation is frustratingly hard
new file mode 100644
index 0000000000..71f7d2b88f
--- /dev/null
+++ b/data/2020/iclr/NAS evaluation is frustratingly hard	
@@ -0,0 +1 @@
+Neural Architecture Search (NAS) is an exciting new field which promises to be as much as a game-changer as Convolutional Neural Networks were in 2012. Despite many great works leading to substantial improvements on a variety of tasks, comparison between different methods is still very much an open issue. While most algorithms are tested on the same datasets, there is no shared experimental protocol followed by all. As such, and due to the under-use of ablation studies, there is a lack of clarity regarding why certain methods are more effective than others. Our first contribution is a benchmark of 8 NAS methods on 5 datasets. To overcome the hurdle of comparing methods with different search spaces, we propose using a method’s relative improvement over the randomly sampled average architecture, which effectively removes advantages arising from expertly engineered search spaces or training protocols. Surprisingly, we find that many NAS techniques struggle to significantly beat the average architecture baseline. We perform further experiments with the commonly used DARTS search space in order to understand the contribution of each component in the NAS pipeline. These experiments highlight that: (i) the use of tricks in the evaluation protocol has a predominant impact on the reported performance of architectures; (ii) the cell-based search space has a very narrow accuracy range, such that the seed has a considerable impact on architecture rankings; (iii) the hand-designed macrostructure (cells) is more important than the searched micro-structure (operations); and (iv) the depth-gap is a real phenomenon, evidenced by the change in rankings between 8 and 20 cell architectures. To conclude, we suggest best practices, that we hope will prove useful for the community and help mitigate current NAS pitfalls, e.g. difficulties in reproducibility and comparison of search methods. We provide the code used for our experiments at link-to-come.
\ No newline at end of file
diff --git a/data/2020/iclr/Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data b/data/2020/iclr/Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data
new file mode 100644
index 0000000000..3a73e5fc7b
--- /dev/null
+++ b/data/2020/iclr/Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data	
@@ -0,0 +1 @@
+Nowadays, deep neural networks (DNNs) have become the main instrument for machine learning tasks within a wide range of domains, including vision, NLP, and speech. Meanwhile, in an important case of heterogenous tabular data, the advantage of DNNs over shallow counterparts remains questionable. In particular, there is no sufficient evidence that deep learning machinery allows constructing methods that outperform gradient boosting decision trees (GBDT), which are often the top choice for tabular problems. In this paper, we introduce Neural Oblivious Decision Ensembles (NODE), a new deep learning architecture, designed to work with any tabular data. In a nutshell, the proposed NODE architecture generalizes ensembles of oblivious decision trees, but benefits from both end-to-end gradient-based optimization and the power of multi-layer hierarchical representation learning. With an extensive experimental comparison to the leading GBDT packages on a large number of tabular datasets, we demonstrate the advantage of the proposed NODE architecture, which outperforms the competitors on most of the tasks. We open-source the PyTorch implementation of NODE and believe that it will become a universal framework for machine learning on tabular data.
\ No newline at end of file
diff --git a/data/2020/iclr/Neural Stored-program Memory b/data/2020/iclr/Neural Stored-program Memory
new file mode 100644
index 0000000000..4c841f29be
--- /dev/null
+++ b/data/2020/iclr/Neural Stored-program Memory	
@@ -0,0 +1 @@
+Neural networks powered with external memory simulate computer behaviors. These models, which use the memory to store data for a neural controller, can learn algorithms and other complex tasks. In this paper, we introduce a new memory to store weights for the controller, analogous to the stored-program memory in modern computer architectures. The proposed model, dubbed Neural Stored-program Memory, augments current memory-augmented neural networks, creating differentiable machines that can switch programs through time, adapt to variable contexts and thus resemble the Universal Turing Machine. A wide range of experiments demonstrate that the resulting machines not only excel in classical algorithmic problems, but also have potential for compositional, continual, few-shot learning and question-answering tasks.
\ No newline at end of file
diff --git a/data/2020/iclr/Neural Text Generation With Unlikelihood Training b/data/2020/iclr/Neural Text Generation With Unlikelihood Training
new file mode 100644
index 0000000000..ae84b16d9c
--- /dev/null
+++ b/data/2020/iclr/Neural Text Generation With Unlikelihood Training	
@@ -0,0 +1 @@
+Neural text generation is a key tool in natural language applications, but it is well known there are major problems at its core. In particular, standard likelihood training and decoding leads to dull and repetitive outputs. While some post-hoc fixes have been proposed, in particular top-$k$ and nucleus sampling, they do not address the fact that the token-level probabilities predicted by the model are poor. In this paper we show that the likelihood objective itself is at fault, resulting in a model that assigns too much probability to sequences containing repeats and frequent words, unlike those from the human training distribution. We propose a new objective, unlikelihood training, which forces unlikely generations to be assigned lower probability by the model. We show that both token and sequence level unlikelihood training give less repetitive, less dull text while maintaining perplexity, giving superior generations using standard greedy or beam search. According to human evaluations, our approach with standard beam search also outperforms the currently popular decoding methods of nucleus sampling or beam blocking, thus providing a strong alternative to existing techniques.
\ No newline at end of file
diff --git a/data/2020/iclr/Novelty Detection Via Blurring b/data/2020/iclr/Novelty Detection Via Blurring
new file mode 100644
index 0000000000..500c17222c
--- /dev/null
+++ b/data/2020/iclr/Novelty Detection Via Blurring	
@@ -0,0 +1 @@
+Conventional out-of-distribution (OOD) detection schemes based on variational autoencoder or Random Network Distillation (RND) are known to assign lower uncertainty to the OOD data than the target distribution. In this work, we discover that such conventional novelty detection schemes are also vulnerable to the blurred images. Based on the observation, we construct a novel RND-based OOD detector, SVD-RND, that utilizes blurred images during training. Our detector is simple, efficient in test time, and outperforms baseline OOD detectors in various domains. Further results show that SVD-RND learns a better target distribution representation than the baselines. Finally, SVD-RND combined with geometric transform achieves near-perfect detection accuracy in CelebA domain.
\ No newline at end of file
diff --git a/data/2020/iclr/Observational Overfitting in Reinforcement Learning b/data/2020/iclr/Observational Overfitting in Reinforcement Learning
new file mode 100644
index 0000000000..56d3501926
--- /dev/null
+++ b/data/2020/iclr/Observational Overfitting in Reinforcement Learning	
@@ -0,0 +1 @@
+A major component of overfitting in model-free reinforcement learning (RL) involves the case where the agent may mistakenly correlate reward with certain spurious features from the observations generated by the Markov Decision Process (MDP). We provide a general framework for analyzing this scenario, which we use to design multiple synthetic benchmarks from only modifying the observation space of an MDP. When an agent overfits to different observation spaces even if the underlying MDP dynamics is fixed, we term this observational overfitting. Our experiments expose intriguing properties especially with regards to implicit regularization, and also corroborate results from previous works in RL generalization and supervised learning (SL).
\ No newline at end of file
diff --git a/data/2020/iclr/On Computation and Generalization of Generative Adversarial Imitation Learning b/data/2020/iclr/On Computation and Generalization of Generative Adversarial Imitation Learning
new file mode 100644
index 0000000000..331fcf3d1c
--- /dev/null
+++ b/data/2020/iclr/On Computation and Generalization of Generative Adversarial Imitation Learning	
@@ -0,0 +1 @@
+Generative Adversarial Imitation Learning (GAIL) is a powerful and practical approach for learning sequential decision-making policies. Different from Reinforcement Learning (RL), GAIL takes advantage of demonstration data by experts (e.g., human), and learns both the policy and reward function of the unknown environment. Despite the significant empirical progresses, the theory behind GAIL is still largely unknown. The major difficulty comes from the underlying temporal dependency of the demonstration data and the minimax computational formulation of GAIL without convex-concave structure. To bridge such a gap between theory and practice, this paper investigates the theoretical properties of GAIL. Specifically, we show: (1) For GAIL with general reward parameterization, the generalization can be guaranteed as long as the class of the reward functions is properly controlled; (2) For GAIL, where the reward is parameterized as a reproducing kernel function, GAIL can be efficiently solved by stochastic first order optimization algorithms, which attain sublinear convergence to a stationary solution. To the best of our knowledge, these are the first results on statistical and computational guarantees of imitation learning with reward/policy function ap- proximation. Numerical experiments are provided to support our analysis.
\ No newline at end of file
diff --git a/data/2020/iclr/On Identifiability in Transformers b/data/2020/iclr/On Identifiability in Transformers
new file mode 100644
index 0000000000..d4b1aa6043
--- /dev/null
+++ b/data/2020/iclr/On Identifiability in Transformers	
@@ -0,0 +1 @@
+In this paper we delve deep in the Transformer architecture by investigating two of its core components: self-attention and contextual embeddings. In particular, we study the identifiability of attention weights and token embeddings, and the aggregation of context into hidden tokens. We show that, for sequences longer than the attention head dimension, attention weights are not identifiable. We propose effective attention as a complementary tool for improving explanatory interpretations based on attention. Furthermore, we show that input tokens retain to a large degree their identity across the model. We also find evidence suggesting that identity information is mainly encoded in the angle of the embeddings and gradually decreases with depth. Finally, we demonstrate strong mixing of input information in the generation of contextual embeddings by means of a novel quantification method based on gradient attribution. Overall, we show that self-attention distributions are not directly interpretable and present tools to better understand and further investigate Transformer models.
\ No newline at end of file
diff --git a/data/2020/iclr/On Mutual Information Maximization for Representation Learning b/data/2020/iclr/On Mutual Information Maximization for Representation Learning
new file mode 100644
index 0000000000..25d472fb08
--- /dev/null
+++ b/data/2020/iclr/On Mutual Information Maximization for Representation Learning	
@@ -0,0 +1 @@
+Many recent methods for unsupervised or self-supervised representation learning train feature extractors by maximizing an estimate of the mutual information (MI) between different views of the data. This comes with several immediate problems: For example, MI is notoriously hard to estimate, and using it as an objective for representation learning may lead to highly entangled representations due to its invariance under arbitrary invertible transformations. Nevertheless, these methods have been repeatedly shown to excel in practice. In this paper we argue, and provide empirical evidence, that the success of these methods cannot be attributed to the properties of MI alone, and that they strongly depend on the inductive bias in both the choice of feature extractor architectures and the parametrization of the employed MI estimators. Finally, we establish a connection to deep metric learning and argue that this interpretation may be a plausible explanation for the success of the recently introduced methods.
\ No newline at end of file
diff --git "a/data/2020/iclr/On the \"steerability\" of generative adversarial networks" "b/data/2020/iclr/On the \"steerability\" of generative adversarial networks"
new file mode 100644
index 0000000000..2b33f5ad8c
--- /dev/null
+++ "b/data/2020/iclr/On the \"steerability\" of generative adversarial networks"	
@@ -0,0 +1 @@
+An open secret in contemporary machine learning is that many models work beautifully on standard benchmarks but fail to generalize outside the lab. This has been attributed to biased training data, which provide poor coverage over real world events. Generative models are no exception, but recent advances in generative adversarial networks (GANs) suggest otherwise - these models can now synthesize strikingly realistic and diverse images. Is generative modeling of photos a solved problem? We show that although current GANs can fit standard datasets very well, they still fall short of being comprehensive models of the visual manifold. In particular, we study their ability to fit simple transformations such as camera movements and color changes. We find that the models reflect the biases of the datasets on which they are trained (e.g., centered objects), but that they also exhibit some capacity for generalization: by "steering" in latent space, we can shift the distribution while still creating realistic images. We hypothesize that the degree of distributional shift is related to the breadth of the training data distribution. Thus, we conduct experiments to quantify the limits of GAN transformations and introduce techniques to mitigate the problem. Code is released on our project page: this https URL
\ No newline at end of file
diff --git a/data/2020/iclr/On the Variance of the Adaptive Learning Rate and Beyond b/data/2020/iclr/On the Variance of the Adaptive Learning Rate and Beyond
new file mode 100644
index 0000000000..875c47dd02
--- /dev/null
+++ b/data/2020/iclr/On the Variance of the Adaptive Learning Rate and Beyond	
@@ -0,0 +1 @@
+The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in details. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate (i.e., it has problematically large variance in the early stage), suggest warmup works as a variance reduction technique, and provide both empirical and theoretical evidence to verify our hypothesis. We further propose RAdam, a new variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate. Extensive experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the effectiveness and robustness of our proposed method. All implementations are available at: this https URL.
\ No newline at end of file
diff --git a/data/2020/iclr/On the Weaknesses of Reinforcement Learning for Neural Machine Translation b/data/2020/iclr/On the Weaknesses of Reinforcement Learning for Neural Machine Translation
new file mode 100644
index 0000000000..7c635e7e71
--- /dev/null
+++ b/data/2020/iclr/On the Weaknesses of Reinforcement Learning for Neural Machine Translation	
@@ -0,0 +1 @@
+Reinforcement learning (RL) is frequently used to increase performance in text generation tasks, including machine translation (MT), notably through the use of Minimum Risk Training (MRT) and Generative Adversarial Networks (GAN). However, little is known about what and how these methods learn in the context of MT. We prove that one of the most common RL methods for MT does not optimize the expected reward, as well as show that other methods take an infeasibly long time to converge. In fact, our results suggest that RL practices in MT are likely to improve performance only where the pre-trained parameters are already close to yielding the correct translation. Our findings further suggest that observed gains may be due to effects unrelated to the training signal, but rather from changes in the shape of the distribution curve.
\ No newline at end of file
diff --git a/data/2020/iclr/One-Shot Pruning of Recurrent Neural Networks by Jacobian Spectrum Evaluation b/data/2020/iclr/One-Shot Pruning of Recurrent Neural Networks by Jacobian Spectrum Evaluation
new file mode 100644
index 0000000000..9ddfe668a6
--- /dev/null
+++ b/data/2020/iclr/One-Shot Pruning of Recurrent Neural Networks by Jacobian Spectrum Evaluation	
@@ -0,0 +1 @@
+Recent advances in the sparse neural network literature have made it possible to prune many large feed forward and convolutional networks with only a small quantity of data. Yet, these same techniques often falter when applied to the problem of recovering sparse recurrent networks. These failures are quantitative: when pruned with recent techniques, RNNs typically obtain worse performance than they do under a simple random pruning scheme. The failures are also qualitative: the distribution of active weights in a pruned LSTM or GRU network tend to be concentrated in specific neurons and gates, and not well dispersed across the entire architecture. We seek to rectify both the quantitative and qualitative issues with recurrent network pruning by introducing a new recurrent pruning objective derived from the spectrum of the recurrent Jacobian. Our objective is data efficient (requiring only 64 data points to prune the network), easy to implement, and produces 95% sparse GRUs that significantly improve on existing baselines. We evaluate on sequential MNIST, Billion Words, and Wikitext.
\ No newline at end of file
diff --git a/data/2020/iclr/Optimistic Exploration even with a Pessimistic Initialisation b/data/2020/iclr/Optimistic Exploration even with a Pessimistic Initialisation
new file mode 100644
index 0000000000..b60e3a1cd0
--- /dev/null
+++ b/data/2020/iclr/Optimistic Exploration even with a Pessimistic Initialisation	
@@ -0,0 +1 @@
+Optimistic initialisation is an effective strategy for efficient exploration in reinforcement learning (RL). In the tabular case, all provably efficient model-free algorithms rely on it. However, model-free deep RL algorithms do not use optimistic initialisation despite taking inspiration from these provably efficient tabular algorithms. In particular, in scenarios with only positive rewards, Q-values are initialised at their lowest possible values due to commonly used network initialisation schemes, a pessimistic initialisation. Merely initialising the network to output optimistic Q-values is not enough, since we cannot ensure that they remain optimistic for novel state-action pairs, which is crucial for exploration. We propose a simple count-based augmentation to pessimistically initialised Q-values that separates the source of optimism from the neural network. We show that this scheme is provably efficient in the tabular setting and extend it to the deep RL setting. Our algorithm, Optimistic Pessimistically Initialised Q-Learning (OPIQ), augments the Q-value estimates of a DQN-based agent with count-derived bonuses to ensure optimism during both action selection and bootstrapping. We show that OPIQ outperforms non-optimistic DQN variants that utilise a pseudocount-based intrinsic motivation in hard exploration tasks, and that it predicts optimistic estimates for novel state-action pairs.
\ No newline at end of file
diff --git a/data/2020/iclr/Option Discovery using Deep Skill Chaining b/data/2020/iclr/Option Discovery using Deep Skill Chaining
new file mode 100644
index 0000000000..d47177164e
--- /dev/null
+++ b/data/2020/iclr/Option Discovery using Deep Skill Chaining	
@@ -0,0 +1 @@
+Autonomously discovering temporally extended actions, or skills, is a longstanding goal of hierarchical reinforcement learning. We propose a new algorithm that combines skill chaining with deep neural networks to autonomously discover skills in high-dimensional, continuous domains. The resulting algorithm, deep skill chaining, constructs skills with the property that executing one enables the agent to execute another. We demonstrate that deep skill chaining significantly outperforms both non-hierarchical agents and other state-of-the-art skill discovery techniques in challenging continuous control tasks.
\ No newline at end of file
diff --git a/data/2020/iclr/Order Learning and Its Application to Age Estimation b/data/2020/iclr/Order Learning and Its Application to Age Estimation
new file mode 100644
index 0000000000..4b02337e2e
--- /dev/null
+++ b/data/2020/iclr/Order Learning and Its Application to Age Estimation	
@@ -0,0 +1 @@
+We propose order learning to determine the order graph of classes, representing ranks or priorities, and classify an object instance into one of the classes. To this end, we design a pairwise comparator to categorize the relationship between two instances into one of three cases: one instance is `greater than,' `similar to,' or `smaller than' the other. Then, by comparing an input instance with reference instances and maximizing the consistency among the comparison results, the class of the input can be estimated reliably. We apply order learning to develop a facial age estimator, which provides the state-of-the-art performance. Moreover, the performance is further improved when the order graph is divided into disjoint chains using gender and ethnic group information or even in an unsupervised manner.
\ No newline at end of file
diff --git a/data/2020/iclr/Overlearning Reveals Sensitive Attributes b/data/2020/iclr/Overlearning Reveals Sensitive Attributes
new file mode 100644
index 0000000000..dbdb6e40ed
--- /dev/null
+++ b/data/2020/iclr/Overlearning Reveals Sensitive Attributes	
@@ -0,0 +1,3 @@
+"Overlearning" means that a model trained for a seemingly simple objective implicitly learns to recognize attributes and concepts that are (1) not part of the learning objective, and (2) sensitive from a privacy or bias perspective. For example, a binary gender classifier of facial images also learns to recognize races\textemdash even races that are not represented in the training data\textemdash and identities. 
+We demonstrate overlearning in several vision and NLP models and analyze its harmful consequences. First, inference-time representations of an overlearned model reveal sensitive attributes of the input, breaking privacy protections such as model partitioning. Second, an overlearned model can be "re-purposed" for a different, privacy-violating task even in the absence of the original training data. 
+We show that overlearning is intrinsic for some tasks and cannot be prevented by censoring unwanted attributes. Finally, we investigate where, when, and why overlearning happens during model training.
\ No newline at end of file
diff --git a/data/2020/iclr/Physics-as-Inverse-Graphics: Unsupervised Physical Parameter Estimation from Video b/data/2020/iclr/Physics-as-Inverse-Graphics: Unsupervised Physical Parameter Estimation from Video
new file mode 100644
index 0000000000..598183a292
--- /dev/null
+++ b/data/2020/iclr/Physics-as-Inverse-Graphics: Unsupervised Physical Parameter Estimation from Video	
@@ -0,0 +1 @@
+We propose a model that is able to perform physical parameter estimation of systems from video, where the differential equations governing the scene dynamics are known, but labeled states or objects are not available. Existing physical scene understanding methods require either object state supervision, or do not integrate with differentiable physics to learn interpretable system parameters and states. We address this problem through a \textit{physics-as-inverse-graphics} approach that brings together vision-as-inverse-graphics and differentiable physics engines, where objects and explicit state and velocity representations are discovered by the model. This framework allows us to perform long term extrapolative video prediction, as well as vision-based model-predictive control. Our approach significantly outperforms related unsupervised methods in long-term future frame prediction of systems with interacting objects (such as ball-spring or 3-body gravitational systems), due to its ability to build dynamics into the model as an inductive bias. We further show the value of this tight vision-physics integration by demonstrating data-efficient learning of vision-actuated model-based control for a pendulum system. We also show that the controller's interpretability provides unique capabilities in goal-driven control and physical reasoning for zero-data adaptation.
\ No newline at end of file
diff --git a/data/2020/iclr/Piecewise linear activations substantially shape the loss surfaces of neural networks b/data/2020/iclr/Piecewise linear activations substantially shape the loss surfaces of neural networks
new file mode 100644
index 0000000000..2f0ac5887f
--- /dev/null
+++ b/data/2020/iclr/Piecewise linear activations substantially shape the loss surfaces of neural networks	
@@ -0,0 +1 @@
+Understanding the loss surface of a neural network is fundamentally important to the understanding of deep learning. This paper presents how piecewise linear activation functions substantially shape the loss surfaces of neural networks. We first prove that the loss surfaces of many neural networks have infinite spurious local minima, which are defined as the local minima with higher empirical risks than the global minima. Our result holds for any neural network with arbitrary depth and arbitrary piecewise linear activation functions (excluding linear functions) under most loss functions in practice with some mild assumptions. This result demonstrates that the networks with piecewise linear activations possess substantial differences to the well-studied linear neural networks. Essentially, the underlying assumptions for the above result are consistent with most practical circumstances where the output layer is narrower than any hidden layer. In addition, the loss surface of a neural network with piecewise linear activations is partitioned into multiple smooth and multilinear cells by nondifferentiable boundaries. The constructed spurious local minima are concentrated in one cell as a valley: they are connected with each other by a continuous path, on which empirical risk is invariant. Further for one-hidden-layer networks, we prove that all local minima in a cell constitute an equivalence class; they are concentrated in a valley; and they are all global minima in the cell.
\ No newline at end of file
diff --git a/data/2020/iclr/Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP b/data/2020/iclr/Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP
new file mode 100644
index 0000000000..2c77d95e6e
--- /dev/null
+++ b/data/2020/iclr/Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP	
@@ -0,0 +1 @@
+The lottery ticket hypothesis proposes that over-parameterization of deep neural networks (DNNs) aids training by increasing the probability of a "lucky" sub-network initialization being present rather than by helping the optimization process (Frankle & Carbin, 2019). Intriguingly, this phenomenon suggests that initialization strategies for DNNs can be improved substantially, but the lottery ticket hypothesis has only previously been tested in the context of supervised learning for natural image tasks. Here, we evaluate whether "winning ticket" initializations exist in two different domains: natural language processing (NLP) and reinforcement learning (RL).For NLP, we examined both recurrent LSTM models and large-scale Transformer models (Vaswani et al., 2017). For RL, we analyzed a number of discrete-action space tasks, including both classic control and pixel control. Consistent with workin supervised image classification, we confirm that winning ticket initializations generally outperform parameter-matched random initializations, even at extreme pruning rates for both NLP and RL. Notably, we are able to find winning ticket initializations for Transformers which enable models one-third the size to achieve nearly equivalent performance. Together, these results suggest that the lottery ticket hypothesis is not restricted to supervised learning of natural images, but rather represents a broader phenomenon in DNNs.
\ No newline at end of file
diff --git a/data/2020/iclr/Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring b/data/2020/iclr/Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring
new file mode 100644
index 0000000000..d2f598f47b
--- /dev/null
+++ b/data/2020/iclr/Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring	
@@ -0,0 +1 @@
+The use of deep pre-trained transformers has led to remarkable progress in a number of applications (Devlin et al., 2018). For tasks that make pairwise comparisons between sequences, matching a given input with a corresponding label, two approaches are common: Cross-encoders performing full self-attention over the pair and Bi-encoders encoding the pair separately. The former often performs better, but is too slow for practical use. In this work, we develop a new transformer architecture, the Poly-encoder, that learns global rather than token level self-attention features. We perform a detailed comparison of all three approaches, including what pre-training and fine-tuning strategies work best. We show our models achieve state-of-the-art results on four tasks; that Poly-encoders are faster than Cross-encoders and more accurate than Bi-encoders; and that the best results are obtained by pre-training on large datasets similar to the downstream tasks.
\ No newline at end of file
diff --git a/data/2020/iclr/Population-Guided Parallel Policy Search for Reinforcement Learning b/data/2020/iclr/Population-Guided Parallel Policy Search for Reinforcement Learning
new file mode 100644
index 0000000000..0fff32167f
--- /dev/null
+++ b/data/2020/iclr/Population-Guided Parallel Policy Search for Reinforcement Learning	
@@ -0,0 +1 @@
+In this paper, a new population-guided parallel learning scheme is proposed to enhance the performance of off-policy reinforcement learning (RL). In the proposed scheme, multiple identical learners with their own value-functions and policies share a common experience replay buffer, and search a good policy in collaboration with the guidance of the best policy information. The key point is that the information of the best policy is fused in a soft manner by constructing an augmented loss function for policy update to enlarge the overall search region by the multiple learners. The guidance by the previous best policy and the enlarged range enable faster and better policy search. Monotone improvement of the expected cumulative return by the proposed scheme is proved theoretically. Working algorithms are constructed by applying the proposed scheme to the twin delayed deep deterministic (TD3) policy gradient algorithm. Numerical results show that the constructed algorithm outperforms most of the current state-of-the-art RL algorithms, and the gain is significant in the case of sparse reward environment.
\ No newline at end of file
diff --git a/data/2020/iclr/Pre-training Tasks for Embedding-based Large-scale Retrieval b/data/2020/iclr/Pre-training Tasks for Embedding-based Large-scale Retrieval
new file mode 100644
index 0000000000..db496d3df8
--- /dev/null
+++ b/data/2020/iclr/Pre-training Tasks for Embedding-based Large-scale Retrieval	
@@ -0,0 +1 @@
+We consider the large-scale query-document retrieval problem: given a query (e.g., a question), return the set of relevant documents (e.g., paragraphs containing the answer) from a large document corpus. This problem is often solved in two steps. The retrieval phase first reduces the solution space, returning a subset of candidate documents. The scoring phase then re-ranks the documents. Critically, the retrieval algorithm not only desires high recall but also requires to be highly efficient, returning candidates in time sublinear to the number of documents. Unlike the scoring phase witnessing significant advances recently due to the BERT-style pre-training tasks on cross-attention models, the retrieval phase remains less well studied. Most previous works rely on classic Information Retrieval (IR) methods such as BM-25 (token matching + TF-IDF weights). These models only accept sparse handcrafted features and can not be optimized for different downstream tasks of interest. In this paper, we conduct a comprehensive study on the embedding-based retrieval models. We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks. With adequately designed paragraph-level pre-training tasks, the Transformer models can remarkably improve over the widely-used BM-25 as well as embedding models without Transformers. The paragraph-level pre-training tasks we studied are Inverse Cloze Task (ICT), Body First Selection (BFS), Wiki Link Prediction (WLP), and the combination of all three.
\ No newline at end of file
diff --git a/data/2020/iclr/Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model b/data/2020/iclr/Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model
new file mode 100644
index 0000000000..191cae57ce
--- /dev/null
+++ b/data/2020/iclr/Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model	
@@ -0,0 +1 @@
+Recent breakthroughs of pretrained language models have shown the effectiveness of self-supervised learning for a wide range of natural language processing (NLP) tasks. In addition to standard syntactic and semantic NLP tasks, pretrained models achieve strong improvements on tasks that involve real-world knowledge, suggesting that large-scale language modeling could be an implicit method to capture knowledge. In this work, we further investigate the extent to which pretrained models such as BERT capture knowledge using a zero-shot fact completion task. Moreover, we propose a simple yet effective weakly supervised pretraining objective, which explicitly forces the model to incorporate knowledge about real-world entities. Models trained with our new objective yield significant improvements on the fact completion task. When applied to downstream tasks, our model consistently outperforms BERT on four entity-related question answering datasets (i.e., WebQuestions, TriviaQA, SearchQA and Quasar-T) with an average 2.7 F1 improvements and a standard fine-grained entity typing dataset (i.e., FIGER) with 5.7 accuracy gains.
\ No newline at end of file
diff --git a/data/2020/iclr/Progressive Memory Banks for Incremental Domain Adaptation b/data/2020/iclr/Progressive Memory Banks for Incremental Domain Adaptation
new file mode 100644
index 0000000000..b61318748a
--- /dev/null
+++ b/data/2020/iclr/Progressive Memory Banks for Incremental Domain Adaptation	
@@ -0,0 +1 @@
+This paper addresses the problem of incremental domain adaptation (IDA) in natural language processing (NLP). We assume each domain comes one after another, and that we could only access data in the current domain. The goal of IDA is to build a unified model performing well on all the domains that we have encountered. We adopt the recurrent neural network (RNN) widely used in NLP, but augment it with a directly parameterized memory bank, which is retrieved by an attention mechanism at each step of RNN transition. The memory bank provides a natural way of IDA: when adapting our model to a new domain, we progressively add new slots to the memory bank, which increases the number of parameters, and thus the model capacity. We learn the new memory slots and fine-tune existing parameters by back-propagation. Experimental results show that our approach achieves significantly better performance than fine-tuning alone. Compared with expanding hidden states, our approach is more robust for old domains, shown by both empirical and theoretical results. Our model also outperforms previous work of IDA including elastic weight consolidation and progressive neural networks in the experiments.
\ No newline at end of file
diff --git a/data/2020/iclr/ProxSGD: Training Structured Neural Networks under Regularization and Constraints b/data/2020/iclr/ProxSGD: Training Structured Neural Networks under Regularization and Constraints
new file mode 100644
index 0000000000..9d02b56341
--- /dev/null
+++ b/data/2020/iclr/ProxSGD: Training Structured Neural Networks under Regularization and Constraints	
@@ -0,0 +1 @@
+In this paper, we consider the problem of training neural networks (NN). To promote a NN with specific structures, we explicitly take into consideration the nonsmooth regularization (such as L1-norm) and constraints (such as interval constraint). This is formulated as a constrained nonsmooth nonconvex optimization problem, and we propose a convergent proximal-type stochastic gradient descent (Prox-SGD) algorithm. We show that under properly selected learning rates, momentum eventually resembles the unknown real gradient and thus is crucial in analyzing the convergence. We establish that with probability 1, every limit point of the sequence generated by the proposed Prox-SGD is a stationary point. Then the Prox-SGD is tailored to train a sparse neural network and a binary neural network, and the theoretical analysis is also supported by extensive numerical tests.
\ No newline at end of file
diff --git a/data/2020/iclr/Pruned Graph Scattering Transforms b/data/2020/iclr/Pruned Graph Scattering Transforms
new file mode 100644
index 0000000000..670bb7f9c7
--- /dev/null
+++ b/data/2020/iclr/Pruned Graph Scattering Transforms	
@@ -0,0 +1 @@
+Graph convolutional networks (GCNs) have achieved remarkable performance in a variety of network science learning tasks. However, theoretical analysis of such approaches is still at its infancy. Graph scattering transforms (GSTs) are non-trainable deep GCN models that are amenable to generalization and stability analyses. The present work addresses some limitations of GSTs by introducing a novel so-termed pruned (p)GST approach. The resultant pruning algorithm is guided by a graph-spectrum-inspired criterion, and retains informative scattering features on-the-fly while bypassing the exponential complexity associated with GSTs. It is further established that pGSTs are stable to perturbations of the input graph signals with bounded energy. Experiments showcase that i) pGST performs comparably to the baseline GST that uses all scattering features, while achieving significant computational savings; ii) pGST achieves comparable performance to state-of-the-art GCNs; and iii) Graph data from various domains lead to different scattering patterns, suggesting domain-adaptive pGST network architectures.
\ No newline at end of file
diff --git a/data/2020/iclr/Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving b/data/2020/iclr/Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving
new file mode 100644
index 0000000000..a8049c8f75
--- /dev/null
+++ b/data/2020/iclr/Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving	
@@ -0,0 +1 @@
+Detecting objects such as cars and pedestrians in 3D plays an indispensable role in autonomous driving. Existing approaches largely rely on expensive LiDAR sensors for accurate depth information. While recently pseudo-LiDAR has been introduced as a promising alternative, at a much lower cost based solely on stereo images, there is still a notable performance gap. In this paper we provide substantial advances to the pseudo-LiDAR framework through improvements in stereo depth estimation. Concretely, we adapt the stereo network architecture and loss function to be more aligned with accurate depth estimation of faraway objects --- currently the primary weakness of pseudo-LiDAR. Further, we explore the idea to leverage cheaper but extremely sparse LiDAR sensors, which alone provide insufficient information for 3D detection, to de-bias our depth estimation. We propose a depth-propagation algorithm, guided by the initial depth estimates, to diffuse these few exact measurements across the entire depth map. We show on the KITTI object detection benchmark that our combined approach yields substantial improvements in depth estimation and stereo-based 3D object detection --- outperforming the previous state-of-the-art detection accuracy for faraway objects by 40%. Our code is available at this https URL.
\ No newline at end of file
diff --git a/data/2020/iclr/Pure and Spurious Critical Points: a Geometric Study of Linear Networks b/data/2020/iclr/Pure and Spurious Critical Points: a Geometric Study of Linear Networks
new file mode 100644
index 0000000000..9e885c73c4
--- /dev/null
+++ b/data/2020/iclr/Pure and Spurious Critical Points: a Geometric Study of Linear Networks	
@@ -0,0 +1 @@
+The critical locus of the loss function of a neural network is determined by the geometry of the functional space and by the parameterization of this space by the network's weights. We introduce a natural distinction between pure critical points, which only depend on the functional space, and spurious critical points, which arise from the parameterization. We apply this perspective to revisit and extend the literature on the loss function of linear neural networks. For this type of network, the functional space is either the set of all linear maps from input to output space, or a determinantal variety, i.e., a set of linear maps with bounded rank. We use geometric properties of determinantal varieties to derive new results on the landscape of linear networks with different loss functions and different parameterizations. Our analysis clearly illustrates that the absence of "bad" local minima in the loss landscape of linear networks is due to two distinct phenomena that apply in different settings: it is true for arbitrary smooth convex losses in the case of architectures that can express all linear maps ("filling architectures") but it holds only for the quadratic loss when the functional space is a determinantal variety ("non-filling architectures"). Without any assumption on the architecture, smooth convex losses may lead to landscapes with many bad minima.
\ No newline at end of file
diff --git a/data/2020/iclr/Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP b/data/2020/iclr/Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP
new file mode 100644
index 0000000000..df834aa6d8
--- /dev/null
+++ b/data/2020/iclr/Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP	
@@ -0,0 +1 @@
+A fundamental question in reinforcement learning is whether model-free algorithms are sample efficient. Recently, Jin et al. \cite{jin2018q} proposed a Q-learning algorithm with UCB exploration policy, and proved it has nearly optimal regret bound for finite-horizon episodic MDP. In this paper, we adapt Q-learning with UCB-exploration bonus to infinite-horizon MDP with discounted rewards \emph{without} accessing a generative model. We show that the \textit{sample complexity of exploration} of our algorithm is bounded by $\tilde{O}({\frac{SA}{\epsilon^2(1-\gamma)^7}})$. This improves the previously best known result of $\tilde{O}({\frac{SA}{\epsilon^4(1-\gamma)^8}})$ in this setting achieved by delayed Q-learning \cite{strehl2006pac}, and matches the lower bound in terms of $\epsilon$ as well as $S$ and $A$ except for logarithmic factors.
\ No newline at end of file
diff --git a/data/2020/iclr/Quantifying the Cost of Reliable Photo Authentication via High-Performance Learned Lossy Representations b/data/2020/iclr/Quantifying the Cost of Reliable Photo Authentication via High-Performance Learned Lossy Representations
new file mode 100644
index 0000000000..15830209b9
--- /dev/null
+++ b/data/2020/iclr/Quantifying the Cost of Reliable Photo Authentication via High-Performance Learned Lossy Representations	
@@ -0,0 +1 @@
+Detection of photo manipulation relies on subtle statistical traces, notoriously removed by aggressive lossy compression employed online. We demonstrate that end-to-end modeling of complex photo dissemination channels allows for codec optimization with explicit provenance objectives. We design a lightweight trainable lossy image codec, that delivers competitive rate-distortion performance, on par with best hand-engineered alternatives, but has lower computational footprint on modern GPU-enabled platforms. Our results show that significant improvements in manipulation detection accuracy are possible at fractional costs in bandwidth/storage. Our codec improved the accuracy from 37% to 86% even at very low bit-rates, well below the practicality of JPEG (QF 20).
\ No newline at end of file
diff --git a/data/2020/iclr/RTFM: Generalising to New Environment Dynamics via Reading b/data/2020/iclr/RTFM: Generalising to New Environment Dynamics via Reading
new file mode 100644
index 0000000000..611b089d75
--- /dev/null
+++ b/data/2020/iclr/RTFM: Generalising to New Environment Dynamics via Reading	
@@ -0,0 +1 @@
+Obtaining policies that can generalise to new environments in reinforcement learning is challenging. In this work, we demonstrate that language understanding via a reading policy learner is a promising vehicle for generalisation to new environments. We propose a grounded policy learning problem, Read to Fight Monsters (RTFM), in which the agent must jointly reason over a language goal, relevant dynamics described in a document, and environment observations. We procedurally generate environment dynamics and corresponding language descriptions of the dynamics, such that agents must read to understand new environment dynamics instead of memorising any particular information. In addition, we propose txt2π, a model that captures three-way interactions between the goal, document, and observations. On RTFM, txt2π generalises to new environments with dynamics not seen during training via reading. Furthermore, our model outperforms baselines such as FiLM and language-conditioned CNNs on RTFM. Through curriculum learning, txt2π produces policies that excel on complex RTFM tasks requiring several reasoning and coreference steps.
\ No newline at end of file
diff --git a/data/2020/iclr/RaCT: Toward Amortized Ranking-Critical Training For Collaborative Filtering b/data/2020/iclr/RaCT: Toward Amortized Ranking-Critical Training For Collaborative Filtering
new file mode 100644
index 0000000000..92f4cf6976
--- /dev/null
+++ b/data/2020/iclr/RaCT: Toward Amortized Ranking-Critical Training For Collaborative Filtering	
@@ -0,0 +1 @@
+We investigate new methods for training collaborative filtering models based on actor-critic reinforcement learning, to more directly maximize ranking-based objective functions. Specifically, we train a critic network to approximate ranking-based metrics, and then update the actor network to directly optimize against the learned metrics. In contrast to traditional learning-to-rank methods that require re-running the optimization procedure for new lists, our critic-based method amortizes the scoring process with a neural network, and can directly provide the (approximate) ranking scores for new lists. We demonstrate the actor-critic's ability to significantly improve the performance of a variety of prediction models, and achieve better or comparable performance to the state-of-the-art on three large-scale datasets.
\ No newline at end of file
diff --git a/data/2020/iclr/Ranking Policy Gradient b/data/2020/iclr/Ranking Policy Gradient
new file mode 100644
index 0000000000..3d9ff08368
--- /dev/null
+++ b/data/2020/iclr/Ranking Policy Gradient	
@@ -0,0 +1 @@
+Sample inefficiency is a long-lasting problem in reinforcement learning (RL). The state-of-the-art estimates the optimal action values while it usually involves an extensive search over the state-action space and unstable optimization. Towards the sample-efficient RL, we propose ranking policy gradient (RPG), a policy gradient method that learns the optimal rank of a set of discrete actions. To accelerate the learning of policy gradient methods, we establish the equivalence between maximizing the lower bound of return and imitating a near-optimal policy without accessing any oracles. These results lead to a general off-policy learning framework, which preserves the optimality, reduces variance, and improves the sample-efficiency. Furthermore, the sample complexity of RPG does not depend on the dimension of state space, which enables RPG for large-scale problems. We conduct extensive experiments showing that when consolidating with the off-policy learning framework, RPG substantially reduces the sample complexity, comparing to the state-of-the-art.
\ No newline at end of file
diff --git a/data/2020/iclr/Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML b/data/2020/iclr/Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML
new file mode 100644
index 0000000000..a81e55aa07
--- /dev/null
+++ b/data/2020/iclr/Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML	
@@ -0,0 +1 @@
+An important research direction in machine learning has centered around developing meta-learning algorithms to tackle few-shot learning. An especially successful algorithm has been Model Agnostic Meta-Learning (MAML), a method that consists of two optimization loops, with the outer loop finding a meta-initialization, from which the inner loop can efficiently learn new tasks. Despite MAML's popularity, a fundamental open question remains -- is the effectiveness of MAML due to the meta-initialization being primed for rapid learning (large, efficient changes in the representations) or due to feature reuse, with the meta initialization already containing high quality features? We investigate this question, via ablation studies and analysis of the latent representations, finding that feature reuse is the dominant factor. This leads to the ANIL (Almost No Inner Loop) algorithm, a simplification of MAML where we remove the inner loop for all but the (task-specific) head of a MAML-trained network. ANIL matches MAML's performance on benchmark few-shot image classification and RL and offers computational improvements over MAML. We further study the precise contributions of the head and body of the network, showing that performance on the test tasks is entirely determined by the quality of the learned features, and we can remove even the head of the network (the NIL algorithm). We conclude with a discussion of the rapid learning vs feature reuse question for meta-learning algorithms more broadly.
\ No newline at end of file
diff --git a/data/2020/iclr/ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning b/data/2020/iclr/ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning
new file mode 100644
index 0000000000..6189b8d06a
--- /dev/null
+++ b/data/2020/iclr/ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning	
@@ -0,0 +1 @@
+Recent powerful pre-trained language models have achieved remarkable performance on most of the popular datasets for reading comprehension. It is time to introduce more challenging datasets to push the development of this field towards more comprehensive reasoning of text. In this paper, we introduce a new Reading Comprehension dataset requiring logical reasoning (ReClor) extracted from standardized graduate admission examinations. As earlier studies suggest, human-annotated datasets usually contain biases, which are often exploited by models to achieve high accuracy without truly understanding the text. In order to comprehensively evaluate the logical reasoning ability of models on ReClor, we propose to identify biased data points and separate them into EASY set while the rest as HARD set. Empirical results show that the state-of-the-art models have an outstanding ability to capture biases contained in the dataset with high accuracy on EASY set. However, they struggle on HARD set with poor performance near that of random guess, indicating more research is needed to essentially enhance the logical reasoning ability of current models.
\ No newline at end of file
diff --git a/data/2020/iclr/ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring b/data/2020/iclr/ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring
new file mode 100644
index 0000000000..0d68f6b002
--- /dev/null
+++ b/data/2020/iclr/ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring	
@@ -0,0 +1 @@
+We improve the recently-proposed ``MixMatch semi-supervised learning algorithm by introducing two new techniques: distribution alignment and augmentation anchoring. - Distribution alignment encourages the marginal distribution of predictions on unlabeled data to be close to the marginal distribution of ground-truth labels. - Augmentation anchoring} feeds multiple strongly augmented versions of an input into the model and encourages each output to be close to the prediction for a weakly-augmented version of the same input. To produce strong augmentations, we propose a variant of AutoAugment which learns the augmentation policy while the model is being trained. Our new algorithm, dubbed ReMixMatch, is significantly more data-efficient than prior work, requiring between 5 times and 16 times less data to reach the same accuracy. For example, on CIFAR-10 with 250 labeled examples we reach 93.73% accuracy (compared to MixMatch's accuracy of 93.58% with 4000 examples) and a median accuracy of 84.92% with just four labels per class.
\ No newline at end of file
diff --git a/data/2020/iclr/Reanalysis of Variance Reduced Temporal Difference Learning b/data/2020/iclr/Reanalysis of Variance Reduced Temporal Difference Learning
new file mode 100644
index 0000000000..c010eee833
--- /dev/null
+++ b/data/2020/iclr/Reanalysis of Variance Reduced Temporal Difference Learning	
@@ -0,0 +1 @@
+Temporal difference (TD) learning is a popular algorithm for policy evaluation in reinforcement learning, but the vanilla TD can substantially suffer from the inherent optimization variance. A variance reduced TD (VRTD) algorithm was proposed by Korda and La (2015), which applies the variance reduction technique directly to the online TD learning with Markovian samples. In this work, we first point out the technical errors in the analysis of VRTD in Korda and La (2015), and then provide a mathematically solid analysis of the non-asymptotic convergence of VRTD and its variance reduction performance. We show that VRTD is guaranteed to converge to a neighborhood of the fixed-point solution of TD at a linear convergence rate. Furthermore, the variance error (for both i.i.d. and Markovian sampling) and the bias error (for Markovian sampling) of VRTD are significantly reduced by the batch size of variance reduction in comparison to those of vanilla TD.
\ No newline at end of file
diff --git a/data/2020/iclr/Recurrent neural circuits for contour detection b/data/2020/iclr/Recurrent neural circuits for contour detection
new file mode 100644
index 0000000000..57c4024970
--- /dev/null
+++ b/data/2020/iclr/Recurrent neural circuits for contour detection	
@@ -0,0 +1 @@
+We introduce a deep recurrent neural network architecture that approximates visual cortical circuits (Mely et al., 2018). We show that this architecture, which we refer to as the 𝜸-net, learns to solve contour detection tasks with better sample efficiency than state-of-the-art feedforward networks, while also exhibiting a classic perceptual illusion, known as the orientation-tilt illusion. Correcting this illusion significantly reduces \gnetw contour detection accuracy by driving it to prefer low-level edges over high-level object boundary contours. Overall, our study suggests that the orientation-tilt illusion is a byproduct of neural circuits that help biological visual systems achieve robust and efficient contour detection, and that incorporating these circuits in artificial neural networks can improve computer vision.
\ No newline at end of file
diff --git a/data/2020/iclr/Reinforced active learning for image segmentation b/data/2020/iclr/Reinforced active learning for image segmentation
new file mode 100644
index 0000000000..9b2676be8f
--- /dev/null
+++ b/data/2020/iclr/Reinforced active learning for image segmentation	
@@ -0,0 +1 @@
+Learning-based approaches for semantic segmentation have two inherent challenges. First, acquiring pixel-wise labels is expensive and time-consuming. Second, realistic segmentation datasets are highly unbalanced: some categories are much more abundant than others, biasing the performance to the most represented ones. In this paper, we are interested in focusing human labelling effort on a small subset of a larger pool of data, minimizing this effort while maximizing performance of a segmentation model on a hold-out set. We present a new active learning strategy for semantic segmentation based on deep reinforcement learning (RL). An agent learns a policy to select a subset of small informative image regions -- opposed to entire images -- to be labeled, from a pool of unlabeled data. The region selection decision is made based on predictions and uncertainties of the segmentation model being trained. Our method proposes a new modification of the deep Q-network (DQN) formulation for active learning, adapting it to the large-scale nature of semantic segmentation problems. We test the proof of concept in CamVid and provide results in the large-scale dataset Cityscapes. On Cityscapes, our deep RL region-based DQN approach requires roughly 30% less additional labeled data than our most competitive baseline to reach the same performance. Moreover, we find that our method asks for more labels of under-represented categories compared to the baselines, improving their performance and helping to mitigate class imbalance.
\ No newline at end of file
diff --git a/data/2020/iclr/Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation b/data/2020/iclr/Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation
new file mode 100644
index 0000000000..e4c9863536
--- /dev/null
+++ b/data/2020/iclr/Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation	
@@ -0,0 +1 @@
+Natural question generation (QG) aims to generate questions from a passage and an answer. Previous works on QG either (i) ignore the rich structure information hidden in text, (ii) solely rely on cross-entropy loss that leads to issues like exposure bias and inconsistency between train/test measurement, or (iii) fail to fully exploit the answer information. To address these limitations, in this paper, we propose a reinforcement learning (RL) based graph-to-sequence (Graph2Seq) model for QG. Our model consists of a Graph2Seq generator with a novel Bidirectional Gated Graph Neural Network based encoder to embed the passage, and a hybrid evaluator with a mixed objective combining both cross-entropy and RL losses to ensure the generation of syntactically and semantically valid text. We also introduce an effective Deep Alignment Network for incorporating the answer information into the passage at both the word and contextual levels. Our model is end-to-end trainable and achieves new state-of-the-art scores, outperforming existing methods by a significant margin on the standard SQuAD benchmark.
\ No newline at end of file
diff --git a/data/2020/iclr/Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives b/data/2020/iclr/Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives
new file mode 100644
index 0000000000..e6c407ee8a
--- /dev/null
+++ b/data/2020/iclr/Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives	
@@ -0,0 +1 @@
+Reinforcement learning agents that operate in diverse and complex environments can benefit from the structured decomposition of their behavior. Often, this is addressed in the context of hierarchical reinforcement learning, where the aim is to decompose a policy into lower-level primitives or options, and a higher-level meta-policy that triggers the appropriate behaviors for a given situation. However, the meta-policy must still produce appropriate decisions in all states. In this work, we propose a policy design that decomposes into primitives, similarly to hierarchical reinforcement learning, but without a high-level meta-policy. Instead, each primitive can decide for themselves whether they wish to act in the current state. We use an information-theoretic mechanism for enabling this decentralized decision: each primitive chooses how much information it needs about the current state to make a decision and the primitive that requests the most information about the current state acts in the world. The primitives are regularized to use as little information as possible, which leads to natural competition and specialization. We experimentally demonstrate that this policy architecture improves over both flat and hierarchical policies in terms of generalization.
\ No newline at end of file
diff --git a/data/2020/iclr/Relational State-Space Model for Stochastic Multi-Object Systems b/data/2020/iclr/Relational State-Space Model for Stochastic Multi-Object Systems
new file mode 100644
index 0000000000..df077199b6
--- /dev/null
+++ b/data/2020/iclr/Relational State-Space Model for Stochastic Multi-Object Systems	
@@ -0,0 +1 @@
+Real-world dynamical systems often consist of multiple stochastic subsystems that interact with each other. Modeling and forecasting the behavior of such dynamics are generally not easy, due to the inherent hardness in understanding the complicated interactions and evolutions of their constituents. This paper introduces the relational state-space model (R-SSM), a sequential hierarchical latent variable model that makes use of graph neural networks (GNNs) to simulate the joint state transitions of multiple correlated objects. By letting GNNs cooperate with SSM, R-SSM provides a flexible way to incorporate relational information into the modeling of multi-object dynamics. We further suggest augmenting the model with normalizing flows instantiated for vertex-indexed random variables and propose two auxiliary contrastive objectives to facilitate the learning. The utility of R-SSM is empirically evaluated on synthetic and real time series datasets.
\ No newline at end of file
diff --git a/data/2020/iclr/Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness b/data/2020/iclr/Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness
new file mode 100644
index 0000000000..bb16857547
--- /dev/null
+++ b/data/2020/iclr/Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness	
@@ -0,0 +1 @@
+Pang et.al. [1] presented Max-Mahalanobis center (MMC) loss and argued that MMC loss is adversarial more robust 4 than SCE. The author’s SCE loss conveys inappropriate supervisory signals to the model, leading to sparse sample 5 density in the feature space. In this reproducibility challenge we verify the claims that training with MMC loss produces 6 adversarially robust models while also enabling accuracy comparably with models trained with SCE loss. 7
\ No newline at end of file
diff --git a/data/2020/iclr/Robust And Interpretable Blind Image Denoising Via Bias-Free Convolutional Neural Networks b/data/2020/iclr/Robust And Interpretable Blind Image Denoising Via Bias-Free Convolutional Neural Networks
new file mode 100644
index 0000000000..3600d783bb
--- /dev/null
+++ b/data/2020/iclr/Robust And Interpretable Blind Image Denoising Via Bias-Free Convolutional Neural Networks	
@@ -0,0 +1 @@
+Deep convolutional networks often append additive constant ("bias") terms to their convolution operations, enabling a richer repertoire of functional mappings. Biases are also used to facilitate training, by subtracting mean response over batches of training images (a component of "batch normalization"). Recent state-of-the-art blind denoising methods (e.g., DnCNN) seem to require these terms for their success. Here, however, we show that these networks systematically overfit the noise levels for which they are trained: when deployed at noise levels outside the training range, performance degrades dramatically. In contrast, a bias-free architecture -- obtained by removing the constant terms in every layer of the network, including those used for batch normalization-- generalizes robustly across noise levels, while preserving state-of-the-art performance within the training range. Locally, the bias-free network acts linearly on the noisy image, enabling direct analysis of network behavior via standard linear-algebraic tools. These analyses provide interpretations of network functionality in terms of nonlinear adaptive filtering, and projection onto a union of low-dimensional subspaces, connecting the learning-based method to more traditional denoising methodology.
\ No newline at end of file
diff --git a/data/2020/iclr/Robust Local Features for Improving the Generalization of Adversarial Training b/data/2020/iclr/Robust Local Features for Improving the Generalization of Adversarial Training
new file mode 100644
index 0000000000..904befe9f0
--- /dev/null
+++ b/data/2020/iclr/Robust Local Features for Improving the Generalization of Adversarial Training	
@@ -0,0 +1 @@
+Adversarial training has been demonstrated as one of the most effective methods for training robust models to defend against adversarial examples. However, adversarially trained models often lack adversarially robust generalization on unseen testing data. Recent works show that adversarially trained models are more biased towards global structure features. Instead, in this work, we would like to investigate the relationship between the generalization of adversarial training and the robust local features, as the robust local features generalize well for unseen shape variation. To learn the robust local features, we develop a Random Block Shuffle (RBS) transformation to break up the global structure features on normal adversarial examples. We continue to propose a new approach called Robust Local Features for Adversarial Training (RLFAT), which first learns the robust local features by adversarial training on the RBS-transformed adversarial examples, and then transfers the robust local features into the training of normal adversarial examples. To demonstrate the generality of our argument, we implement RLFAT in currently state-of-the-art adversarial training frameworks. Extensive experiments on STL-10, CIFAR-10 and CIFAR-100 show that RLFAT significantly improves both the adversarially robust generalization and the standard generalization of adversarial training. Additionally, we demonstrate that our models capture more local features of the object on the images, aligning better with human perception.
\ No newline at end of file
diff --git a/data/2020/iclr/Robust training with ensemble consensus b/data/2020/iclr/Robust training with ensemble consensus
new file mode 100644
index 0000000000..1b8f9797ef
--- /dev/null
+++ b/data/2020/iclr/Robust training with ensemble consensus	
@@ -0,0 +1 @@
+Since deep neural networks are over-parametrized, they may memorize noisy examples. We address such memorizing issue under the existence of annotation noise. From the fact that deep neural networks cannot generalize neighborhoods of the features acquired via memorization, we find that noisy examples do not consistently incur small losses on the network in the presence of perturbation. Based on this, we propose a novel training method called Learning with Ensemble Consensus (LEC) whose goal is to prevent overfitting noisy examples by eliminating them identified via consensus of an ensemble of perturbed networks. One of the proposed LECs, LTEC outperforms the current state-of-the-art methods on MNIST, CIFAR-10, and CIFAR-100 despite its efficient memory.
\ No newline at end of file
diff --git a/data/2020/iclr/SAdam: A Variant of Adam for Strongly Convex Functions b/data/2020/iclr/SAdam: A Variant of Adam for Strongly Convex Functions
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2020/iclr/SELF: Learning to Filter Noisy Labels with Self-Ensembling b/data/2020/iclr/SELF: Learning to Filter Noisy Labels with Self-Ensembling
new file mode 100644
index 0000000000..f8301b0726
--- /dev/null
+++ b/data/2020/iclr/SELF: Learning to Filter Noisy Labels with Self-Ensembling	
@@ -0,0 +1 @@
+Deep neural networks (DNNs) have been shown to over-fit a dataset when being trained with noisy labels for a long enough time. To overcome this problem, we present a simple and effective method self-ensemble label filtering (SELF) to progressively filter out the wrong labels during training. Our method improves the task performance by gradually allowing supervision only from the potentially non-noisy (clean) labels and stops learning on the filtered noisy labels. For the filtering, we form running averages of predictions over the entire training dataset using the network output at different training epochs. We show that these ensemble estimates yield more accurate identification of inconsistent predictions throughout training than the single estimates of the network at the most recent training epoch. While filtered samples are removed entirely from the supervised training loss, we dynamically leverage them via semi-supervised learning in the unsupervised loss. We demonstrate the positive effect of such an approach on various image classification tasks under both symmetric and asymmetric label noise and at different noise ratios. It substantially outperforms all previous works on noise-aware learning across different datasets and can be applied to a broad set of network architectures.
\ No newline at end of file
diff --git a/data/2020/iclr/SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards b/data/2020/iclr/SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards
new file mode 100644
index 0000000000..ae6c997af1
--- /dev/null
+++ b/data/2020/iclr/SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards	
@@ -0,0 +1 @@
+Learning to imitate expert behavior from demonstrations can be challenging, especially in environments with high-dimensional, continuous observations and unknown dynamics. Supervised learning methods based on behavioral cloning (BC) suffer from distribution shift: because the agent greedily imitates demonstrated actions, it can drift away from demonstrated states due to error accumulation. Recent methods based on reinforcement learning (RL), such as inverse RL and generative adversarial imitation learning (GAIL), overcome this issue by training an RL agent to match the demonstrations over a long horizon. Since the true reward function for the task is unknown, these methods learn a reward function from the demonstrations, often using complex and brittle approximation techniques that involve adversarial training. We propose a simple alternative that still uses RL, but does not require learning a reward function. The key idea is to provide the agent with an incentive to match the demonstrations over a long horizon, by encouraging it to return to demonstrated states upon encountering new, out-of-distribution states. We accomplish this by giving the agent a constant reward of r=+1 for matching the demonstrated action in a demonstrated state, and a constant reward of r=0 for all other behavior. Our method, which we call soft Q imitation learning (SQIL), can be implemented with a handful of minor modifications to any standard Q-learning or off-policy actor-critic algorithm. Theoretically, we show that SQIL can be interpreted as a regularized variant of BC that uses a sparsity prior to encourage long-horizon imitation. Empirically, we show that SQIL outperforms BC and achieves competitive results compared to GAIL, on a variety of image-based and low-dimensional tasks in Box2D, Atari, and MuJoCo.
\ No newline at end of file
diff --git a/data/2020/iclr/Sampling-Free Learning of Bayesian Quantized Neural Networks b/data/2020/iclr/Sampling-Free Learning of Bayesian Quantized Neural Networks
new file mode 100644
index 0000000000..ec8a5682e4
--- /dev/null
+++ b/data/2020/iclr/Sampling-Free Learning of Bayesian Quantized Neural Networks	
@@ -0,0 +1 @@
+Bayesian learning of model parameters in neural networks is important in scenarios where estimates with well-calibrated uncertainty are important. In this paper, we propose Bayesian quantized networks (BQNs), quantized neural networks (QNNs) for which we learn a posterior distribution over their discrete parameters. We provide a set of efficient algorithms for learning and prediction in BQNs without the need to sample from their parameters or activations, which not only allows for differentiable learning in QNNs, but also reduces the variance in gradients. We evaluate BQNs on MNIST, Fashion-MNIST, KMNIST and CIFAR10 image classification datasets, compared against bootstrap ensemble of QNNs (E-QNN). We demonstrate BQNs achieve both lower predictive errors and better-calibrated uncertainties than E-QNN (with less than 20% of the negative log-likelihood).
\ No newline at end of file
diff --git a/data/2020/iclr/Scalable Model Compression by Entropy Penalized Reparameterization b/data/2020/iclr/Scalable Model Compression by Entropy Penalized Reparameterization
new file mode 100644
index 0000000000..e0c0c0c38f
--- /dev/null
+++ b/data/2020/iclr/Scalable Model Compression by Entropy Penalized Reparameterization	
@@ -0,0 +1 @@
+We describe a simple and general neural network weight compression approach, in which the network parameters (weights and biases) are represented in a "latent" space, amounting to a reparameterization. This space is equipped with a learned probability model, which is used to impose an entropy penalty on the parameter representation during training, and to compress the representation using a simple arithmetic coder after training. Classification accuracy and model compressibility is maximized jointly, with the bitrate--accuracy trade-off specified by a hyperparameter. We evaluate the method on the MNIST, CIFAR-10 and ImageNet classification benchmarks using six distinct model architectures. Our results show that state-of-the-art model compression can be achieved in a scalable and general way without requiring complex procedures such as multi-stage training.
\ No newline at end of file
diff --git a/data/2020/iclr/Scalable Neural Methods for Reasoning With a Symbolic Knowledge Base b/data/2020/iclr/Scalable Neural Methods for Reasoning With a Symbolic Knowledge Base
new file mode 100644
index 0000000000..e8079dc9f5
--- /dev/null
+++ b/data/2020/iclr/Scalable Neural Methods for Reasoning With a Symbolic Knowledge Base	
@@ -0,0 +1 @@
+We describe a novel way of representing a symbolic knowledge base (KB) called a sparse-matrix reified KB. This representation enables neural modules that are fully differentiable, faithful to the original semantics of the KB, expressive enough to model multi-hop inferences, and scalable enough to use with realistically large KBs. The sparse-matrix reified KB can be distributed across multiple GPUs, can scale to tens of millions of entities and facts, and is orders of magnitude faster than naive sparse-matrix implementations. The reified KB enables very simple end-to-end architectures to obtain competitive performance on several benchmarks representing two families of tasks: KB completion, and learning semantic parsers from denotations.
\ No newline at end of file
diff --git a/data/2020/iclr/Scalable and Order-robust Continual Learning with Additive Parameter Decomposition b/data/2020/iclr/Scalable and Order-robust Continual Learning with Additive Parameter Decomposition
new file mode 100644
index 0000000000..ddf47a38a5
--- /dev/null
+++ b/data/2020/iclr/Scalable and Order-robust Continual Learning with Additive Parameter Decomposition	
@@ -0,0 +1 @@
+While recent continual learning methods largely alleviate the catastrophic problem on toy-sized datasets, some issues remain to be tackled to apply them to real-world problem domains. First, a continual learning model should effectively handle catastrophic forgetting and be efficient to train even with a large number of tasks. Secondly, it needs to tackle the problem of order-sensitivity, where the performance of the tasks largely varies based on the order of the task arrival sequence, as it may cause serious problems where fairness plays a critical role (e.g. medical diagnosis). To tackle these practical challenges, we propose a novel continual learning method that is scalable as well as order-robust, which instead of learning a completely shared set of weights, represents the parameters for each task as a sum of task-shared and sparse task-adaptive parameters. With our Additive Parameter Decomposition (APD), the task-adaptive parameters for earlier tasks remain mostly unaffected, where we update them only to reflect the changes made to the task-shared parameters. This decomposition of parameters effectively prevents catastrophic forgetting and order-sensitivity, while being computation- and memory-efficient. Further, we can achieve even better scalability with APD using hierarchical knowledge consolidation, which clusters the task-adaptive parameters to obtain hierarchically shared parameters. We validate our network with APD, APD-Net, on multiple benchmark datasets against state-of-the-art continual learning methods, which it largely outperforms in accuracy, scalability, and order-robustness.
\ No newline at end of file
diff --git a/data/2020/iclr/Selection via Proxy: Efficient Data Selection for Deep Learning b/data/2020/iclr/Selection via Proxy: Efficient Data Selection for Deep Learning
new file mode 100644
index 0000000000..e54e21f84a
--- /dev/null
+++ b/data/2020/iclr/Selection via Proxy: Efficient Data Selection for Deep Learning	
@@ -0,0 +1 @@
+Data selection methods such as active learning and core-set selection are useful tools for machine learning on large datasets, but they can be prohibitively expensive to apply in deep learning. Unlike in other areas of machine learning, the feature representations that these techniques depend on are learned in deep learning rather than given, which takes a substantial amount of training time. In this work, we show that we can significantly improve the computational efficiency of data selection in deep learning by using a much smaller proxy model to perform data selection for tasks that will eventually require a large target model (e.g., selecting data points to label for active learning). In deep learning, we can scale down models by removing hidden layers or reducing their dimension to create proxies that are an order of magnitude faster. Although these small proxy models have significantly higher error, we find that they empirically provide useful rankings for data selection that have a high correlation with those of larger models. We evaluate this "selection via proxy" (SVP) approach on several data selection tasks. For active learning, applying SVP to Sener and Savarese [2018]'s recent method for active learning in deep learning gives a 4x improvement in execution time while yielding the same model accuracy. For core-set selection, we show that a proxy model that trains 10x faster than a target ResNet164 model on CIFAR10 can be used to remove 50% of the training data without compromising the accuracy of the target model, making end-to-end training time improvements via core-set selection possible.
\ No newline at end of file
diff --git a/data/2020/iclr/Self-Adversarial Learning with Comparative Discrimination for Text Generation b/data/2020/iclr/Self-Adversarial Learning with Comparative Discrimination for Text Generation
new file mode 100644
index 0000000000..52867e6495
--- /dev/null
+++ b/data/2020/iclr/Self-Adversarial Learning with Comparative Discrimination for Text Generation	
@@ -0,0 +1 @@
+Conventional Generative Adversarial Networks (GANs) for text generation tend to have issues of reward sparsity and mode collapse that affect the quality and diversity of generated samples. To address the issues, we propose a novel self-adversarial learning (SAL) paradigm for improving GANs' performance in text generation. In contrast to standard GANs that use a binary classifier as its discriminator to predict whether a sample is real or generated, SAL employs a comparative discriminator which is a pairwise classifier for comparing the text quality between a pair of samples. During training, SAL rewards the generator when its currently generated sentence is found to be better than its previously generated samples. This self-improvement reward mechanism allows the model to receive credits more easily and avoid collapsing towards the limited number of real samples, which not only helps alleviate the reward sparsity issue but also reduces the risk of mode collapse. Experiments on text generation benchmark datasets show that our proposed approach substantially improves both the quality and the diversity, and yields more stable performance compared to the previous GANs for text generation.
\ No newline at end of file
diff --git a/data/2020/iclr/Semantically-Guided Representation Learning for Self-Supervised Monocular Depth b/data/2020/iclr/Semantically-Guided Representation Learning for Self-Supervised Monocular Depth
new file mode 100644
index 0000000000..dfa53859c4
--- /dev/null
+++ b/data/2020/iclr/Semantically-Guided Representation Learning for Self-Supervised Monocular Depth	
@@ -0,0 +1 @@
+Self-supervised learning is showing great promise for monocular depth estimation, using geometry as the only source of supervision. Depth networks are indeed capable of learning representations that relate visual appearance to 3D properties by implicitly leveraging category-level patterns. In this work we investigate how to leverage more directly this semantic structure to guide geometric representation learning, while remaining in the self-supervised regime. Instead of using semantic labels and proxy losses in a multi-task approach, we propose a new architecture leveraging fixed pretrained semantic segmentation networks to guide self-supervised representation learning via pixel-adaptive convolutions. Furthermore, we propose a two-stage training process to overcome a common semantic bias on dynamic objects via resampling. Our method improves upon the state of the art for self-supervised monocular depth prediction over all pixels, fine-grained details, and per semantic categories.
\ No newline at end of file
diff --git a/data/2020/iclr/Sharing Knowledge in Multi-Task Deep Reinforcement Learning b/data/2020/iclr/Sharing Knowledge in Multi-Task Deep Reinforcement Learning
new file mode 100644
index 0000000000..1359ce66b3
--- /dev/null
+++ b/data/2020/iclr/Sharing Knowledge in Multi-Task Deep Reinforcement Learning	
@@ -0,0 +1 @@
+We study the benefit of sharing representations among tasks to enable the effective use of deep neural networks in Multi-Task Reinforcement Learning. We leverage the assumption that learning from different tasks, sharing common properties, is helpful to generalize the knowledge of them resulting in a more effective feature extraction compared to learning a single task. Intuitively, the resulting set of features offers performance benefits when used by Reinforcement Learning algorithms. We prove this by providing theoretical guarantees that highlight the conditions for which is convenient to share representations among tasks, extending the well-known finite-time bounds of Approximate Value-Iteration to the multi-task setting. In addition, we complement our analysis by proposing multi-task extensions of three Reinforcement Learning algorithms that we empirically evaluate on widely used Reinforcement Learning benchmarks showing significant improvements over the single-task counterparts in terms of sample efficiency and performance.
\ No newline at end of file
diff --git a/data/2020/iclr/Short and Sparse Deconvolution - A Geometric Approach b/data/2020/iclr/Short and Sparse Deconvolution - A Geometric Approach
new file mode 100644
index 0000000000..975bce4b80
--- /dev/null
+++ b/data/2020/iclr/Short and Sparse Deconvolution - A Geometric Approach	
@@ -0,0 +1 @@
+Short-and-sparse deconvolution (SaSD) is the problem of extracting localized, recurring motifs in signals with spatial or temporal structure. Variants of this problem arise in applications such as image deblurring, microscopy, neural spike sorting, and more. The problem is challenging in both theory and practice, as natural optimization formulations are nonconvex. Moreover, practical deconvolution problems involve smooth motifs (kernels) whose spectra decay rapidly, resulting in poor conditioning and numerical challenges. This paper is motivated by recent theoretical advances, which characterize the optimization landscape of a particular nonconvex formulation of SaSD. This is used to derive a $provable$ algorithm which exactly solves certain non-practical instances of the SaSD problem. We leverage the key ideas from this theory (sphere constraints, data-driven initialization) to develop a $practical$ algorithm, which performs well on data arising from a range of application areas. We highlight key additional challenges posed by the ill-conditioning of real SaSD problems, and suggest heuristics (acceleration, continuation, reweighting) to mitigate them. Experiments demonstrate both the performance and generality of the proposed method.
\ No newline at end of file
diff --git a/data/2020/iclr/Sign Bits Are All You Need for Black-Box Attacks b/data/2020/iclr/Sign Bits Are All You Need for Black-Box Attacks
new file mode 100644
index 0000000000..58fac77e48
--- /dev/null
+++ b/data/2020/iclr/Sign Bits Are All You Need for Black-Box Attacks	
@@ -0,0 +1 @@
+We present a novel black-box adversarial attack algorithm with state-of-the-art model evasion rates for query efficiency under $\ell_\infty$ and $\ell_2$ metrics. It exploits a \textit{sign-based}, rather than magnitude-based, gradient estimation approach that shifts the gradient estimation from continuous to binary black-box optimization. It adaptively constructs queries to estimate the gradient, one query relying upon the previous, rather than re-estimating the gradient each step with random query construction. Its reliance on sign bits yields a smaller memory footprint and it requires neither hyperparameter tuning or dimensionality reduction. Further, its theoretical performance is guaranteed and it can characterize adversarial subspaces better than white-box gradient-aligned subspaces. On two public black-box attack challenges and a model robustly trained against transfer attacks, the algorithm's evasion rates surpass all submitted attacks. For a suite of published models, the algorithm is $3.8\times$ less failure-prone while spending $2.5\times$ fewer queries versus the best combination of state of art algorithms. For example, it evades a standard MNIST model using just $12$ queries on average. Similar performance is observed on a standard IMAGENET model with an average of $579$ queries.
\ No newline at end of file
diff --git a/data/2020/iclr/Sign-OPT: A Query-Efficient Hard-label Adversarial Attack b/data/2020/iclr/Sign-OPT: A Query-Efficient Hard-label Adversarial Attack
new file mode 100644
index 0000000000..9067c6b8f2
--- /dev/null
+++ b/data/2020/iclr/Sign-OPT: A Query-Efficient Hard-label Adversarial Attack	
@@ -0,0 +1 @@
+We study the most practical problem setup for evaluating adversarial robustness of a machine learning system with limited access: the hard-label black-box attack setting for generating adversarial examples, where limited model queries are allowed and only the decision is provided to a queried data input. Several algorithms have been proposed for this problem but they typically require huge amount (>20,000) of queries for attacking one example. Among them, one of the state-of-the-art approaches (Cheng et al., 2019) showed that hard-label attack can be modeled as an optimization problem where the objective function can be evaluated by binary search with additional model queries, thereby a zeroth order optimization algorithm can be applied. In this paper, we adopt the same optimization formulation but propose to directly estimate the sign of gradient at any direction instead of the gradient itself, which enjoys the benefit of single query. Using this single query oracle for retrieving sign of directional derivative, we develop a novel query-efficient Sign-OPT approach for hard-label black-box attack. We provide a convergence analysis of the new algorithm and conduct experiments on several models on MNIST, CIFAR-10 and ImageNet. We find that Sign-OPT attack consistently requires 5X to 10X fewer queries when compared to the current state-of-the-art approaches, and usually converges to an adversarial example with smaller perturbation.
\ No newline at end of file
diff --git a/data/2020/iclr/SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum b/data/2020/iclr/SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum
new file mode 100644
index 0000000000..a09320b33a
--- /dev/null
+++ b/data/2020/iclr/SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum	
@@ -0,0 +1 @@
+Distributed optimization is essential for training large models on large datasets. Multiple approaches have been proposed to reduce the communication overhead in distributed training, such as synchronizing only after performing multiple local SGD steps, and decentralized methods (e.g., using gossip algorithms) to decouple communications among workers. Although these methods run faster than AllReduce-based methods, which use blocking communication before every update, the resulting models may be less accurate after the same number of updates. Inspired by the BMUF method of Chen & Huo (2016), we propose a slow momentum (SlowMo) framework, where workers periodically synchronize and perform a momentum update, after multiple iterations of a base optimization algorithm. Experiments on image classification and machine translation tasks demonstrate that SlowMo consistently yields improvements in optimization and generalization performance relative to the base optimizer, even when the additional overhead is amortized over many updates so that the SlowMo runtime is on par with that of the base optimizer. We provide theoretical convergence guarantees showing that SlowMo converges to a stationary point of smooth non-convex losses. Since BMUF can be expressed through the SlowMo framework, our results also correspond to the first theoretical convergence guarantees for BMUF.
\ No newline at end of file
diff --git a/data/2020/iclr/Stochastic AUC Maximization with Deep Neural Networks b/data/2020/iclr/Stochastic AUC Maximization with Deep Neural Networks
new file mode 100644
index 0000000000..da74cbcc65
--- /dev/null
+++ b/data/2020/iclr/Stochastic AUC Maximization with Deep Neural Networks	
@@ -0,0 +1 @@
+Stochastic AUC maximization has garnered an increasing interest due to better fit to imbalanced data classification. However, existing works are limited to stochastic AUC maximization with a linear predictive model, which restricts its predictive power when dealing with extremely complex data. In this paper, we consider stochastic AUC maximization problem with a deep neural network as the predictive model. Building on the saddle point reformulation of a surrogated loss of AUC, the problem can be cast into a {\it non-convex concave} min-max problem. The main contribution made in this paper is to make stochastic AUC maximization more practical for deep neural networks and big data with theoretical insights as well. In particular, we propose to explore Polyak-Łojasiewicz (PL) condition that has been proved and observed in deep learning, which enables us to develop new stochastic algorithms with even faster convergence rate and more practical step size scheme. An AdaGrad-style algorithm is also analyzed under the PL condition with adaptive convergence rate. Our experimental results demonstrate the effectiveness of the proposed algorithms.
\ No newline at end of file
diff --git a/data/2020/iclr/Stochastic Conditional Generative Networks with Basis Decomposition b/data/2020/iclr/Stochastic Conditional Generative Networks with Basis Decomposition
new file mode 100644
index 0000000000..679d753453
--- /dev/null
+++ b/data/2020/iclr/Stochastic Conditional Generative Networks with Basis Decomposition	
@@ -0,0 +1 @@
+While generative adversarial networks (GANs) have revolutionized machine learning, a number of open questions remain to fully understand them and exploit their power. One of these questions is how to efficiently achieve proper diversity and sampling of the multi-mode data space. To address this, we introduce BasisGAN, a stochastic conditional multi-mode image generator. By exploiting the observation that a convolutional filter can be well approximated as a linear combination of a small set of basis elements, we learn a plug-and-played basis generator to stochastically generate basis elements, with just a few hundred of parameters, to fully embed stochasticity into convolutional filters. By sampling basis elements instead of filters, we dramatically reduce the cost of modeling the parameter space with no sacrifice on either image diversity or fidelity. To illustrate this proposed plug-and-play framework, we construct variants of BasisGAN based on state-of-the-art conditional image generation networks, and train the networks by simply plugging in a basis generator, without additional auxiliary components, hyperparameters, or training objectives. The experimental success is complemented with theoretical results indicating how the perturbations introduced by the proposed sampling of basis elements can propagate to the appearance of generated images.
\ No newline at end of file
diff --git a/data/2020/iclr/Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well b/data/2020/iclr/Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well
new file mode 100644
index 0000000000..91e0696677
--- /dev/null
+++ b/data/2020/iclr/Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well	
@@ -0,0 +1 @@
+We propose Stochastic Weight Averaging in Parallel (SWAP), an algorithm to accelerate DNN training. Our algorithm uses large mini-batches to compute an approximate solution quickly and then refines it by averaging the weights of multiple models computed independently and in parallel. The resulting models generalize equally well as those trained with small mini-batches but are produced in a substantially shorter time. We demonstrate the reduction in training time and the good generalization performance of the resulting models on the computer vision datasets CIFAR10, CIFAR100, and ImageNet.
\ No newline at end of file
diff --git a/data/2020/iclr/StructPool: Structured Graph Pooling via Conditional Random Fields b/data/2020/iclr/StructPool: Structured Graph Pooling via Conditional Random Fields
new file mode 100644
index 0000000000..55d9c198b3
--- /dev/null
+++ b/data/2020/iclr/StructPool: Structured Graph Pooling via Conditional Random Fields	
@@ -0,0 +1 @@
+Learning high-level representations for graphs is of great importance for graph analysis tasks. In addition to graph convolution, graph pooling is an important but less explored research area. In particular, most of existing graph pooling techniques do not consider the graph structural information explicitly. We argue that such information is important and develop a novel graph pooling technique, know as the StructPool, in this work. We consider the graph pooling as a node clustering problem, which requires the learning of a cluster assignment matrix. We propose to formulate it as a structured prediction problem and employ conditional random fields to capture the relationships among assignments of different nodes. We also generalize our method to incorporate graph topological information in designing the Gibbs energy function. Experimental results on multiple datasets demonstrate the effectiveness of our proposed StructPool.
\ No newline at end of file
diff --git a/data/2020/iclr/TabFact: A Large-scale Dataset for Table-based Fact Verification b/data/2020/iclr/TabFact: A Large-scale Dataset for Table-based Fact Verification
new file mode 100644
index 0000000000..effc06e856
--- /dev/null
+++ b/data/2020/iclr/TabFact: A Large-scale Dataset for Table-based Fact Verification	
@@ -0,0 +1 @@
+The problem of verifying whether a textual hypothesis holds based on the given evidence, also known as fact verification, plays an important role in the study of natural language understanding and semantic representation. However, existing studies are mainly restricted to dealing with unstructured evidence (e.g., natural language sentences and documents, news, etc), while verification under structured evidence, such as tables, graphs, and databases, remains under-explored. This paper specifically aims to study the fact verification given semi-structured data as evidence. To this end, we construct a large-scale dataset called TabFact with 16k Wikipedia tables as the evidence for 118k human-annotated natural language statements, which are labeled as either ENTAILED or REFUTED. TabFact is challenging since it involves both soft linguistic reasoning and hard symbolic reasoning. To address these reasoning challenges, we design two different models: Table-BERT and Latent Program Algorithm (LPA). Table-BERT leverages the state-of-the-art pre-trained language model to encode the linearized tables and statements into continuous vectors for verification. LPA parses statements into programs and executes them against the tables to obtain the returned binary value for verification. Both methods achieve similar accuracy but still lag far behind human performance. We also perform a comprehensive analysis to demonstrate great future opportunities. The data and code of the dataset are provided in \url{this https URL}.
\ No newline at end of file
diff --git a/data/2020/iclr/The Implicit Bias of Depth: How Incremental Learning Drives Generalization b/data/2020/iclr/The Implicit Bias of Depth: How Incremental Learning Drives Generalization
new file mode 100644
index 0000000000..3beb98778f
--- /dev/null
+++ b/data/2020/iclr/The Implicit Bias of Depth: How Incremental Learning Drives Generalization	
@@ -0,0 +1 @@
+A leading hypothesis for the surprising generalization of neural networks is that the dynamics of gradient descent bias the model towards simple solutions, by searching through the solution space in an incremental order of complexity. We formally define the notion of incremental learning dynamics and derive the conditions on depth and initialization for which this phenomenon arises in deep linear models. Our main theoretical contribution is a dynamical depth separation result, proving that while shallow models can exhibit incremental learning dynamics, they require the initialization to be exponentially small for these dynamics to present themselves. However, once the model becomes deeper, the dependence becomes polynomial and incremental learning can arise in more natural settings. We complement our theoretical findings by experimenting with deep matrix sensing, quadratic neural networks and with binary classification using diagonal and convolutional linear networks, showing all of these models exhibit incremental learning.
\ No newline at end of file
diff --git a/data/2020/iclr/The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget b/data/2020/iclr/The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget
new file mode 100644
index 0000000000..8e8ab78bd6
--- /dev/null
+++ b/data/2020/iclr/The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget	
@@ -0,0 +1 @@
+In many applications, it is desirable to extract only the relevant information from complex input data, which involves making a decision about which input features are relevant. The information bottleneck method formalizes this as an information-theoretic optimization problem by maintaining an optimal tradeoff between compression (throwing away irrelevant input information), and predicting the target. In many problem settings, including the reinforcement learning problems we consider in this work, we might prefer to compress only part of the input. This is typically the case when we have a standard conditioning input, such as a state observation, and a ``privileged'' input, which might correspond to the goal of a task, the output of a costly planning algorithm, or communication with another agent. In such cases, we might prefer to compress the privileged input, either to achieve better generalization (e.g., with respect to goals) or to minimize access to costly information (e.g., in the case of communication). Practical implementations of the information bottleneck based on variational inference require access to the privileged input in order to compute the bottleneck variable, so although they perform compression, this compression operation itself needs unrestricted, lossless access. In this work, we propose the variational bandwidth bottleneck, which decides for each example on the estimated value of the privileged information before seeing it, i.e., only based on the standard input, and then accordingly chooses stochastically, whether to access the privileged input or not. We formulate a tractable approximation to this framework and demonstrate in a series of reinforcement learning experiments that it can improve generalization and reduce access to computationally costly information.
\ No newline at end of file
diff --git a/data/2020/iclr/The asymptotic spectrum of the Hessian of DNN throughout training b/data/2020/iclr/The asymptotic spectrum of the Hessian of DNN throughout training
new file mode 100644
index 0000000000..7b085e8c20
--- /dev/null
+++ b/data/2020/iclr/The asymptotic spectrum of the Hessian of DNN throughout training	
@@ -0,0 +1 @@
+The dynamics of DNNs during gradient descent is described by the so-called Neural Tangent Kernel (NTK). In this article, we show that the NTK allows one to gain precise insight into the Hessian of the cost of DNNs. When the NTK is fixed during training, we obtain a full characterization of the asymptotics of the spectrum of the Hessian, at initialization and during training. In the so-called mean-field limit, where the NTK is not fixed during training, we describe the first two moments of the Hessian at initialization.
\ No newline at end of file
diff --git a/data/2020/iclr/Theory and Evaluation Metrics for Learning Disentangled Representations b/data/2020/iclr/Theory and Evaluation Metrics for Learning Disentangled Representations
new file mode 100644
index 0000000000..53ca3700c5
--- /dev/null
+++ b/data/2020/iclr/Theory and Evaluation Metrics for Learning Disentangled Representations	
@@ -0,0 +1 @@
+We make two theoretical contributions to disentanglement learning by (a) defining precise semantics of disentangled representations, and (b) establishing robust metrics for evaluation. First, we characterize the concept "disentangled representations" used in supervised and unsupervised methods along three dimensions-informativeness, separability and interpretability - which can be expressed and quantified explicitly using information-theoretic constructs. This helps explain the behaviors of several well-known disentanglement learning models. We then propose robust metrics for measuring informativeness, separability and interpretability. Through a comprehensive suite of experiments, we show that our metrics correctly characterize the representations learned by different methods and are consistent with qualitative (visual) results. Thus, the metrics allow disentanglement learning methods to be compared on a fair ground. We also empirically uncovered new interesting properties of VAE-based methods and interpreted them with our formulation. These findings are promising and hopefully will encourage the design of more theoretically driven models for learning disentangled representations.
\ No newline at end of file
diff --git a/data/2020/iclr/Thieves on Sesame Street! Model Extraction of BERT-based APIs b/data/2020/iclr/Thieves on Sesame Street! Model Extraction of BERT-based APIs
new file mode 100644
index 0000000000..562fc74515
--- /dev/null
+++ b/data/2020/iclr/Thieves on Sesame Street! Model Extraction of BERT-based APIs	
@@ -0,0 +1 @@
+We study the problem of model extraction in natural language processing, in which an adversary with only query access to a victim model attempts to reconstruct a local copy of that model. Assuming that both the adversary and victim model fine-tune a large pretrained language model such as BERT (Devlin et al. 2019), we show that the adversary does not need any real training data to successfully mount the attack. In fact, the attacker need not even use grammatical or semantically meaningful queries: we show that random sequences of words coupled with task-specific heuristics form effective queries for model extraction on a diverse set of NLP tasks, including natural language inference and question answering. Our work thus highlights an exploit only made feasible by the shift towards transfer learning methods within the NLP community: for a query budget of a few hundred dollars, an attacker can extract a model that performs only slightly worse than the victim model. Finally, we study two defense strategies against model extraction---membership classification and API watermarking---which while successful against naive adversaries, are ineffective against more sophisticated ones.
\ No newline at end of file
diff --git a/data/2020/iclr/To Relieve Your Headache of Training an MRF, Take AdVIL b/data/2020/iclr/To Relieve Your Headache of Training an MRF, Take AdVIL
new file mode 100644
index 0000000000..e7717bcce8
--- /dev/null
+++ b/data/2020/iclr/To Relieve Your Headache of Training an MRF, Take AdVIL	
@@ -0,0 +1 @@
+We propose a black-box algorithm called {\it Adversarial Variational Inference and Learning} (AdVIL) to perform inference and learning on a general Markov random field (MRF). AdVIL employs two variational distributions to approximately infer the latent variables and estimate the partition function of an MRF, respectively. The two variational distributions provide an estimate of the negative log-likelihood of the MRF as a minimax optimization problem, which is solved by stochastic gradient descent. AdVIL is proven convergent under certain conditions. On one hand, compared with contrastive divergence, AdVIL requires a minimal assumption about the model structure and can deal with a broader family of MRFs. On the other hand, compared with existing black-box methods, AdVIL provides a tighter estimate of the log partition function and achieves much better empirical results.
\ No newline at end of file
diff --git a/data/2020/iclr/Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets b/data/2020/iclr/Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets
new file mode 100644
index 0000000000..5191d15cd3
--- /dev/null
+++ b/data/2020/iclr/Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets	
@@ -0,0 +1 @@
+Adaptive gradient algorithms perform gradient-based updates using the history of gradients and are ubiquitous in training deep neural networks. While adaptive gradient methods theory is well understood for minimization problems, the underlying factors driving their empirical success in min-max problems such as GANs remain unclear. In this paper, we aim at bridging this gap from both theoretical and empirical perspectives. Theoretically, we develop an algorithm (Optimistic Stochastic Gradient, OSG) for solving a class of non-convex non-concave min-max problem and establish $O(\epsilon^{-4})$ complexity for finding $\epsilon$-first-order stationary point, in which only one stochastic first-order oracle is invoked in each iteration. An adaptive variant of the proposed algorithm (Optimistic Adagrad, OAdagrad) is also analyzed, revealing an \emph{improved} adaptive complexity $\widetilde{O}\left(\epsilon^{-\frac{2}{1-\alpha}}\right)$~\footnote{Here $\widetilde{O}(\cdot)$ compresses a logarithmic factor of $\epsilon$.}, where $\alpha$ characterizes the growth rate of the cumulative stochastic gradient and $0\leq \alpha\leq 1/2$. To the best of our knowledge, this is the first work for establishing adaptive complexity in non-convex non-concave min-max optimization. Empirically, our experiments show that indeed adaptive gradient algorithms outperform their non-adaptive counterparts in GAN training. Moreover, this observation can be explained by the slow growth rate of the cumulative stochastic gradient, as observed empirically.
\ No newline at end of file
diff --git a/data/2020/iclr/Transferable Perturbations of Deep Feature Distributions b/data/2020/iclr/Transferable Perturbations of Deep Feature Distributions
new file mode 100644
index 0000000000..8e972f4cdc
--- /dev/null
+++ b/data/2020/iclr/Transferable Perturbations of Deep Feature Distributions	
@@ -0,0 +1 @@
+Almost all current adversarial attacks of CNN classifiers rely on information derived from the output layer of the network. This work presents a new adversarial attack based on the modeling and exploitation of class-wise and layer-wise deep feature distributions. We achieve state-of-the-art targeted blackbox transfer-based attack results for undefended ImageNet models. Further, we place a priority on explainability and interpretability of the attacking process. Our methodology affords an analysis of how adversarial attacks change the intermediate feature distributions of CNNs, as well as a measure of layer-wise and class-wise feature distributional separability/entanglement. We also conceptualize a transition from task/data-specific to model-specific features within a CNN architecture that directly impacts the transferability of adversarial examples.
\ No newline at end of file
diff --git a/data/2020/iclr/Tree-Structured Attention with Hierarchical Accumulation b/data/2020/iclr/Tree-Structured Attention with Hierarchical Accumulation
new file mode 100644
index 0000000000..8fdf341f28
--- /dev/null
+++ b/data/2020/iclr/Tree-Structured Attention with Hierarchical Accumulation	
@@ -0,0 +1 @@
+Incorporating hierarchical structures like constituency trees has been shown to be effective for various natural language processing (NLP) tasks. However, it is evident that state-of-the-art (SOTA) sequence-based models like the Transformer struggle to encode such structures inherently. On the other hand, dedicated models like the Tree-LSTM, while explicitly modeling hierarchical structures, do not perform as efficiently as the Transformer. In this paper, we attempt to bridge this gap with Hierarchical Accumulation to encode parse tree structures into self-attention at constant time complexity. Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German task. It also yields improvements over Transformer and Tree-LSTM on three text classification tasks. We further demonstrate that using hierarchical priors can compensate for data shortage, and that our model prefers phrase-level attentions over token-level attentions.
\ No newline at end of file
diff --git a/data/2020/iclr/Understanding Architectures Learnt by Cell-based Neural Architecture Search b/data/2020/iclr/Understanding Architectures Learnt by Cell-based Neural Architecture Search
new file mode 100644
index 0000000000..13f8ffd6fd
--- /dev/null
+++ b/data/2020/iclr/Understanding Architectures Learnt by Cell-based Neural Architecture Search	
@@ -0,0 +1 @@
+Neural architecture search (NAS) generates architectures automatically for given tasks, e.g., image classification and language modeling. Recently, various NAS algorithms have been proposed to improve search efficiency and effectiveness. However, little attention is paid to understand the generated architectures, including whether they share any commonality. In this paper, we analyze the generated architectures and give our explanations of their superior performance. We firstly uncover that the architectures generated by NAS algorithms share a common connection pattern, which contributes to their fast convergence. Consequently, these architectures are selected during architecture search. We further empirically and theoretically show that the fast convergence is the consequence of smooth loss landscape and accurate gradient information conducted by the common connection pattern. Contracting to universal recognition, we finally observe that popular NAS architectures do not always generalize better than the candidate architectures, encouraging us to re-think about the state-of-the-art NAS algorithms.
\ No newline at end of file
diff --git a/data/2020/iclr/Understanding Knowledge Distillation in Non-autoregressive Machine Translation b/data/2020/iclr/Understanding Knowledge Distillation in Non-autoregressive Machine Translation
new file mode 100644
index 0000000000..49cdbd0594
--- /dev/null
+++ b/data/2020/iclr/Understanding Knowledge Distillation in Non-autoregressive Machine Translation	
@@ -0,0 +1 @@
+Non-autoregressive machine translation (NAT) systems predict a sequence of output tokens in parallel, achieving substantial improvements in generation speed compared to autoregressive models. Existing NAT models usually rely on the technique of knowledge distillation, which creates the training data from a pretrained autoregressive model for better performance. Knowledge distillation is empirically useful, leading to large gains in accuracy for NAT models, but the reason for this success has, as of yet, been unclear. In this paper, we first design systematic experiments to investigate why knowledge distillation is crucial to NAT training. We find that knowledge distillation can reduce the complexity of data sets and help NAT to model the variations in the output data. Furthermore, a strong correlation is observed between the capacity of an NAT model and the optimal complexity of the distilled data for the best translation quality. Based on these findings, we further propose several approaches that can alter the complexity of data sets to improve the performance of NAT models. We achieve the state-of-the-art performance for the NAT-based models, and close the gap with the autoregressive baseline on WMT14 En-De benchmark.
\ No newline at end of file
diff --git a/data/2020/iclr/Understanding the Limitations of Variational Mutual Information Estimators b/data/2020/iclr/Understanding the Limitations of Variational Mutual Information Estimators
new file mode 100644
index 0000000000..53d862d175
--- /dev/null
+++ b/data/2020/iclr/Understanding the Limitations of Variational Mutual Information Estimators	
@@ -0,0 +1 @@
+Variational approaches based on neural networks are showing promise for estimating mutual information (MI) between high dimensional variables. However, they can be difficult to use in practice due to poorly understood bias/variance tradeoffs. We theoretically show that, under some conditions, estimators such as MINE exhibit variance that could grow exponentially with the true amount of underlying MI. We also empirically demonstrate that existing estimators fail to satisfy basic self-consistency properties of MI, such as data processing and additivity under independence. Based on a unified perspective of variational approaches, we develop a new estimator that focuses on variance reduction. Empirical results on standard benchmark tasks demonstrate that our proposed estimator exhibits improved bias-variance trade-offs on standard benchmark tasks.
\ No newline at end of file
diff --git a/data/2020/iclr/Unpaired Point Cloud Completion on Real Scans using Adversarial Training b/data/2020/iclr/Unpaired Point Cloud Completion on Real Scans using Adversarial Training
new file mode 100644
index 0000000000..18d7e9c2f6
--- /dev/null
+++ b/data/2020/iclr/Unpaired Point Cloud Completion on Real Scans using Adversarial Training	
@@ -0,0 +1 @@
+As 3D scanning solutions become increasingly popular, several deep learning setups have been developed geared towards that task of scan completion, i.e., plausibly filling in regions there were missed in the raw scans. These methods, however, largely rely on supervision in the form of paired training data, i.e., partial scans with corresponding desired completed scans. While these methods have been successfully demonstrated on synthetic data, the approaches cannot be directly used on real scans in absence of suitable paired training data. We develop a first approach that works directly on input point clouds, does not require paired training data, and hence can directly be applied to real scans for scan completion. We evaluate the approach qualitatively on several real-world datasets (ScanNet, Matterport, KITTI), quantitatively on 3D-EPN shape completion benchmark dataset, and demonstrate realistic completions under varying levels of incompleteness.
\ No newline at end of file
diff --git a/data/2020/iclr/Unsupervised Model Selection for Variational Disentangled Representation Learning b/data/2020/iclr/Unsupervised Model Selection for Variational Disentangled Representation Learning
new file mode 100644
index 0000000000..4d1e2417d2
--- /dev/null
+++ b/data/2020/iclr/Unsupervised Model Selection for Variational Disentangled Representation Learning	
@@ -0,0 +1 @@
+Disentangled representations have recently been shown to improve fairness, data efficiency and generalisation in simple supervised and reinforcement learning tasks. To extend the benefits of disentangled representations to more complex domains and practical applications, it is important to enable hyperparameter tuning and model selection of existing unsupervised approaches without requiring access to ground truth attribute labels, which are not available for most datasets. This paper addresses this problem by introducing a simple yet robust and reliable method for unsupervised disentangled model selection. Our approach, Unsupervised Disentanglement Ranking (UDR), leverages the recent theoretical results that explain why variational autoencoders disentangle (Rolinek et al, 2019), to quantify the quality of disentanglement by performing pairwise comparisons between trained model representations. We show that our approach performs comparably to the existing supervised alternatives across 5,400 models from six state of the art unsupervised disentangled representation learning model classes. Furthermore, we show that the ranking produced by our approach correlates well with the final task performance on two different domains.
\ No newline at end of file
diff --git a/data/2020/iclr/V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control b/data/2020/iclr/V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control
new file mode 100644
index 0000000000..a3e98f6bd9
--- /dev/null
+++ b/data/2020/iclr/V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control	
@@ -0,0 +1 @@
+Some of the most successful applications of deep reinforcement learning to challenging domains in discrete and continuous control have used policy gradient methods in the on-policy setting. However, policy gradients can suffer from large variance that may limit performance, and in practice require carefully tuned entropy regularization to prevent policy collapse. As an alternative to policy gradient algorithms, we introduce V-MPO, an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) that performs policy iteration based on a learned state-value function. We show that V-MPO surpasses previously reported scores for both the Atari-57 and DMLab-30 benchmark suites in the multi-task setting, and does so reliably without importance weighting, entropy regularization, or population-based tuning of hyperparameters. On individual DMLab and Atari levels, the proposed algorithm can achieve scores that are substantially higher than has previously been reported. V-MPO is also applicable to problems with high-dimensional, continuous action spaces, which we demonstrate in the context of learning to control simulated humanoids with 22 degrees of freedom from full state observations and 56 degrees of freedom from pixel observations, as well as example OpenAI Gym tasks where V-MPO achieves substantially higher asymptotic scores than previously reported.
\ No newline at end of file
diff --git a/data/2020/iclr/V4D: 4D Convolutional Neural Networks for Video-level Representation Learning b/data/2020/iclr/V4D: 4D Convolutional Neural Networks for Video-level Representation Learning
new file mode 100644
index 0000000000..ad76b88b67
--- /dev/null
+++ b/data/2020/iclr/V4D: 4D Convolutional Neural Networks for Video-level Representation Learning	
@@ -0,0 +1 @@
+Most existing 3D CNNs for video representation learning are clip-based methods, and thus do not consider video-level temporal evolution of spatio-temporal features. In this paper, we propose Video-level 4D Convolutional Neural Networks, referred as V4D, to model the evolution of long-range spatio-temporal representation with 4D convolutions, and at the same time, to preserve strong 3D spatio-temporal representation with residual connections. Specifically, we design a new 4D residual block able to capture inter-clip interactions, which could enhance the representation power of the original clip-level 3D CNNs. The 4D residual blocks can be easily integrated into the existing 3D CNNs to perform long-range modeling hierarchically. We further introduce the training and inference methods for the proposed V4D. Extensive experiments are conducted on three video recognition benchmarks, where V4D achieves excellent results, surpassing recent 3D CNNs by a large margin.
\ No newline at end of file
diff --git a/data/2020/iclr/VL-BERT: Pre-training of Generic Visual-Linguistic Representations b/data/2020/iclr/VL-BERT: Pre-training of Generic Visual-Linguistic Representations
new file mode 100644
index 0000000000..bd47bea8aa
--- /dev/null
+++ b/data/2020/iclr/VL-BERT: Pre-training of Generic Visual-Linguistic Representations	
@@ -0,0 +1 @@
+We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark. Code is released at \url{this https URL}.
\ No newline at end of file
diff --git a/data/2020/iclr/Variational Recurrent Models for Solving Partially Observable Control Tasks b/data/2020/iclr/Variational Recurrent Models for Solving Partially Observable Control Tasks
new file mode 100644
index 0000000000..688812d1c1
--- /dev/null
+++ b/data/2020/iclr/Variational Recurrent Models for Solving Partially Observable Control Tasks	
@@ -0,0 +1 @@
+In partially observable (PO) environments, deep reinforcement learning (RL) agents often suffer from unsatisfactory performance, since two problems need to be tackled together: how to extract information from the raw observations to solve the task, and how to improve the policy. In this study, we propose an RL algorithm for solving PO tasks. Our method comprises two parts: a variational recurrent model (VRM) for modeling the environment, and an RL controller that has access to both the environment and the VRM. The proposed algorithm was tested in two types of PO robotic control tasks, those in which either coordinates or velocities were not observable and those that require long-term memorization. Our experiments show that the proposed algorithm achieved better data efficiency and/or learned more optimal policy than other alternative approaches in tasks in which unobserved states cannot be inferred from raw observations in a simple manner.
\ No newline at end of file
diff --git a/data/2020/iclr/Vid2Game: Controllable Characters Extracted from Real-World Videos b/data/2020/iclr/Vid2Game: Controllable Characters Extracted from Real-World Videos
new file mode 100644
index 0000000000..23c3a02304
--- /dev/null
+++ b/data/2020/iclr/Vid2Game: Controllable Characters Extracted from Real-World Videos	
@@ -0,0 +1,2 @@
+We are given a video of a person performing a certain activity, from which we extract a controllable model. The model generates novel image sequences of that person, according to arbitrary user-defined control signals, typically marking the displacement of the moving body. The generated video can have an arbitrary background, and effectively capture both the dynamics and appearance of the person. 
+The method is based on two networks. The first network maps a current pose, and a single-instance control signal to the next pose. The second network maps the current pose, the new pose, and a given background, to an output frame. Both networks include multiple novelties that enable high-quality performance. This is demonstrated on multiple characters extracted from various videos of dancers and athletes.
\ No newline at end of file
diff --git a/data/2020/iclr/VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation b/data/2020/iclr/VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation
new file mode 100644
index 0000000000..ff42d78975
--- /dev/null
+++ b/data/2020/iclr/VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation	
@@ -0,0 +1 @@
+Generative models that can model and predict sequences of future events can, in principle, learn to capture complex real-world phenomena, such as physical interactions. However, a central challenge in video prediction is that the future is highly uncertain: a sequence of past observations of events can imply many possible futures. Although a number of recent works have studied probabilistic models that can represent uncertain futures, such models are either extremely expensive computationally as in the case of pixel-level autoregressive models, or do not directly optimize the likelihood of the data. To our knowledge, our work is the first to propose multi-frame video prediction with normalizing flows, which allows for direct optimization of the data likelihood, and produces high-quality stochastic predictions. We describe an approach for modeling the latent space dynamics, and demonstrate that flow-based generative models offer a viable and competitive approach to generative modelling of video.
\ No newline at end of file
diff --git a/data/2020/iclr/Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards b/data/2020/iclr/Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards
new file mode 100644
index 0000000000..7990acf360
--- /dev/null
+++ b/data/2020/iclr/Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards	
@@ -0,0 +1 @@
+Imitation learning allows agents to learn complex behaviors from demonstrations. However, learning a complex vision-based task may require an impractical number of demonstrations. Meta-imitation learning is a promising approach towards enabling agents to learn a new task from one or a few demonstrations by leveraging experience from learning similar tasks. In the presence of task ambiguity or unobserved dynamics, demonstrations alone may not provide enough information; an agent must also try the task to successfully infer a policy. In this work, we propose a method that can learn to learn from both demonstrations and trial-and-error experience with sparse reward feedback. In comparison to meta-imitation, this approach enables the agent to effectively and efficiently improve itself autonomously beyond the demonstration data. In comparison to meta-reinforcement learning, we can scale to substantially broader distributions of tasks, as the demonstration reduces the burden of exploration. Our experiments show that our method significantly outperforms prior approaches on a set of challenging, vision-based control tasks.
\ No newline at end of file
diff --git a/data/2020/iclr/Weakly Supervised Clustering by Exploiting Unique Class Count b/data/2020/iclr/Weakly Supervised Clustering by Exploiting Unique Class Count
new file mode 100644
index 0000000000..0e6dbac362
--- /dev/null
+++ b/data/2020/iclr/Weakly Supervised Clustering by Exploiting Unique Class Count	
@@ -0,0 +1 @@
+A weakly supervised learning based clustering framework is proposed in this paper. As the core of this framework, we introduce a novel multiple instance learning task based on a bag level label called unique class count (ucc), which is the number of unique classes among all instances inside the bag. In this task, no annotations on individual instances inside the bag are needed during training of the models. We mathematically prove that with a perfect ucc classifier, perfect clustering of individual instances inside the bags is possible even when no annotations on individual instances are given during training. We have constructed a neural network based ucc classifier and experimentally shown that the clustering performance of our framework with our weakly supervised ucc classifier is comparable to that of fully supervised learning models where labels for all instances are known. Furthermore, we have tested the applicability of our framework to a real world task of semantic segmentation of breast cancer metastases in histological lymph node sections and shown that the performance of our weakly supervised framework is comparable to the performance of a fully supervised Unet model.
\ No newline at end of file
diff --git a/data/2020/iclr/What graph neural networks cannot learn: depth vs width b/data/2020/iclr/What graph neural networks cannot learn: depth vs width
new file mode 100644
index 0000000000..99759af92c
--- /dev/null
+++ b/data/2020/iclr/What graph neural networks cannot learn: depth vs width	
@@ -0,0 +1 @@
+This paper studies the expressive power of graph neural networks falling within the message-passing framework (GNNmp). Two results are presented. First, GNNmp are shown to be Turing universal under sufficient conditions on their depth, width, node attributes, and layer expressiveness. Second, it is discovered that GNNmp can lose a significant portion of their power when their depth and width is restricted. The proposed impossibility statements stem from a new technique that enables the repurposing of seminal results from distributed computing and leads to lower bounds for an array of decision, optimization, and estimation problems involving graphs. Strikingly, several of these problems are deemed impossible unless the product of a GNNmp's depth and width exceeds a polynomial of the graph size; this dependence remains significant even for tasks that appear simple or when considering approximation.
\ No newline at end of file
diff --git a/data/2021/iclr/A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning b/data/2021/iclr/A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning
new file mode 100644
index 0000000000..a32aeaf6f3
--- /dev/null
+++ b/data/2021/iclr/A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning	
@@ -0,0 +1 @@
+Modern large-scale machine learning applications require stochastic optimization algorithms to be implemented on distributed compute systems. A key bottleneck of such systems is the communication overhead for exchanging information across the workers, such as stochastic gradients. Among the many techniques proposed to remedy this issue, one of the most successful is the framework of compressed communication with error feedback (EF). EF remains the only known technique that can deal with the error induced by contractive compressors which are not unbiased, such as Top-$K$. In this paper, we propose a new and theoretically and practically better alternative to EF for dealing with contractive compressors. In particular, we propose a construction which can transform any contractive compressor into an induced unbiased compressor. Following this transformation, existing methods able to work with unbiased compressors can be applied. We show that our approach leads to vast improvements over EF, including reduced memory requirements, better communication complexity guarantees and fewer assumptions. We further extend our results to federated learning with partial participation following an arbitrary distribution over the nodes, and demonstrate the benefits thereof. We perform several numerical experiments which validate our theoretical findings.
\ No newline at end of file
diff --git a/data/2021/iclr/A Block Minifloat Representation for Training Deep Neural Networks b/data/2021/iclr/A Block Minifloat Representation for Training Deep Neural Networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/A Critique of Self-Expressive Deep Subspace Clustering b/data/2021/iclr/A Critique of Self-Expressive Deep Subspace Clustering
new file mode 100644
index 0000000000..2937fdbd66
--- /dev/null
+++ b/data/2021/iclr/A Critique of Self-Expressive Deep Subspace Clustering	
@@ -0,0 +1 @@
+Subspace clustering is an unsupervised clustering technique designed to cluster data that is supported on a union of linear subspaces, with each subspace defining a cluster with dimension lower than the ambient space. Many existing formulations for this problem are based on exploiting the self-expressive property of linear subspaces, where any point within a subspace can be represented as linear combination of other points within the subspace. To extend this approach to data supported on a union of non-linear manifolds, numerous studies have proposed learning an appropriate kernel embedding of the original data using a neural network, which is regularized by a self-expressive loss function on the data in the embedded space to encourage a union of linear subspaces prior on the data in the embedded space. Here we show that there are a number of potential flaws with this approach which have not been adequately addressed in prior work. In particular, we show the model formulation is often ill-posed in multiple ways, which can lead to a degenerate embedding of the data, which need not correspond to a union of subspaces at all. We validate our theoretical results experimentally and additionally repeat prior experiments reported in the literature, where we conclude that a significant portion of the previously claimed performance benefits can be attributed to an ad-hoc post processing step rather than the clustering model.
\ No newline at end of file
diff --git a/data/2021/iclr/A Design Space Study for LISTA and Beyond b/data/2021/iclr/A Design Space Study for LISTA and Beyond
new file mode 100644
index 0000000000..faf859ed5a
--- /dev/null
+++ b/data/2021/iclr/A Design Space Study for LISTA and Beyond	
@@ -0,0 +1 @@
+In recent years, great success has been witnessed in building problem-specific deep networks from unrolling iterative algorithms, for solving inverse problems and beyond. Unrolling is believed to incorporate the model-based prior with the learning capacity of deep learning. This paper revisits the role of unrolling as a design approach for deep networks: to what extent its resulting special architecture is superior, and can we find better? Using LISTA for sparse recovery as a representative example, we conduct the first thorough design space study for the unrolled models. Among all possible variations, we focus on extensively varying the connectivity patterns and neuron types, leading to a gigantic design space arising from LISTA. To efficiently explore this space and identify top performers, we leverage the emerging tool of neural architecture search (NAS). We carefully examine the searched top architectures in a number of settings, and are able to discover networks that are consistently better than LISTA. We further present more visualization and analysis to "open the black box", and find that the searched top architectures demonstrate highly consistent and potentially transferable patterns. We hope our study to spark more reflections and explorations on how to better mingle model-based optimization prior and data-driven learning.
\ No newline at end of file
diff --git a/data/2021/iclr/A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima b/data/2021/iclr/A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima
new file mode 100644
index 0000000000..51a43d5434
--- /dev/null
+++ b/data/2021/iclr/A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima	
@@ -0,0 +1 @@
+Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum that often generalizes well. However, it is mathematically unclear how deep learning can select a flat minimum among so many minima. To answer the question quantitatively, we develop a density diffusion theory (DDT) to reveal how minima selection quantitatively depends on the minima sharpness and the hyperparameters. To the best of our knowledge, we are the first to theoretically and empirically prove that, benefited from the Hessian-dependent covariance of stochastic gradient noise, SGD favors flat minima exponentially more than sharp minima, while Gradient Descent (GD) with injected white noise favors flat minima only polynomially more than sharp minima. We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima in terms of the ratio of the batch size and learning rate. Thus, large-batch training cannot search flat minima efficiently in a realistic computational time.
\ No newline at end of file
diff --git a/data/2021/iclr/A Discriminative Gaussian Mixture Model with Sparsity b/data/2021/iclr/A Discriminative Gaussian Mixture Model with Sparsity
new file mode 100644
index 0000000000..0956958f41
--- /dev/null
+++ b/data/2021/iclr/A Discriminative Gaussian Mixture Model with Sparsity	
@@ -0,0 +1 @@
+In probabilistic classification, a discriminative model based on the softmax function has a potential limitation in that it assumes unimodality for each class in the feature space. The mixture model can address this issue, although it leads to an increase in the number of parameters. We propose a sparse classifier based on a discriminative GMM, referred to as a sparse discriminative Gaussian mixture (SDGM). In the SDGM, a GMM-based discriminative model is trained via sparse Bayesian learning. Using this sparse learning framework, we can simultaneously remove redundant Gaussian components and reduce the number of parameters used in the remaining components during learning; this learning method reduces the model complexity, thereby improving the generalization capability. Furthermore, the SDGM can be embedded into neural networks (NNs), such as convolutional NNs, and can be trained in an end-to-end manner. Experimental results demonstrated that the proposed method outperformed the existing softmax-based discriminative models.
\ No newline at end of file
diff --git a/data/2021/iclr/A Distributional Approach to Controlled Text Generation b/data/2021/iclr/A Distributional Approach to Controlled Text Generation
new file mode 100644
index 0000000000..4d4b76e0ea
--- /dev/null
+++ b/data/2021/iclr/A Distributional Approach to Controlled Text Generation	
@@ -0,0 +1 @@
+We propose a Distributional Approach for addressing Controlled Text Generation from pre-trained Language Models (LMs). This approach permits to specify, in a single formal framework, both “pointwise” and “distributional” constraints over the target LM — to our knowledge, the first model with such generality — while minimizing KL divergence from the initial LM distribution. The optimal target distribution is then uniquely determined as an explicit EBM (Energy-Based Model) representation. From that optimal representation we then train a target controlled Autoregressive LM through an adaptive distributional variant of Policy Gradient. We conduct a first set of experiments over pointwise constraints showing the advantages of our approach over a set of baselines, in terms of obtaining a controlled LM balancing constraint satisfaction with divergence from the initial LM. We then perform experiments over distributional constraints, a unique feature of our approach, demonstrating its potential as a remedy to the problem of Bias in Language Models. Through an ablation study, we show the effectiveness of our adaptive technique for obtaining faster convergence.1
\ No newline at end of file
diff --git a/data/2021/iclr/A Geometric Analysis of Deep Generative Image Models and Its Applications b/data/2021/iclr/A Geometric Analysis of Deep Generative Image Models and Its Applications
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/A Good Image Generator Is What You Need for High-Resolution Video Synthesis b/data/2021/iclr/A Good Image Generator Is What You Need for High-Resolution Video Synthesis
new file mode 100644
index 0000000000..c2a8a2d334
--- /dev/null
+++ b/data/2021/iclr/A Good Image Generator Is What You Need for High-Resolution Video Synthesis	
@@ -0,0 +1 @@
+Image and video synthesis are closely related areas aiming at generating content from noise. While rapid progress has been demonstrated in improving image-based models to handle large resolutions, high-quality renderings, and wide variations in image content, achieving comparable video generation results remains problematic. We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. Not only does such a framework render high-resolution videos, but it also is an order of magnitude more computationally efficient. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled. With such a representation, our framework allows for a broad range of applications, including content and motion manipulation. Furthermore, we introduce a new task, which we call cross-domain video synthesis, in which the image and motion generators are trained on disjoint datasets belonging to different domains. This allows for generating moving objects for which the desired video data is not available. Extensive experiments on various datasets demonstrate the advantages of our methods over existing video generation techniques. Code will be released at https://github.com/snap-research/MoCoGAN-HD.
\ No newline at end of file
diff --git a/data/2021/iclr/A Gradient Flow Framework For Analyzing Network Pruning b/data/2021/iclr/A Gradient Flow Framework For Analyzing Network Pruning
new file mode 100644
index 0000000000..656bce4ed4
--- /dev/null
+++ b/data/2021/iclr/A Gradient Flow Framework For Analyzing Network Pruning	
@@ -0,0 +1 @@
+Recent network pruning methods focus on pruning models early-on in training. To estimate the impact of removing a parameter, these methods use importance measures that were originally designed to prune trained models. Despite lacking justification for their use early-on in training, such measures result in surprisingly low accuracy loss. To better explain this behavior, we develop a general gradient flow based framework that unifies state-of-the-art importance measures through the norm of model parameters. We use this framework to determine the relationship between pruning measures and evolution of model parameters, establishing several results related to pruning models early-on in training: (i) magnitude-based pruning removes parameters that contribute least to reduction in loss, resulting in models that converge faster than magnitude-agnostic methods; (ii) loss-preservation based pruning preserves first-order model evolution dynamics and is therefore appropriate for pruning minimally trained models; and (iii) gradient-norm based pruning affects second-order model evolution dynamics, such that increasing gradient norm via pruning can produce poorly performing models. We validate our claims on several VGG-13, MobileNet-V1, and ResNet-56 models trained on CIFAR-10 and CIFAR-100. Code available at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/A Hypergradient Approach to Robust Regression without Correspondence b/data/2021/iclr/A Hypergradient Approach to Robust Regression without Correspondence
new file mode 100644
index 0000000000..414aee2028
--- /dev/null
+++ b/data/2021/iclr/A Hypergradient Approach to Robust Regression without Correspondence	
@@ -0,0 +1 @@
+We consider a regression problem, where the correspondence between input and output data is not available. Such shuffled data is commonly observed in many real world problems. Taking flow cytometry as an example, the measuring instruments are unable to preserve the correspondence between the samples and the measurements. Due to the combinatorial nature, most of existing methods are only applicable when the sample size is small, and limited to linear regression models. To overcome such bottlenecks, we propose a new computational framework - ROBOT- for the shuffled regression problem, which is applicable to large data and complex models. Specifically, we propose to formulate the regression without correspondence as a continuous optimization problem. Then by exploiting the interaction between the regression model and the data correspondence, we propose to develop a hypergradient approach based on differentiable programming techniques. Such a hypergradient approach essentially views the data correspondence as an operator of the regression, and therefore allows us to find a better descent direction for the model parameter by differentiating through the data correspondence. ROBOT is quite general, and can be further extended to the inexact correspondence setting, where the input and output data are not necessarily exactly aligned. Thorough numerical experiments show that ROBOT achieves better performance than existing methods in both linear and nonlinear regression tasks, including real-world applications such as flow cytometry and multi-object tracking.
\ No newline at end of file
diff --git a/data/2021/iclr/A Learning Theoretic Perspective on Local Explainability b/data/2021/iclr/A Learning Theoretic Perspective on Local Explainability
new file mode 100644
index 0000000000..590bf2d60f
--- /dev/null
+++ b/data/2021/iclr/A Learning Theoretic Perspective on Local Explainability	
@@ -0,0 +1 @@
+In this paper, we explore connections between interpretable machine learning and learning theory through the lens of local approximation explanations. First, we tackle the traditional problem of performance generalization and bound the test-time accuracy of a model using a notion of how locally explainable it is. Second, we explore the novel problem of explanation generalization which is an important concern for a growing class of finite sample-based local approximation explanations. Finally, we validate our theoretical results empirically and show that they reflect what can be seen in practice.
\ No newline at end of file
diff --git a/data/2021/iclr/A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks b/data/2021/iclr/A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks
new file mode 100644
index 0000000000..42a4549603
--- /dev/null
+++ b/data/2021/iclr/A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks	
@@ -0,0 +1 @@
+Autoregressive language models pretrained on large corpora have been successful at solving downstream tasks, even with zero-shot usage. However, there is little theoretical justification for their success. This paper considers the following questions: (1) Why should learning the distribution of natural language help with downstream classification tasks? (2) Why do features learned using language modeling help solve downstream tasks with linear classifiers? For (1), we hypothesize, and verify empirically, that classification tasks of interest can be reformulated as next word prediction tasks, thus making language modeling a meaningful pretraining task. For (2), we analyze properties of the cross-entropy objective to show that $\epsilon$-optimal language models in cross-entropy (log-perplexity) learn features that are $\mathcal{O}(\sqrt{\epsilon})$-good on natural linear classification tasks, thus demonstrating mathematically that doing well on language modeling can be beneficial for downstream tasks. We perform experiments to verify assumptions and validate theoretical results. Our theoretical insights motivate a simple alternative to the cross-entropy objective that performs well on some linear classification tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/A PAC-Bayesian Approach to Generalization Bounds for Graph Neural Networks b/data/2021/iclr/A PAC-Bayesian Approach to Generalization Bounds for Graph Neural Networks
new file mode 100644
index 0000000000..f333d81e47
--- /dev/null
+++ b/data/2021/iclr/A PAC-Bayesian Approach to Generalization Bounds for Graph Neural Networks	
@@ -0,0 +1 @@
+In this paper, we derive generalization bounds for the two primary classes of graph neural networks (GNNs), namely graph convolutional networks (GCNs) and message passing GNNs (MPGNNs), via a PAC-Bayesian approach. Our result reveals that the maximum node degree and spectral norm of the weights govern the generalization bounds of both models. We also show that our bound for GCNs is a natural generalization of the results developed in arXiv:1707.09564v2 [cs.LG] for fully-connected and convolutional neural networks. For message passing GNNs, our PAC-Bayes bound improves over the Rademacher complexity based bound in arXiv:2002.06157v1 [cs.LG], showing a tighter dependency on the maximum node degree and the maximum hidden dimension. The key ingredients of our proofs are a perturbation analysis of GNNs and the generalization of PAC-Bayes analysis to non-homogeneous GNNs. We perform an empirical study on several real-world graph datasets and verify that our PAC-Bayes bound is tighter than others.
\ No newline at end of file
diff --git a/data/2021/iclr/A Panda? No, It's a Sloth: Slowdown Attacks on Adaptive Multi-Exit Neural Network Inference b/data/2021/iclr/A Panda? No, It's a Sloth: Slowdown Attacks on Adaptive Multi-Exit Neural Network Inference
new file mode 100644
index 0000000000..9d3c6b01ad
--- /dev/null
+++ b/data/2021/iclr/A Panda? No, It's a Sloth: Slowdown Attacks on Adaptive Multi-Exit Neural Network Inference	
@@ -0,0 +1 @@
+Recent increases in the computational demands of deep neural networks (DNNs), combined with the observation that most input samples require only simple models, have sparked interest in $input$-$adaptive$ multi-exit architectures, such as MSDNets or Shallow-Deep Networks. These architectures enable faster inferences and could bring DNNs to low-power devices, e.g. in the Internet of Things (IoT). However, it is unknown if the computational savings provided by this approach are robust against adversarial pressure. In particular, an adversary may aim to slow down adaptive DNNs by increasing their average inference time$-$a threat analogous to the $denial$-$of$-$service$ attacks from the Internet. In this paper, we conduct a systematic evaluation of this threat by experimenting with three generic multi-exit DNNs (based on VGG16, MobileNet, and ResNet56) and a custom multi-exit architecture, on two popular image classification benchmarks (CIFAR-10 and Tiny ImageNet). To this end, we show that adversarial sample-crafting techniques can be modified to cause slowdown, and we propose a metric for comparing their impact on different architectures. We show that a slowdown attack reduces the efficacy of multi-exit DNNs by 90%-100%, and it amplifies the latency by 1.5-5$\times$ in a typical IoT deployment. We also show that it is possible to craft universal, reusable perturbations and that the attack can be effective in realistic black-box scenarios, where the attacker has limited knowledge about the victim. Finally, we show that adversarial training provides limited protection against slowdowns. These results suggest that further research is needed for defending multi-exit architectures against this emerging threat.
\ No newline at end of file
diff --git a/data/2021/iclr/A Temporal Kernel Approach for Deep Learning with Continuous-time Information b/data/2021/iclr/A Temporal Kernel Approach for Deep Learning with Continuous-time Information
new file mode 100644
index 0000000000..899e2386f1
--- /dev/null
+++ b/data/2021/iclr/A Temporal Kernel Approach for Deep Learning with Continuous-time Information	
@@ -0,0 +1 @@
+Sequential deep learning models such as RNN, causal CNN and attention mechanism do not readily consume continuous-time information. Discretizing the temporal data, as we show, causes inconsistency even for simple continuous-time processes. Current approaches often handle time in a heuristic manner to be consistent with the existing deep learning architectures and implementations. In this paper, we provide a principled way to characterize continuous-time systems using deep learning tools. Notably, the proposed approach applies to all the major deep learning architectures and requires little modifications to the implementation. The critical insight is to represent the continuous-time system by composing neural networks with a temporal kernel, where we gain our intuition from the recent advancements in understanding deep learning with Gaussian process and neural tangent kernel. To represent the temporal kernel, we introduce the random feature approach and convert the kernel learning problem to spectral density estimation under reparameterization. We further prove the convergence and consistency results even when the temporal kernel is non-stationary, and the spectral density is misspecified. The simulations and real-data experiments demonstrate the empirical effectiveness of our temporal kernel approach in a broad range of settings.
\ No newline at end of file
diff --git a/data/2021/iclr/A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention b/data/2021/iclr/A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention
new file mode 100644
index 0000000000..34e6004fdd
--- /dev/null
+++ b/data/2021/iclr/A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention	
@@ -0,0 +1 @@
+We address the problem of learning on large sets of features, motivated by the need of performing pooling operations in long biological sequences of varying sizes, with long-range dependencies, and possibly few labeled data. To address this challenging task, we introduce a parametrized embedding that aggregates the features from a given set according to the optimal transport plan between the set and a trainable reference. Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost. Our aggregation technique admits two useful interpretations: it may be seen as a mechanism related to attention layers in neural networks, yet that requires less data, or it may be seen as a scalable surrogate of a classical optimal transport-based kernel. We experimentally demonstrate the effectiveness of our approach on biological sequences, achieving state-of-the-art results for protein fold recognition and detection of chromatin profiles tasks, and, as a proof of concept, we show promising results for processing natural language sequences. We provide an open-source implementation of our embedding that can be used alone or as a module in larger learning models. Our code is freely available at \url{https://github.com/claying/OTK}.
\ No newline at end of file
diff --git a/data/2021/iclr/A Unified Approach to Interpreting and Boosting Adversarial Transferability b/data/2021/iclr/A Unified Approach to Interpreting and Boosting Adversarial Transferability
new file mode 100644
index 0000000000..060aecb51a
--- /dev/null
+++ b/data/2021/iclr/A Unified Approach to Interpreting and Boosting Adversarial Transferability	
@@ -0,0 +1 @@
+In this paper, we use the interaction inside adversarial perturbations to explain and boost the adversarial transferability. We discover and prove the negative correlation between the adversarial transferability and the interaction inside adversarial perturbations. The negative correlation is further verified through different DNNs with various inputs. Moreover, this negative correlation can be regarded as a unified perspective to understand current transferability-boosting methods. To this end, we prove that some classic methods of enhancing the transferability essentially decease interactions inside adversarial perturbations. Based on this, we propose to directly penalize interactions during the attacking process, which significantly improves the adversarial transferability.
\ No newline at end of file
diff --git a/data/2021/iclr/A Universal Representation Transformer Layer for Few-Shot Image Classification b/data/2021/iclr/A Universal Representation Transformer Layer for Few-Shot Image Classification
new file mode 100644
index 0000000000..2d9138c3bf
--- /dev/null
+++ b/data/2021/iclr/A Universal Representation Transformer Layer for Few-Shot Image Classification	
@@ -0,0 +1 @@
+Few-shot classification aims to recognize unseen classes when presented with only a small number of samples. We consider the problem of multi-domain few-shot image classification, where unseen classes and examples come from diverse data sources. This problem has seen growing interest and has inspired the development of benchmarks such as Meta-Dataset. A key challenge in this multi-domain setting is to effectively integrate the feature representations from the diverse set of training domains. Here, we propose a Universal Representation Transformer (URT) layer, that meta-learns to leverage universal features for few-shot classification by dynamically re-weighting and composing the most appropriate domain-specific representations. In experiments, we show that URT sets a new state-of-the-art result on Meta-Dataset. Specifically, it achieves top-performance on the highest number of data sources compared to competing methods. We analyze variants of URT and present a visualization of the attention score heatmaps that sheds light on how the model performs cross-domain generalization. Our code is available at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/A Wigner-Eckart Theorem for Group Equivariant Convolution Kernels b/data/2021/iclr/A Wigner-Eckart Theorem for Group Equivariant Convolution Kernels
new file mode 100644
index 0000000000..f889c3fc0a
--- /dev/null
+++ b/data/2021/iclr/A Wigner-Eckart Theorem for Group Equivariant Convolution Kernels	
@@ -0,0 +1 @@
+Group equivariant convolutional networks (GCNNs) endow classical convolutional networks with additional symmetry priors, which can lead to a considerably improved performance. Recent advances in the theoretical description of GCNNs revealed that such models can generally be understood as performing convolutions with G-steerable kernels, that is, kernels that satisfy an equivariance constraint themselves. While the G-steerability constraint has been derived, it has to date only been solved for specific use cases - a general characterization of G-steerable kernel spaces is still missing. This work provides such a characterization for the practically relevant case of G being any compact group. Our investigation is motivated by a striking analogy between the constraints underlying steerable kernels on the one hand and spherical tensor operators from quantum mechanics on the other hand. By generalizing the famous Wigner-Eckart theorem for spherical tensor operators, we prove that steerable kernel spaces are fully understood and parameterized in terms of 1) generalized reduced matrix elements, 2) Clebsch-Gordan coefficients, and 3) harmonic basis functions on homogeneous spaces.
\ No newline at end of file
diff --git a/data/2021/iclr/A statistical theory of cold posteriors in deep neural networks b/data/2021/iclr/A statistical theory of cold posteriors in deep neural networks
new file mode 100644
index 0000000000..ee99245890
--- /dev/null
+++ b/data/2021/iclr/A statistical theory of cold posteriors in deep neural networks	
@@ -0,0 +1 @@
+To get Bayesian neural networks to perform comparably to standard neural networks it is usually necessary to artificially reduce uncertainty using a "tempered" or "cold" posterior. This is extremely concerning: if the prior is accurate, Bayes inference/decision theory is optimal, and any artificial changes to the posterior should harm performance. While this suggests that the prior may be at fault, here we argue that in fact, BNNs for image classification use the wrong likelihood. In particular, standard image benchmark datasets such as CIFAR-10 are carefully curated. We develop a generative model describing curation which gives a principled Bayesian account of cold posteriors, because the likelihood under this new generative model closely matches the tempered likelihoods used in past work.
\ No newline at end of file
diff --git a/data/2021/iclr/A teacher-student framework to distill future trajectories b/data/2021/iclr/A teacher-student framework to distill future trajectories
new file mode 100644
index 0000000000..e827ed7788
--- /dev/null
+++ b/data/2021/iclr/A teacher-student framework to distill future trajectories	
@@ -0,0 +1 @@
+By learning to predict trajectories of dynamical systems, model-based methods can make extensive use of all observations from past experience. However, due to partial observability, stochasticity, compounding errors, and irrelevant dynamics, training to predict observations explicitly often results in poor models. Model-free techniques try to side-step the problem by learning to predict values directly. While breaking the explicit dependency on future observations can result in strong performance, this usually comes at the cost of low sample efficiency, as the abundant information about the dynamics contained in future observations goes unused. Here we take a step back from both approaches: Instead of hand-designing how trajectories should be incorporated, a teacher network learns to extract relevant information from the trajectories and to distill it into target activations which guide a student model that can only observe the present. The teacher is trained with meta-gradients to maximize the student’s performance on a validation set. Our approach performs well on tasks that are difficult for model-free and model-based methods, and we study the role of every component through ablation studies.
\ No newline at end of file
diff --git a/data/2021/iclr/A unifying view on implicit bias in training linear neural networks b/data/2021/iclr/A unifying view on implicit bias in training linear neural networks
new file mode 100644
index 0000000000..5f7939ffe3
--- /dev/null
+++ b/data/2021/iclr/A unifying view on implicit bias in training linear neural networks	
@@ -0,0 +1 @@
+We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases, and investigate the linear version of the formulation called linear tensor networks. For $L$-layer linear tensor networks that are orthogonally decomposable, we show that gradient flow on separable classification finds a stationary point of the $\ell_{2/L}$ max-margin problem in a "transformed" input space defined by the network. For underdetermined regression, we prove that gradient flow finds a global minimum which minimizes a norm-like function that interpolates between weighted $\ell_1$ and $\ell_2$ norms in the transformed input space. Our theorems subsume existing results in the literature while removing most of the convergence assumptions. We also provide experiments that corroborate our analysis.
\ No newline at end of file
diff --git a/data/2021/iclr/ALFWorld: Aligning Text and Embodied Environments for Interactive Learning b/data/2021/iclr/ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
new file mode 100644
index 0000000000..64c3cd8315
--- /dev/null
+++ b/data/2021/iclr/ALFWorld: Aligning Text and Embodied Environments for Interactive Learning	
@@ -0,0 +1 @@
+Given a simple request (e.g., Put a washed apple in the kitchen fridge), humans can reason in purely abstract terms by imagining action sequences and scoring their likelihood of success, prototypicality, and efficiency, all without moving a muscle. Once we see the kitchen in question, we can update our abstract plans to fit the scene. Embodied agents require the same abilities, but existing work does not yet provide the infrastructure necessary for both reasoning abstractly and executing concretely. We address this limitation by introducing ALFWorld, a simulator that enables agents to learn abstract, text-based policies in TextWorld (Cote et al., 2018) and then execute goals from the ALFRED benchmark (Shridhar et al., 2020) in a rich visual environment. ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in TextWorld, corresponds directly to concrete, visually grounded actions. In turn, as we demonstrate empirically, this fosters better agent generalization than training only in the visually grounded environment. BUTLER's simple, modular design factors the problem to allow researchers to focus on models for improving every piece of the pipeline (language understanding, planning, navigation, visual scene understanding, and so forth).
\ No newline at end of file
diff --git a/data/2021/iclr/ANOCE: Analysis of Causal Effects with Multiple Mediators via Constrained Structural Learning b/data/2021/iclr/ANOCE: Analysis of Causal Effects with Multiple Mediators via Constrained Structural Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/ARMOURED: Adversarially Robust MOdels using Unlabeled data by REgularizing Diversity b/data/2021/iclr/ARMOURED: Adversarially Robust MOdels using Unlabeled data by REgularizing Diversity
new file mode 100644
index 0000000000..70457335bf
--- /dev/null
+++ b/data/2021/iclr/ARMOURED: Adversarially Robust MOdels using Unlabeled data by REgularizing Diversity	
@@ -0,0 +1 @@
+Adversarial attacks pose a major challenge for modern deep neural networks. Recent advancements show that adversarially robust generalization requires a huge amount of labeled data for training. If annotation becomes a burden, can unlabeled data help bridge the gap? In this paper, we propose ARMOURED, an ad-versarially robust training method based on semi-supervised learning that consists of two components. The ﬁrst component applies multi-view learning to simultaneously optimize multiple independent networks and utilizes unlabeled data to enforce labeling consistency. The second component reduces adversarial trans-ferability among the networks via diversity regularizers inspired by determinan-tal point processes and entropy maximization. Experimental results show that under small perturbation budgets, ARMOURED is robust against strong adaptive adversaries. Notably, ARMOURED does not rely on generating adversarial samples during training. When used in combination with adversarial training, ARMOURED achieves state-of-the-art robustness against (cid:96) ∞ and (cid:96) 2 attacks for a range of perturbation budgets, while maintaining high accuracy on clean samples. We demonstrate the robustness of ARMOURED on CIFAR-10 and SVHN datasets against state-of-the-art benchmarks in adversarial robust training.
\ No newline at end of file
diff --git a/data/2021/iclr/Accelerating Convergence of Replica Exchange Stochastic Gradient MCMC via Variance Reduction b/data/2021/iclr/Accelerating Convergence of Replica Exchange Stochastic Gradient MCMC via Variance Reduction
new file mode 100644
index 0000000000..3fe8695746
--- /dev/null
+++ b/data/2021/iclr/Accelerating Convergence of Replica Exchange Stochastic Gradient MCMC via Variance Reduction	
@@ -0,0 +1 @@
+Replica exchange stochastic gradient Langevin dynamics (reSGLD) has shown promise in accelerating the convergence in non-convex learning; however, an excessively large correction for avoiding biases from noisy energy estimators has limited the potential of the acceleration. To address this issue, we study the variance reduction for noisy energy estimators, which promotes much more effective swaps. Theoretically, we provide a non-asymptotic analysis on the exponential acceleration for the underlying continuous-time Markov jump process; moreover, we consider a generalized Girsanov theorem which includes the change of Poisson measure to overcome the crude discretization based on the Growall's inequality and yields a much tighter error in the 2-Wasserstein ($\mathcal{W}_2$) distance. Numerically, we conduct extensive experiments and obtain the state-of-the-art results in optimization and uncertainty estimates for synthetic experiments and image data.
\ No newline at end of file
diff --git a/data/2021/iclr/Accurate Learning of Graph Representations with Graph Multiset Pooling b/data/2021/iclr/Accurate Learning of Graph Representations with Graph Multiset Pooling
new file mode 100644
index 0000000000..86072b695f
--- /dev/null
+++ b/data/2021/iclr/Accurate Learning of Graph Representations with Graph Multiset Pooling	
@@ -0,0 +1 @@
+Graph neural networks have been widely used on modeling graph data, achieving impressive results on node classification and link prediction tasks. Yet, obtaining an accurate representation for a graph further requires a pooling function that maps a set of node representations into a compact form. A simple sum or average over all node representations considers all node features equally without consideration of their task relevance, and any structural dependencies among them. Recently proposed hierarchical graph pooling methods, on the other hand, may yield the same representation for two different graphs that are distinguished by the Weisfeiler-Lehman test, as they suboptimally preserve information from the node features. To tackle these limitations of existing graph pooling methods, we first formulate the graph pooling problem as a multiset encoding problem with auxiliary information about the graph structure, and propose a Graph Multiset Transformer (GMT) which is a multi-head attention based global pooling layer that captures the interaction between nodes according to their structural dependencies. We show that GMT satisfies both injectiveness and permutation invariance, such that it is at most as powerful as the Weisfeiler-Lehman graph isomorphism test. Moreover, our methods can be easily extended to the previous node clustering approaches for hierarchical graph pooling. Our experimental results show that GMT significantly outperforms state-of-the-art graph pooling methods on graph classification benchmarks with high memory and time efficiency, and obtains even larger performance gain on graph reconstruction and generation tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning b/data/2021/iclr/Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning
new file mode 100644
index 0000000000..f4fc53d08e
--- /dev/null
+++ b/data/2021/iclr/Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning	
@@ -0,0 +1 @@
+Federated learning (FL) is a distributed machine learning architecture that leverages a large number of workers to jointly learn a model with decentralized data. FL has received increasing attention in recent years thanks to its data privacy protection, communication efficiency and a linear speedup for convergence in training (i.e., convergence performance increases linearly with respect to the number of workers). However, existing studies on linear speedup for convergence are only limited to the assumptions of i.i.d. datasets across workers and/or full worker participation, both of which rarely hold in practice. So far, it remains an open question whether or not the linear speedup for convergence is achievable under non-i.i.d. datasets with partial worker participation in FL. In this paper, we show that the answer is affirmative. Specifically, we show that the federated averaging (FedAvg) algorithm (with two-sided learning rates) on non-i.i.d. datasets in non-convex settings achieves a convergence rate $\mathcal{O}(\frac{1}{\sqrt{mKT}} + \frac{1}{T})$ for full worker participation and a convergence rate $\mathcal{O}(\frac{1}{\sqrt{nKT}} + \frac{1}{T})$ for partial worker participation, where $K$ is the number of local steps, $T$ is the number of total communication rounds, $m$ is the total worker number and $n$ is the worker number in one communication round if for partial worker participation. Our results also reveal that the local steps in FL could help the convergence and show that the maximum number of local steps can be improved to $T/m$. We conduct extensive experiments on MNIST and CIFAR-10 to verify our theoretical results.
\ No newline at end of file
diff --git a/data/2021/iclr/Acting in Delayed Environments with Non-Stationary Markov Policies b/data/2021/iclr/Acting in Delayed Environments with Non-Stationary Markov Policies
new file mode 100644
index 0000000000..c351fe926d
--- /dev/null
+++ b/data/2021/iclr/Acting in Delayed Environments with Non-Stationary Markov Policies	
@@ -0,0 +1 @@
+The standard Markov Decision Process (MDP) formulation hinges on the assumption that an action is executed immediately after it was chosen. However, assuming it is often unrealistic and can lead to catastrophic failures in applications such as robotic manipulation, cloud computing, and finance. We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of m steps. The brute-force state augmentation baseline where the state is concatenated to the last m committed actions suffers from an exponential complexity in m, as we show for policy iteration. We then prove that with execution delay, Markov policies in the original state-space are sufficient for attaining maximal reward, but need to be non-stationary. As for stationary Markov policies, we show they are sub-optimal in general. Consequently, we devise a non-stationary Q-learning style model-based algorithm that solves delayed execution tasks without resorting to state-augmentation. Experiments on tabular, physical, and Atari domains reveal that it converges quickly to high performance even for substantial delays, while standard approaches that either ignore the delay or rely on state-augmentation struggle or fail due to divergence. The code will be shared upon publication.
\ No newline at end of file
diff --git a/data/2021/iclr/Activation-level uncertainty in deep neural networks b/data/2021/iclr/Activation-level uncertainty in deep neural networks
new file mode 100644
index 0000000000..41622b4720
--- /dev/null
+++ b/data/2021/iclr/Activation-level uncertainty in deep neural networks	
@@ -0,0 +1 @@
+,
\ No newline at end of file
diff --git a/data/2021/iclr/Active Contrastive Learning of Audio-Visual Video Representations b/data/2021/iclr/Active Contrastive Learning of Audio-Visual Video Representations
new file mode 100644
index 0000000000..cf75ab1e56
--- /dev/null
+++ b/data/2021/iclr/Active Contrastive Learning of Audio-Visual Video Representations	
@@ -0,0 +1 @@
+Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance. However, obtaining a tight lower bound requires a sample size exponential in MI and thus a large set of negative samples. We can incorporate more samples by building a large queue-based dictionary, but there are theoretical limits to performance improvements even with a large number of negative samples. We hypothesize that random negative sampling leads to a highly redundant dictionary that results in suboptimal representations for downstream tasks. In this paper, we propose an active contrastive learning approach that builds an actively sampled dictionary with diverse and informative items
\ No newline at end of file
diff --git a/data/2021/iclr/AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition b/data/2021/iclr/AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition
new file mode 100644
index 0000000000..270a96a12b
--- /dev/null
+++ b/data/2021/iclr/AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition	
@@ -0,0 +1 @@
+Temporal modelling is the key for efficient video action recognition. While understanding temporal information can improve recognition accuracy for dynamic actions, removing temporal redundancy and reusing past features can significantly save computation leading to efficient action recognition. In this paper, we introduce an adaptive temporal fusion network, called AdaFuse, that dynamically fuses channels from current and past feature maps for strong temporal modelling. Specifically, the necessary information from the historical convolution feature maps is fused with current pruned feature maps with the goal of improving both recognition accuracy and efficiency. In addition, we use a skipping operation to further reduce the computation cost of action recognition. Extensive experiments on Something V1&V2, Jester and Mini-Kinetics show that our approach can achieve about 40% computation savings with comparable accuracy to state-of-the-art methods. The project page can be found at https://mengyuest.github.io/AdaFuse/
\ No newline at end of file
diff --git a/data/2021/iclr/AdaGCN: Adaboosting Graph Convolutional Networks into Deep Models b/data/2021/iclr/AdaGCN: Adaboosting Graph Convolutional Networks into Deep Models
new file mode 100644
index 0000000000..e36ac335b9
--- /dev/null
+++ b/data/2021/iclr/AdaGCN: Adaboosting Graph Convolutional Networks into Deep Models	
@@ -0,0 +1 @@
+The design of deep graph models still remains to be investigated and the crucial part is how to explore and exploit the knowledge from different hops of neighbors in an efficient way. In this paper, we propose a novel RNN-like deep graph neural network architecture by incorporating AdaBoost into the computation of network; and the proposed graph convolutional network called AdaGCN~(AdaBoosting Graph Convolutional Network) has the ability to efficiently extract knowledge from high-order neighbors and integrate knowledge from different hops of neighbors into the network in an AdaBoost way. We also present the architectural difference between AdaGCN and existing graph convolutional methods to show the benefits of our proposal. Finally, extensive experiments demonstrate the state-of-the-art prediction performance and the computational advantage of our approach AdaGCN.
\ No newline at end of file
diff --git a/data/2021/iclr/AdaSpeech: Adaptive Text to Speech for Custom Voice b/data/2021/iclr/AdaSpeech: Adaptive Text to Speech for Custom Voice
new file mode 100644
index 0000000000..650b3be955
--- /dev/null
+++ b/data/2021/iclr/AdaSpeech: Adaptive Text to Speech for Custom Voice	
@@ -0,0 +1 @@
+Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims to adapt a source TTS model to synthesize personal voice for a target speaker using few speech data. Custom voice presents two unique challenges for TTS adaptation: 1) to support diverse customers, the adaptation model needs to handle diverse acoustic conditions that could be very different from source speech data, and 2) to support a large number of customers, the adaptation parameters need to be small enough for each target speaker to reduce memory usage while maintaining high voice quality. In this work, we propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices. We design several techniques in AdaSpeech to address the two challenges in custom voice: 1) To handle different acoustic conditions, we use two acoustic encoders to extract an utterance-level vector and a sequence of phoneme-level vectors from the target speech during training; in inference, we extract the utterance-level vector from a reference speech and use an acoustic predictor to predict the phoneme-level vectors. 2) To better trade off the adaptation parameters and voice quality, we introduce conditional layer normalization in the mel-spectrogram decoder of AdaSpeech, and fine-tune this part in addition to speaker embedding for adaptation. We pre-train the source TTS model on LibriTTS datasets and fine-tune it on VCTK and LJSpeech datasets (with different acoustic conditions from LibriTTS) with few adaptation data, e.g., 20 sentences, about 1 minute speech. Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice. Audio samples are available at https://speechresearch.github.io/adaspeech/.
\ No newline at end of file
diff --git a/data/2021/iclr/AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights b/data/2021/iclr/AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights
new file mode 100644
index 0000000000..2330cd5806
--- /dev/null
+++ b/data/2021/iclr/AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights	
@@ -0,0 +1 @@
+Normalization techniques are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights provides an advantageous ground for gradient descent (GD) optimizers: the effective step sizes are automatically reduced over time, stabilizing the overall training procedure. It is often overlooked, however, that the additional introduction of momentum in GD optimizers results in a far more rapid reduction in effective step sizes for scale-invariant weights, a phenomenon that has not yet been studied and may have caused unwanted side effects in the current practice. This is a crucial issue because arguably the vast majority of modern deep neural networks consist of (1) momentum-based GD (e.g. SGD or Adam) and (2) scale-invariant parameters. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances. We propose a simple and effective remedy, SGDP and AdamP: get rid of the radial component, or the norm-increasing direction, at each optimizer step. Because of the scale invariance, this modification only alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers. Given the ubiquity of momentum GD and scale invariance in machine learning, we have evaluated our methods against the baselines on 13 benchmarks. They range from vision tasks like classification (e.g. ImageNet), retrieval (e.g. CUB and SOP), and detection (e.g. COCO) to language modelling (e.g. WikiText) and audio classification (e.g. DCASE) tasks. We verify that our solution brings about uniform gains in those benchmarks. Source code is available at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/Adapting to Reward Progressivity via Spectral Reinforcement Learning b/data/2021/iclr/Adapting to Reward Progressivity via Spectral Reinforcement Learning
new file mode 100644
index 0000000000..f2b547083e
--- /dev/null
+++ b/data/2021/iclr/Adapting to Reward Progressivity via Spectral Reinforcement Learning	
@@ -0,0 +1 @@
+In this paper we consider reinforcement learning tasks with progressive rewards; that is, tasks where the rewards tend to increase in magnitude over time. We hypothesise that this property may be problematic for value-based deep reinforcement learning agents, particularly if the agent must first succeed in relatively unrewarding regions of the task in order to reach more rewarding regions. To address this issue, we propose Spectral DQN, which decomposes the reward into frequencies such that the high frequencies only activate when large rewards are found. This allows the training loss to be balanced so that it gives more even weighting across small and large reward regions. In two domains with extreme reward progressivity, where standard value-based methods struggle significantly, Spectral DQN is able to make much farther progress. Moreover, when evaluated on a set of six standard Atari games that do not overtly favour the approach, Spectral DQN remains more than competitive: While it underperforms one of the benchmarks in a single game, it comfortably surpasses the benchmarks in three games. These results demonstrate that the approach is not overfit to its target problem, and suggest that Spectral DQN may have advantages beyond addressing reward progressivity.
\ No newline at end of file
diff --git a/data/2021/iclr/Adaptive Extra-Gradient Methods for Min-Max Optimization and Games b/data/2021/iclr/Adaptive Extra-Gradient Methods for Min-Max Optimization and Games
new file mode 100644
index 0000000000..3b871b9a17
--- /dev/null
+++ b/data/2021/iclr/Adaptive Extra-Gradient Methods for Min-Max Optimization and Games	
@@ -0,0 +1 @@
+We present a new family of min-max optimization algorithms that automatically exploit the geometry of the gradient data observed at earlier iterations to perform more informative extra-gradient steps in later ones. Thanks to this adaptation mechanism, the proposed method automatically detects whether the problem is smooth or not, without requiring any prior tuning by the optimizer. As a result, the algorithm simultaneously achieves order-optimal convergence rates, i.e., it converges to an $\varepsilon$-optimal solution within $\mathcal{O}(1/\varepsilon)$ iterations in smooth problems, and within $\mathcal{O}(1/\varepsilon^2)$ iterations in non-smooth ones. Importantly, these guarantees do not require any of the standard boundedness or Lipschitz continuity conditions that are typically assumed in the literature; in particular, they apply even to problems with singularities (such as resource allocation problems and the like). This adaptation is achieved through the use of a geometric apparatus based on Finsler metrics and a suitably chosen mirror-prox template that allows us to derive sharp convergence rates for the methods at hand.
\ No newline at end of file
diff --git a/data/2021/iclr/Adaptive Federated Optimization b/data/2021/iclr/Adaptive Federated Optimization
new file mode 100644
index 0000000000..dda6844a11
--- /dev/null
+++ b/data/2021/iclr/Adaptive Federated Optimization	
@@ -0,0 +1 @@
+Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Due to the heterogeneity of the client datasets, standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Adam, and Yogi, and analyze their convergence in the presence of heterogeneous data for general nonconvex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can significantly improve the performance of federated learning.
\ No newline at end of file
diff --git a/data/2021/iclr/Adaptive Procedural Task Generation for Hard-Exploration Problems b/data/2021/iclr/Adaptive Procedural Task Generation for Hard-Exploration Problems
new file mode 100644
index 0000000000..57d7a2fc0c
--- /dev/null
+++ b/data/2021/iclr/Adaptive Procedural Task Generation for Hard-Exploration Problems	
@@ -0,0 +1 @@
+We introduce Adaptive Procedural Task Generation (APT-Gen), an approach for progressively generating a sequence of tasks as curricula to facilitate reinforcement learning in hard-exploration problems. At the heart of our approach, a task generator learns to create tasks via a black-box procedural generation module by adaptively sampling from the parameterized task space. To enable curriculum learning in the absence of a direct indicator of learning progress, the task generator is trained by balancing the agent's expected return in the generated tasks and their similarities to the target task. Through adversarial training, the similarity between the generated tasks and the target task is adaptively estimated by a task discriminator defined on the agent's behaviors. In this way, our approach can efficiently generate tasks of rich variations for target tasks of unknown parameterization or not covered by the predefined task space. Experiments demonstrate the effectiveness of our approach through quantitative and qualitative analysis in various scenarios.
\ No newline at end of file
diff --git a/data/2021/iclr/Adaptive Universal Generalized PageRank Graph Neural Network b/data/2021/iclr/Adaptive Universal Generalized PageRank Graph Neural Network
new file mode 100644
index 0000000000..b1f4679fe0
--- /dev/null
+++ b/data/2021/iclr/Adaptive Universal Generalized PageRank Graph Neural Network	
@@ -0,0 +1 @@
+In many important graph data processing applications the acquired information includes both node features and observations of the graph topology. Graph neural networks (GNNs) are designed to exploit both sources of evidence but they do not optimally trade-off their utility and integrate them in a manner that is also universal. Here, universality refers to independence on homophily or heterophily graph assumptions. We address these issues by introducing a new Generalized PageRank (GPR) GNN architecture that adaptively learns the GPR weights so as to jointly optimize node feature and topological information extraction, regardless of the extent to which the node labels are homophilic or heterophilic. Learned GPR weights automatically adjust to the node label pattern, irrelevant on the type of initialization, and thereby guarantee excellent learning performance for label patterns that are usually hard to handle. Furthermore, they allow one to avoid feature over-smoothing, a process which renders feature information nondiscriminative, without requiring the network to be shallow. Our accompanying theoretical analysis of the GPR-GNN method is facilitated by novel synthetic benchmark datasets generated by the so-called contextual stochastic block model. We also compare the performance of our GNN architecture with that of several state-of-the-art GNNs on the problem of node-classification, using well-known benchmark homophilic and heterophilic datasets. The results demonstrate that GPR-GNN offers significant performance improvement compared to existing techniques on both synthetic and benchmark data.
\ No newline at end of file
diff --git a/data/2021/iclr/Adaptive and Generative Zero-Shot Learning b/data/2021/iclr/Adaptive and Generative Zero-Shot Learning
new file mode 100644
index 0000000000..7ddefa425d
--- /dev/null
+++ b/data/2021/iclr/Adaptive and Generative Zero-Shot Learning	
@@ -0,0 +1 @@
+We address the problem of generalized zero-shot learning (GZSL) where the task is to predict the class label of a target image whether its label belongs to the seen or unseen category. Similar to ZSL, the learning setting assumes that all class-level semantic features are given, while only the images of seen classes are available for training. By exploring the correlation between image features and the corresponding semantic features, the main idea of the proposed approach is to enrich the semantic-to-visual (S2V) embeddings via a seamless fusion of adaptive and generative learning. To this end, we extend the semantic features of each class by supplementing image-adaptive attention so that the learned S2V embedding can account for not only inter-class but also intra-class variations. In addition, to break the limit of training with images only from seen classes, we design a generative scheme to simultaneously generate virtual class labels and their visual features by sampling and interpolating over seen counterparts. In inference, a testing image will give rise to two different S2V embeddings, seen and virtual. The former is used to decide whether the underlying label is of the unseen category or otherwise a speciﬁc seen class; the latter is to predict an unseen class label. To demonstrate the effectiveness of our method, we report state-of-the-art results on four standard GZSL datasets, including an ablation study of the proposed modules.
\ No newline at end of file
diff --git a/data/2021/iclr/Adversarial score matching and improved sampling for image generation b/data/2021/iclr/Adversarial score matching and improved sampling for image generation
new file mode 100644
index 0000000000..eb1ec6c272
--- /dev/null
+++ b/data/2021/iclr/Adversarial score matching and improved sampling for image generation	
@@ -0,0 +1,2 @@
+Denoising Score Matching with Annealed Langevin Sampling (DSM-ALS) has recently found success in generative modeling. The approach works by first training a neural network to estimate the score of a distribution, and then using Langevin dynamics to sample from the data distribution assumed by the score network. Despite the convincing visual quality of samples, this method appears to perform worse than Generative Adversarial Networks (GANs) under the Frechet Inception Distance, a standard metric for generative models. 
+We show that this apparent gap vanishes when denoising the final Langevin samples using the score network. In addition, we propose two improvements to DSM-ALS: 1) Consistent Annealed Sampling as a more stable alternative to Annealed Langevin Sampling, and 2) a hybrid training formulation, composed of both Denoising Score Matching and adversarial objectives. By combining these two techniques and exploring different network architectures, we elevate score matching methods and obtain results competitive with state-of-the-art image generation on CIFAR-10.
\ No newline at end of file
diff --git a/data/2021/iclr/Adversarially Guided Actor-Critic b/data/2021/iclr/Adversarially Guided Actor-Critic
new file mode 100644
index 0000000000..1eff15f551
--- /dev/null
+++ b/data/2021/iclr/Adversarially Guided Actor-Critic	
@@ -0,0 +1 @@
+Despite definite success in deep reinforcement learning problems, actor-critic algorithms are still confronted with sample inefficiency in complex environments, particularly in tasks where efficient exploration is a bottleneck. These methods consider a policy (the actor) and a value function (the critic) whose respective losses are built using different motivations and approaches. This paper introduces a third protagonist: the adversary. While the adversary mimics the actor by minimizing the KL-divergence between their respective action distributions, the actor, in addition to learning to solve the task, tries to differentiate itself from the adversary predictions. This novel objective stimulates the actor to follow strategies that could not have been correctly predicted from previous trajectories, making its behavior innovative in tasks where the reward is extremely rare. Our experimental analysis shows that the resulting Adversarially Guided Actor-Critic (AGAC) algorithm leads to more exhaustive exploration. Notably, AGAC outperforms current state-of-the-art methods on a set of various hard-exploration and procedurally-generated tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Adversarially-Trained Deep Nets Transfer Better: Illustration on Image Classification b/data/2021/iclr/Adversarially-Trained Deep Nets Transfer Better: Illustration on Image Classification
new file mode 100644
index 0000000000..a66726f09f
--- /dev/null
+++ b/data/2021/iclr/Adversarially-Trained Deep Nets Transfer Better: Illustration on Image Classification	
@@ -0,0 +1 @@
+Transfer learning has emerged as a powerful methodology for adapting pre-trained deep neural networks on image recognition tasks to new domains. This process consists of taking a neural network pre-trained on a large feature-rich source dataset, freezing the early layers that encode essential generic image properties, and then fine-tuning the last few layers in order to capture specific information related to the target situation. This approach is particularly useful when only limited or weakly labeled data are available for the new task. In this work, we demonstrate that adversarially-trained models transfer better than non-adversarially-trained models, especially if only limited data are available for the new domain task. Further, we observe that adversarial training biases the learnt representations to retaining shapes, as opposed to textures, which impacts the transferability of the source models. Finally, through the lens of influence functions, we discover that transferred adversarially-trained models contain more human-identifiable semantic information, which explains – at least partly – why adversarially-trained models transfer better.
\ No newline at end of file
diff --git a/data/2021/iclr/Aligning AI With Shared Human Values b/data/2021/iclr/Aligning AI With Shared Human Values
new file mode 100644
index 0000000000..5949ba66a2
--- /dev/null
+++ b/data/2021/iclr/Aligning AI With Shared Human Values	
@@ -0,0 +1 @@
+We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete understanding of basic ethical knowledge. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
\ No newline at end of file
diff --git a/data/2021/iclr/An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale b/data/2021/iclr/An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
new file mode 100644
index 0000000000..b902e22485
--- /dev/null
+++ b/data/2021/iclr/An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale	
@@ -0,0 +1 @@
+While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
\ No newline at end of file
diff --git a/data/2021/iclr/An Unsupervised Deep Learning Approach for Real-World Image Denoising b/data/2021/iclr/An Unsupervised Deep Learning Approach for Real-World Image Denoising
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Analyzing the Expressive Power of Graph Neural Networks in a Spectral Perspective b/data/2021/iclr/Analyzing the Expressive Power of Graph Neural Networks in a Spectral Perspective
new file mode 100644
index 0000000000..f3d443fb02
--- /dev/null
+++ b/data/2021/iclr/Analyzing the Expressive Power of Graph Neural Networks in a Spectral Perspective	
@@ -0,0 +1 @@
+In the recent literature of Graph Neural Networks (GNN), the expressive power of models has been studied through their capability to distinguish if two given graphs are isomorphic or not. Since the graph isomorphism problem is NP-intermediate, and Weisfeiler-Lehman (WL) test can give sufficient but not enough evidence in polynomial time, the theoretical power of GNNs is usually evaluated by the equivalence of WL-test order, followed by an empirical analysis of the models on some reference inductive and transductive datasets. However, such analysis does not account the signal processing pipeline, whose capability is generally evaluated in the spectral domain. In this paper, we argue that a spectral analysis of GNNs behavior can provide a complementary point of view to go one step further in the understanding of GNNs. By bridging the gap between the spectral and spatial design of graph convolutions, we theoretically demonstrate some equivalence of the graph convolution process regardless it is designed in the spatial or the spectral domain. Using this connection, we managed to re-formulate most of the state-of-the-art graph neural networks into one common framework. This general framework allows to lead a spectral analysis of the most popular GNNs, explaining their performance and showing their limits according to spectral point of view. Our theoretical spectral analysis is confirmed by experiments on various graph databases. Furthermore, we demonstrate the necessity of high and/or band-pass filters on a graph dataset, while the majority of GNN is limited to only low-pass and inevitably it fails. Code available at https://github.com/balcilar/gnn-spectral-expressive-power.
\ No newline at end of file
diff --git a/data/2021/iclr/Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics b/data/2021/iclr/Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics
new file mode 100644
index 0000000000..17f8953e26
--- /dev/null
+++ b/data/2021/iclr/Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics	
@@ -0,0 +1 @@
+A central challenge in developing versatile machine learning systems is catastrophic forgetting: a model trained on tasks in sequence will suffer significant performance drops on earlier tasks. Despite the ubiquity of catastrophic forgetting, there is limited understanding of the underlying process and its causes. In this paper, we address this important knowledge gap, investigating how forgetting affects representations in neural network models. Through representational analysis techniques, we find that deeper layers are disproportionately the source of forgetting. Supporting this, a study of methods to mitigate forgetting illustrates that they act to stabilize deeper layers. These insights enable the development of an analytic argument and empirical picture relating the degree of forgetting to representational similarity between tasks. Consistent with this picture, we observe maximal forgetting occurs for task sequences with intermediate similarity. We perform empirical studies on the standard split CIFAR-10 setup and also introduce a novel CIFAR-100 based task approximating realistic input distribution shift.
\ No newline at end of file
diff --git a/data/2021/iclr/Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies b/data/2021/iclr/Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies
new file mode 100644
index 0000000000..7e3c56f66d
--- /dev/null
+++ b/data/2021/iclr/Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies	
@@ -0,0 +1 @@
+We present DINO (\textbf{D}ETR with \textbf{I}mproved de\textbf{N}oising anch\textbf{O}r boxes), a state-of-the-art end-to-end object detector. % in this paper. DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box prediction. DINO achieves $49.4$AP in $12$ epochs and $51.3$AP in $24$ epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding a significant improvement of $\textbf{+6.0}$\textbf{AP} and $\textbf{+2.7}$\textbf{AP}, respectively, compared to DN-DETR, the previous best DETR-like model. DINO scales well in both model size and data size. Without bells and whistles, after pre-training on the Objects365 dataset with a SwinL backbone, DINO obtains the best results on both COCO \texttt{val2017} ($\textbf{63.2}$\textbf{AP}) and \texttt{test-dev} (\textbf{$\textbf{63.3}$AP}). Compared to other models on the leaderboard, DINO significantly reduces its model size and pre-training data size while achieving better results. Our code will be available at \url{https://github.com/IDEACVR/DINO}.
\ No newline at end of file
diff --git a/data/2021/iclr/Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval b/data/2021/iclr/Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval
new file mode 100644
index 0000000000..45c72da11d
--- /dev/null
+++ b/data/2021/iclr/Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval	
@@ -0,0 +1 @@
+We propose a simple and efficient multi-hop dense retrieval approach for answering complex open-domain questions, which achieves state-of-the-art performance on two multi-hop datasets, HotpotQA and multi-evidence FEVER. Contrary to previous work, our method does not require access to any corpus-specific information, such as inter-document hyperlinks or human-annotated entity markers, and can be applied to any unstructured text corpus. Our system also yields a much better efficiency-accuracy trade-off, matching the best published accuracy on HotpotQA while being 10 times faster at inference time.
\ No newline at end of file
diff --git a/data/2021/iclr/Anytime Sampling for Autoregressive Models via Ordered Autoencoding b/data/2021/iclr/Anytime Sampling for Autoregressive Models via Ordered Autoencoding
new file mode 100644
index 0000000000..39b079c55e
--- /dev/null
+++ b/data/2021/iclr/Anytime Sampling for Autoregressive Models via Ordered Autoencoding	
@@ -0,0 +1 @@
+Autoregressive models are widely used for tasks such as image and audio generation. The sampling process of these models, however, does not allow interruptions and cannot adapt to real-time computational resources. This challenge impedes the deployment of powerful autoregressive models, which involve a slow sampling process that is sequential in nature and typically scales linearly with respect to the data dimension. To address this difficulty, we propose a new family of autoregressive models that enables anytime sampling. Inspired by Principal Component Analysis, we learn a structured representation space where dimensions are ordered based on their importance with respect to reconstruction. Using an autoregressive model in this latent space, we trade off sample quality for computational efficiency by truncating the generation process before decoding into the original data space. Experimentally, we demonstrate in several image and audio generation tasks that sample quality degrades gracefully as we reduce the computational budget for sampling. The approach suffers almost no loss in sample quality (measured by FID) using only 60\% to 80\% of all latent dimensions for image data. Code is available at https://github.com/Newbeeer/Anytime-Auto-Regressive-Model .
\ No newline at end of file
diff --git a/data/2021/iclr/Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval b/data/2021/iclr/Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
new file mode 100644
index 0000000000..bb8d917219
--- /dev/null
+++ b/data/2021/iclr/Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval	
@@ -0,0 +1 @@
+Conducting text retrieval in a dense learned representation space has many intriguing advantages over sparse retrieval. Yet the effectiveness of dense retrieval (DR) often requires combination with sparse retrieval. In this paper, we identify that the main bottleneck is in the training mechanisms, where the negative instances used in training are not representative of the irrelevant documents in testing. This paper presents Approximate nearest neighbor Negative Contrastive Estimation (ANCE), a training mechanism that constructs negatives from an Approximate Nearest Neighbor (ANN) index of the corpus, which is parallelly updated with the learning process to select more realistic negative training instances. This fundamentally resolves the discrepancy between the data distribution used in the training and testing of DR. In our experiments, ANCE boosts the BERT-Siamese DR model to outperform all competitive dense and sparse retrieval baselines. It nearly matches the accuracy of sparse-retrieval-and-BERT-reranking using dot-product in the ANCE-learned representation space and provides almost 100x speed-up.
\ No newline at end of file
diff --git a/data/2021/iclr/Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks b/data/2021/iclr/Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks
new file mode 100644
index 0000000000..65a257fad0
--- /dev/null
+++ b/data/2021/iclr/Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks	
@@ -0,0 +1 @@
+Neural networks (NNs) whose subnetworks implement reusable functions are expected to offer numerous advantages, including compositionality through efficient recombination of functional building blocks, interpretability, preventing catastrophic interference, etc. Understanding if and how NNs are modular could provide insights into how to improve them. Current inspection methods, however, fail to link modules to their functionality. In this paper, we present a novel method based on learning binary weight masks to identify individual weights and subnets responsible for specific functions. Using this powerful tool, we contribute an extensive study of emerging modularity in NNs that covers several standard architectures and datasets. We demonstrate how common NNs fail to reuse submodules and offer new insights into the related issue of systematic generalization on language tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Are Neural Rankers still Outperformed by Gradient Boosted Decision Trees? b/data/2021/iclr/Are Neural Rankers still Outperformed by Gradient Boosted Decision Trees?
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Are wider nets better given the same number of parameters? b/data/2021/iclr/Are wider nets better given the same number of parameters?
new file mode 100644
index 0000000000..2849d23753
--- /dev/null
+++ b/data/2021/iclr/Are wider nets better given the same number of parameters?	
@@ -0,0 +1 @@
+Empirical studies demonstrate that the performance of neural networks improves with increasing number of parameters. In most of these studies, the number of parameters is increased by increasing the network width. This begs the question: Is the observed improvement due to the larger number of parameters, or is it due to the larger width itself? We compare different ways of increasing model width while keeping the number of parameters constant. We show that for models initialized with a random, static sparsity pattern in the weight tensors, network width is the determining factor for good performance, while the number of weights is secondary, as long as trainability is ensured. As a step towards understanding this effect, we analyze these models in the framework of Gaussian Process kernels. We find that the distance between the sparse finite-width model kernel and the infinite-width kernel at initialization is indicative of model performance.
\ No newline at end of file
diff --git a/data/2021/iclr/Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning b/data/2021/iclr/Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning
new file mode 100644
index 0000000000..960fbe7579
--- /dev/null
+++ b/data/2021/iclr/Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning	
@@ -0,0 +1 @@
+Complex, multi-task problems have proven to be difficult to solve efficiently in a sparse-reward reinforcement learning setting. In order to be sample efficient, multi-task learning requires reuse and sharing of low-level policies. To facilitate the automatic decomposition of hierarchical tasks, we propose the use of step-by-step human demonstrations in the form of natural language instructions and action trajectories. We introduce a dataset of such demonstrations in a crafting-based grid world. Our model consists of a high-level language generator and low-level policy, conditioned on language. We find that human demonstrations help solve the most complex tasks. We also find that incorporating natural language allows the model to generalize to unseen tasks in a zero-shot setting and to learn quickly from a few demonstrations. Generalization is not only reflected in the actions of the agent, but also in the generated natural language instructions in unseen tasks. Our approach also gives our trained agent interpretable behaviors because it is able to generate a sequence of high-level descriptions of its actions.
\ No newline at end of file
diff --git a/data/2021/iclr/Async-RED: A Provably Convergent Asynchronous Block Parallel Stochastic Method using Deep Denoising Priors b/data/2021/iclr/Async-RED: A Provably Convergent Asynchronous Block Parallel Stochastic Method using Deep Denoising Priors
new file mode 100644
index 0000000000..e3b3fa4eeb
--- /dev/null
+++ b/data/2021/iclr/Async-RED: A Provably Convergent Asynchronous Block Parallel Stochastic Method using Deep Denoising Priors	
@@ -0,0 +1 @@
+Regularization by denoising (RED) is a recently developed framework for solving inverse problems by integrating advanced denoisers as image priors. Recent work has shown its state-of-the-art performance when combined with pre-trained deep denoisers. However, current RED algorithms are inadequate for parallel processing on multicore systems. We address this issue by proposing a new asynchronous RED (ASYNC-RED) algorithm that enables asynchronous parallel processing of data, making it significantly faster than its serial counterparts for large-scale inverse problems. The computational complexity of ASYNC-RED is further reduced by using a random subset of measurements at every iteration. We present complete theoretical analysis of the algorithm by establishing its convergence under explicit assumptions on the data-fidelity and the denoiser. We validate ASYNC-RED on image recovery using pre-trained deep denoisers as priors.
\ No newline at end of file
diff --git a/data/2021/iclr/Attentional Constellation Nets for Few-Shot Learning b/data/2021/iclr/Attentional Constellation Nets for Few-Shot Learning
new file mode 100644
index 0000000000..47f0992877
--- /dev/null
+++ b/data/2021/iclr/Attentional Constellation Nets for Few-Shot Learning	
@@ -0,0 +1 @@
+is
\ No newline at end of file
diff --git a/data/2021/iclr/Auction Learning as a Two-Player Game b/data/2021/iclr/Auction Learning as a Two-Player Game
new file mode 100644
index 0000000000..30cd2fc921
--- /dev/null
+++ b/data/2021/iclr/Auction Learning as a Two-Player Game	
@@ -0,0 +1 @@
+Designing an incentive compatible auction that maximizes expected revenue is a central problem in Auction Design. While theoretical approaches to the problem have hit some limits, a recent research direction initiated by Duetting et al. (2019) consists in building neural network architectures to find optimal auctions. We propose two conceptual deviations from their approach which result in enhanced performance. First, we use recent results in theoretical auction design (Rubinstein and Weinberg, 2018) to introduce a time-independent Lagrangian. This not only circumvents the need for an expensive hyper-parameter search (as in prior work), but also provides a principled metric to compare the performance of two auctions (absent from prior work). Second, the optimization procedure in previous work uses an inner maximization loop to compute optimal misreports. We amortize this process through the introduction of an additional neural network. We demonstrate the effectiveness of our approach by learning competitive or strictly improved auctions compared to prior work. Both results together further imply a novel formulation of Auction Design as a two-player game with stationary utility functions.
\ No newline at end of file
diff --git a/data/2021/iclr/Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting b/data/2021/iclr/Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting
new file mode 100644
index 0000000000..23ca96629c
--- /dev/null
+++ b/data/2021/iclr/Augmenting Physical Models with Deep Networks for Complex Dynamics Forecasting	
@@ -0,0 +1 @@
+Forecasting complex dynamical phenomena in settings where only partial knowledge of their dynamics is available is a prevalent problem across various scientific fields. While purely data-driven approaches are arguably insufficient in this context, standard physical modeling-based approaches tend to be over-simplistic, inducing non-negligible errors. In this work, we introduce the APHYNITY framework, a principled approach for augmenting incomplete physical dynamics described by differential equations with deep data-driven models. It consists of decomposing the dynamics into two components: a physical component accounting for the dynamics for which we have some prior knowledge, and a data-driven component accounting for errors of the physical model. The learning problem is carefully formulated such that the physical model explains as much of the data as possible, while the data-driven component only describes information that cannot be captured by the physical model; no more, no less. This not only provides the existence and uniqueness for this decomposition, but also ensures interpretability and benefit generalization. Experiments made on three important use cases, each representative of a different family of phenomena, i.e. reaction–diffusion equations, wave equations and the non-linear damped pendulum, show that APHYNITY can efficiently leverage approximate physical models to accurately forecast the evolution of the system and correctly identify relevant physical parameters. The code is available at https://github.com/yuan-yin/APHYNITY.
\ No newline at end of file
diff --git a/data/2021/iclr/Auto Seg-Loss: Searching Metric Surrogates for Semantic Segmentation b/data/2021/iclr/Auto Seg-Loss: Searching Metric Surrogates for Semantic Segmentation
new file mode 100644
index 0000000000..353664f3ed
--- /dev/null
+++ b/data/2021/iclr/Auto Seg-Loss: Searching Metric Surrogates for Semantic Segmentation	
@@ -0,0 +1 @@
+Designing proper loss functions is essential in training deep networks. Especially in the field of semantic segmentation, various evaluation metrics have been proposed for diverse scenarios. Despite the success of the widely adopted cross-entropy loss and its variants, the mis-alignment between the loss functions and evaluation metrics degrades the network performance. Meanwhile, manually designing loss functions for each specific metric requires expertise and significant manpower. In this paper, we propose to automate the design of metric-specific loss functions by searching differentiable surrogate losses for each metric. We substitute the non-differentiable operations in the metrics with parameterized functions, and conduct parameter search to optimize the shape of loss surfaces. Two constraints are introduced to regularize the search space and make the search efficient. Extensive experiments on PASCAL VOC and Cityscapes demonstrate that the searched surrogate losses outperform the manually designed loss functions consistently. The searched losses can generalize well to other datasets and networks. Code shall be released.
\ No newline at end of file
diff --git a/data/2021/iclr/AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly b/data/2021/iclr/AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly
new file mode 100644
index 0000000000..f2abf722aa
--- /dev/null
+++ b/data/2021/iclr/AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly	
@@ -0,0 +1 @@
+The learning rate (LR) schedule is one of the most important hyper-parameters needing careful tuning in training DNNs. However, it is also one of the least automated parts of machine learning systems and usually costs significant manual effort and computing. Though there are pre-defined LR schedules and optimizers with adaptive LR, they introduce new hyperparameters that need to be tuned separately for different tasks/datasets. In this paper, we consider the question: Can we automatically tune the LR over the course of training without human involvement? We propose an efficient method, AutoLRS, which automatically optimizes the LR for each training stage by modeling training dynamics. AutoLRS aims to find an LR applied to every $\tau$ steps that minimizes the resulted validation loss. We solve this black-box optimization on the fly by Bayesian optimization (BO). However, collecting training instances for BO requires a system to evaluate each LR queried by BO's acquisition function for $\tau$ steps, which is prohibitively expensive in practice. Instead, we apply each candidate LR for only $\tau'\ll\tau$ steps and train an exponential model to predict the validation loss after $\tau$ steps. This mutual-training process between BO and the loss-prediction model allows us to limit the training steps invested in the BO search. We demonstrate the advantages and the generality of AutoLRS through extensive experiments of training DNNs for tasks from diverse domains using different optimizers. The LR schedules auto-generated by AutoLRS lead to a speedup of $1.22\times$, $1.43\times$, and $1.5\times$ when training ResNet-50, Transformer, and BERT, respectively, compared to the LR schedules in their original papers, and an average speedup of $1.31\times$ over state-of-the-art heavily-tuned LR schedules.
\ No newline at end of file
diff --git a/data/2021/iclr/Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization b/data/2021/iclr/Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization
new file mode 100644
index 0000000000..9ec5b84504
--- /dev/null
+++ b/data/2021/iclr/Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization	
@@ -0,0 +1 @@
+Standard dynamics models for continuous control make use of feedforward computation to predict the conditional distribution of next state and reward given current state and action using a multivariate Gaussian with a diagonal covariance structure. This modeling choice assumes that different dimensions of the next state and reward are conditionally independent given the current state and action and may be driven by the fact that fully observable physics-based simulation environments entail deterministic transition dynamics. In this paper, we challenge this conditional independence assumption and propose a family of expressive autoregressive dynamics models that generate different dimensions of the next state and reward sequentially conditioned on previous dimensions. We demonstrate that autoregressive dynamics models indeed outperform standard feedforward models in log-likelihood on heldout transitions. Furthermore, we compare different model-based and model-free off-policy evaluation (OPE) methods on RL Unplugged, a suite of offline MuJoCo datasets, and find that autoregressive dynamics models consistently outperform all baselines, achieving a new state-of-the-art. Finally, we show that autoregressive dynamics models are useful for offline policy optimization by serving as a way to enrich the replay buffer through data augmentation and improving performance using model-based planning.
\ No newline at end of file
diff --git a/data/2021/iclr/Autoregressive Entity Retrieval b/data/2021/iclr/Autoregressive Entity Retrieval
new file mode 100644
index 0000000000..e334930b23
--- /dev/null
+++ b/data/2021/iclr/Autoregressive Entity Retrieval	
@@ -0,0 +1 @@
+Entities are at the center of how we represent and aggregate knowledge. For instance, Encyclopedias such as Wikipedia are structured by entities (e.g., one per article). The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering. One way to understand current approaches is as classifiers among atomic labels, one for each entity. Their weight vectors are dense entity representations produced by encoding entity information such as descriptions. This approach leads to several shortcomings: i) context and entity affinity is mainly captured through a vector dot product, potentially missing fine-grained interactions between the two; ii) a large memory footprint is needed to store dense representations when considering large entity sets; iii) an appropriately hard set of negative data has to be subsampled at training time. We propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion, and conditioned on the context. This enables to mitigate the aforementioned technical issues: i) the autoregressive formulation allows us to directly capture relations between context and entity name, effectively cross encoding both; ii) the memory footprint is greatly reduced because the parameters of our encoder-decoder architecture scale with vocabulary size, not entity count; iii) the exact softmax loss can be efficiently computed without the need to subsample negative data. We show the efficacy of the approach with more than 20 datasets on entity disambiguation, end-to-end entity linking and document retrieval tasks, achieving new SOTA, or very competitive results while using a tiny fraction of the memory of competing systems. Finally, we demonstrate that new entities can be added by simply specifying their unambiguous name.
\ No newline at end of file
diff --git a/data/2021/iclr/Auxiliary Learning by Implicit Differentiation b/data/2021/iclr/Auxiliary Learning by Implicit Differentiation
new file mode 100644
index 0000000000..01712b5739
--- /dev/null
+++ b/data/2021/iclr/Auxiliary Learning by Implicit Differentiation	
@@ -0,0 +1 @@
+Training with multiple auxiliary tasks is a common practice used in deep learning for improving the performance on the main task of interest. Two main challenges arise in this multi-task learning setting: (i) Designing useful auxiliary tasks; and (ii) Combining auxiliary tasks into a single coherent loss. We propose a novel framework, \textit{AuxiLearn}, that targets both challenges, based on implicit differentiation. First, when useful auxiliaries are known, we propose learning a network that combines all losses into a single coherent objective function. This network can learn \textit{non-linear} interactions between auxiliary tasks. Second, when no useful auxiliary task is known, we describe how to learn a network that generates a meaningful, novel auxiliary task. We evaluate AuxiLearn in a series of tasks and domains, including image segmentation and learning with attributes. We find that AuxiLearn consistently improves accuracy compared with competing methods.
\ No newline at end of file
diff --git a/data/2021/iclr/Auxiliary Task Update Decomposition: the Good, the Bad and the neutral b/data/2021/iclr/Auxiliary Task Update Decomposition: the Good, the Bad and the neutral
new file mode 100644
index 0000000000..aa1f7e95da
--- /dev/null
+++ b/data/2021/iclr/Auxiliary Task Update Decomposition: the Good, the Bad and the neutral	
@@ -0,0 +1 @@
+While deep learning has been very beneficial in data-rich settings, tasks with smaller training set often resort to pre-training or multitask learning to leverage data from other tasks. In this case, careful consideration is needed to select tasks and model parameterizations such that updates from the auxiliary tasks actually help the primary task. We seek to alleviate this burden by formulating a model-agnostic framework that performs fine-grained manipulation of the auxiliary task gradients. We propose to decompose auxiliary updates into directions which help, damage or leave the primary task loss unchanged. This allows weighting the update directions differently depending on their impact on the problem of interest. We present a novel and efficient algorithm for that purpose and show its advantage in practice. Our method leverages efficient automatic differentiation procedures and randomized singular value decomposition for scalability. We show that our framework is generic and encompasses some prior work as particular cases. Our approach consistently outperforms strong and widely used baselines when leveraging out-of-distribution data for Text and Image classification tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Average-case Acceleration for Bilinear Games and Normal Matrices b/data/2021/iclr/Average-case Acceleration for Bilinear Games and Normal Matrices
new file mode 100644
index 0000000000..c3117db967
--- /dev/null
+++ b/data/2021/iclr/Average-case Acceleration for Bilinear Games and Normal Matrices	
@@ -0,0 +1 @@
+Advances in generative modeling and adversarial learning have given rise to renewed interest in smooth games. However, the absence of symmetry in the matrix of second derivatives poses challenges that are not present in the classical minimization framework. While a rich theory of average-case analysis has been developed for minimization problems, little is known in the context of smooth games. In this work we take a first step towards closing this gap by developing average-case optimal first-order methods for a subset of smooth games. We make the following three main contributions. First, we show that for zero-sum bilinear games the average-case optimal method is the optimal method for the minimization of the Hamiltonian. Second, we provide an explicit expression for the optimal method corresponding to normal matrices, potentially non-symmetric. Finally, we specialize it to matrices with eigenvalues located in a disk and show a provable speed-up compared to worst-case optimal algorithms. We illustrate our findings through benchmarks with a varying degree of mismatch with our assumptions.
\ No newline at end of file
diff --git a/data/2021/iclr/BERTology Meets Biology: Interpreting Attention in Protein Language Models b/data/2021/iclr/BERTology Meets Biology: Interpreting Attention in Protein Language Models
new file mode 100644
index 0000000000..73bb9f329f
--- /dev/null
+++ b/data/2021/iclr/BERTology Meets Biology: Interpreting Attention in Protein Language Models	
@@ -0,0 +1 @@
+Transformer architectures have proven to learn useful representations for protein classification and generation tasks. However, these representations present challenges in interpretability. Through the lens of attention, we analyze the inner workings of the Transformer and explore how the model discerns structural and functional properties of proteins. We show that attention (1) captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure, (2) targets binding sites, a key functional component of proteins, and (3) focuses on progressively more complex biophysical properties with increasing layer depth. We also present a three-dimensional visualization of the interaction between attention and protein structure. Our findings align with known biological processes and provide a tool to aid discovery in protein engineering and synthetic biology. The code for visualization and analysis is available at https://github.com/salesforce/provis.
\ No newline at end of file
diff --git a/data/2021/iclr/BOIL: Towards Representation Change for Few-shot Learning b/data/2021/iclr/BOIL: Towards Representation Change for Few-shot Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction b/data/2021/iclr/BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction
new file mode 100644
index 0000000000..547992b42f
--- /dev/null
+++ b/data/2021/iclr/BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction	
@@ -0,0 +1 @@
+We study the challenging task of neural network quantization without end-to-end retraining, called Post-training Quantization (PTQ). PTQ usually requires a small subset of training data but produces less powerful quantized models than Quantization-Aware Training (QAT). In this work, we propose a novel PTQ framework, dubbed BRECQ, which pushes the limits of bitwidth in PTQ down to INT2 for the first time. BRECQ leverages the basic building blocks in neural networks and reconstructs them one-by-one. In a comprehensive theoretical study of the second-order error, we show that BRECQ achieves a good balance between cross-layer dependency and generalization error. To further employ the power of quantization, the mixed precision technique is incorporated in our framework by approximating the inter-layer and intra-layer sensitivity. Extensive experiments on various handcrafted and searched neural architectures are conducted for both image classification and object detection tasks. And for the first time we prove that, without bells and whistles, PTQ can attain 4-bit ResNet and MobileNetV2 comparable with QAT and enjoy 240 times faster production of quantized models. Codes are available at https://github.com/yhhhli/BRECQ.
\ No newline at end of file
diff --git a/data/2021/iclr/BREEDS: Benchmarks for Subpopulation Shift b/data/2021/iclr/BREEDS: Benchmarks for Subpopulation Shift
new file mode 100644
index 0000000000..97867e272b
--- /dev/null
+++ b/data/2021/iclr/BREEDS: Benchmarks for Subpopulation Shift	
@@ -0,0 +1 @@
+We develop a methodology for assessing the robustness of models to subpopulation shift---specifically, their ability to generalize to novel data subpopulations that were not observed during training. Our approach leverages the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions. This enables us to synthesize realistic distribution shifts whose sources can be precisely controlled and characterized, within existing large-scale datasets. Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity. We then validate that the corresponding shifts are tractable by obtaining human baselines for them. Finally, we utilize these benchmarks to measure the sensitivity of standard model architectures as well as the effectiveness of off-the-shelf train-time robustness interventions. Code and data available at this https URL .
\ No newline at end of file
diff --git a/data/2021/iclr/BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization b/data/2021/iclr/BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization
new file mode 100644
index 0000000000..8587c6f8a7
--- /dev/null
+++ b/data/2021/iclr/BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization	
@@ -0,0 +1 @@
+Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks, and thus, have been widely investigated. However, it lacks a systematic method to determine the exact quantization scheme. Previous methods either examine only a small manually-designed search space or utilize a cumbersome neural architecture search to explore the vast search space. These approaches cannot lead to an optimal quantization scheme efficiently. This work proposes bit-level sparsity quantization (BSQ) to tackle the mixed-precision quantization from a new angle of inducing bit-level sparsity. We consider each bit of quantized weights as an independent trainable variable and introduce a differentiable bit-sparsity regularizer. BSQ can induce all-zero bits across a group of weight elements and realize the dynamic precision reduction, leading to a mixed-precision quantization scheme of the original model. Our method enables the exploration of the full mixed-precision space with a single gradient-based optimization process, with only one hyperparameter to tradeoff the performance and compression. BSQ achieves both higher accuracy and higher bit reduction on various model architectures on the CIFAR-10 and ImageNet datasets comparing to previous methods.
\ No newline at end of file
diff --git a/data/2021/iclr/BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration b/data/2021/iclr/BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration
new file mode 100644
index 0000000000..1e383f99fe
--- /dev/null
+++ b/data/2021/iclr/BUSTLE: Bottom-Up Program Synthesis Through Learning-Guided Exploration	
@@ -0,0 +1 @@
+Program synthesis is challenging largely because of the difficulty of search in a large space of programs. Human programmers routinely tackle the task of writing complex programs by writing sub-programs and then analysing their intermediate results to compose them in appropriate ways. Motivated by this intuition, we present a new synthesis approach that leverages learning to guide a bottom-up search over programs. In particular, we train a model to prioritize compositions of intermediate values during search conditioned on a given set of input-output examples. This is a powerful combination because of several emergent properties: First, in bottom-up search, intermediate programs can be executed, providing semantic information to the neural network. Second, given the concrete values from those executions, we can exploit rich features based on recent work on property signatures. Finally, bottom-up search allows the system substantial flexibility in what order to generate the solution, allowing the synthesizer to build up a program from multiple smaller sub-programs. Overall, our empirical evaluation finds that the combination of learning and bottom-up search is remarkably effective, even with simple supervised learning approaches. We demonstrate the effectiveness of our technique on a new data set for synthesis of string transformation programs.
\ No newline at end of file
diff --git a/data/2021/iclr/Bag of Tricks for Adversarial Training b/data/2021/iclr/Bag of Tricks for Adversarial Training
new file mode 100644
index 0000000000..207c95148a
--- /dev/null
+++ b/data/2021/iclr/Bag of Tricks for Adversarial Training	
@@ -0,0 +1 @@
+Adversarial training (AT) is one of the most effective strategies for promoting model robustness. However, recent benchmarks show that most of the proposed improvements on AT are less effective than simply early stopping the training procedure. This counter-intuitive fact motivates us to investigate the implementation details of tens of AT methods. Surprisingly, we find that the basic training settings (e.g., weight decay, learning rate schedule, etc.) used in these methods are highly inconsistent, which could largely affect the model performance as shown in our experiments. For example, a slightly different value of weight decay can reduce the model robust accuracy by more than 7%, which is probable to override the potential promotion induced by the proposed methods. In this work, we provide comprehensive evaluations on the effects of basic training tricks and hyperparameter settings for adversarially trained models. We provide a reasonable baseline setting and re-implement previous defenses to achieve new state-of-the-art results.
\ No newline at end of file
diff --git a/data/2021/iclr/Balancing Constraints and Rewards with Meta-Gradient D4PG b/data/2021/iclr/Balancing Constraints and Rewards with Meta-Gradient D4PG
new file mode 100644
index 0000000000..2839d6aea6
--- /dev/null
+++ b/data/2021/iclr/Balancing Constraints and Rewards with Meta-Gradient D4PG	
@@ -0,0 +1 @@
+Deploying Reinforcement Learning (RL) agents to solve real-world applications often requires satisfying complex system constraints. Often the constraint thresholds are incorrectly set due to the complex nature of a system or the inability to verify the thresholds offline (e.g, no simulator or reasonable offline evaluation procedure exists). This results in solutions where a task cannot be solved without violating the constraints. However, in many real-world cases, constraint violations are undesirable yet they are not catastrophic, motivating the need for soft-constrained RL approaches. We present two soft-constrained RL approaches that utilize meta-gradients to find a good trade-off between expected return and minimizing constraint violations. We demonstrate the effectiveness of these approaches by showing that they consistently outperform the baselines across four different Mujoco domains.
\ No newline at end of file
diff --git a/data/2021/iclr/Batch Reinforcement Learning Through Continuation Method b/data/2021/iclr/Batch Reinforcement Learning Through Continuation Method
new file mode 100644
index 0000000000..fd0cb58e08
--- /dev/null
+++ b/data/2021/iclr/Batch Reinforcement Learning Through Continuation Method	
@@ -0,0 +1 @@
+Many real-world applications of reinforcement learning (RL) require the agent to learn from a ﬁxed set of trajectories, without collecting new interactions. Policy optimization under this setting is extremely challenging as: 1) the geometry of the objective function is hard to optimize efﬁciently; 2) the shift of data distributions causes high noise in the value estimation. In this work, we propose a simple yet effective policy iteration approach to batch RL using global optimization techniques known as continuation. By constraining the difference between the learned policy and the behavior policy that generates the ﬁxed trajectories
\ No newline at end of file
diff --git "a/data/2021/iclr/Bayesian Few-Shot Classification with One-vs-Each P\303\263lya-Gamma Augmented Gaussian Processes" "b/data/2021/iclr/Bayesian Few-Shot Classification with One-vs-Each P\303\263lya-Gamma Augmented Gaussian Processes"
new file mode 100644
index 0000000000..21cdb59086
--- /dev/null
+++ "b/data/2021/iclr/Bayesian Few-Shot Classification with One-vs-Each P\303\263lya-Gamma Augmented Gaussian Processes"	
@@ -0,0 +1 @@
+Few-shot classification (FSC), the task of adapting a classifier to unseen classes given a small labeled dataset, is an important step on the path toward human-like machine learning. Bayesian methods are well-suited to tackling the fundamental issue of overfitting in the few-shot scenario because they allow practitioners to specify prior beliefs and update those beliefs in light of observed data. Contemporary approaches to Bayesian few-shot classification maintain a posterior distribution over model parameters, which is slow and requires storage that scales with model size. Instead, we propose a Gaussian process classifier based on a novel combination of Polya-gamma augmentation and the one-vs-each softmax approximation that allows us to efficiently marginalize over functions rather than model parameters. We demonstrate improved accuracy and uncertainty quantification on both standard few-shot classification benchmarks and few-shot domain transfer tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Behavioral Cloning from Noisy Demonstrations b/data/2021/iclr/Behavioral Cloning from Noisy Demonstrations
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods b/data/2021/iclr/Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods
new file mode 100644
index 0000000000..0185c6f6f2
--- /dev/null
+++ b/data/2021/iclr/Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods	
@@ -0,0 +1 @@
+Establishing a theoretical analysis that explains why deep learning can outperform shallow learning such as kernel methods is one of the biggest issues in the deep learning literature. Towards answering this question, we evaluate excess risk of a deep learning estimator trained by a noisy gradient descent with ridge regularization on a mildly overparameterized neural network, and discuss its superiority to a class of linear estimators that includes neural tangent kernel approach, random feature model, other kernel methods, $k$-NN estimator and so on. We consider a teacher-student regression model, and eventually show that any linear estimator can be outperformed by deep learning in a sense of the minimax optimal rate especially for a high dimension setting. The obtained excess bounds are so-called fast learning rate which is faster than $O(1/\sqrt{n})$ that is obtained by usual Rademacher complexity analysis. This discrepancy is induced by the non-convex geometry of the model and the noisy gradient descent used for neural network training provably reaches a near global optimal solution even though the loss landscape is highly non-convex. Although the noisy gradient descent does not employ any explicit or implicit sparsity inducing regularization, it shows a preferable generalization performance that dominates linear estimators.
\ No newline at end of file
diff --git a/data/2021/iclr/Better Fine-Tuning by Reducing Representational Collapse b/data/2021/iclr/Better Fine-Tuning by Reducing Representational Collapse
new file mode 100644
index 0000000000..c4fb9a5010
--- /dev/null
+++ b/data/2021/iclr/Better Fine-Tuning by Reducing Representational Collapse	
@@ -0,0 +1 @@
+Although widely adopted, existing approaches for fine-tuning pre-trained language models have been shown to be unstable across hyper-parameter settings, motivating recent work on trust region methods. In this paper, we present a simplified and efficient method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise (sampling from either a normal or uniform distribution), thereby discouraging representation change during fine-tuning when possible without hurting performance. We also introduce a new analysis to motivate the use of trust region methods more generally, by studying representational collapse; the degradation of generalizable representations from pre-trained models as they are fine-tuned for a specific end task. Extensive experiments show that our fine-tuning method matches or exceeds the performance of previous trust region methods on a range of understanding and generation tasks (including DailyMail/CNN, Gigaword, Reddit TIFU, and the GLUE benchmark), while also being much faster. We also show that it is less prone to representation collapse; the pre-trained models maintain more generalizable representations every time they are fine-tuned.
\ No newline at end of file
diff --git a/data/2021/iclr/Beyond Categorical Label Representations for Image Classification b/data/2021/iclr/Beyond Categorical Label Representations for Image Classification
new file mode 100644
index 0000000000..7fa22452c6
--- /dev/null
+++ b/data/2021/iclr/Beyond Categorical Label Representations for Image Classification	
@@ -0,0 +1 @@
+We find that the way we choose to represent data labels can have a profound effect on the quality of trained models. For example, training an image classifier to regress audio labels rather than traditional categorical probabilities produces a more reliable classification. This result is surprising, considering that audio labels are more complex than simpler numerical probabilities or text. We hypothesize that high dimensional, high entropy label representations are generally more useful because they provide a stronger error signal. We support this hypothesis with evidence from various label representations including constant matrices, spectrograms, shuffled spectrograms, Gaussian mixtures, and uniform random matrices of various dimensionalities. Our experiments reveal that high dimensional, high entropy labels achieve comparable accuracy to text (categorical) labels on the standard image classification task, but features learned through our label representations exhibit more robustness under various adversarial attacks and better effectiveness with a limited amount of training data. These results suggest that label representation may play a more important role than previously thought. The project website is at \url{https://www.creativemachineslab.com/label-representation.html}.
\ No newline at end of file
diff --git a/data/2021/iclr/Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with 1 n Parameters b/data/2021/iclr/Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with 1 n Parameters
new file mode 100644
index 0000000000..9429065f70
--- /dev/null
+++ b/data/2021/iclr/Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with 1 n Parameters	
@@ -0,0 +1 @@
+Recent works have demonstrated reasonable success of representation learning in hypercomplex space. Specifically,"fully-connected layers with Quaternions"(4D hypercomplex numbers), which replace real-valued matrix multiplications in fully-connected layers with Hamilton products of Quaternions, both enjoy parameter savings with only 1/4 learnable parameters and achieve comparable performance in various applications. However, one key caveat is that hypercomplex space only exists at very few predefined dimensions (4D, 8D, and 16D). This restricts the flexibility of models that leverage hypercomplex multiplications. To this end, we propose parameterizing hypercomplex multiplications, allowing models to learn multiplication rules from data regardless of whether such rules are predefined. As a result, our method not only subsumes the Hamilton product, but also learns to operate on any arbitrary nD hypercomplex space, providing more architectural flexibility using arbitrarily $1/n$ learnable parameters compared with the fully-connected layer counterpart. Experiments of applications to the LSTM and Transformer models on natural language inference, machine translation, text style transfer, and subject verb agreement demonstrate architectural flexibility and effectiveness of the proposed approach.
\ No newline at end of file
diff --git a/data/2021/iclr/BiPointNet: Binary Neural Network for Point Clouds b/data/2021/iclr/BiPointNet: Binary Neural Network for Point Clouds
new file mode 100644
index 0000000000..31b4aad1fe
--- /dev/null
+++ b/data/2021/iclr/BiPointNet: Binary Neural Network for Point Clouds	
@@ -0,0 +1 @@
+To alleviate the resource constraint for real-time point clouds applications that run on edge devices, we present BiPointNet, the first model binarization approach for efficient deep learning on point clouds. In this work, we discover that the immense performance drop of binarized models for point clouds is caused by two main challenges: aggregation-induced feature homogenization that leads to a degradation of information entropy, and scale distortion that hinders optimization and invalidates scale-sensitive structures. With theoretical justifications and in-depth analysis, we propose Entropy-Maximizing Aggregation(EMA) to modulate the distribution before aggregation for the maximum information entropy, andLayer-wise Scale Recovery(LSR) to efficiently restore feature scales. Extensive experiments show that our BiPointNet outperforms existing binarization methods by convincing margins, at the level even comparable with the full precision counterpart. We highlight that our techniques are generic which show significant improvements on various fundamental tasks and mainstream backbones. BiPoint-Net gives an impressive 14.7 times speedup and 18.9 times storage saving on real-world resource-constrained devices.
\ No newline at end of file
diff --git a/data/2021/iclr/Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech b/data/2021/iclr/Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Blending MPC & Value Function Approximation for Efficient Reinforcement Learning b/data/2021/iclr/Blending MPC & Value Function Approximation for Efficient Reinforcement Learning
new file mode 100644
index 0000000000..7ab1a4bf48
--- /dev/null
+++ b/data/2021/iclr/Blending MPC & Value Function Approximation for Efficient Reinforcement Learning	
@@ -0,0 +1 @@
+Model-Predictive Control (MPC) is a powerful tool for controlling complex, real-world systems that uses a model to make predictions about future behavior. For each state encountered, MPC solves an online optimization problem to choose a control action that will minimize future cost. This is a surprisingly effective strategy, but real-time performance requirements warrant the use of simple models. If the model is not sufficiently accurate, then the resulting controller can be biased, limiting performance. We present a framework for improving on MPC with model-free reinforcement learning (RL). The key insight is to view MPC as constructing a series of local Q-function approximations. We show that by using a parameter $\lambda$, similar to the trace decay parameter in TD($\lambda$), we can systematically trade-off learned value estimates against the local Q-function approximations. We present a theoretical analysis that shows how error from inaccurate models in MPC and value function estimation in RL can be balanced. We further propose an algorithm that changes $\lambda$ over time to reduce the dependence on MPC as our estimates of the value function improve, and test the efficacy our approach on challenging high-dimensional manipulation tasks with biased models in simulation. We demonstrate that our approach can obtain performance comparable with MPC with access to true dynamics even under severe model bias and is more sample efficient as compared to model-free RL.
\ No newline at end of file
diff --git a/data/2021/iclr/Boost then Convolve: Gradient Boosting Meets Graph Neural Networks b/data/2021/iclr/Boost then Convolve: Gradient Boosting Meets Graph Neural Networks
new file mode 100644
index 0000000000..9f7e1627c1
--- /dev/null
+++ b/data/2021/iclr/Boost then Convolve: Gradient Boosting Meets Graph Neural Networks	
@@ -0,0 +1 @@
+Graph neural networks (GNNs) are powerful models that have been successful in various graph representation learning tasks. Whereas gradient boosted decision trees (GBDT) often outperform other machine learning methods when faced with heterogeneous tabular data. But what approach should be used for graphs with tabular node features? Previous GNN models have mostly focused on networks with homogeneous sparse features and, as we show, are suboptimal in the heterogeneous setting. In this work, we propose a novel architecture that trains GBDT and GNN jointly to get the best of both worlds: the GBDT model deals with heterogeneous features, while GNN accounts for the graph structure. Our model benefits from end-to-end optimization by allowing new trees to fit the gradient updates of GNN. With an extensive experimental comparison to the leading GBDT and GNN models, we demonstrate a significant increase in performance on a variety of graphs with tabular features. The code is available: https://github.com/nd7141/bgnn.
\ No newline at end of file
diff --git a/data/2021/iclr/Bowtie Networks: Generative Modeling for Joint Few-Shot Recognition and Novel-View Synthesis b/data/2021/iclr/Bowtie Networks: Generative Modeling for Joint Few-Shot Recognition and Novel-View Synthesis
new file mode 100644
index 0000000000..9f7f4541bc
--- /dev/null
+++ b/data/2021/iclr/Bowtie Networks: Generative Modeling for Joint Few-Shot Recognition and Novel-View Synthesis	
@@ -0,0 +1 @@
+Generative modeling has recently shown great promise in computer vision, but its success is often limited to separate tasks. In this paper, motivated by multi-task learning of shareable feature representations, we consider a novel problem of learning a shared generative model across various tasks. We instantiate it on the illustrative dual-task of joint few-shot recognition and novel-view synthesis: given only one or few images of a novel object from arbitrary views with only category annotation, we aim to simultaneously learn an object classifier and generate images of the object from new viewpoints. To this end, we propose bowtie networks that jointly learn 3D geometric and semantic representations with feedback in the loop. Experimental evaluation on challenging fine-grained recognition datasets demonstrates that our synthesized images are realistic from multiple viewpoints and significantly improve recognition performance as ways of data augmentation, especially in the low-data regime. We further show that our approach is flexible and can be easily extended to incorporate other tasks, such as style guided synthesis.
\ No newline at end of file
diff --git a/data/2021/iclr/Bypassing the Ambient Dimension: Private SGD with Gradient Subspace Identification b/data/2021/iclr/Bypassing the Ambient Dimension: Private SGD with Gradient Subspace Identification
new file mode 100644
index 0000000000..833fd7ab2f
--- /dev/null
+++ b/data/2021/iclr/Bypassing the Ambient Dimension: Private SGD with Gradient Subspace Identification	
@@ -0,0 +1 @@
+Differentially private SGD (DP-SGD) is one of the most popular methods for solving differentially private empirical risk minimization (ERM). Due to its noisy perturbation on each gradient update, the error rate of DP-SGD scales with the ambient dimension $p$, the number of parameters in the model. Such dependence can be problematic for over-parameterized models where $p \gg n$, the number of training samples. Existing lower bounds on private ERM show that such dependence on $p$ is inevitable in the worst case. In this paper, we circumvent the dependence on the ambient dimension by leveraging a low-dimensional structure of gradient space in deep networks---that is, the stochastic gradients for deep nets usually stay in a low dimensional subspace in the training process. We propose Projected DP-SGD that performs noise reduction by projecting the noisy gradients to a low-dimensional subspace, which is given by the top gradient eigenspace on a small public dataset. We provide a general sample complexity analysis on the public dataset for the gradient subspace identification problem and demonstrate that under certain low-dimensional assumptions the public sample complexity only grows logarithmically in $p$. Finally, we provide a theoretical analysis and empirical evaluations to show that our method can substantially improve the accuracy of DP-SGD.
\ No newline at end of file
diff --git a/data/2021/iclr/Byzantine-Resilient Non-Convex Stochastic Gradient Descent b/data/2021/iclr/Byzantine-Resilient Non-Convex Stochastic Gradient Descent
new file mode 100644
index 0000000000..4782fa88dc
--- /dev/null
+++ b/data/2021/iclr/Byzantine-Resilient Non-Convex Stochastic Gradient Descent	
@@ -0,0 +1 @@
+We study adversary-resilient stochastic distributed optimization, in which $m$ machines can independently compute stochastic gradients, and cooperate to jointly optimize over their local objective functions. However, an $\alpha$-fraction of the machines are $\textit{Byzantine}$, in that they may behave in arbitrary, adversarial ways. We consider a variant of this procedure in the challenging $\textit{non-convex}$ case. Our main result is a new algorithm SafeguardSGD which can provably escape saddle points and find approximate local minima of the non-convex objective. The algorithm is based on a new concentration filtering technique, and its sample and time complexity bounds match the best known theoretical bounds in the stochastic, distributed setting when no Byzantine machines are present. Our algorithm is practical: it improves upon the performance of prior methods when training deep neural networks, it is relatively lightweight, and is the first method to withstand two recently-proposed Byzantine attacks.
\ No newline at end of file
diff --git a/data/2021/iclr/C-Learning: Horizon-Aware Cumulative Accessibility Estimation b/data/2021/iclr/C-Learning: Horizon-Aware Cumulative Accessibility Estimation
new file mode 100644
index 0000000000..0ed065d40d
--- /dev/null
+++ b/data/2021/iclr/C-Learning: Horizon-Aware Cumulative Accessibility Estimation	
@@ -0,0 +1 @@
+Multi-goal reaching is an important problem in reinforcement learning needed to achieve algorithmic generalization. Despite recent advances in this field, current algorithms suffer from three major challenges: high sample complexity, learning only a single way of reaching the goals, and difficulties in solving complex motion planning tasks. In order to address these limitations, we introduce the concept of cumulative accessibility functions, which measure the reachability of a goal from a given state within a specified horizon. We show that these functions obey a recurrence relation, which enables learning from offline interactions. We also prove that optimal cumulative accessibility functions are monotonic in the planning horizon. Additionally, our method can trade off speed and reliability in goal-reaching by suggesting multiple paths to a single goal depending on the provided horizon. We evaluate our approach on a set of multi-goal discrete and continuous control tasks. We show that our method outperforms state-of-the-art goal-reaching algorithms in success rate, sample complexity, and path optimality. Our code is available at this https URL, and additional visualizations can be found at this https URL .
\ No newline at end of file
diff --git a/data/2021/iclr/C-Learning: Learning to Achieve Goals via Recursive Classification b/data/2021/iclr/C-Learning: Learning to Achieve Goals via Recursive Classification
new file mode 100644
index 0000000000..91251b1cc8
--- /dev/null
+++ b/data/2021/iclr/C-Learning: Learning to Achieve Goals via Recursive Classification	
@@ -0,0 +1 @@
+We study the problem of predicting and controlling the future state distribution of an autonomous agent. This problem, which can be viewed as a reframing of goal-conditioned reinforcement learning (RL), is centered around learning a conditional probability density function over future states. Instead of directly estimating this density function, we indirectly estimate this density function by training a classifier to predict whether an observation comes from the future. Via Bayes' rule, predictions from our classifier can be transformed into predictions over future states. Importantly, an off-policy variant of our algorithm allows us to predict the future state distribution of a new policy, without collecting new experience. This variant allows us to optimize functionals of a policy's future state distribution, such as the density of reaching a particular goal state. While conceptually similar to Q-learning, our work lays a principled foundation for goal-conditioned RL as density estimation, providing justification for goal-conditioned methods used in prior work. This foundation makes hypotheses about Q-learning, including the optimal goal-sampling ratio, which we confirm experimentally. Moreover, our proposed method is competitive with prior goal-conditioned RL methods.
\ No newline at end of file
diff --git a/data/2021/iclr/CO2: Consistent Contrast for Unsupervised Visual Representation Learning b/data/2021/iclr/CO2: Consistent Contrast for Unsupervised Visual Representation Learning
new file mode 100644
index 0000000000..e59b3b1b7d
--- /dev/null
+++ b/data/2021/iclr/CO2: Consistent Contrast for Unsupervised Visual Representation Learning	
@@ -0,0 +1 @@
+Contrastive learning has been adopted as a core method for unsupervised visual representation learning. Without human annotation, the common practice is to perform an instance discrimination task: Given a query image crop, this task labels crops from the same image as positives, and crops from other randomly sampled images as negatives. An important limitation of this label assignment strategy is that it can not reflect the heterogeneous similarity between the query crop and each crop from other images, taking them as equally negative, while some of them may even belong to the same semantic class as the query. To address this issue, inspired by consistency regularization in semi-supervised learning on unlabeled data, we propose Consistent Contrast (CO2), which introduces a consistency regularization term into the current contrastive learning framework. Regarding the similarity of the query crop to each crop from other images as "unlabeled", the consistency term takes the corresponding similarity of a positive crop as a pseudo label, and encourages consistency between these two similarities. Empirically, CO2 improves Momentum Contrast (MoCo) by 2.9% top-1 accuracy on ImageNet linear protocol, 3.8% and 1.1% top-5 accuracy on 1% and 10% labeled semi-supervised settings. It also transfers to image classification, object detection, and semantic segmentation on PASCAL VOC. This shows that CO2 learns better visual representations for these downstream tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/CPR: Classifier-Projection Regularization for Continual Learning b/data/2021/iclr/CPR: Classifier-Projection Regularization for Continual Learning
new file mode 100644
index 0000000000..ee73909378
--- /dev/null
+++ b/data/2021/iclr/CPR: Classifier-Projection Regularization for Continual Learning	
@@ -0,0 +1 @@
+We propose a general, yet simple patch that can be applied to existing regularization-based continual learning methods called classifier-projection regularization (CPR). Inspired by both recent results on neural networks with wide local minima and information theory, CPR adds an additional regularization term that maximizes the entropy of a classifier's output probability. We demonstrate that this additional term can be interpreted as a projection of the conditional probability given by a classifier's output to the uniform distribution. By applying the Pythagorean theorem for KL divergence, we then prove that this projection may (in theory) improve the performance of continual learning methods. In our extensive experimental results, we apply CPR to several state-of-the-art regularization-based continual learning methods and benchmark performance on popular image recognition datasets. Our results demonstrate that CPR indeed promotes a wide local minima and significantly improves both accuracy and plasticity while simultaneously mitigating the catastrophic forgetting of baseline continual learning methods.
\ No newline at end of file
diff --git a/data/2021/iclr/CPT: Efficient Deep Neural Network Training via Cyclic Precision b/data/2021/iclr/CPT: Efficient Deep Neural Network Training via Cyclic Precision
new file mode 100644
index 0000000000..899325c088
--- /dev/null
+++ b/data/2021/iclr/CPT: Efficient Deep Neural Network Training via Cyclic Precision	
@@ -0,0 +1 @@
+Low-precision deep neural network (DNN) training has gained tremendous attention as reducing precision is one of the most effective knobs for boosting DNNs' training time/energy efficiency. In this paper, we attempt to explore low-precision training from a new perspective as inspired by recent findings in understanding DNN training: we conjecture that DNNs' precision might have a similar effect as the learning rate during DNN training, and advocate dynamic precision along the training trajectory for further boosting the time/energy efficiency of DNN training. Specifically, we propose Cyclic Precision Training (CPT) to cyclically vary the precision between two boundary values which can be identified using a simple precision range test within the first few training epochs. Extensive simulations and ablation studies on five datasets and eleven models demonstrate that CPT's effectiveness is consistent across various models/tasks (including classification and language modeling). Furthermore, through experiments and visualization we show that CPT helps to (1) converge to a wider minima with a lower generalization error and (2) reduce training variance which we believe opens up a new design knob for simultaneously improving the optimization and efficiency of DNN training. Our codes are available at: this https URL
\ No newline at end of file
diff --git a/data/2021/iclr/CT-Net: Channel Tensorization Network for Video Classification b/data/2021/iclr/CT-Net: Channel Tensorization Network for Video Classification
new file mode 100644
index 0000000000..30306fbca8
--- /dev/null
+++ b/data/2021/iclr/CT-Net: Channel Tensorization Network for Video Classification	
@@ -0,0 +1 @@
+3D convolution is powerful for video classification but often computationally expensive, recent studies mainly focus on decomposing it on spatial-temporal and/or channel dimensions. Unfortunately, most approaches fail to achieve a preferable balance between convolutional efficiency and feature-interaction sufficiency. For this reason, we propose a concise and novel Channel Tensorization Network (CT-Net), by treating the channel dimension of input feature as a multiplication of K sub-dimensions. On one hand, it naturally factorizes convolution in a multiple dimension way, leading to a light computation burden. On the other hand, it can effectively enhance feature interaction from different channels, and progressively enlarge the 3D receptive field of such interaction to boost classification accuracy. Furthermore, we equip our CT-Module with a Tensor Excitation (TE) mechanism. It can learn to exploit spatial, temporal and channel attention in a high-dimensional manner, to improve the cooperative power of all the feature dimensions in our CT-Module. Finally, we flexibly adapt ResNet as our CT-Net. Extensive experiments are conducted on several challenging video benchmarks, e.g., Kinetics-400, Something-Something V1 and V2. Our CT-Net outperforms a number of recent SOTA approaches, in terms of accuracy and/or efficiency. The codes and models will be available on https://github.com/Andy1621/CT-Net.
\ No newline at end of file
diff --git a/data/2021/iclr/CaPC Learning: Confidential and Private Collaborative Learning b/data/2021/iclr/CaPC Learning: Confidential and Private Collaborative Learning
new file mode 100644
index 0000000000..cd4057f864
--- /dev/null
+++ b/data/2021/iclr/CaPC Learning: Confidential and Private Collaborative Learning	
@@ -0,0 +1 @@
+Machine learning benefits from large training datasets, which may not always be possible to collect by any single entity, especially when using privacy-sensitive data. In many contexts, such as healthcare and finance, separate parties may wish to collaborate and learn from each other's data but are prevented from doing so due to privacy regulations. Some regulations prevent explicit sharing of data between parties by joining datasets in a central location (confidentiality). Others also limit implicit sharing of data, e.g., through model predictions (privacy). There is currently no method that enables machine learning in such a setting, where both confidentiality and privacy need to be preserved, to prevent both explicit and implicit sharing of data. Federated learning only provides confidentiality, not privacy, since gradients shared still contain private information. Differentially private learning assumes unreasonably large datasets. Furthermore, both of these learning paradigms produce a central model whose architecture was previously agreed upon by all parties rather than enabling collaborative learning where each party learns and improves their own local model. We introduce Confidential and Private Collaborative (CaPC) learning, the first method provably achieving both confidentiality and privacy in a collaborative setting. We leverage secure multi-party computation (MPC), homomorphic encryption (HE), and other techniques in combination with privately aggregated teacher models. We demonstrate how CaPC allows participants to collaborate without having to explicitly join their training sets or train a central model. Each party is able to improve the accuracy and fairness of their model, even in settings where each party has a model that performs well on their own dataset or when datasets are not IID and model architectures are heterogeneous across parties.
\ No newline at end of file
diff --git a/data/2021/iclr/Calibration of Neural Networks using Splines b/data/2021/iclr/Calibration of Neural Networks using Splines
new file mode 100644
index 0000000000..abd7e2d0bc
--- /dev/null
+++ b/data/2021/iclr/Calibration of Neural Networks using Splines	
@@ -0,0 +1 @@
+Calibrating neural networks is of utmost importance when employing them in safety-critical applications where the downstream decision making depends on the predicted probabilities. Measuring calibration error amounts to comparing two empirical distributions. In this work, we introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test in which the main idea is to compare the respective cumulative probability distributions. From this, by approximating the empirical cumulative distribution using a differentiable function via splines, we obtain a recalibration function, which maps the network outputs to actual (calibrated) class assignment probabilities. The spine-fitting is performed using a held-out calibration set and the obtained recalibration function is evaluated on an unseen test set. We tested our method against existing calibration approaches on various image classification datasets and our spline-based recalibration approach consistently outperforms existing methods on KS error as well as other commonly used calibration measures.
\ No newline at end of file
diff --git a/data/2021/iclr/Calibration tests beyond classification b/data/2021/iclr/Calibration tests beyond classification
new file mode 100644
index 0000000000..1986939999
--- /dev/null
+++ b/data/2021/iclr/Calibration tests beyond classification	
@@ -0,0 +1 @@
+Most supervised machine learning tasks are subject to irreducible prediction errors. Probabilistic predictive models address this limitation by providing probability distributions that represent a belief over plausible targets, rather than point estimates. Such models can be a valuable tool in decision-making under uncertainty, provided that the model output is meaningful and interpretable. Calibrated models guarantee that the probabilistic predictions are neither over- nor under-confident. In the machine learning literature, different measures and statistical tests have been proposed and studied for evaluating the calibration of classification models. For regression problems, however, research has been focused on a weaker condition of calibration based on predicted quantiles for real-valued targets. In this paper, we propose the first framework that unifies calibration evaluation and tests for probabilistic predictive models. It applies to any such model, including classification and regression models of arbitrary dimension. Furthermore, the framework generalizes existing measures and provides a more intuitive reformulation of a recently proposed framework for calibration in multi-class classification.
\ No newline at end of file
diff --git a/data/2021/iclr/Can a Fruit Fly Learn Word Embeddings? b/data/2021/iclr/Can a Fruit Fly Learn Word Embeddings?
new file mode 100644
index 0000000000..cb72aee6cf
--- /dev/null
+++ b/data/2021/iclr/Can a Fruit Fly Learn Word Embeddings?	
@@ -0,0 +1 @@
+The mushroom body of the fruit fly brain is one of the best studied systems in neuroscience. At its core it consists of a population of Kenyon cells, which receive inputs from multiple sensory modalities. These cells are inhibited by the anterior paired lateral neuron, thus creating a sparse high dimensional representation of the inputs. In this work we study a mathematical formalization of this network motif and apply it to learning the correlational structure between words and their context in a corpus of unstructured text, a common natural language processing (NLP) task. We show that this network can learn semantic representations of words and can generate both static and context-dependent word embeddings. Unlike conventional methods (e.g., BERT, GloVe) that use dense representations for word embedding, our algorithm encodes semantic meaning of words and their context in the form of sparse binary hash codes. The quality of the learned representations is evaluated on word similarity analysis, word-sense disambiguation, and document classification. It is shown that not only can the fruit fly network motif achieve performance comparable to existing methods in NLP, but, additionally, it uses only a fraction of the computational resources (shorter training time and smaller memory footprint).
\ No newline at end of file
diff --git a/data/2021/iclr/Capturing Label Characteristics in VAEs b/data/2021/iclr/Capturing Label Characteristics in VAEs
new file mode 100644
index 0000000000..ebf81bcd77
--- /dev/null
+++ b/data/2021/iclr/Capturing Label Characteristics in VAEs	
@@ -0,0 +1 @@
+We present a principled approach to incorporating labels in VAEs that captures the rich characteristic information associated with those labels. While prior work has typically conflated these by learning latent variables that directly correspond to label values, we argue this is contrary to the intended effect of supervision in VAEs-capturing rich label characteristics with the latents. For example, we may want to capture the characteristics of a face that make it look young, rather than just the age of the person. To this end, we develop the CCVAE, a novel VAE model and concomitant variational objective which captures label characteristics explicitly in the latent space, eschewing direct correspondences between label values and latents. Through judicious structuring of mappings between such characteristic latents and labels, we show that the CCVAE can effectively learn meaningful representations of the characteristics of interest across a variety of supervision schemes. In particular, we show that the CCVAE allows for more effective and more general interventions to be performed, such as smooth traversals within the characteristics for a given label, diverse conditional generation, and transferring characteristics across datapoints.
\ No newline at end of file
diff --git a/data/2021/iclr/Categorical Normalizing Flows via Continuous Transformations b/data/2021/iclr/Categorical Normalizing Flows via Continuous Transformations
new file mode 100644
index 0000000000..0fed0bf9ad
--- /dev/null
+++ b/data/2021/iclr/Categorical Normalizing Flows via Continuous Transformations	
@@ -0,0 +1 @@
+Despite their popularity, to date, the application of normalizing flows on categorical data stays limited. The current practice of using dequantization to map discrete data to a continuous space is inapplicable as categorical data has no intrinsic order. Instead, categorical data have complex and latent relations that must be inferred, like the synonymy between words. In this paper, we investigate Categorical Normalizing Flows, that is normalizing flows for categorical data. By casting the encoding of categorical data in continuous space as a variational inference problem, we jointly optimize the continuous representation and the model likelihood. To maintain unique decoding, we learn a partitioning of the latent space by factorizing the posterior. Meanwhile, the complex relations between the categorical variables are learned by the ensuing normalizing flow, thus maintaining a close-to exact likelihood estimate and making it possible to scale up to a large number of categories. Based on Categorical Normalizing Flows, we propose GraphCNF a permutation-invariant generative model on graphs, outperforming both one-shot and autoregressive flow-based state-of-the-art on molecule generation.
\ No newline at end of file
diff --git a/data/2021/iclr/CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning b/data/2021/iclr/CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning
new file mode 100644
index 0000000000..e0c608d2b6
--- /dev/null
+++ b/data/2021/iclr/CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning	
@@ -0,0 +1 @@
+Despite recent successes of reinforcement learning (RL), it remains a challenge for agents to transfer learned skills to related environments. To facilitate research addressing this problem, we propose CausalWorld, a benchmark for causal structure and transfer learning in a robotic manipulation environment. The environment is a simulation of an open-source robotic platform, hence offering the possibility of sim-to-real transfer. Tasks consist of constructing 3D shapes from a given set of blocks - inspired by how children learn to build complex structures. The key strength of CausalWorld is that it provides a combinatorial family of such tasks with common causal structure and underlying factors (including, e.g., robot and object masses, colors, sizes). The user (or the agent) may intervene on all causal variables, which allows for fine-grained control over how similar different tasks (or task distributions) are. One can thus easily define training and evaluation distributions of a desired difficulty level, targeting a specific form of generalization (e.g., only changes in appearance or object mass). Further, this common parametrization facilitates defining curricula by interpolating between an initial and a target task. While users may define their own task distributions, we present eight meaningful distributions as concrete benchmarks, ranging from simple to very challenging, all of which require long-horizon planning as well as precise low-level motor control. Finally, we provide baseline results for a subset of these tasks on distinct training curricula and corresponding evaluation protocols, verifying the feasibility of the tasks in this benchmark.
\ No newline at end of file
diff --git a/data/2021/iclr/CcGAN: Continuous Conditional Generative Adversarial Networks for Image Generation b/data/2021/iclr/CcGAN: Continuous Conditional Generative Adversarial Networks for Image Generation
new file mode 100644
index 0000000000..0db0b6a81b
--- /dev/null
+++ b/data/2021/iclr/CcGAN: Continuous Conditional Generative Adversarial Networks for Image Generation	
@@ -0,0 +1 @@
+This work proposes the continuous conditional generative adversarial network (CcGAN), the first generative model for image generation conditional on continuous, scalar conditions (termed regression labels). Existing conditional GANs (cGANs) are mainly designed for categorical conditions (e.g., class labels); conditioning on regression labels is mathematically distinct and raises two fundamental problems: (P1) Since there may be very few (even zero) real images for some regression labels, minimizing existing empirical versions of cGAN losses (a.k.a. empirical cGAN losses) often fails in practice; (P2) Since regression labels are scalar and infinitely many, conventional label input methods are not applicable. The proposed CcGAN solves the above problems, respectively, by (S1) reformulating existing empirical cGAN losses to be appropriate for the continuous scenario; and (S2) proposing a naive label input (NLI) method and an improved label input (ILI) method to incorporate regression labels into the generator and the discriminator. The reformulation in (S1) leads to two novel empirical discriminator losses, termed the hard vicinal discriminator loss (HVDL) and the soft vicinal discriminator loss (SVDL) respectively, and a novel empirical generator loss. The error bounds of a discriminator trained with HVDL and SVDL are derived under mild assumptions in this work. Two new benchmark datasets (RC-49 and Cell-200) and a novel evaluation metric (Sliding Frechet Inception Distance) are also proposed for this continuous scenario. Our experiments on the Circular 2-D Gaussians, RC-49, UTKFace, Cell-200, and Steering Angle datasets show that CcGAN can generate diverse, high-quality samples from the image distribution conditional on a given regression label. Moreover, in these experiments, CcGAN substantially outperforms cGAN both visually and quantitatively.
\ No newline at end of file
diff --git a/data/2021/iclr/Certify or Predict: Boosting Certified Robustness with Compositional Architectures b/data/2021/iclr/Certify or Predict: Boosting Certified Robustness with Compositional Architectures
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Chaos of Learning Beyond Zero-sum and Coordination via Game Decompositions b/data/2021/iclr/Chaos of Learning Beyond Zero-sum and Coordination via Game Decompositions
new file mode 100644
index 0000000000..7a0beb2ece
--- /dev/null
+++ b/data/2021/iclr/Chaos of Learning Beyond Zero-sum and Coordination via Game Decompositions	
@@ -0,0 +1 @@
+Machine learning processes, e.g. ''learning in games'', can be viewed as non-linear dynamical systems. In general, such systems exhibit a wide spectrum of behaviors, ranging from stability/recurrence to the undesirable phenomena of chaos (or ''butterfly effect''). Chaos captures sensitivity of round-off errors and can severely affect predictability and reproducibility of ML systems, but AI/ML community's understanding of it remains rudimentary. It has a lot out there that await exploration. Recently, Cheung and Piliouras employed volume-expansion argument to show that Lyapunov chaos occurs in the cumulative payoff space, when some popular learning algorithms, including Multiplicative Weights Update (MWU), Follow-the-Regularized-Leader (FTRL) and Optimistic MWU (OMWU), are used in several subspaces of games, e.g. zero-sum, coordination or graphical constant-sum games. It is natural to ask: can these results generalize to much broader families of games? We take on a game decomposition approach and answer the question affirmatively. Among other results, we propose a notion of ''matrix domination'' and design a linear program, and use them to characterize bimatrix games where MWU is Lyapunov chaotic almost everywhere. Such family of games has positive Lebesgue measure in the bimatrix game space, indicating that chaos is a substantial issue of learning in games. For multi-player games, we present a local equivalence of volume change between general games and graphical games, which is used to perform volume and chaos analyses of MWU and OMWU in potential games.
\ No newline at end of file
diff --git a/data/2021/iclr/Characterizing signal propagation to close the performance gap in unnormalized ResNets b/data/2021/iclr/Characterizing signal propagation to close the performance gap in unnormalized ResNets
new file mode 100644
index 0000000000..bcb1538010
--- /dev/null
+++ b/data/2021/iclr/Characterizing signal propagation to close the performance gap in unnormalized ResNets	
@@ -0,0 +1 @@
+Batch Normalization is a key component in almost all state-of-the-art image classifiers, but it also introduces practical challenges: it breaks the independence between training examples within a batch, can incur compute and memory overhead, and often results in unexpected bugs. Building on recent theoretical analyses of deep ResNets at initialization, we propose a simple set of analysis tools to characterize signal propagation on the forward pass, and leverage these tools to design highly performant ResNets without activation normalization layers. Crucial to our success is an adapted version of the recently proposed Weight Standardization. Our analysis tools show how this technique preserves the signal in networks with ReLU or Swish activation functions by ensuring that the per-channel activation means do not grow with depth. Across a range of FLOP budgets, our networks attain performance competitive with the state-of-the-art EfficientNets on ImageNet.
\ No newline at end of file
diff --git a/data/2021/iclr/ChipNet: Budget-Aware Pruning with Heaviside Continuous Approximations b/data/2021/iclr/ChipNet: Budget-Aware Pruning with Heaviside Continuous Approximations
new file mode 100644
index 0000000000..bebe33ecaa
--- /dev/null
+++ b/data/2021/iclr/ChipNet: Budget-Aware Pruning with Heaviside Continuous Approximations	
@@ -0,0 +1 @@
+Structured pruning methods are among the effective strategies for extracting small resource-efficient convolutional neural networks from their dense counterparts with minimal loss in accuracy. However, most existing methods still suffer from one or more limitations, that include 1) the need for training the dense model from scratch with pruning-related parameters embedded in the architecture, 2) requiring model-specific hyperparameter settings, 3) inability to include budget-related constraint in the training process, and 4) instability under scenarios of extreme pruning. In this paper, we present ChipNet, a deterministic pruning strategy that employs continuous Heaviside function and a novel crispness loss to identify a highly sparse network out of an existing dense network. Our choice of continuous Heaviside function is inspired by the field of design optimization, where the material distribution task is posed as a continuous optimization problem, but only discrete values (0 or 1) are practically feasible and expected as final outcomes. Our approach's flexible design facilitates its use with different choices of budget constraints while maintaining stability for very low target budgets. Experimental results show that ChipNet outperforms state-of-the-art structured pruning methods by remarkable margins of up to 16.1% in terms of accuracy. Further, we show that the masks obtained with ChipNet are transferable across datasets. For certain cases, it was observed that masks transferred from a model trained on feature-rich teacher dataset provide better performance on the student dataset than those obtained by directly pruning on the student data itself.
\ No newline at end of file
diff --git a/data/2021/iclr/Clairvoyance: A Pipeline Toolkit for Medical Time Series b/data/2021/iclr/Clairvoyance: A Pipeline Toolkit for Medical Time Series
new file mode 100644
index 0000000000..43d6abeeb7
--- /dev/null
+++ b/data/2021/iclr/Clairvoyance: A Pipeline Toolkit for Medical Time Series	
@@ -0,0 +1 @@
+Time-series learning is the bread and butter of data-driven *clinical decision support*, and the recent explosion in ML research has demonstrated great potential in various healthcare settings. At the same time, medical time-series problems in the wild are challenging due to their highly *composite* nature: They entail design choices and interactions among components that preprocess data, impute missing values, select features, issue predictions, estimate uncertainty, and interpret models. Despite exponential growth in electronic patient data, there is a remarkable gap between the potential and realized utilization of ML for clinical research and decision support. In particular, orchestrating a real-world project lifecycle poses challenges in engineering (i.e. hard to build), evaluation (i.e. hard to assess), and efficiency (i.e. hard to optimize). Designed to address these issues simultaneously, Clairvoyance proposes a unified, end-to-end, autoML-friendly pipeline that serves as a (i) software toolkit, (ii) empirical standard, and (iii) interface for optimization. Our ultimate goal lies in facilitating transparent and reproducible experimentation with complex inference workflows, providing integrated pathways for (1) personalized prediction, (2) treatment-effect estimation, and (3) information acquisition. Through illustrative examples on real-world data in outpatient, general wards, and intensive-care settings, we illustrate the applicability of the pipeline paradigm on core tasks in the healthcare journey. To the best of our knowledge, Clairvoyance is the first to demonstrate viability of a comprehensive and automatable pipeline for clinical time-series ML.
\ No newline at end of file
diff --git a/data/2021/iclr/Class Normalization for (Continual)? Generalized Zero-Shot Learning b/data/2021/iclr/Class Normalization for (Continual)? Generalized Zero-Shot Learning
new file mode 100644
index 0000000000..33ad44472c
--- /dev/null
+++ b/data/2021/iclr/Class Normalization for (Continual)? Generalized Zero-Shot Learning	
@@ -0,0 +1 @@
+Normalization techniques have proved to be a crucial ingredient of successful training in a traditional supervised learning regime. However, in the zero-shot learning (ZSL) world, these ideas have received only marginal attention. This work studies normalization in ZSL scenario from both theoretical and practical perspectives. First, we give a theoretical explanation to two popular tricks used in zero-shot learning: normalize+scale and attributes normalization and show that they help training by preserving variance during a forward pass. Next, we demonstrate that they are insufficient to normalize a deep ZSL model and propose Class Normalization (CN): a normalization scheme, which alleviates this issue both provably and in practice. Third, we show that ZSL models typically have more irregular loss surface compared to traditional classifiers and that the proposed method partially remedies this problem. Then, we test our approach on 4 standard ZSL datasets and outperform sophisticated modern SotA with a simple MLP optimized without any bells and whistles and having ≈50 times faster training speed. Finally, we generalize ZSL to a broader problem — continual ZSL, and introduce some principled metrics and rigorous baselines for this new setup. The source code is available at https://github.com/universome/class-norm.
\ No newline at end of file
diff --git a/data/2021/iclr/Clustering-friendly Representation Learning via Instance Discrimination and Feature Decorrelation b/data/2021/iclr/Clustering-friendly Representation Learning via Instance Discrimination and Feature Decorrelation
new file mode 100644
index 0000000000..b6d1611d96
--- /dev/null
+++ b/data/2021/iclr/Clustering-friendly Representation Learning via Instance Discrimination and Feature Decorrelation	
@@ -0,0 +1 @@
+Clustering is one of the most fundamental tasks in machine learning. Recently, deep clustering has become a major trend in clustering techniques. Representation learning often plays an important role in the effectiveness of deep clustering, and thus can be a principal cause of performance degradation. In this paper, we propose a clustering-friendly representation learning method using instance discrimination and feature decorrelation. Our deep-learning-based representation learning method is motivated by the properties of classical spectral clustering. Instance discrimination learns similarities among data and feature decorrelation removes redundant correlation among features. We utilize an instance discrimination method in which learning individual instance classes leads to learning similarity among instances. Through detailed experiments and examination, we show that the approach can be adapted to learning a latent space for clustering. We design novel softmax-formulated decorrelation constraints for learning. In evaluations of image clustering using CIFAR-10 and ImageNet-10, our method achieves accuracy of 81.5% and 95.4%, respectively. We also show that the softmax-formulated constraints are compatible with various neural networks.
\ No newline at end of file
diff --git a/data/2021/iclr/Co-Mixup: Saliency Guided Joint Mixup with Supermodular Diversity b/data/2021/iclr/Co-Mixup: Saliency Guided Joint Mixup with Supermodular Diversity
new file mode 100644
index 0000000000..1f0b6d55f1
--- /dev/null
+++ b/data/2021/iclr/Co-Mixup: Saliency Guided Joint Mixup with Supermodular Diversity	
@@ -0,0 +1 @@
+While deep neural networks show great performance on fitting to the training distribution, improving the networks' generalization performance to the test distribution and robustness to the sensitivity to input perturbations still remain as a challenge. Although a number of mixup based augmentation strategies have been proposed to partially address them, it remains unclear as to how to best utilize the supervisory signal within each input data for mixup from the optimization perspective. We propose a new perspective on batch mixup and formulate the optimal construction of a batch of mixup data maximizing the data saliency measure of each individual mixup data and encouraging the supermodular diversity among the constructed mixup data. This leads to a novel discrete optimization problem minimizing the difference between submodular functions. We also propose an efficient modular approximation based iterative submodular minimization algorithm for efficient mixup computation per each minibatch suitable for minibatch based neural network training. Our experiments show the proposed method achieves the state of the art generalization, calibration, and weakly supervised localization results compared to other mixup methods. The source code is available at https://github.com/snu-mllab/Co-Mixup.
\ No newline at end of file
diff --git a/data/2021/iclr/CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers b/data/2021/iclr/CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers
new file mode 100644
index 0000000000..e927a2017c
--- /dev/null
+++ b/data/2021/iclr/CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers	
@@ -0,0 +1 @@
+Dialogue state trackers have made significant progress on benchmark datasets, but their generalization capability to novel and realistic scenarios beyond the held-out conversations is less understood. We propose controllable counterfactuals (CoCo) to bridge this gap and evaluate dialogue state tracking (DST) models on novel scenarios, i.e., would the system successfully tackle the request if the user responded differently but still consistently with the dialogue flow? CoCo leverages turn-level belief states as counterfactual conditionals to produce novel conversation scenarios in two steps: (i) counterfactual goal generation at turn-level by dropping and adding slots followed by replacing slot values, (ii) counterfactual conversation generation that is conditioned on (i) and consistent with the dialogue flow. Evaluating state-of-the-art DST models on MultiWOZ dataset with CoCo-generated counterfactuals results in a significant performance drop of up to 30.8% (from 49.4% to 18.6%) in absolute joint goal accuracy. In comparison, widely used techniques like paraphrasing only affect the accuracy by at most 2%. Human evaluations show that CoCo-generated conversations perfectly reflect the underlying user goal with more than 95% accuracy and are as human-like as the original conversations, further strengthening its reliability and promise to be adopted as part of the robustness evaluation of DST models.
\ No newline at end of file
diff --git a/data/2021/iclr/CoCon: A Self-Supervised Approach for Controlled Text Generation b/data/2021/iclr/CoCon: A Self-Supervised Approach for Controlled Text Generation
new file mode 100644
index 0000000000..71b323b407
--- /dev/null
+++ b/data/2021/iclr/CoCon: A Self-Supervised Approach for Controlled Text Generation	
@@ -0,0 +1 @@
+Pretrained Transformer-based language models (LMs) display remarkable natural language generation capabilities. With their immense potential, controlling text generation of such LMs is getting attention. While there are studies that seek to control high-level attributes (such as sentiment and topic) of generated text, there is still a lack of more precise control over its content at the word- and phrase-level. Here, we propose Content-Conditioner (CoCon) to control an LM's output text with a target content, at a fine-grained level. In our self-supervised approach, the CoCon block learns to help the LM complete a partially-observed text sequence by conditioning with content inputs that are withheld from the LM. Through experiments, we show that CoCon can naturally incorporate target content into generated texts and control high-level text attributes in a zero-shot manner.
\ No newline at end of file
diff --git a/data/2021/iclr/CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding b/data/2021/iclr/CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding
new file mode 100644
index 0000000000..5afc873b25
--- /dev/null
+++ b/data/2021/iclr/CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding	
@@ -0,0 +1 @@
+Data augmentation has been demonstrated as an effective strategy for improving model generalization and data efficiency. However, due to the discrete nature of natural language, designing label-preserving transformations for text data tends to be more challenging. In this paper, we propose a novel data augmentation framework dubbed CoDA, which synthesizes diverse and informative augmented examples by integrating multiple transformations organically. Moreover, a contrastive regularization objective is introduced to capture the global relationship among all the data samples. A momentum encoder along with a memory bank is further leveraged to better estimate the contrastive loss. To verify the effectiveness of the proposed framework, we apply CoDA to Transformer-based models on a wide range of natural language understanding tasks. On the GLUE benchmark, CoDA gives rise to an average improvement of 2.2% while applied to the RoBERTa-large model. More importantly, it consistently exhibits stronger results relative to several competitive data augmentation and adversarial training base-lines (including the low-resource settings). Extensive experiments show that the proposed contrastive objective can be flexibly combined with various data augmentation approaches to further boost their performance, highlighting the wide applicability of the CoDA framework.
\ No newline at end of file
diff --git a/data/2021/iclr/Collective Robustness Certificates: Exploiting Interdependence in Graph Neural Networks b/data/2021/iclr/Collective Robustness Certificates: Exploiting Interdependence in Graph Neural Networks
new file mode 100644
index 0000000000..058105fd2b
--- /dev/null
+++ b/data/2021/iclr/Collective Robustness Certificates: Exploiting Interdependence in Graph Neural Networks	
@@ -0,0 +1 @@
+In tasks like node classification, image segmentation, and named-entity recognition we have a classifier that simultaneously outputs multiple predictions (a vector of labels) based on a single input, i.e. a single graph, image, or document respectively. Existing adversarial robustness certificates consider each prediction independently and are thus overly pessimistic for such tasks. They implicitly assume that an adversary can use different perturbed inputs to attack different predictions, ignoring the fact that we have a single shared input. We propose the first collective robustness certificate which computes the number of predictions that are simultaneously guaranteed to remain stable under perturbation, i.e. cannot be attacked. We focus on Graph Neural Networks and leverage their locality property - perturbations only affect the predictions in a close neighborhood - to fuse multiple single-node certificates into a drastically stronger collective certificate. For example, on the Citeseer dataset our collective certificate for node classification increases the average number of certifiable feature perturbations from $7$ to $351$.
\ No newline at end of file
diff --git a/data/2021/iclr/Colorization Transformer b/data/2021/iclr/Colorization Transformer
new file mode 100644
index 0000000000..f5cc09fae5
--- /dev/null
+++ b/data/2021/iclr/Colorization Transformer	
@@ -0,0 +1 @@
+We present the Colorization Transformer, a novel approach for diverse high fidelity image colorization based on self-attention. Given a grayscale image, the colorization proceeds in three steps. We first use a conditional autoregressive transformer to produce a low resolution coarse coloring of the grayscale image. Our architecture adopts conditional transformer layers to effectively condition grayscale input. Two subsequent fully parallel networks upsample the coarse colored low resolution image into a finely colored high resolution image. Sampling from the Colorization Transformer produces diverse colorings whose fidelity outperforms the previous state-of-the-art on colorising ImageNet based on FID results and based on a human evaluation in a Mechanical Turk test. Remarkably, in more than 60% of cases human evaluators prefer the highest rated among three generated colorings over the ground truth. The code and pre-trained checkpoints for Colorization Transformer are publicly available at https://github.com/google-research/google-research/tree/master/coltran
\ No newline at end of file
diff --git a/data/2021/iclr/Combining Ensembles and Data Augmentation Can Harm Your Calibration b/data/2021/iclr/Combining Ensembles and Data Augmentation Can Harm Your Calibration
new file mode 100644
index 0000000000..b9113d285e
--- /dev/null
+++ b/data/2021/iclr/Combining Ensembles and Data Augmentation Can Harm Your Calibration	
@@ -0,0 +1 @@
+Ensemble methods which average over multiple neural network predictions are a simple approach to improve a model's calibration and robustness. Similarly, data augmentation techniques, which encode prior information in the form of invariant feature transformations, are effective for improving calibration and robustness. In this paper, we show a surprising pathology: combining ensembles and data augmentation can harm model calibration. This leads to a trade-off in practice, whereby improved accuracy by combining the two techniques comes at the expense of calibration. On the other hand, selecting only one of the techniques ensures good uncertainty estimates at the expense of accuracy. We investigate this pathology and identify a compounding under-confidence among methods which marginalize over sets of weights and data augmentation techniques which soften labels. Finally, we propose a simple correction, achieving the best of both worlds with significant accuracy and calibration gains over using only ensembles or data augmentation individually. Applying the correction produces new state-of-the art in uncertainty calibration across CIFAR-10, CIFAR-100, and ImageNet.
\ No newline at end of file
diff --git a/data/2021/iclr/Combining Label Propagation and Simple Models out-performs Graph Neural Networks b/data/2021/iclr/Combining Label Propagation and Simple Models out-performs Graph Neural Networks
new file mode 100644
index 0000000000..afea870d86
--- /dev/null
+++ b/data/2021/iclr/Combining Label Propagation and Simple Models out-performs Graph Neural Networks	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) are the predominant technique for learning over graphs. However, there is relatively little understanding of why GNNs are successful in practice and whether they are necessary for good performance. Here, we show that for many standard transductive node classification benchmarks, we can exceed or match the performance of state-of-the-art GNNs by combining shallow models that ignore the graph structure with two simple post-processing steps that exploit correlation in the label structure: (i) an "error correlation" that spreads residual errors in training data to correct errors in test data and (ii) a "prediction correlation" that smooths the predictions on the test data. We call this overall procedure Correct and Smooth (C&S), and the post-processing steps are implemented via simple modifications to standard label propagation techniques from early graph-based semi-supervised learning methods. Our approach exceeds or nearly matches the performance of state-of-the-art GNNs on a wide variety of benchmarks, with just a small fraction of the parameters and orders of magnitude faster runtime. For instance, we exceed the best known GNN performance on the OGB-Products dataset with 137 times fewer parameters and greater than 100 times less training time. The performance of our methods highlights how directly incorporating label information into the learning algorithm (as was done in traditional techniques) yields easy and substantial performance gains. We can also incorporate our techniques into big GNN models, providing modest gains. Our code for the OGB results is at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/Combining Physics and Machine Learning for Network Flow Estimation b/data/2021/iclr/Combining Physics and Machine Learning for Network Flow Estimation
new file mode 100644
index 0000000000..945c9b46d6
--- /dev/null
+++ b/data/2021/iclr/Combining Physics and Machine Learning for Network Flow Estimation	
@@ -0,0 +1 @@
+.
\ No newline at end of file
diff --git a/data/2021/iclr/Communication in Multi-Agent Reinforcement Learning: Intention Sharing b/data/2021/iclr/Communication in Multi-Agent Reinforcement Learning: Intention Sharing
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/CompOFA - Compound Once-For-All Networks for Faster Multi-Platform Deployment b/data/2021/iclr/CompOFA - Compound Once-For-All Networks for Faster Multi-Platform Deployment
new file mode 100644
index 0000000000..18080675e0
--- /dev/null
+++ b/data/2021/iclr/CompOFA - Compound Once-For-All Networks for Faster Multi-Platform Deployment	
@@ -0,0 +1 @@
+The emergence of CNNs in mainstream deployment has necessitated methods to design and train efficient architectures tailored to maximize the accuracy under diverse hardware&latency constraints. To scale these resource-intensive tasks with an increasing number of deployment targets, Once-For-All (OFA) proposed an approach to jointly train several models at once with a constant training cost. However, this cost remains as high as 40-50 GPU days and also suffers from a combinatorial explosion of sub-optimal model configurations. We seek to reduce this search space -- and hence the training budget -- by constraining search to models close to the accuracy-latency Pareto frontier. We incorporate insights of compound relationships between model dimensions to build CompOFA, a design space smaller by several orders of magnitude. Through experiments on ImageNet, we demonstrate that even with simple heuristics we can achieve a 2x reduction in training time and 216x speedup in model search/extraction time compared to the state of the art, without loss of Pareto optimality! We also show that this smaller design space is dense enough to support equally accurate models for a similar diversity of hardware and latency targets, while also reducing the complexity of the training and subsequent extraction algorithms.
\ No newline at end of file
diff --git a/data/2021/iclr/Complex Query Answering with Neural Link Predictors b/data/2021/iclr/Complex Query Answering with Neural Link Predictors
new file mode 100644
index 0000000000..7f96b02837
--- /dev/null
+++ b/data/2021/iclr/Complex Query Answering with Neural Link Predictors	
@@ -0,0 +1 @@
+Neural link predictors are useful for identifying missing edges in large scale Knowledge Graphs. However, it is still not clear how to use these models for answering more complex queries containing logical conjunctions (∧), disjunctions (∨), and existential quantifiers (∃). We propose a framework for efficiently answering complex queries on in- complete Knowledge Graphs. We translate each query into an end-to-end differentiable objective, where the truth value of each atom is computed by a pre-trained neural link predictor. We then analyse two solutions to the optimisation problem, including gradient-based and combinatorial search. In our experiments, the proposed approach produces more accurate results than state-of-the-art methods — black-box models trained on millions of generated queries — without the need for training on a large and diverse set of complex queries. Using orders of magnitude less training data, we obtain relative improvements ranging from 8% up to 40% in Hits@3 across multiple knowledge graphs. We find that it is possible to explain the outcome of our model in terms of the intermediate solutions identified for each of the complex query atoms. All our source code and datasets are available online (https://github.com/uclnlp/cqd).
\ No newline at end of file
diff --git a/data/2021/iclr/Computational Separation Between Convolutional and Fully-Connected Networks b/data/2021/iclr/Computational Separation Between Convolutional and Fully-Connected Networks
new file mode 100644
index 0000000000..e064caf691
--- /dev/null
+++ b/data/2021/iclr/Computational Separation Between Convolutional and Fully-Connected Networks	
@@ -0,0 +1 @@
+Convolutional neural networks (CNN) exhibit unmatched performance in a multitude of computer vision tasks. However, the advantage of using convolutional networks over fully-connected networks is not understood from a theoretical perspective. In this work, we show how convolutional networks can leverage locality in the data, and thus achieve a computational advantage over fully-connected networks. Specifically, we show a class of problems that can be efficiently solved using convolutional networks trained with gradient-descent, but at the same time is hard to learn using a polynomial-size fully-connected network.
\ No newline at end of file
diff --git a/data/2021/iclr/Concept Learners for Few-Shot Learning b/data/2021/iclr/Concept Learners for Few-Shot Learning
new file mode 100644
index 0000000000..2b52109c22
--- /dev/null
+++ b/data/2021/iclr/Concept Learners for Few-Shot Learning	
@@ -0,0 +1 @@
+Developing algorithms that are able to generalize to a novel task given only a few labeled examples represents a fundamental challenge in closing the gap between machine- and human-level performance. The core of human cognition lies in the structured, reusable concepts that help us to rapidly adapt to new tasks and provide reasoning behind our decisions. However, existing meta-learning methods learn complex representations across prior labeled tasks without imposing any structure on the learned representations. Here we propose COMET, a meta-learning method that improves generalization ability by learning to learn along human-interpretable concept dimensions. Instead of learning a joint unstructured metric space, COMET learns mappings of high-level concepts into semi-structured metric spaces, and effectively combines the outputs of independent concept learners. We evaluate our model on few-shot tasks from diverse domains, including fine-grained image classification, document categorization and cell type annotation on a novel dataset from a biological domain developed in our work. COMET significantly outperforms strong meta-learning baselines, achieving 6-15% relative improvement on the most challenging 1-shot learning tasks, while unlike existing methods providing interpretations behind the model's predictions.
\ No newline at end of file
diff --git a/data/2021/iclr/Conditional Generative Modeling via Learning the Latent Space b/data/2021/iclr/Conditional Generative Modeling via Learning the Latent Space
new file mode 100644
index 0000000000..f9d2c8003d
--- /dev/null
+++ b/data/2021/iclr/Conditional Generative Modeling via Learning the Latent Space	
@@ -0,0 +1 @@
+Although deep learning has achieved appealing results on several machine learning tasks, most of the models are deterministic at inference, limiting their application to single-modal settings. We propose a novel general-purpose framework for conditional generation in multimodal spaces, that uses latent variables to model generalizable learning patterns while minimizing a family of regression cost functions. At inference, the latent variables are optimized to find optimal solutions corresponding to multiple output modes. Compared to existing generative solutions, in multimodal spaces, our approach demonstrates faster and stable convergence, and can learn better representations for downstream tasks. Importantly, it provides a simple generic model that can beat highly engineered pipelines tailored using domain expertise on a variety of tasks, while generating diverse outputs. Our codes will be released.
\ No newline at end of file
diff --git a/data/2021/iclr/Conditional Negative Sampling for Contrastive Learning of Visual Representations b/data/2021/iclr/Conditional Negative Sampling for Contrastive Learning of Visual Representations
new file mode 100644
index 0000000000..c0f7e004ce
--- /dev/null
+++ b/data/2021/iclr/Conditional Negative Sampling for Contrastive Learning of Visual Representations	
@@ -0,0 +1 @@
+Recent methods for learning unsupervised visual representations, dubbed contrastive learning, optimize the noise-contrastive estimation (NCE) bound on mutual information between two views of an image. NCE uses randomly sampled negative examples to normalize the objective. In this paper, we show that choosing difficult negatives, or those more similar to the current instance, can yield stronger representations. To do this, we introduce a family of mutual information estimators that sample negatives conditionally -- in a "ring" around each positive. We prove that these estimators lower-bound mutual information, with higher bias but lower variance than NCE. Experimentally, we find our approach, applied on top of existing models (IR, CMC, and MoCo) improves accuracy by 2-5% points in each case, measured by linear evaluation on four standard image datasets. Moreover, we find continued benefits when transferring features to a variety of new image distributions from the Meta-Dataset collection and to a variety of downstream tasks such as object detection, instance segmentation, and keypoint detection.
\ No newline at end of file
diff --git a/data/2021/iclr/Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data b/data/2021/iclr/Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data
new file mode 100644
index 0000000000..c0c880b675
--- /dev/null
+++ b/data/2021/iclr/Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data	
@@ -0,0 +1 @@
+Multi-Task Learning (MTL) has emerged as a promising approach for transferring learned knowledge across different tasks. However, multi-task learning must deal with challenges such as: overfitting to low resource tasks, catastrophic forgetting, and negative task transfer, or learning interference. Additionally, in Natural Language Processing (NLP), MTL alone has typically not reached the performance level possible through per-task fine-tuning of pretrained models. However, many fine-tuning approaches are both parameter inefficient, e.g. potentially involving one new model per task, and highly susceptible to losing knowledge acquired during pretraining. We propose a novel transformer based architecture consisting of a new conditional attention mechanism as well as a set of task conditioned modules that facilitate weight sharing. Through this construction we achieve more efficient parameter sharing and mitigate forgetting by keeping half of the weights of a pretrained model fixed. We also use a new multi-task data sampling strategy to mitigate the negative effects of data imbalance across tasks. Using this approach we are able to surpass single-task fine-tuning methods while being parameter and data efficient. With our base model, we attain 2.2% higher performance compared to a full fine-tuned BERT large model on the GLUE benchmark, adding only 5.6% more trained parameters per task (whereas naive fine-tuning potentially adds 100% of the trained parameters per task) and needing only 64.6% of the data. We show that a larger variant of our single multi-task model approach performs competitively across 26 NLP tasks and yields state-of-the-art results on a number of test and development sets.
\ No newline at end of file
diff --git a/data/2021/iclr/Conformation-Guided Molecular Representation with Hamiltonian Neural Networks b/data/2021/iclr/Conformation-Guided Molecular Representation with Hamiltonian Neural Networks
new file mode 100644
index 0000000000..83df078b7d
--- /dev/null
+++ b/data/2021/iclr/Conformation-Guided Molecular Representation with Hamiltonian Neural Networks	
@@ -0,0 +1 @@
+Well-designed molecular representations (fingerprints) are vital to combine medical chemistry and deep learning. Whereas incorporating 3D geometry of molecules (i.e. conformations) in their representations seems beneficial, current 3D algorithms are still in infancy. In this paper, we propose a novel molecular representation algorithm which preserves 3D conformations of molecules with a Molecular Hamiltonian Network (HamNet). In HamNet, implicit positions and momentums of atoms in a molecule interact in the Hamiltonian Engine following the discretized Hamiltonian equations. These implicit coordinations are supervised with real conformations with translation-&rotation-invariant losses, and further used as inputs to the Fingerprint Generator, a message-passing neural network. Experiments show that the Hamiltonian Engine can well preserve molecular conformations, and that the fingerprints generated by HamNet achieve state-of-the-art performances on MoleculeNet, a standard molecular machine learning benchmark.
\ No newline at end of file
diff --git a/data/2021/iclr/Conservative Safety Critics for Exploration b/data/2021/iclr/Conservative Safety Critics for Exploration
new file mode 100644
index 0000000000..abe26a3656
--- /dev/null
+++ b/data/2021/iclr/Conservative Safety Critics for Exploration	
@@ -0,0 +1 @@
+Safe exploration presents a major challenge in reinforcement learning (RL): when active data collection requires deploying partially trained policies, we must ensure that these policies avoid catastrophically unsafe regions, while still enabling trial and error learning. In this paper, we target the problem of safe exploration in RL by learning a conservative safety estimate of environment states through a critic, and provably upper bound the likelihood of catastrophic failures at every training iteration. We theoretically characterize the tradeoff between safety and policy improvement, show that the safety constraints are likely to be satisfied with high probability during training, derive provable convergence guarantees for our approach, which is no worse asymptotically than standard RL, and demonstrate the efficacy of the proposed approach on a suite of challenging navigation, manipulation, and locomotion tasks. Empirically, we show that the proposed approach can achieve competitive task performance while incurring significantly lower catastrophic failure rates during training than prior methods. Videos are at this url this https URL
\ No newline at end of file
diff --git a/data/2021/iclr/Contemplating Real-World Object Classification b/data/2021/iclr/Contemplating Real-World Object Classification
new file mode 100644
index 0000000000..1c37baaaf1
--- /dev/null
+++ b/data/2021/iclr/Contemplating Real-World Object Classification	
@@ -0,0 +1 @@
+Deep object recognition models have been very successful over benchmark datasets such as ImageNet. How accurate and robust are they to distribution shifts arising from natural and synthetic variations in datasets? Prior research on this problem has primarily focused on ImageNet variations (e.g., ImageNetV2, ImageNet-A). To avoid potential inherited biases in these studies, we take a different approach. Specifically, we reanalyze the ObjectNet dataset recently proposed by Barbu et al. containing objects in daily life situations. They showed a dramatic performance drop of the state of the art object recognition models on this dataset. Due to the importance and implications of their results regarding the generalization ability of deep models, we take a second look at their analysis. We find that applying deep models to the isolated objects, rather than the entire scene as is done in the original paper, results in around 20-30% performance improvement. Relative to the numbers reported in Barbu et al., around 10-15% of the performance loss is recovered, without any test time data augmentation. Despite this gain, however, we conclude that deep models still suffer drastically on the ObjectNet dataset. We also investigate the robustness of models against synthetic image perturbations such as geometric transformations (e.g., scale, rotation, translation), natural image distortions (e.g., impulse noise, blur) as well as adversarial attacks (e.g., FGSM and PGD-5). Our results indicate that limiting the object area as much as possible (i.e., from the entire image to the bounding box to the segmentation mask) leads to consistent improvement in accuracy and robustness.
\ No newline at end of file
diff --git a/data/2021/iclr/Contextual Dropout: An Efficient Sample-Dependent Dropout Module b/data/2021/iclr/Contextual Dropout: An Efficient Sample-Dependent Dropout Module
new file mode 100644
index 0000000000..8684ca1b94
--- /dev/null
+++ b/data/2021/iclr/Contextual Dropout: An Efficient Sample-Dependent Dropout Module	
@@ -0,0 +1 @@
+Dropout has been demonstrated as a simple and effective module to not only regularize the training process of deep neural networks, but also provide the uncertainty estimation for prediction. However, the quality of uncertainty estimation is highly dependent on the dropout probabilities. Most current models use the same dropout distributions across all data samples due to its simplicity. Despite the potential gains in the flexibility of modeling uncertainty, sample-dependent dropout, on the other hand, is less explored as it often encounters scalability issues or involves non-trivial model changes. In this paper, we propose contextual dropout with an efficient structural design as a simple and scalable sample-dependent dropout module, which can be applied to a wide range of models at the expense of only slightly increased memory and computational cost. We learn the dropout probabilities with a variational objective, compatible with both Bernoulli dropout and Gaussian dropout. We apply the contextual dropout module to various models with applications to image classification and visual question answering and demonstrate the scalability of the method with large-scale datasets, such as ImageNet and VQA 2.0. Our experimental results show that the proposed method outperforms baseline methods in terms of both accuracy and quality of uncertainty estimation.
\ No newline at end of file
diff --git a/data/2021/iclr/Contextual Transformation Networks for Online Continual Learning b/data/2021/iclr/Contextual Transformation Networks for Online Continual Learning
new file mode 100644
index 0000000000..8034aa81ca
--- /dev/null
+++ b/data/2021/iclr/Contextual Transformation Networks for Online Continual Learning	
@@ -0,0 +1 @@
+5. The results show that the behavioural cloning strategy is more suitable for alleviating forgetting in ER, while enjoying less memory overhead or faster running time compared to other alternatives.
\ No newline at end of file
diff --git a/data/2021/iclr/Continual learning in recurrent neural networks b/data/2021/iclr/Continual learning in recurrent neural networks
new file mode 100644
index 0000000000..6fedf7f463
--- /dev/null
+++ b/data/2021/iclr/Continual learning in recurrent neural networks	
@@ -0,0 +1 @@
+While a diverse collection of continual learning (CL) methods has been proposed to prevent catastrophic forgetting, a thorough investigation of their effectiveness for processing sequential data with recurrent neural networks (RNNs) is lacking. Here, we provide the first comprehensive evaluation of established CL methods on a variety of sequential data benchmarks. Specifically, we shed light on the particularities that arise when applying weight-importance methods, such as elastic weight consolidation, to RNNs. In contrast to feedforward networks, RNNs iteratively reuse a shared set of weights and require working memory to process input samples. We show that the performance of weight-importance methods is not directly affected by the length of the processed sequences, but rather by high working memory requirements, which lead to an increased need for stability at the cost of decreased plasticity for learning subsequent tasks. We additionally provide theoretical arguments supporting this interpretation by studying linear RNNs. Our study shows that established CL methods can be successfully ported to the recurrent case, and that a recent regularization approach based on hypernetworks outperforms weight-importance methods, thus emerging as a promising candidate for CL in RNNs. Overall, we provide insights on the differences between CL in feedforward networks and RNNs, while guiding towards effective solutions to tackle CL on sequential data.
\ No newline at end of file
diff --git a/data/2021/iclr/Continuous Wasserstein-2 Barycenter Estimation without Minimax Optimization b/data/2021/iclr/Continuous Wasserstein-2 Barycenter Estimation without Minimax Optimization
new file mode 100644
index 0000000000..07f7364105
--- /dev/null
+++ b/data/2021/iclr/Continuous Wasserstein-2 Barycenter Estimation without Minimax Optimization	
@@ -0,0 +1 @@
+Wasserstein barycenters provide a geometric notion of the weighted average of probability measures based on optimal transport. In this paper, we present a scalable algorithm to compute Wasserstein-2 barycenters given sample access to the input measures, which are not restricted to being discrete. While past approaches rely on entropic or quadratic regularization, we employ input convex neural networks and cycle-consistency regularization to avoid introducing bias. As a result, our approach does not resort to minimax optimization. We provide theoretical analysis on error bounds as well as empirical evidence of the effectiveness of the proposed approach in low-dimensional qualitative scenarios and high-dimensional quantitative experiments.
\ No newline at end of file
diff --git a/data/2021/iclr/Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning b/data/2021/iclr/Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning
new file mode 100644
index 0000000000..5f67035f41
--- /dev/null
+++ b/data/2021/iclr/Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning	
@@ -0,0 +1 @@
+Reinforcement learning methods trained on few environments rarely learn policies that generalize to unseen environments. To improve generalization, we incorporate the inherent sequential structure in reinforcement learning into the representation learning process. This approach is orthogonal to recent approaches, which rarely exploit this structure explicitly. Specifically, we introduce a theoretically motivated policy similarity metric (PSM) for measuring behavioral similarity between states. PSM assigns high similarity to states for which the optimal policies in those states as well as in future states are similar. We also present a contrastive representation learning procedure to embed any state similarity metric, which we instantiate with PSM to obtain policy similarity embeddings (PSEs). We demonstrate that PSEs improve generalization on diverse benchmarks, including LQR with spurious correlations, a jumping task from pixels, and Distracting DM Control Suite.
\ No newline at end of file
diff --git a/data/2021/iclr/Contrastive Divergence Learning is a Time Reversal Adversarial Game b/data/2021/iclr/Contrastive Divergence Learning is a Time Reversal Adversarial Game
new file mode 100644
index 0000000000..713734420c
--- /dev/null
+++ b/data/2021/iclr/Contrastive Divergence Learning is a Time Reversal Adversarial Game	
@@ -0,0 +1 @@
+Contrastive divergence (CD) learning is a classical method for fitting unnormalized statistical models to data samples. Despite its wide-spread use, the convergence properties of this algorithm are still not well understood. The main source of difficulty is an unjustified approximation which has been used to derive the gradient of the loss. In this paper, we present an alternative derivation of CD that does not require any approximation and sheds new light on the objective that is actually being optimized by the algorithm. Specifically, we show that CD is an adversarial learning procedure, where a discriminator attempts to classify whether a Markov chain generated from the model has been time-reversed. Thus, although predating generative adversarial networks (GANs) by more than a decade, CD is, in fact, closely related to these techniques. Our derivation settles well with previous observations, which have concluded that CD's update steps cannot be expressed as the gradients of any fixed objective function. In addition, as a byproduct, our derivation reveals a simple correction that can be used as an alternative to Metropolis-Hastings rejection, which is required when the underlying Markov chain is inexact (e.g. when using Langevin dynamics with a large step).
\ No newline at end of file
diff --git a/data/2021/iclr/Contrastive Explanations for Reinforcement Learning via Embedded Self Predictions b/data/2021/iclr/Contrastive Explanations for Reinforcement Learning via Embedded Self Predictions
new file mode 100644
index 0000000000..cacad5a583
--- /dev/null
+++ b/data/2021/iclr/Contrastive Explanations for Reinforcement Learning via Embedded Self Predictions	
@@ -0,0 +1 @@
+We investigate a deep reinforcement learning (RL) architecture that supports explaining why a learned agent prefers one action over another. The key idea is to learn action-values that are directly represented via human-understandable properties of expected futures. This is realized via the embedded self-prediction (ESP)model, which learns said properties in terms of human provided features. Action preferences can then be explained by contrasting the future properties predicted for each action. To address cases where there are a large number of features, we develop a novel method for computing minimal sufficient explanations from anESP. Our case studies in three domains, including a complex strategy game, show that ESP models can be effectively learned and support insightful explanations.
\ No newline at end of file
diff --git a/data/2021/iclr/Contrastive Learning with Adversarial Perturbations for Conditional Text Generation b/data/2021/iclr/Contrastive Learning with Adversarial Perturbations for Conditional Text Generation
new file mode 100644
index 0000000000..08778e8263
--- /dev/null
+++ b/data/2021/iclr/Contrastive Learning with Adversarial Perturbations for Conditional Text Generation	
@@ -0,0 +1 @@
+Recently, sequence-to-sequence (seq2seq) models with the Transformer architecture have achieved remarkable performance on various conditional text generation tasks, such as machine translation. However, most of them are trained with teacher forcing with the ground truth label given at each time step, without being exposed to incorrectly generated tokens during training, which hurts its generalization to unseen inputs, that is known as the ``exposure bias" problem. In this work, we propose to mitigate the conditional text generation problem by contrasting positive pairs with negative pairs, such that the model is exposed to various valid or incorrect perturbations of the inputs, for improved generalization. However, training the model with naive contrastive learning framework using random non-target sequences as negative examples is suboptimal, since they are easily distinguishable from the correct output, especially so with models pretrained with large text corpora. Also, generating positive examples requires domain-specific augmentation heuristics which may not generalize over diverse domains. To tackle this problem, we propose a principled method to generate positive and negative samples for contrastive learning of seq2seq models. Specifically, we generate negative examples by adding small perturbations to the input sequence to minimize its conditional likelihood, and positive examples by adding large perturbations while enforcing it to have a high conditional likelihood. Such ``hard'' positive and negative pairs generated using our method guides the model to better distinguish correct outputs from incorrect ones. We empirically show that our proposed method significantly improves the generalization of the seq2seq on three text generation tasks - machine translation, text summarization, and question generation.
\ No newline at end of file
diff --git a/data/2021/iclr/Contrastive Learning with Hard Negative Samples b/data/2021/iclr/Contrastive Learning with Hard Negative Samples
new file mode 100644
index 0000000000..6b8601a0b9
--- /dev/null
+++ b/data/2021/iclr/Contrastive Learning with Hard Negative Samples	
@@ -0,0 +1 @@
+How can you sample good negative examples for contrastive learning? We argue that, as with metric learning, contrastive learning of representations benefits from hard negative samples (i.e., points that are difficult to distinguish from an anchor point). The key challenge toward using hard negatives is that contrastive methods must remain unsupervised, making it infeasible to adopt existing negative sampling strategies that use true similarity information. In response, we develop a new family of unsupervised sampling methods for selecting hard negative samples where the user can control the hardness. A limiting case of this sampling results in a representation that tightly clusters each class, and pushes different classes as far apart as possible. The proposed method improves downstream performance across multiple modalities, requires only few additional lines of code to implement, and introduces no computational overhead.
\ No newline at end of file
diff --git a/data/2021/iclr/Contrastive Syn-to-Real Generalization b/data/2021/iclr/Contrastive Syn-to-Real Generalization
new file mode 100644
index 0000000000..9c4f3ce223
--- /dev/null
+++ b/data/2021/iclr/Contrastive Syn-to-Real Generalization	
@@ -0,0 +1 @@
+Training on synthetic data can be beneficial for label or data-scarce scenarios. However, synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work, we make a key observation that the diversity of the learned feature embeddings plays an important role in the generalization performance. To this end, we propose contrastive synthetic-to-real generalization (CSG), a novel framework that leverages the pre-trained ImageNet knowledge to prevent overfitting to the synthetic domain, while promoting the diversity of feature embeddings as an inductive bias to improve generalization. In addition, we enhance the proposed CSG framework with attentional pooling (A-pool) to let the model focus on semantically important regions and further improve its generalization. We demonstrate the effectiveness of CSG on various synthetic training tasks, exhibiting state-of-the-art performance on zero-shot domain generalization.
\ No newline at end of file
diff --git a/data/2021/iclr/Control-Aware Representations for Model-based Reinforcement Learning b/data/2021/iclr/Control-Aware Representations for Model-based Reinforcement Learning
new file mode 100644
index 0000000000..3674287b86
--- /dev/null
+++ b/data/2021/iclr/Control-Aware Representations for Model-based Reinforcement Learning	
@@ -0,0 +1 @@
+A major challenge in modern reinforcement learning (RL) is efficient control of dynamical systems from high-dimensional sensory observations. Learning controllable embedding (LCE) is a promising approach that addresses this challenge by embedding the observations into a lower-dimensional latent space, estimating the latent dynamics, and utilizing it to perform control in the latent space. Two important questions in this area are how to learn a representation that is amenable to the control problem at hand, and how to achieve an end-to-end framework for representation learning and control. In this paper, we take a few steps towards addressing these questions. We first formulate a LCE model to learn representations that are suitable to be used by a policy iteration style algorithm in the latent space. We call this model control-aware representation learning (CARL). We derive a loss function for CARL that has close connection to the prediction, consistency, and curvature (PCC) principle for representation learning. We derive three implementations of CARL. In the offline implementation, we replace the locally-linear control algorithm (e.g.,~iLQR) used by the existing LCE methods with a RL algorithm, namely model-based soft actor-critic, and show that it results in significant improvement. In online CARL, we interleave representation learning and control, and demonstrate further gain in performance. Finally, we propose value-guided CARL, a variation in which we optimize a weighted version of the CARL loss function, where the weights depend on the TD-error of the current policy. We evaluate the proposed algorithms by extensive experiments on benchmark tasks and compare them with several LCE baselines.
\ No newline at end of file
diff --git a/data/2021/iclr/Convex Potential Flows: Universal Probability Distributions with Optimal Transport and Convex Optimization b/data/2021/iclr/Convex Potential Flows: Universal Probability Distributions with Optimal Transport and Convex Optimization
new file mode 100644
index 0000000000..8c99b336e9
--- /dev/null
+++ b/data/2021/iclr/Convex Potential Flows: Universal Probability Distributions with Optimal Transport and Convex Optimization	
@@ -0,0 +1 @@
+Flow-based models are powerful tools for designing probabilistic models with tractable density. This paper introduces Convex Potential Flows (CP-Flow), a natural and efficient parameterization of invertible models inspired by the optimal transport (OT) theory. CP-Flows are the gradient map of a strongly convex neural potential function. The convexity implies invertibility and allows us to resort to convex optimization to solve the convex conjugate for efficient inversion. To enable maximum likelihood training, we derive a new gradient estimator of the log-determinant of the Jacobian, which involves solving an inverse-Hessian vector product using the conjugate gradient method. The gradient estimator has constant-memory cost, and can be made effectively unbiased by reducing the error tolerance level of the convex optimization routine. Theoretically, we prove that CP-Flows are universal density approximators and are optimal in the OT sense. Our empirical results show that CP-Flow performs competitively on standard benchmarks of density estimation and variational inference.
\ No newline at end of file
diff --git a/data/2021/iclr/Convex Regularization behind Neural Reconstruction b/data/2021/iclr/Convex Regularization behind Neural Reconstruction
new file mode 100644
index 0000000000..727272a14d
--- /dev/null
+++ b/data/2021/iclr/Convex Regularization behind Neural Reconstruction	
@@ -0,0 +1 @@
+Neural networks have shown tremendous potential for reconstructing high-resolution images in inverse problems. The non-convex and opaque nature of neural networks, however, hinders their utility in sensitive applications such as medical imaging. To cope with this challenge, this paper advocates a convex duality framework that makes a two-layer fully-convolutional ReLU denoising network amenable to convex optimization. The convex dual network not only offers the optimum training with convex solvers, but also facilitates interpreting training and prediction. In particular, it implies training neural networks with weight decay regularization induces path sparsity while the prediction is piecewise linear filtering. A range of experiments with MNIST and fastMRI datasets confirm the efficacy of the dual network optimization problem.
\ No newline at end of file
diff --git a/data/2021/iclr/Coping with Label Shift via Distributionally Robust Optimisation b/data/2021/iclr/Coping with Label Shift via Distributionally Robust Optimisation
new file mode 100644
index 0000000000..6781008af9
--- /dev/null
+++ b/data/2021/iclr/Coping with Label Shift via Distributionally Robust Optimisation	
@@ -0,0 +1 @@
+The label shift problem refers to the supervised learning setting where the train and test label distributions do not match. Existing work addressing label shift usually assumes access to an \emph{unlabelled} test sample. This sample may be used to estimate the test label distribution, and to then train a suitably re-weighted classifier. While approaches using this idea have proven effective, their scope is limited as it is not always feasible to access the target domain; further, they require repeated retraining if the model is to be deployed in \emph{multiple} test environments. Can one instead learn a \emph{single} classifier that is robust to arbitrary label shifts from a broad family? In this paper, we answer this question by proposing a model that minimises an objective based on distributionally robust optimisation (DRO). We then design and analyse a gradient descent-proximal mirror ascent algorithm tailored for large-scale problems to optimise the proposed objective. %, and establish its convergence. Finally, through experiments on CIFAR-100 and ImageNet, we show that our technique can significantly improve performance over a number of baselines in settings where label shift is present.
\ No newline at end of file
diff --git a/data/2021/iclr/CopulaGNN: Towards Integrating Representational and Correlational Roles of Graphs in Graph Neural Networks b/data/2021/iclr/CopulaGNN: Towards Integrating Representational and Correlational Roles of Graphs in Graph Neural Networks
new file mode 100644
index 0000000000..12b3d629e9
--- /dev/null
+++ b/data/2021/iclr/CopulaGNN: Towards Integrating Representational and Correlational Roles of Graphs in Graph Neural Networks	
@@ -0,0 +1 @@
+Graph-structured data are ubiquitous. However, graphs encode diverse types of information and thus play different roles in data representation. In this paper, we distinguish the \textit{representational} and the \textit{correlational} roles played by the graphs in node-level prediction tasks, and we investigate how Graph Neural Network (GNN) models can effectively leverage both types of information. Conceptually, the representational information provides guidance for the model to construct better node features; while the correlational information indicates the correlation between node outcomes conditional on node features. Through a simulation study, we find that many popular GNN models are incapable of effectively utilizing the correlational information. By leveraging the idea of the copula, a principled way to describe the dependence among multivariate random variables, we offer a general solution. The proposed Copula Graph Neural Network (CopulaGNN) can take a wide range of GNN models as base models and utilize both representational and correlational information stored in the graphs. Experimental results on two types of regression tasks verify the effectiveness of the proposed method.
\ No newline at end of file
diff --git a/data/2021/iclr/Correcting experience replay for multi-agent communication b/data/2021/iclr/Correcting experience replay for multi-agent communication
new file mode 100644
index 0000000000..38c2eb717f
--- /dev/null
+++ b/data/2021/iclr/Correcting experience replay for multi-agent communication	
@@ -0,0 +1 @@
+We consider the problem of learning to communicate using multi-agent reinforcement learning (MARL). A common approach is to learn off-policy, using data sampled from a replay buffer. However, messages received in the past may not accurately reflect the current communication policy of each agent, and this complicates learning. We therefore introduce a 'communication correction' which accounts for the non-stationarity of observed communication induced by multi-agent learning. It works by relabelling the received message to make it likely under the communicator's current policy, and thus be a better reflection of the receiver's current environment. To account for cases in which agents are both senders and receivers, we introduce an ordered relabelling scheme. Our correction is computationally efficient and can be integrated with a range of off-policy algorithms. It substantially improves the ability of communicating MARL systems to learn across a variety of cooperative and competitive tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Counterfactual Generative Networks b/data/2021/iclr/Counterfactual Generative Networks
new file mode 100644
index 0000000000..b6cd7022e9
--- /dev/null
+++ b/data/2021/iclr/Counterfactual Generative Networks	
@@ -0,0 +1 @@
+Neural networks are prone to learning shortcuts -- they often model simple correlations, ignoring more complex ones that potentially generalize better. Prior works on image classification show that instead of learning a connection to object shape, deep classifiers tend to exploit spurious correlations with low-level texture or the background for solving the classification task. In this work, we take a step towards more robust and interpretable classifiers that explicitly expose the task's causal structure. Building on current advances in deep generative modeling, we propose to decompose the image generation process into independent causal mechanisms that we train without direct supervision. By exploiting appropriate inductive biases, these mechanisms disentangle object shape, object texture, and background; hence, they allow for generating counterfactual images. We demonstrate the ability of our model to generate such images on MNIST and ImageNet. Further, we show that the counterfactual images can improve out-of-distribution robustness with a marginal drop in performance on the original classification task, despite being synthetic. Lastly, our generative model can be trained efficiently on a single GPU, exploiting common pre-trained models as inductive biases.
\ No newline at end of file
diff --git a/data/2021/iclr/Coupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture for learning long time dependencies b/data/2021/iclr/Coupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture for learning long time dependencies
new file mode 100644
index 0000000000..dc5ad8d507
--- /dev/null
+++ b/data/2021/iclr/Coupled Oscillatory Recurrent Neural Network (coRNN): An accurate and (gradient) stable architecture for learning long time dependencies	
@@ -0,0 +1 @@
+Circuits of biological neurons, such as in the functional parts of the brain can be modeled as networks of coupled oscillators. Inspired by the ability of these systems to express a rich set of outputs while keeping (gradients of) state variables bounded, we propose a novel architecture for recurrent neural networks. Our proposed RNN is based on a time-discretization of a system of second-order ordinary differential equations, modeling networks of controlled nonlinear oscillators. We prove precise bounds on the gradients of the hidden states, leading to the mitigation of the exploding and vanishing gradient problem for this RNN. Experiments show that the proposed RNN is comparable in performance to the state of the art on a variety of benchmarks, demonstrating the potential of this architecture to provide stable and accurate RNNs for processing complex sequential data.
\ No newline at end of file
diff --git a/data/2021/iclr/Creative Sketch Generation b/data/2021/iclr/Creative Sketch Generation
new file mode 100644
index 0000000000..23402410f8
--- /dev/null
+++ b/data/2021/iclr/Creative Sketch Generation	
@@ -0,0 +1 @@
+Sketching or doodling is a popular creative activity that people engage in. However, most existing work in automatic sketch understanding or generation has focused on sketches that are quite mundane. In this work, we introduce two datasets of creative sketches -- Creative Birds and Creative Creatures -- containing 10k sketches each along with part annotations. We propose DoodlerGAN -- a part-based Generative Adversarial Network (GAN) -- to generate unseen compositions of novel part appearances. Quantitative evaluations as well as human studies demonstrate that sketches generated by our approach are more creative and of higher quality than existing approaches. In fact, in Creative Birds, subjects prefer sketches generated by DoodlerGAN over those drawn by humans! Our code can be found at this https URL and a demo can be found at this http URL.
\ No newline at end of file
diff --git a/data/2021/iclr/Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization b/data/2021/iclr/Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization
new file mode 100644
index 0000000000..2853167736
--- /dev/null
+++ b/data/2021/iclr/Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization	
@@ -0,0 +1 @@
+Temporally localizing actions in videos is one of the key components for video understanding. Learning from weakly-labeled data is seen as a potential solu-tion towards avoiding expensive frame-level annotations. Different from other works which only depend on visual-modality, we propose to learn richer audio-visual representation for weakly-supervised action localization. First, we propose a multi-stage cross-attention mechanism to collaboratively fuse audio and visual features, which preserves the intra-modal characteristics. Second, to model both foreground and background frames, we construct an open-max classiﬁer which treats the background class as an open-set. Third, for precise action localization, we design consistency losses to enforce temporal continuity for the action-class prediction, and also help with foreground-prediction reliability. Extensive experiments on two publicly available video-datasets (AVE and ActivityNet1.2) show that the proposed method effectively fuses audio and visual modalities, and achieves the state-of-the-art results for weakly-supervised action localization.
\ No newline at end of file
diff --git a/data/2021/iclr/Cut out the annotator, keep the cutout: better segmentation with weak supervision b/data/2021/iclr/Cut out the annotator, keep the cutout: better segmentation with weak supervision
new file mode 100644
index 0000000000..945c9b46d6
--- /dev/null
+++ b/data/2021/iclr/Cut out the annotator, keep the cutout: better segmentation with weak supervision	
@@ -0,0 +1 @@
+.
\ No newline at end of file
diff --git a/data/2021/iclr/DARTS-: Robustly Stepping out of Performance Collapse Without Indicators b/data/2021/iclr/DARTS-: Robustly Stepping out of Performance Collapse Without Indicators
new file mode 100644
index 0000000000..fd3de30396
--- /dev/null
+++ b/data/2021/iclr/DARTS-: Robustly Stepping out of Performance Collapse Without Indicators	
@@ -0,0 +1 @@
+Despite the fast development of differentiable architecture search (DARTS), it suffers from a standing instability issue regarding searching performance, which extremely limits its application. Existing robustifying methods draw clues from the outcome instead of finding out the causing factor. Various indicators such as Hessian eigenvalues are proposed as a signal of performance collapse, and the searching should be stopped once an indicator reaches a preset threshold. However, these methods tend to easily reject good architectures if thresholds are inappropriately set, let alone the searching is intrinsically noisy. In this paper, we undertake a more subtle and direct approach to resolve the collapse. We first demonstrate that skip connections with a learnable architectural coefficient can easily recover from a disadvantageous state and become dominant. We conjecture that skip connections profit too much from this privilege, hence causing the collapse for the derived model. Therefore, we propose to factor out this benefit with an auxiliary skip connection, ensuring a fairer competition for all operations. Extensive experiments on various datasets verify that our approach can substantially improve the robustness of DARTS.
\ No newline at end of file
diff --git a/data/2021/iclr/DC3: A learning method for optimization with hard constraints b/data/2021/iclr/DC3: A learning method for optimization with hard constraints
new file mode 100644
index 0000000000..722d650fda
--- /dev/null
+++ b/data/2021/iclr/DC3: A learning method for optimization with hard constraints	
@@ -0,0 +1 @@
+Large optimization problems with hard constraints arise in many settings, yet classical solvers are often prohibitively slow, motivating the use of deep networks as cheap"approximate solvers."Unfortunately, naive deep learning approaches typically cannot enforce the hard constraints of such problems, leading to infeasible solutions. In this work, we present Deep Constraint Completion and Correction (DC3), an algorithm to address this challenge. Specifically, this method enforces feasibility via a differentiable procedure, which implicitly completes partial solutions to satisfy equality constraints and unrolls gradient-based corrections to satisfy inequality constraints. We demonstrate the effectiveness of DC3 in both synthetic optimization tasks and the real-world setting of AC optimal power flow, where hard constraints encode the physics of the electrical grid. In both cases, DC3 achieves near-optimal objective values while preserving feasibility.
\ No newline at end of file
diff --git a/data/2021/iclr/DDPNOpt: Differential Dynamic Programming Neural Optimizer b/data/2021/iclr/DDPNOpt: Differential Dynamic Programming Neural Optimizer
new file mode 100644
index 0000000000..f0e9aea20e
--- /dev/null
+++ b/data/2021/iclr/DDPNOpt: Differential Dynamic Programming Neural Optimizer	
@@ -0,0 +1 @@
+Interpretation of Deep Neural Networks (DNNs) training as an optimal control problem with nonlinear dynamical systems has received considerable attention recently, yet the algorithmic development remains relatively limited. In this work, we make an attempt along this line by reformulating the training procedure from the trajectory optimization perspective. We first show that most widely-used algorithms for training DNNs can be linked to the Differential Dynamic Programming (DDP), a celebrated second-order trajectory optimization algorithm rooted in the Approximate Dynamic Programming. In this vein, we propose a new variant of DDP that can accept batch optimization for training feedforward networks, while integrating naturally with the recent progress in curvature approximation. The resulting algorithm features layer-wise feedback policies which improve convergence rate and reduce sensitivity to hyper-parameter over existing methods. We show that the algorithm is competitive against state-ofthe-art first and second order methods. Our work opens up new avenues for principled algorithmic design built upon the optimal control theory.
\ No newline at end of file
diff --git a/data/2021/iclr/DICE: Diversity in Deep Ensembles via Conditional Redundancy Adversarial Estimation b/data/2021/iclr/DICE: Diversity in Deep Ensembles via Conditional Redundancy Adversarial Estimation
new file mode 100644
index 0000000000..fe1defa2c3
--- /dev/null
+++ b/data/2021/iclr/DICE: Diversity in Deep Ensembles via Conditional Redundancy Adversarial Estimation	
@@ -0,0 +1 @@
+Deep ensembles perform better than a single network thanks to the diversity among their members. Recent approaches regularize predictions to increase diversity; however, they also drastically decrease individual members' performances. In this paper, we argue that learning strategies for deep ensembles need to tackle the trade-off between ensemble diversity and individual accuracies. Motivated by arguments from information theory and leveraging recent advances in neural estimation of conditional mutual information, we introduce a novel training criterion called DICE: it increases diversity by reducing spurious correlations among features. The main idea is that features extracted from pairs of members should only share information useful for target class prediction without being conditionally redundant. Therefore, besides the classification loss with information bottleneck, we adversarially prevent features from being conditionally predictable from each other. We manage to reduce simultaneous errors while protecting class information. We obtain state-of-the-art accuracy results on CIFAR-10/100: for example, an ensemble of 5 networks trained with DICE matches an ensemble of 7 networks trained independently. We further analyze the consequences on calibration, uncertainty estimation, out-of-distribution detection and online co-distillation.
\ No newline at end of file
diff --git a/data/2021/iclr/DINO: A Conditional Energy-Based GAN for Domain Translation b/data/2021/iclr/DINO: A Conditional Energy-Based GAN for Domain Translation
new file mode 100644
index 0000000000..e2b4dc7f0d
--- /dev/null
+++ b/data/2021/iclr/DINO: A Conditional Energy-Based GAN for Domain Translation	
@@ -0,0 +1 @@
+Domain translation is the process of transforming data from one domain to another while preserving the common semantics. Some of the most popular domain translation systems are based on conditional generative adversarial networks, which use source domain data to drive the generator and as an input to the discriminator. However, this approach does not enforce the preservation of shared semantics since the conditional input can often be ignored by the discriminator. We propose an alternative method for conditioning and present a new framework, where two networks are simultaneously trained, in a supervised manner, to perform domain translation in opposite directions. Our method is not only better at capturing the shared information between two domains but is more generic and can be applied to a broader range of problems. The proposed framework performs well even in challenging cross-modal translations, such as video-driven speech reconstruction, for which other systems struggle to maintain correspondence.
\ No newline at end of file
diff --git a/data/2021/iclr/DOP: Off-Policy Multi-Agent Decomposed Policy Gradients b/data/2021/iclr/DOP: Off-Policy Multi-Agent Decomposed Policy Gradients
new file mode 100644
index 0000000000..31b7049e3f
--- /dev/null
+++ b/data/2021/iclr/DOP: Off-Policy Multi-Agent Decomposed Policy Gradients	
@@ -0,0 +1 @@
+Multi-agent policy gradient (MAPG) methods recently witness vigorous progress. However, there is a significant performance discrepancy between MAPG methods and state-of-the-art multi-agent value-based approaches. In this paper, we investigate causes that hinder the performance of MAPG algorithms and present a multi-agent decomposed policy gradient method (DOP). This method introduces the idea of value function decomposition into the multi-agent actor-critic framework. Based on this idea, DOP supports efficient off-policy learning and addresses the issue of centralized-decentralized mismatch and credit assignment in both discrete and continuous action spaces. We formally show that DOP critics have sufficient representational capability to guarantee convergence. In addition, empirical evaluations on the StarCraft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP significantly outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms. Demonstrative videos are available at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning b/data/2021/iclr/Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning
new file mode 100644
index 0000000000..864767c7ca
--- /dev/null
+++ b/data/2021/iclr/Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning	
@@ -0,0 +1 @@
+Dancing to music is one of human's innate abilities since ancient times. In machine learning research, however, synthesizing dance movements from music is a challenging problem. Recently, researchers synthesize human motion sequences through autoregressive models like recurrent neural network (RNN). Such an approach often generates short sequences due to an accumulation of prediction errors that are fed back into the neural network. This problem becomes even more severe in the long motion sequence generation. Besides, the consistency between dance and music in terms of style, rhythm and beat is yet to be taken into account during modeling. In this paper, we formalize the music-conditioned dance generation as a sequence-to-sequence learning problem and devise a novel seq2seq architecture to efficiently process long sequences of music features and capture the fine-grained correspondence between music and dance. Furthermore, we propose a novel curriculum learning strategy to alleviate error accumulation of autoregressive models in long motion sequence generation, which gently changes the training process from a fully guided teacher-forcing scheme using the previous ground-truth movements, towards a less guided autoregressive scheme mostly using the generated movements instead. Extensive experiments show that our approach significantly outperforms the existing state-of-the-arts on automatic metrics and human evaluation. We also make a demo video to demonstrate the superior performance of our proposed approach at https://www.youtube.com/watch?v=lmE20MEheZ8.
\ No newline at end of file
diff --git a/data/2021/iclr/Data-Efficient Reinforcement Learning with Self-Predictive Representations b/data/2021/iclr/Data-Efficient Reinforcement Learning with Self-Predictive Representations
new file mode 100644
index 0000000000..ffc4c39fdd
--- /dev/null
+++ b/data/2021/iclr/Data-Efficient Reinforcement Learning with Self-Predictive Representations	
@@ -0,0 +1 @@
+While deep reinforcement learning excels at solving tasks where large amounts of data can be collected through virtually unlimited interaction with the environment, learning from limited interaction remains a key challenge. We posit that an agent can learn more efficiently if we augment reward maximization with self-supervised objectives based on structure in its visual input and sequential interaction with the environment. Our method, Self-Predictive Representations(SPR), trains an agent to predict its own latent state representations multiple steps into the future. We compute target representations for future states using an encoder which is an exponential moving average of the agent's parameters and we make predictions using a learned transition model. On its own, this future prediction objective outperforms prior methods for sample-efficient deep RL from pixels. We further improve performance by adding data augmentation to the future prediction loss, which forces the agent's representations to be consistent across multiple views of an observation. Our full self-supervised objective, which combines future prediction and data augmentation, achieves a median human-normalized score of 0.415 on Atari in a setting limited to 100k steps of environment interaction, which represents a 55% relative improvement over the previous state-of-the-art. Notably, even in this limited data regime, SPR exceeds expert human scores on 7 out of 26 games. The code associated with this work is available at https://github.com/mila-iqia/spr
\ No newline at end of file
diff --git a/data/2021/iclr/Dataset Condensation with Gradient Matching b/data/2021/iclr/Dataset Condensation with Gradient Matching
new file mode 100644
index 0000000000..020419beec
--- /dev/null
+++ b/data/2021/iclr/Dataset Condensation with Gradient Matching	
@@ -0,0 +1 @@
+Efficient training of deep neural networks is an increasingly important problem in the era of sophisticated architectures and large-scale datasets. This paper proposes a training set synthesis technique, called Dataset Condensation, that learns to produce a small set of informative samples for training deep neural networks from scratch in a small fraction of the required computational cost on the original data while achieving comparable results. We rigorously evaluate its performance in several computer vision benchmarks and show that it significantly outperforms the state-of-the-art methods. Finally we show promising applications of our method in continual learning and domain adaptation.
\ No newline at end of file
diff --git a/data/2021/iclr/Dataset Inference: Ownership Resolution in Machine Learning b/data/2021/iclr/Dataset Inference: Ownership Resolution in Machine Learning
new file mode 100644
index 0000000000..b594e6910c
--- /dev/null
+++ b/data/2021/iclr/Dataset Inference: Ownership Resolution in Machine Learning	
@@ -0,0 +1 @@
+With increasingly more data and computation involved in their training, machine learning models constitute valuable intellectual property. This has spurred interest in model stealing, which is made more practical by advances in learning with partial, little, or no supervision. Existing defenses focus on inserting unique watermarks in a model's decision surface, but this is insufficient: the watermarks are not sampled from the training distribution and thus are not always preserved during model stealing. In this paper, we make the key observation that knowledge contained in the stolen model's training set is what is common to all stolen copies. The adversary's goal, irrespective of the attack employed, is always to extract this knowledge or its by-products. This gives the original model's owner a strong advantage over the adversary: model owners have access to the original training data. We thus introduce $dataset$ $inference$, the process of identifying whether a suspected model copy has private knowledge from the original model's dataset, as a defense against model stealing. We develop an approach for dataset inference that combines statistical testing with the ability to estimate the distance of multiple data points to the decision boundary. Our experiments on CIFAR10, SVHN, CIFAR100 and ImageNet show that model owners can claim with confidence greater than 99% that their model (or dataset as a matter of fact) was stolen, despite only exposing 50 of the stolen model's training points. Dataset inference defends against state-of-the-art attacks even when the adversary is adaptive. Unlike prior work, it does not require retraining or overfitting the defended model.
\ No newline at end of file
diff --git a/data/2021/iclr/Dataset Meta-Learning from Kernel Ridge-Regression b/data/2021/iclr/Dataset Meta-Learning from Kernel Ridge-Regression
new file mode 100644
index 0000000000..650b6fdc3f
--- /dev/null
+++ b/data/2021/iclr/Dataset Meta-Learning from Kernel Ridge-Regression	
@@ -0,0 +1 @@
+One of the most fundamental aspects of any machine learning algorithm is the training data used by the algorithm. We introduce the novel concept of $\epsilon$-approximation of datasets, obtaining datasets which are much smaller than or are significant corruptions of the original training data while maintaining similar model performance. We introduce a meta-learning algorithm called Kernel Inducing Points (KIP) for obtaining such remarkable datasets, inspired by the recent developments in the correspondence between infinitely-wide neural networks and kernel ridge-regression (KRR). For KRR tasks, we demonstrate that KIP can compress datasets by one or two orders of magnitude, significantly improving previous dataset distillation and subset selection methods while obtaining state of the art results for MNIST and CIFAR-10 classification. Furthermore, our KIP-learned datasets are transferable to the training of finite-width neural networks even beyond the lazy-training regime, which leads to state of the art results for neural network dataset distillation with potential applications to privacy-preservation.
\ No newline at end of file
diff --git a/data/2021/iclr/DeLighT: Deep and Light-weight Transformer b/data/2021/iclr/DeLighT: Deep and Light-weight Transformer
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Deberta: decoding-Enhanced Bert with Disentangled Attention b/data/2021/iclr/Deberta: decoding-Enhanced Bert with Disentangled Attention
new file mode 100644
index 0000000000..8ad1f6ef02
--- /dev/null
+++ b/data/2021/iclr/Deberta: decoding-Enhanced Bert with Disentangled Attention	
@@ -0,0 +1 @@
+Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency of model pre-training and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and pre-trained models will be made publicly available at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/Debiasing Concept-based Explanations with Causal Analysis b/data/2021/iclr/Debiasing Concept-based Explanations with Causal Analysis
new file mode 100644
index 0000000000..907f4213c0
--- /dev/null
+++ b/data/2021/iclr/Debiasing Concept-based Explanations with Causal Analysis	
@@ -0,0 +1 @@
+Concept-based explanation approach is a popular model interpertability tool because it expresses the reasons for a model's predictions in terms of concepts that are meaningful for the domain experts. In this work, we study the problem of the concepts being correlated with confounding information in the features. We propose a new causal prior graph for modeling the impacts of unobserved variables and a method to remove the impact of confounding information and noise using a two-stage regression technique borrowed from the instrumental variable literature. We also model the completeness of the concepts set and show that our debiasing method works when the concepts are not complete. Our synthetic and real-world experiments demonstrate the success of our method in removing biases and improving the ranking of the concepts in terms of their contribution to the explanation of the predictions.
\ No newline at end of file
diff --git a/data/2021/iclr/Decentralized Attribution of Generative Models b/data/2021/iclr/Decentralized Attribution of Generative Models
new file mode 100644
index 0000000000..ca2ba82bd4
--- /dev/null
+++ b/data/2021/iclr/Decentralized Attribution of Generative Models	
@@ -0,0 +1 @@
+There have been growing concerns regarding the fabrication of contents through generative models. This paper investigates the feasibility of decentralized attribution of such models. Given a set of generative models learned from the same dataset, attributability is achieved when a public verification service exists to correctly identify the source models for generated content. Attribution allows tracing of machine-generated content back to its source model, thus facilitating IP-protection and content regulation. Existing attribution methods are non-scalable with respect to the number of models and lack theoretical bounds on attributability. This paper studies decentralized attribution, where provable attributability can be achieved by only requiring each model to be distinguishable from the authentic data. Our major contributions are the derivation of the sufficient conditions for decentralized attribution and the design of keys following these conditions. Specifically, we show that decentralized attribution can be achieved when keys are (1) orthogonal to each other, and (2) belonging to a subspace determined by the data distribution. This result is validated on MNIST and CelebA. Lastly, we use these datasets to examine the trade-off between generation quality and robust attributability against adversarial post-processes.
\ No newline at end of file
diff --git a/data/2021/iclr/Deciphering and Optimizing Multi-Task Learning: a Random Matrix Approach b/data/2021/iclr/Deciphering and Optimizing Multi-Task Learning: a Random Matrix Approach
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Deconstructing the Regularization of BatchNorm b/data/2021/iclr/Deconstructing the Regularization of BatchNorm
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Decoupling Global and Local Representations via Invertible Generative Flows b/data/2021/iclr/Decoupling Global and Local Representations via Invertible Generative Flows
new file mode 100644
index 0000000000..15bbf0b6cf
--- /dev/null
+++ b/data/2021/iclr/Decoupling Global and Local Representations via Invertible Generative Flows	
@@ -0,0 +1 @@
+In this work, we propose a new generative model that is capable of automatically decoupling global and local representations of images in an entirely unsupervised setting, by embedding a generative ﬂow in the VAE framework to model the decoder. Speciﬁcally, the proposed model utilizes the variational auto-encoding framework to learn a (low-dimensional) vector of latent variables to capture the global information of an image, which is fed as a conditional input to a ﬂow-based invertible decoder with architecture borrowed from style transfer literature. Experimental results on standard image benchmarks demonstrate the effectiveness of our model in terms of density estimation, image generation and unsupervised representation learning. Importantly, this work demonstrates that with only architectural inductive biases, a generative model with a likelihood-based objective is capable of learning decoupled representations, requiring no explicit supervision. The code for our model is available at https://github.com/XuezheMax/wolf .
\ No newline at end of file
diff --git a/data/2021/iclr/Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation b/data/2021/iclr/Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation
new file mode 100644
index 0000000000..74a3cfb7b3
--- /dev/null
+++ b/data/2021/iclr/Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation	
@@ -0,0 +1 @@
+Much recent effort has been invested in non-autoregressive neural machine translation, which appears to be an efficient alternative to state-of-the-art autoregressive machine translation on modern GPUs. In contrast to the latter, where generation is sequential, the former allows generation to be parallelized across target token positions. Some of the latest non-autoregressive models have achieved impressive translation quality-speed tradeoffs compared to autoregressive baselines. In this work, we reexamine this tradeoff and argue that autoregressive baselines can be substantially sped up without loss in accuracy. Specifically, we study autoregressive models with encoders and decoders of varied depths. Our extensive experiments show that given a sufficiently deep encoder, a single-layer autoregressive decoder can substantially outperform strong non-autoregressive models with comparable inference speed. We show that the speed disadvantage for autoregressive baselines compared to non-autoregressive methods has been overestimated in three aspects: suboptimal layer allocation, insufficient speed measurement, and lack of knowledge distillation. Our results establish a new protocol for future research toward fast, accurate machine translation. Our code is available at https://github.com/jungokasai/deep-shallow.
\ No newline at end of file
diff --git a/data/2021/iclr/Deep Equals Shallow for ReLU Networks in Kernel Regimes b/data/2021/iclr/Deep Equals Shallow for ReLU Networks in Kernel Regimes
new file mode 100644
index 0000000000..93f0dec64c
--- /dev/null
+++ b/data/2021/iclr/Deep Equals Shallow for ReLU Networks in Kernel Regimes	
@@ -0,0 +1 @@
+Deep networks are often considered to be more expressive than shallow ones in terms of approximation. Indeed, certain functions can be approximated by deep networks provably more efficiently than by shallow ones, however, no tractable algorithms are known for learning such deep models. Separately, a recent line of work has shown that deep networks trained with gradient descent may behave like (tractable) kernel methods in a certain over-parameterized regime, where the kernel is determined by the architecture and initialization, and this paper focuses on approximation for such kernels. We show that for ReLU activations, the kernels derived from deep fully-connected networks have essentially the same approximation properties as their "shallow" two-layer counterpart, namely the same eigenvalue decay for the corresponding integral operator. This highlights the limitations of the kernel framework for understanding the benefits of such deep architectures. Our main theoretical result relies on characterizing such eigenvalue decays through differentiability properties of the kernel function, which also easily applies to the study of other kernels defined on the sphere.
\ No newline at end of file
diff --git a/data/2021/iclr/Deep Learning meets Projective Clustering b/data/2021/iclr/Deep Learning meets Projective Clustering
new file mode 100644
index 0000000000..596685d7b1
--- /dev/null
+++ b/data/2021/iclr/Deep Learning meets Projective Clustering	
@@ -0,0 +1,3 @@
+A common approach for compressing NLP networks is to encode the embedding layer as a matrix $A\in\mathbb{R}^{n\times d}$, compute its rank-$j$ approximation $A_j$ via SVD, and then factor $A_j$ into a pair of matrices that correspond to smaller fully-connected layers to replace the original embedding layer. Geometrically, the rows of $A$ represent points in $\mathbb{R}^d$, and the rows of $A_j$ represent their projections onto the $j$-dimensional subspace that minimizes the sum of squared distances ("errors") to the points. In practice, these rows of $A$ may be spread around $k>1$ subspaces, so factoring $A$ based on a single subspace may lead to large errors that turn into large drops in accuracy. 
+Inspired by \emph{projective clustering} from computational geometry, we suggest replacing this subspace by a set of $k$ subspaces, each of dimension $j$, that minimizes the sum of squared distances over every point (row in $A$) to its \emph{closest} subspace. Based on this approach, we provide a novel architecture that replaces the original embedding layer by a set of $k$ small layers that operate in parallel and are then recombined with a single fully-connected layer. 
+Extensive experimental results on the GLUE benchmark yield networks that are both more accurate and smaller compared to the standard matrix factorization (SVD). For example, we further compress DistilBERT by reducing the size of the embedding layer by $40\%$ while incurring only a $0.5\%$ average drop in accuracy over all nine GLUE tasks, compared to a $2.8\%$ drop using the existing SVD approach. On RoBERTa we achieve $43\%$ compression of the embedding layer with less than a $0.8\%$ average drop in accuracy as compared to a $3\%$ drop previously. Open code for reproducing and extending our results is provided.
\ No newline at end of file
diff --git a/data/2021/iclr/Deep Networks and the Multiple Manifold Problem b/data/2021/iclr/Deep Networks and the Multiple Manifold Problem
new file mode 100644
index 0000000000..7e02c55b25
--- /dev/null
+++ b/data/2021/iclr/Deep Networks and the Multiple Manifold Problem	
@@ -0,0 +1 @@
+We study the multiple manifold problem, a binary classification task modeled on applications in machine vision, in which a deep fully-connected neural network is trained to separate two low-dimensional submanifolds of the unit sphere. We provide an analysis of the one-dimensional case, proving for a simple manifold configuration that when the network depth $L$ is large relative to certain geometric and statistical properties of the data, the network width $n$ grows as a sufficiently large polynomial in $L$, and the number of i.i.d. samples from the manifolds is polynomial in $L$, randomly-initialized gradient descent rapidly learns to classify the two manifolds perfectly with high probability. Our analysis demonstrates concrete benefits of depth and width in the context of a practically-motivated model problem: the depth acts as a fitting resource, with larger depths corresponding to smoother networks that can more readily separate the class manifolds, and the width acts as a statistical resource, enabling concentration of the randomly-initialized network and its gradients. The argument centers around the neural tangent kernel and its role in the nonasymptotic analysis of training overparameterized neural networks; to this literature, we contribute essentially optimal rates of concentration for the neural tangent kernel of deep fully-connected networks, requiring width $n \gtrsim L\,\mathrm{poly}(d_0)$ to achieve uniform concentration of the initial kernel over a $d_0$-dimensional submanifold of the unit sphere $\mathbb{S}^{n_0-1}$, and a nonasymptotic framework for establishing generalization of networks trained in the NTK regime with structured data. The proof makes heavy use of martingale concentration to optimally treat statistical dependencies across layers of the initial random network. This approach should be of use in establishing similar results for other network architectures.
\ No newline at end of file
diff --git a/data/2021/iclr/Deep Neural Network Fingerprinting by Conferrable Adversarial Examples b/data/2021/iclr/Deep Neural Network Fingerprinting by Conferrable Adversarial Examples
new file mode 100644
index 0000000000..af0629cfc4
--- /dev/null
+++ b/data/2021/iclr/Deep Neural Network Fingerprinting by Conferrable Adversarial Examples	
@@ -0,0 +1 @@
+In Machine Learning as a Service, a provider trains a deep neural network and provides many users access. The hosted (source) model is susceptible to model stealing attacks, where an adversary derives a \emph{surrogate model} from API access to the source model. For post hoc detection of such attacks, the provider needs a robust method to determine whether a suspect model is a surrogate of their model. We propose a fingerprinting method for deep neural network classifiers that extracts a set of inputs from the source model so that only surrogates agree with the source model on the classification of such inputs. These inputs are a subclass of transferable adversarial examples which we call \emph{conferrable} adversarial examples that exclusively transfer with a target label from a source model to its surrogates. We propose a new method to generate these conferrable adversarial examples. We present an extensive study on the unremovability of our fingerprint against fine-tuning, weight pruning, retraining, retraining with different architectures, three model extraction attacks from related work, transfer learning, adversarial training, and two new adaptive attacks. Our fingerprint is robust against distillation, related model extraction attacks, and even transfer learning when the attacker has no access to the model provider's dataset. Our fingerprint is the first method that reaches an AUC of 1.0 in verifying surrogates, compared to an AUC of 0.63 by previous fingerprints.
\ No newline at end of file
diff --git a/data/2021/iclr/Deep Neural Tangent Kernel and Laplace Kernel Have the Same RKHS b/data/2021/iclr/Deep Neural Tangent Kernel and Laplace Kernel Have the Same RKHS
new file mode 100644
index 0000000000..610cc2dc61
--- /dev/null
+++ b/data/2021/iclr/Deep Neural Tangent Kernel and Laplace Kernel Have the Same RKHS	
@@ -0,0 +1 @@
+We prove that the reproducing kernel Hilbert spaces (RKHS) of a deep neural tangent kernel and the Laplace kernel include the same set of functions, when both kernels are restricted to the sphere $\mathbb{S}^{d-1}$. Additionally, we prove that the exponential power kernel with a smaller power (making the kernel more non-smooth) leads to a larger RKHS, when it is restricted to the sphere $\mathbb{S}^{d-1}$ and when it is defined on the entire $\mathbb{R}^d$.
\ No newline at end of file
diff --git a/data/2021/iclr/Deep Partition Aggregation: Provable Defenses against General Poisoning Attacks b/data/2021/iclr/Deep Partition Aggregation: Provable Defenses against General Poisoning Attacks
new file mode 100644
index 0000000000..37cce9a7ed
--- /dev/null
+++ b/data/2021/iclr/Deep Partition Aggregation: Provable Defenses against General Poisoning Attacks	
@@ -0,0 +1 @@
+Adversarial poisoning attacks distort training data in order to corrupt the test-time behavior of a classifier. A provable defense provides a certificate for each test sample, which is a lower bound on the magnitude of any adversarial distortion of the training set that can corrupt the test sample's classification. We propose two provable defenses against poisoning attacks: (i) Deep Partition Aggregation (DPA), a certified defense against a general poisoning threat model, defined as the insertion or deletion of a bounded number of samples to the training set -- by implication, this threat model also includes arbitrary distortions to a bounded number of images and/or labels; and (ii) Semi-Supervised DPA (SS-DPA), a certified defense against label-flipping poisoning attacks. DPA is an ensemble method where base models are trained on partitions of the training set determined by a hash function. DPA is related to subset aggregation, a well-studied ensemble method in classical machine learning. DPA can also be viewed as an extension of randomized ablation (Levine & Feizi, 2020a), a certified defense against sparse evasion attacks, to the poisoning domain. Our label-flipping defense, SS-DPA, uses a semi-supervised learning algorithm as its base classifier model: we train each base classifier using the entire unlabeled training set in addition to the labels for a partition. SS-DPA outperforms the existing certified defense for label-flipping attacks (Rosenfeld et al., 2020). SS-DPA certifies >= 50% of test images against 675 label flips (vs. = 50% of test images against > 500 poison image insertions on MNIST, and nine insertions on CIFAR-10. These results establish new state-of-the-art provable defenses against poison attacks.
\ No newline at end of file
diff --git a/data/2021/iclr/Deep Repulsive Clustering of Ordered Data Based on Order-Identity Decomposition b/data/2021/iclr/Deep Repulsive Clustering of Ordered Data Based on Order-Identity Decomposition
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients b/data/2021/iclr/Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients
new file mode 100644
index 0000000000..7b24b2210a
--- /dev/null
+++ b/data/2021/iclr/Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients	
@@ -0,0 +1 @@
+Discovering the underlying mathematical expressions describing a dataset is a core challenge for artificial intelligence. This is the problem of $\textit{symbolic}$ $\textit{regression.}$ Despite recent advances in training neural networks to solve complex tasks, deep learning approaches to symbolic regression are underexplored. We propose a framework that combines deep learning with symbolic regression via a simple idea: use a large model to search the space of small models. More specifically, we use a recurrent neural network to emit a distribution over tractable mathematical expressions, and employ reinforcement learning to train the network to generate better-fitting expressions. Our algorithm significantly outperforms standard genetic programming-based symbolic regression in its ability to exactly recover symbolic expressions on a series of benchmark problems, both with and without added noise. More broadly, our contributions include a framework that can be applied to optimize hierarchical, variable-length objects under a black-box performance metric, with the ability to incorporate a priori constraints in situ, and a risk-seeking policy gradient formulation that optimizes for best-case performance instead of expected performance.
\ No newline at end of file
diff --git a/data/2021/iclr/DeepAveragers: Offline Reinforcement Learning By Solving Derived Non-Parametric MDPs b/data/2021/iclr/DeepAveragers: Offline Reinforcement Learning By Solving Derived Non-Parametric MDPs
new file mode 100644
index 0000000000..b4b595b127
--- /dev/null
+++ b/data/2021/iclr/DeepAveragers: Offline Reinforcement Learning By Solving Derived Non-Parametric MDPs	
@@ -0,0 +1 @@
+We study an approach to offline reinforcement learning (RL) based on optimally solving finitely-represented MDPs derived from a static dataset of experience. This approach can be applied on top of any learned representation and has the potential to easily support multiple solution objectives as well as zero-shot adjustment to changing environments and goals. Our main contribution is to introduce the Deep Averagers with Costs MDP (DAC-MDP) and to investigate its solutions for offline RL. DAC-MDPs are a non-parametric model that can leverage deep representations and account for limited data by introducing costs for exploiting under-represented parts of the model. In theory, we show conditions that allow for lower-bounding the performance of DAC-MDP solutions. We also investigate the empirical behavior in a number of environments, including those with image-based observations. Overall, the experiments demonstrate that the framework can work in practice and scale to large complex offline RL problems.
\ No newline at end of file
diff --git a/data/2021/iclr/Deformable DETR: Deformable Transformers for End-to-End Object Detection b/data/2021/iclr/Deformable DETR: Deformable Transformers for End-to-End Object Detection
new file mode 100644
index 0000000000..0a62c06a4c
--- /dev/null
+++ b/data/2021/iclr/Deformable DETR: Deformable Transformers for End-to-End Object Detection	
@@ -0,0 +1 @@
+DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10$\times$ less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code shall be released.
\ No newline at end of file
diff --git a/data/2021/iclr/Degree-Quant: Quantization-Aware Training for Graph Neural Networks b/data/2021/iclr/Degree-Quant: Quantization-Aware Training for Graph Neural Networks
new file mode 100644
index 0000000000..fa8de3c60e
--- /dev/null
+++ b/data/2021/iclr/Degree-Quant: Quantization-Aware Training for Graph Neural Networks	
@@ -0,0 +1 @@
+Graph neural networks (GNNs) have demonstrated strong performance on a wide variety of tasks due to their ability to model non-uniform structured data. Despite their promise, there exists little research exploring methods to make them more efficient at inference time. In this work, we explore the viability of training quantized GNNs, enabling the usage of low precision integer arithmetic during inference. We identify the sources of error that uniquely arise when attempting to quantize GNNs, and propose an architecturally-agnostic method, Degree-Quant, to improve performance over existing quantization-aware training baselines commonly used on other architectures, such as CNNs. We validate our method on six datasets and show, unlike previous attempts, that models generalize to unseen graphs. Models trained with Degree-Quant for INT8 quantization perform as well as FP32 models in most cases; for INT4 models, we obtain up to 26% gains over the baselines. Our work enables up to 4.7x speedups on CPU when using INT8 arithmetic.
\ No newline at end of file
diff --git a/data/2021/iclr/Denoising Diffusion Implicit Models b/data/2021/iclr/Denoising Diffusion Implicit Models
new file mode 100644
index 0000000000..2798c70a91
--- /dev/null
+++ b/data/2021/iclr/Denoising Diffusion Implicit Models	
@@ -0,0 +1 @@
+Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a Markovian diffusion process. We construct a class of non-Markovian diffusion processes that lead to the same training objective, but whose reverse process can be much faster to sample from. We empirically demonstrate that DDIMs can produce high quality samples $10 \times$ to $50 \times$ faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space.
\ No newline at end of file
diff --git a/data/2021/iclr/Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization b/data/2021/iclr/Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization
new file mode 100644
index 0000000000..a949bb61bb
--- /dev/null
+++ b/data/2021/iclr/Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization	
@@ -0,0 +1 @@
+Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many real-world applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new data-collection policy is high, to the point that it can become prohibitive to update the data-collection policy more than a few times during learning. With this view, we propose a novel concept of deployment efficiency, measuring the number of distinct data-collection policies that are used during policy learning. We observe that naively applying existing model-free offline RL algorithms recursively does not lead to a practical deployment-efficient and sample-efficient algorithm. We propose a novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN) that can effectively optimize a policy offline using 10-20 times fewer data than prior works. Furthermore, the recursive application of BREMEN is able to achieve impressive deployment efficiency while maintaining the same or better sample efficiency, learning successful policies from scratch on simulated robotic environments with only 5-10 deployments, compared to typical values of hundreds to millions in standard RL baselines. Codes and pre-trained models are available at this https URL .
\ No newline at end of file
diff --git a/data/2021/iclr/DialoGraph: Incorporating Interpretable Strategy-Graph Networks into Negotiation Dialogues b/data/2021/iclr/DialoGraph: Incorporating Interpretable Strategy-Graph Networks into Negotiation Dialogues
new file mode 100644
index 0000000000..fc172dea3d
--- /dev/null
+++ b/data/2021/iclr/DialoGraph: Incorporating Interpretable Strategy-Graph Networks into Negotiation Dialogues	
@@ -0,0 +1 @@
+To successfully negotiate a deal, it is not enough to communicate fluently: pragmatic planning of persuasive negotiation strategies is essential. While modern dialogue agents excel at generating fluent sentences, they still lack pragmatic grounding and cannot reason strategically. We present DialoGraph, a negotiation system that incorporates pragmatic strategies in a negotiation dialogue using graph neural networks. DialoGraph explicitly incorporates dependencies between sequences of strategies to enable improved and interpretable prediction of next optimal strategies, given the dialogue context. Our graph-based method outperforms prior state-of-the-art negotiation models both in the accuracy of strategy/dialogue act prediction and in the quality of downstream dialogue response generation. We qualitatively show further benefits of learned strategy-graphs in providing explicit associations between effective negotiation strategies over the course of the dialogue, leading to interpretable and strategic dialogues.
\ No newline at end of file
diff --git a/data/2021/iclr/DiffWave: A Versatile Diffusion Model for Audio Synthesis b/data/2021/iclr/DiffWave: A Versatile Diffusion Model for Audio Synthesis
new file mode 100644
index 0000000000..4013c626d3
--- /dev/null
+++ b/data/2021/iclr/DiffWave: A Versatile Diffusion Model for Audio Synthesis	
@@ -0,0 +1 @@
+In this work, we propose DiffWave, a versatile Diffusion probabilistic model for conditional and unconditional Waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in Different Waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality~(MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.
\ No newline at end of file
diff --git a/data/2021/iclr/Differentiable Segmentation of Sequences b/data/2021/iclr/Differentiable Segmentation of Sequences
new file mode 100644
index 0000000000..75ff5185c3
--- /dev/null
+++ b/data/2021/iclr/Differentiable Segmentation of Sequences	
@@ -0,0 +1 @@
+Segmented models are widely used to describe non-stationary sequential data with discrete change points. Their estimation usually requires solving a mixed discrete-continuous optimization problem, where the segmentation is the discrete part and all other model parameters are continuous. A number of estimation algorithms have been developed that are highly specialized for their specific model assumptions. The dependence on non-standard algorithms makes it hard to integrate segmented models in state-of-the-art deep learning architectures that critically depend on gradient-based optimization techniques. In this work, we formulate a relaxed variant of segmented models that enables joint estimation of all model parameters, including the segmentation, with gradient descent. We build on recent advances in learning continuous warping functions and propose a novel family of warping functions based on the two-sided power (TSP) distribution. TSP-based warping functions are differentiable, have simple closed-form expressions, and can represent segmentation functions exactly. Our formulation includes the important class of segmented generalized linear models as a special case, which makes it highly versatile. We use our approach to model the spread of COVID-19 by segmented Poisson regression, perform logistic regression on Fashion-MNIST with artificial concept drift, and demonstrate its capacities for phoneme segmentation.
\ No newline at end of file
diff --git a/data/2021/iclr/Differentiable Trust Region Layers for Deep Reinforcement Learning b/data/2021/iclr/Differentiable Trust Region Layers for Deep Reinforcement Learning
new file mode 100644
index 0000000000..7f99893960
--- /dev/null
+++ b/data/2021/iclr/Differentiable Trust Region Layers for Deep Reinforcement Learning	
@@ -0,0 +1 @@
+Trust region methods are a popular tool in reinforcement learning as they yield robust policy updates in continuous and discrete action spaces. However, enforcing such trust regions in deep reinforcement learning is difficult. Hence, many approaches, such as Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), are based on approximations. Due to those approximations, they violate the constraints or fail to find the optimal solution within the trust region. Moreover, they are difficult to implement, lack sufficient exploration, and have been shown to depend on seemingly unrelated implementation choices. In this work, we propose differentiable neural network layers to enforce trust regions for deep Gaussian policies via closed-form projections. Unlike existing methods, those layers formalize trust regions for each state individually and can complement existing reinforcement learning algorithms. We derive trust region projections based on the Kullback-Leibler divergence, the Wasserstein L2 distance, and the Frobenius norm for Gaussian distributions. We empirically demonstrate that those projection layers achieve similar or better results than existing methods while being almost agnostic to specific implementation choices. (Code: https://git. io/Jt3go)
\ No newline at end of file
diff --git a/data/2021/iclr/Differentially Private Learning Needs Better Features (or Much More Data) b/data/2021/iclr/Differentially Private Learning Needs Better Features (or Much More Data)
new file mode 100644
index 0000000000..7d3ad9e49f
--- /dev/null
+++ b/data/2021/iclr/Differentially Private Learning Needs Better Features (or Much More Data)	
@@ -0,0 +1 @@
+We demonstrate that differentially private machine learning has not yet reached its "AlexNet moment" on many canonical vision tasks: linear models trained on handcrafted features significantly outperform end-to-end deep neural networks for moderate privacy budgets. To exceed the performance of handcrafted features, we show that private learning requires either much more private data, or access to features learned on public data from a similar domain. Our work introduces simple yet strong baselines for differentially private learning that can inform the evaluation of future progress in this area.
\ No newline at end of file
diff --git a/data/2021/iclr/Directed Acyclic Graph Neural Networks b/data/2021/iclr/Directed Acyclic Graph Neural Networks
new file mode 100644
index 0000000000..a2498ce3b5
--- /dev/null
+++ b/data/2021/iclr/Directed Acyclic Graph Neural Networks	
@@ -0,0 +1 @@
+Graph-structured data ubiquitously appears in science and engineering. Graph neural networks (GNNs) are designed to exploit the relational inductive bias exhibited in graphs; they have been shown to outperform other forms of neural networks in scenarios where structure information supplements node features. The most common GNN architecture aggregates information from neighborhoods based on message passing. Its generality has made it broadly applicable. In this paper, we focus on a special, yet widely used, type of graphs -- DAGs -- and inject a stronger inductive bias -- partial ordering -- into the neural network design. We propose the \emph{directed acyclic graph neural network}, DAGNN, an architecture that processes information according to the flow defined by the partial order. DAGNN can be considered a framework that entails earlier works as special cases (e.g., models for trees and models updating node representations recurrently), but we identify several crucial components that prior architectures lack. We perform comprehensive experiments, including ablation studies, on representative DAG datasets (i.e., source code, neural architectures, and probabilistic graphical models) and demonstrate the superiority of DAGNN over simpler DAG architectures as well as general graph architectures.
\ No newline at end of file
diff --git a/data/2021/iclr/Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate b/data/2021/iclr/Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate
new file mode 100644
index 0000000000..a98ed26947
--- /dev/null
+++ b/data/2021/iclr/Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate	
@@ -0,0 +1 @@
+Understanding the algorithmic regularization effect of stochastic gradient descent (SGD) is one of the key challenges in modern machine learning and deep learning theory. Most of the existing works, however, focus on very small or even infinitesimal learning rate regime, and fail to cover practical scenarios where the learning rate is moderate and annealing. In this paper, we make an initial attempt to characterize the particular regularization effect of SGD in the moderate learning rate regime by studying its behavior for optimizing an overparameterized linear regression problem. In this case, SGD and GD are known to converge to the unique minimum-norm solution; however, with the moderate and annealing learning rate, we show that they exhibit different directional bias: SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions. Furthermore, we show that such directional bias does matter when early stopping is adopted, where the SGD output is nearly optimal but the GD output is suboptimal. Finally, our theory explains several folk arts in practice used for SGD hyperparameter tuning, such as (1) linearly scaling the initial learning rate with batch size; and (2) overrunning SGD with high learning rate even when the loss stops decreasing.
\ No newline at end of file
diff --git a/data/2021/iclr/Disambiguating Symbolic Expressions in Informal Documents b/data/2021/iclr/Disambiguating Symbolic Expressions in Informal Documents
new file mode 100644
index 0000000000..35b863eda0
--- /dev/null
+++ b/data/2021/iclr/Disambiguating Symbolic Expressions in Informal Documents	
@@ -0,0 +1 @@
+We propose the task of disambiguating symbolic expressions in informal STEM documents in the form of LaTeX files - that is, determining their precise semantics and abstract syntax tree - as a neural machine translation task. We discuss the distinct challenges involved and present a dataset with roughly 33,000 entries. We evaluated several baseline models on this dataset, which failed to yield even syntactically valid LaTeX before overfitting. Consequently, we describe a methodology using a transformer language model pre-trained on sources obtained from arxiv.org, which yields promising results despite the small size of the dataset. We evaluate our model using a plurality of dedicated techniques, taking the syntax and semantics of symbolic expressions into account.
\ No newline at end of file
diff --git a/data/2021/iclr/Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization b/data/2021/iclr/Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization
new file mode 100644
index 0000000000..1f3f7e0ea8
--- /dev/null
+++ b/data/2021/iclr/Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization	
@@ -0,0 +1 @@
+We propose a simple, general and effective technique, Reward Randomization for discovering diverse strategic policies in complex multi-agent games. Combining reward randomization and policy gradient, we derive a new algorithm, Reward-Randomized Policy Gradient (RPG). RPG is able to discover multiple distinctive human-interpretable strategies in challenging temporal trust dilemmas, including grid-world games and a real-world game Agar.io, where multiple equilibria exist but standard multi-agent policy gradient algorithms always converge to a fixed one with a sub-optimal payoff for every player even using state-of-the-art exploration techniques. Furthermore, with the set of diverse strategies from RPG, we can (1) achieve higher payoffs by fine-tuning the best policy from the set; and (2) obtain an adaptive agent by using this set of strategies as its training opponents. The source code and example videos can be found in our website: https://sites.google.com/view/staghuntrpg.
\ No newline at end of file
diff --git a/data/2021/iclr/Discovering Non-monotonic Autoregressive Orderings with Variational Inference b/data/2021/iclr/Discovering Non-monotonic Autoregressive Orderings with Variational Inference
new file mode 100644
index 0000000000..e5169e2cbd
--- /dev/null
+++ b/data/2021/iclr/Discovering Non-monotonic Autoregressive Orderings with Variational Inference	
@@ -0,0 +1 @@
+The predominant approach for language modeling is to process sequences from left to right, but this eliminates a source of information: the order by which the sequence was generated. One strategy to recover this information is to decode both the content and ordering of tokens. Existing approaches supervise content and ordering by designing problem-specific loss functions and pre-training with an ordering pre-selected. Other recent works use iterative search to discover problem-specific orderings for training, but suffer from high time complexity and cannot be efficiently parallelized. We address these limitations with an unsupervised parallelizable learner that discovers high-quality generation orders purely from training data -- no domain knowledge required. The learner contains an encoder network and decoder language model that perform variational inference with autoregressive orders (represented as permutation matrices) as latent variables. The corresponding ELBO is not differentiable, so we develop a practical algorithm for end-to-end optimization using policy gradients. We implement the encoder as a Transformer with non-causal attention that outputs permutations in one forward pass. Permutations then serve as target generation orders for training an insertion-based Transformer language model. Empirical results in language modeling tasks demonstrate that our method is context-aware and discovers orderings that are competitive with or even better than fixed orders.
\ No newline at end of file
diff --git a/data/2021/iclr/Discovering a set of policies for the worst case reward b/data/2021/iclr/Discovering a set of policies for the worst case reward
new file mode 100644
index 0000000000..86afc9a2ba
--- /dev/null
+++ b/data/2021/iclr/Discovering a set of policies for the worst case reward	
@@ -0,0 +1 @@
+We study the problem of how to construct a set of policies that can be composed together to solve a collection of reinforcement learning tasks. Each task is a different reward function defined as a linear combination of known features. We consider a specific class of policy compositions which we call set improving policies (SIPs): given a set of policies and a set of tasks, a SIP is any composition of the former whose performance is at least as good as that of its constituents across all the tasks. We focus on the most conservative instantiation of SIPs, set-max policies (SMPs), so our analysis extends to any SIP. This includes known policy-composition operators like generalized policy improvement. Our main contribution is a policy iteration algorithm that builds a set of policies in order to maximize the worst-case performance of the resulting SMP on the set of tasks. The algorithm works by successively adding new policies to the set. We show that the worst-case performance of the resulting SMP strictly improves at each iteration, and the algorithm only stops when there does not exist a policy that leads to improved performance. We empirically evaluate our algorithm on a grid world and also on a set of domains from the DeepMind control suite. We confirm our theoretical results regarding the monotonically improving performance of our algorithm. Interestingly, we also show empirically that the sets of policies computed by the algorithm are diverse, leading to different trajectories in the grid world and very distinct locomotion skills in the control suite.
\ No newline at end of file
diff --git a/data/2021/iclr/Discrete Graph Structure Learning for Forecasting Multiple Time Series b/data/2021/iclr/Discrete Graph Structure Learning for Forecasting Multiple Time Series
new file mode 100644
index 0000000000..122af341ee
--- /dev/null
+++ b/data/2021/iclr/Discrete Graph Structure Learning for Forecasting Multiple Time Series	
@@ -0,0 +1 @@
+Time series forecasting is an extensively studied subject in statistics, economics, and computer science. Exploration of the correlation and causation among the variables in a multivariate time series shows promise in enhancing the performance of a time series model. When using deep neural networks as forecasting models, we hypothesize that exploiting the pairwise information among multiple (multivariate) time series also improves their forecast. If an explicit graph structure is known, graph neural networks (GNNs) have been demonstrated as powerful tools to exploit the structure. In this work, we propose learning the structure simultaneously with the GNN if the graph is unknown. We cast the problem as learning a probabilistic graph model through optimizing the mean performance over the graph distribution. The distribution is parameterized by a neural network so that discrete graphs can be sampled differentiably through reparameterization. Empirical evaluations show that our method is simpler, more efficient, and better performing than a recently proposed bilevel learning approach for graph structure learning, as well as a broad array of forecasting models, either deep or non-deep learning based, and graph or non-graph based.
\ No newline at end of file
diff --git a/data/2021/iclr/Disentangled Recurrent Wasserstein Autoencoder b/data/2021/iclr/Disentangled Recurrent Wasserstein Autoencoder
new file mode 100644
index 0000000000..c5d946f9a0
--- /dev/null
+++ b/data/2021/iclr/Disentangled Recurrent Wasserstein Autoencoder	
@@ -0,0 +1 @@
+Learning disentangled representations leads to interpretable models and facilitates data generation with style transfer, which has been extensively studied on static data such as images in an unsupervised learning framework. However, only a few works have explored unsupervised disentangled sequential representation learning due to challenges of generating sequential data. In this paper, we propose recurrent Wasserstein Autoencoder (R-WAE), a new framework for generative modeling of sequential data. R-WAE disentangles the representation of an input sequence into static and dynamic factors (i.e., time-invariant and time-varying parts). Our theoretical analysis shows that, R-WAE minimizes an upper bound of a penalized form of the Wasserstein distance between model distribution and sequential data distribution, and simultaneously maximizes the mutual information between input data and different disentangled latent factors, respectively. This is superior to (recurrent) VAE which does not explicitly enforce mutual information maximization between input data and disentangled latent representations. When the number of actions in sequential data is available as weak supervision information, R-WAE is extended to learn a categorical latent representation of actions to improve its disentanglement. Experiments on a variety of datasets show that our models outperform other baselines with the same settings in terms of disentanglement and unconditional video generation both quantitatively and qualitatively.
\ No newline at end of file
diff --git a/data/2021/iclr/Disentangling 3D Prototypical Networks for Few-Shot Concept Learning b/data/2021/iclr/Disentangling 3D Prototypical Networks for Few-Shot Concept Learning
new file mode 100644
index 0000000000..239c012375
--- /dev/null
+++ b/data/2021/iclr/Disentangling 3D Prototypical Networks for Few-Shot Concept Learning	
@@ -0,0 +1 @@
+We present neural architectures that disentangle RGB-D images into objects' shapes and styles and a map of the background scene, and explore their applications for few-shot 3D object detection and few-shot concept classification. Our networks incorporate architectural biases that reflect the image formation process, 3D geometry of the world scene, and shape-style interplay. They are trained end-to-end self-supervised by predicting views in static scenes, alongside a small number of 3D object boxes. Objects and scenes are represented in terms of 3D feature grids in the bottleneck of the network. We show that the proposed 3D neural representations are compositional: they can generate novel 3D scene feature maps by mixing object shapes and styles, resizing and adding the resulting object 3D feature maps over background scene feature maps. We show that classifiers for object categories, color, materials, and spatial relationships trained over the disentangled 3D feature sub-spaces generalize better with dramatically fewer examples than the current state-of-the-art, and enable a visual question answering system that uses them as its modules to generalize one-shot to novel objects in the scene.
\ No newline at end of file
diff --git a/data/2021/iclr/Distance-Based Regularisation of Deep Networks for Fine-Tuning b/data/2021/iclr/Distance-Based Regularisation of Deep Networks for Fine-Tuning
new file mode 100644
index 0000000000..0081981f55
--- /dev/null
+++ b/data/2021/iclr/Distance-Based Regularisation of Deep Networks for Fine-Tuning	
@@ -0,0 +1 @@
+We investigate approaches to regularisation during fine-tuning of deep neural networks. First we provide a neural network generalisation bound based on Rademacher complexity that uses the distance the weights have moved from their initial values. This bound has no direct dependence on the number of weights and compares favourably to other bounds when applied to convolutional networks. Our bound is highly relevant for fine-tuning, because providing a network with a good initialisation based on transfer learning means that learning can modify the weights less, and hence achieve tighter generalisation. Inspired by this, we develop a simple yet effective fine-tuning algorithm that constrains the hypothesis class to a small sphere centred on the initial pre-trained weights, thus obtaining provably better generalisation performance than conventional transfer learning. Empirical evaluation shows that our algorithm works well, corroborating our theoretical results. It outperforms both state of the art fine-tuning competitors, and penalty-based alternatives that we show do not directly constrain the radius of the search space.
\ No newline at end of file
diff --git a/data/2021/iclr/Distilling Knowledge from Reader to Retriever for Question Answering b/data/2021/iclr/Distilling Knowledge from Reader to Retriever for Question Answering
new file mode 100644
index 0000000000..3a16de031e
--- /dev/null
+++ b/data/2021/iclr/Distilling Knowledge from Reader to Retriever for Question Answering	
@@ -0,0 +1 @@
+The task of information retrieval is an important component of many natural language processing systems, such as open domain question answering. While traditional methods were based on hand-crafted features, continuous representations based on neural networks recently obtained competitive results. A challenge of using such methods is to obtain supervised data to train the retriever model, corresponding to pairs of query and support documents. In this paper, we propose a technique to learn retriever models for downstream tasks, inspired by knowledge distillation, and which does not require annotated pairs of query and documents. Our approach leverages attention scores of a reader model, used to solve the task based on retrieved documents, to obtain synthetic labels for the retriever. We evaluate our method on question answering, obtaining state-of-the-art results.
\ No newline at end of file
diff --git a/data/2021/iclr/Distributed Momentum for Byzantine-resilient Stochastic Gradient Descent b/data/2021/iclr/Distributed Momentum for Byzantine-resilient Stochastic Gradient Descent
new file mode 100644
index 0000000000..9faeca14b3
--- /dev/null
+++ b/data/2021/iclr/Distributed Momentum for Byzantine-resilient Stochastic Gradient Descent	
@@ -0,0 +1 @@
+Byzantine-resilient Stochastic Gradient Descent (SGD) aims at shielding model training from Byzantine faults, be they ill-labeled training datapoints, exploited software/hardware vulnerabilities, or malicious worker nodes in a distributed setting. Two recent attacks have been challenging state-of-the-art defenses though, often successfully precluding the model from even fitting the training set. The main identified weakness in current defenses is their requirement of a sufficiently low variance-norm ratio for the stochastic gradients. We propose a practical method which, despite increasing the variance, reduces the variance-norm ratio, mitigating the identified weakness. We assess the effectiveness of our method over 736 different training configurations, comprising the 2 state-of-the-art attacks and 6 defenses. For confidence and reproducibility purposes, each configuration is run 5 times with specified seeds (1 to 5), totalling 3680 runs. In our experiments, when the attack is effective enough to decrease the highest observed top-1 cross-accuracy by at least 20% compared to the unattacked run, our technique systematically increases back the highest observed accuracy, and is able to recover at least 20% in more than 60% of the cases.
\ No newline at end of file
diff --git a/data/2021/iclr/Distributional Sliced-Wasserstein and Applications to Generative Modeling b/data/2021/iclr/Distributional Sliced-Wasserstein and Applications to Generative Modeling
new file mode 100644
index 0000000000..5f7aca8e01
--- /dev/null
+++ b/data/2021/iclr/Distributional Sliced-Wasserstein and Applications to Generative Modeling	
@@ -0,0 +1 @@
+Sliced-Wasserstein distance (SWD) and its variation, Max Sliced-Wasserstein distance (Max-SWD), have been widely used in the recent years due to their fast computation and scalability when the probability measures lie in very high dimension. However, these distances still have their weakness, SWD requires a lot of projection samples because it uses the uniform distribution to sample projecting directions, Max-SWD uses only one projection, causing it to lose a large amount of information. In this paper, we propose a novel distance that finds optimal penalized probability measure over the slices, which is named Distributional Sliced-Wasserstein distance (DSWD). We show that the DSWD is a generalization of both SWD and Max-SWD, and the proposed distance could be found by searching for the push-forward measure over a set of measures satisfying some certain constraints. Moreover, similar to SWD, we can extend Generalized Sliced-Wasserstein distance (GSWD) to Distributional Generalized Sliced-Wasserstein distance (DGSWD). Finally, we carry out extensive experiments to demonstrate the favorable generative modeling performances of our distances over the previous sliced-based distances in large-scale real datasets.
\ No newline at end of file
diff --git a/data/2021/iclr/Diverse Video Generation using a Gaussian Process Trigger b/data/2021/iclr/Diverse Video Generation using a Gaussian Process Trigger
new file mode 100644
index 0000000000..fb92f04ad7
--- /dev/null
+++ b/data/2021/iclr/Diverse Video Generation using a Gaussian Process Trigger	
@@ -0,0 +1 @@
+Generating future frames given a few context (or past) frames is a challenging task. It requires modeling the temporal coherence of videos and multi-modality in terms of diversity in the potential future states. Current variational approaches for video generation tend to marginalize over multi-modal future outcomes. Instead, we propose to explicitly model the multi-modality in the future outcomes and leverage it to sample diverse futures. Our approach, Diverse Video Generator, uses a Gaussian Process (GP) to learn priors on future states given the past and maintains a probability distribution over possible futures given a particular sample. In addition, we leverage the changes in this distribution over time to control the sampling of diverse future states by estimating the end of ongoing sequences. That is, we use the variance of GP over the output function space to trigger a change in an action sequence. We achieve state-of-the-art results on diverse future frame generation in terms of reconstruction quality and diversity of the generated sequences.
\ No newline at end of file
diff --git a/data/2021/iclr/Do 2D GANs Know 3D Shape? Unsupervised 3D Shape Reconstruction from 2D Image GANs b/data/2021/iclr/Do 2D GANs Know 3D Shape? Unsupervised 3D Shape Reconstruction from 2D Image GANs
new file mode 100644
index 0000000000..0d22759cad
--- /dev/null
+++ b/data/2021/iclr/Do 2D GANs Know 3D Shape? Unsupervised 3D Shape Reconstruction from 2D Image GANs	
@@ -0,0 +1 @@
+Natural images are projections of 3D objects on a 2D image plane. While state-of-the-art 2D generative models like GANs show unprecedented quality in modeling the natural image manifold, it is unclear whether they implicitly capture the underlying 3D object structures. And if so, how could we exploit such knowledge to recover the 3D shapes of objects in the images? To answer these questions, in this work, we present the first attempt to directly mine 3D geometric clues from an off-the-shelf 2D GAN that is trained on RGB images only. Through our investigation, we found that such a pre-trained GAN indeed contains rich 3D knowledge and thus can be used to recover 3D shape from a single 2D image in an unsupervised manner. The core of our framework is an iterative strategy that explores and exploits diverse viewpoint and lighting variations in the GAN image manifold. The framework does not require 2D keypoint or 3D annotations, or strong assumptions on object shapes (e.g. shapes are symmetric), yet it successfully recovers 3D shapes with high precision for human faces, cats, cars, and buildings. The recovered 3D shapes immediately allow high-quality image editing like relighting and object rotation. We quantitatively demonstrate the effectiveness of our approach compared to previous methods in both 3D shape reconstruction and face rotation. Our code and models will be released at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth b/data/2021/iclr/Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth
new file mode 100644
index 0000000000..8a6762cacf
--- /dev/null
+++ b/data/2021/iclr/Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth	
@@ -0,0 +1 @@
+A key factor in the success of deep neural networks is the ability to scale models to improve performance by varying the architecture depth and width. This simple property of neural network design has resulted in highly effective architectures for a variety of tasks. Nevertheless, there is limited understanding of effects of depth and width on the learned representations. In this paper, we study this fundamental question. We begin by investigating how varying depth and width affects model hidden representations, finding a characteristic block structure in the hidden representations of larger capacity (wider or deeper) models. We demonstrate that this block structure arises when model capacity is large relative to the size of the training set, and is indicative of the underlying layers preserving and propagating the dominant principal component of their representations. This discovery has important ramifications for features learned by different models, namely, representations outside the block structure are often similar across architectures with varying widths and depths, but the block structure is unique to each model. We analyze the output predictions of different model architectures, finding that even when the overall accuracy is similar, wide and deep models exhibit distinctive error patterns and variations across classes.
\ No newline at end of file
diff --git a/data/2021/iclr/Do not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning b/data/2021/iclr/Do not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning
new file mode 100644
index 0000000000..faced88ad9
--- /dev/null
+++ b/data/2021/iclr/Do not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning	
@@ -0,0 +1 @@
+The privacy leakage of the model about the training data can be bounded in the differential privacy mechanism. However, for meaningful privacy parameters, a differentially private model degrades the utility drastically when the model comprises a large number of trainable parameters. In this paper, we propose an algorithm \emph{Gradient Embedding Perturbation (GEP)} towards training differentially private deep models with decent accuracy. Specifically, in each gradient descent step, GEP first projects individual private gradient into a non-sensitive anchor subspace, producing a low-dimensional gradient embedding and a small-norm residual gradient. Then, GEP perturbs the low-dimensional embedding and the residual gradient separately according to the privacy budget. Such a decomposition permits a small perturbation variance, which greatly helps to break the dimensional barrier of private learning. With GEP, we achieve decent accuracy with reasonable computational cost and modest privacy guarantee for deep models. Especially, with privacy bound $\epsilon=8$, we achieve $74.9\%$ test accuracy on CIFAR10 and $95.1\%$ test accuracy on SVHN, significantly improving over existing results.
\ No newline at end of file
diff --git a/data/2021/iclr/Does enhanced shape bias improve neural network robustness to common corruptions? b/data/2021/iclr/Does enhanced shape bias improve neural network robustness to common corruptions?
new file mode 100644
index 0000000000..d77bd88bfa
--- /dev/null
+++ b/data/2021/iclr/Does enhanced shape bias improve neural network robustness to common corruptions?	
@@ -0,0 +1 @@
+Convolutional neural networks (CNNs) learn to extract representations of complex features, such as object shapes and textures to solve image recognition tasks. Recent work indicates that CNNs trained on ImageNet are biased towards features that encode textures and that these alone are sufficient to generalize to unseen test data from the same distribution as the training data but often fail to generalize to out-of-distribution data. It has been shown that augmenting the training data with different image styles decreases this texture bias in favor of increased shape bias while at the same time improving robustness to common corruptions, such as noise and blur. Commonly, this is interpreted as shape bias increasing corruption robustness. However, this relationship is only hypothesized. We perform a systematic study of different ways of composing inputs based on natural images, explicit edge information, and stylization. While stylization is essential for achieving high corruption robustness, we do not find a clear correlation between shape bias and robustness. We conclude that the data augmentation caused by style-variation accounts for the improved corruption robustness and increased shape bias is only a byproduct.
\ No newline at end of file
diff --git a/data/2021/iclr/Domain Generalization with MixStyle b/data/2021/iclr/Domain Generalization with MixStyle
new file mode 100644
index 0000000000..8a88a4625a
--- /dev/null
+++ b/data/2021/iclr/Domain Generalization with MixStyle	
@@ -0,0 +1 @@
+Though convolutional neural networks (CNNs) have demonstrated remarkable ability in learning discriminative features, they often generalize poorly to unseen domains. Domain generalization aims to address this problem by learning from a set of source domains a model that is generalizable to any unseen domain. In this paper, a novel approach is proposed based on probabilistically mixing instance-level feature statistics of training samples across source domains. Our method, termed MixStyle, is motivated by the observation that visual domain is closely related to image style (e.g., photo vs.~sketch images). Such style information is captured by the bottom layers of a CNN where our proposed style-mixing takes place. Mixing styles of training instances results in novel domains being synthesized implicitly, which increase the domain diversity of the source domains, and hence the generalizability of the trained model. MixStyle fits into mini-batch training perfectly and is extremely easy to implement. The effectiveness of MixStyle is demonstrated on a wide range of tasks including category classification, instance retrieval and reinforcement learning.
\ No newline at end of file
diff --git a/data/2021/iclr/Domain-Robust Visual Imitation Learning with Mutual Information Constraints b/data/2021/iclr/Domain-Robust Visual Imitation Learning with Mutual Information Constraints
new file mode 100644
index 0000000000..bd1bd60fbf
--- /dev/null
+++ b/data/2021/iclr/Domain-Robust Visual Imitation Learning with Mutual Information Constraints	
@@ -0,0 +1 @@
+Human beings are able to understand objectives and learn by simply observing others perform a task. Imitation learning methods aim to replicate such capabilities, however, they generally depend on access to a full set of optimal states and actions taken with the agent's actuators and from the agent's point of view. In this paper, we introduce a new algorithm - called Disentangling Generative Adversarial Imitation Learning (DisentanGAIL) - with the purpose of bypassing such constraints. Our algorithm enables autonomous agents to learn directly from high dimensional observations of an expert performing a task, by making use of adversarial learning with a latent representation inside the discriminator network. Such latent representation is regularized through mutual information constraints to incentivize learning only features that encode information about the completion levels of the task being demonstrated. This allows to obtain a shared feature space to successfully perform imitation while disregarding the differences between the expert's and the agent's domains. Empirically, our algorithm is able to efficiently imitate in a diverse range of control problems including balancing, manipulation and locomotive tasks, while being robust to various domain differences in terms of both environment appearance and agent embodiment.
\ No newline at end of file
diff --git a/data/2021/iclr/DrNAS: Dirichlet Neural Architecture Search b/data/2021/iclr/DrNAS: Dirichlet Neural Architecture Search
new file mode 100644
index 0000000000..97d458e55c
--- /dev/null
+++ b/data/2021/iclr/DrNAS: Dirichlet Neural Architecture Search	
@@ -0,0 +1 @@
+This paper proposes a novel differentiable architecture search method by formulating it into a distribution learning problem. We treat the continuously relaxed architecture mixing weight as random variables, modeled by Dirichlet distribution. With recently developed pathwise derivatives, the Dirichlet parameters can be easily optimized with gradient-based optimizer in an end-to-end manner. This formulation improves the generalization ability and induces stochasticity that naturally encourages exploration in the search space. Furthermore, to alleviate the large memory consumption of differentiable NAS, we propose a simple yet effective progressive learning scheme that enables searching directly on large-scale tasks, eliminating the gap between search and evaluation phases. Extensive experiments demonstrate the effectiveness of our method. Specifically, we obtain a test error of 2.46% for CIFAR-10, 23.7% for ImageNet under the mobile setting. On NAS-Bench-201, we also achieve state-of-the-art results on all three datasets and provide insights for the effective design of neural architecture search algorithms.
\ No newline at end of file
diff --git a/data/2021/iclr/Drop-Bottleneck: Learning Discrete Compressed Representation for Noise-Robust Exploration b/data/2021/iclr/Drop-Bottleneck: Learning Discrete Compressed Representation for Noise-Robust Exploration
new file mode 100644
index 0000000000..4a2c76fb20
--- /dev/null
+++ b/data/2021/iclr/Drop-Bottleneck: Learning Discrete Compressed Representation for Noise-Robust Exploration	
@@ -0,0 +1 @@
+We propose a novel information bottleneck (IB) method named Drop-Bottleneck, which discretely drops features that are irrelevant to the target variable. Drop-Bottleneck not only enjoys a simple and tractable compression objective but also additionally provides a deterministic compressed representation of the input variable, which is useful for inference tasks that require consistent representation. Moreover, it can jointly learn a feature extractor and select features considering each feature dimension's relevance to the target task, which is unattainable by most neural network-based IB methods. We propose an exploration method based on Drop-Bottleneck for reinforcement learning tasks. In a multitude of noisy and reward sparse maze navigation tasks in VizDoom (Kempka et al., 2016) and DMLab (Beattie et al., 2016), our exploration method achieves state-of-the-art performance. As a new IB framework, we demonstrate that Drop-Bottleneck outperforms Variational Information Bottleneck (VIB) (Alemi et al., 2017) in multiple aspects including adversarial robustness and dimensionality reduction.
\ No newline at end of file
diff --git a/data/2021/iclr/Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling b/data/2021/iclr/Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling
new file mode 100644
index 0000000000..28b1aed5b6
--- /dev/null
+++ b/data/2021/iclr/Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling	
@@ -0,0 +1 @@
+Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation during the training. The Dual-mode ASR framework can be applied to recent state-of-the-art convolution-based and transformer-based ASR networks. We present extensive experiments with two state-of-the-art ASR networks, ContextNet and Conformer, on two datasets, a widely used public dataset LibriSpeech and a large-scale dataset MultiDomain. Experiments and ablation studies demonstrate that Dual-mode ASR not only simplifies the workflow of training and deploying streaming and full-context ASR models, but also significantly improves both emission latency and recognition accuracy of streaming ASR. With Dual-mode ASR, we achieve new state-of-the-art streaming ASR results on both LibriSpeech and MultiDomain in terms of accuracy and latency.
\ No newline at end of file
diff --git a/data/2021/iclr/DynaTune: Dynamic Tensor Program Optimization in Deep Neural Network Compilation b/data/2021/iclr/DynaTune: Dynamic Tensor Program Optimization in Deep Neural Network Compilation
new file mode 100644
index 0000000000..9485ab27b9
--- /dev/null
+++ b/data/2021/iclr/DynaTune: Dynamic Tensor Program Optimization in Deep Neural Network Compilation	
@@ -0,0 +1 @@
+Recently, the DL compiler, together with Learning to Compile has proven to be a powerful technique for optimizing deep learning models. However, existing methods focus on accelerating the convergence speed of the individual tensor operator rather than the convergence speed of the entire model, which results in long optimization time to obtain a desired latency. In this paper, we present a new method called DynaTune, which provides significantly faster convergence speed to optimize a DNN model. In particular, we consider a Multi-Armed Bandit (MAB) model for the tensor program optimization problem. We use UCB to handle the decision-making of time-slot-based optimization, and we devise a Bayesian belief model that allows predicting the potential performance gain of each operator with uncertainty quantification, which guides the optimization process. We evaluate and compare DynaTune with the state-of-the-art DL compiler. The experiment results show that DynaTune is 1.2–2.4 times faster to achieve the same optimization quality for a range of models across different hardware architectures.
\ No newline at end of file
diff --git a/data/2021/iclr/Dynamic Tensor Rematerialization b/data/2021/iclr/Dynamic Tensor Rematerialization
new file mode 100644
index 0000000000..bfa6feeb55
--- /dev/null
+++ b/data/2021/iclr/Dynamic Tensor Rematerialization	
@@ -0,0 +1 @@
+Checkpointing enables training larger models by freeing intermediate activations and recomputing them on demand. Previous checkpointing techniques are difficult to generalize to dynamic models because they statically plan recomputations offline. We present Dynamic Tensor Rematerialization (DTR), a greedy online algorithm for heuristically checkpointing arbitrary models. DTR is extensible and general: it is parameterized by an eviction policy and only collects lightweight metadata on tensors and operators. Though DTR has no advance knowledge of the model or training task, we prove it can train an $N$-layer feedforward network on an $\Omega(\sqrt{N})$ memory budget with only $\mathcal{O}(N)$ tensor operations. Moreover, we identify a general eviction heuristic and show how it allows DTR to automatically provide favorable checkpointing performance across a variety of models and memory budgets.
\ No newline at end of file
diff --git a/data/2021/iclr/EEC: Learning to Encode and Regenerate Images for Continual Learning b/data/2021/iclr/EEC: Learning to Encode and Regenerate Images for Continual Learning
new file mode 100644
index 0000000000..3e35970436
--- /dev/null
+++ b/data/2021/iclr/EEC: Learning to Encode and Regenerate Images for Continual Learning	
@@ -0,0 +1 @@
+The two main impediments to continual learning are catastrophic forgetting and memory limitations on the storage of data. To cope with these challenges, we propose a novel, cognitively-inspired approach which trains autoencoders with Neural Style Transfer to encode and store images. During training on a new task, reconstructed images from encoded episodes are replayed in order to avoid catastrophic forgetting. The loss function for the reconstructed images is weighted to reduce its effect during classifier training to cope with image degradation. When the system runs out of memory the encoded episodes are converted into centroids and covariance matrices, which are used to generate pseudo-images during classifier training, keeping classifier performance stable while using less memory. Our approach increases classification accuracy by 13-17% over state-of-the-art methods on benchmark datasets, while requiring 78% less storage space.
\ No newline at end of file
diff --git a/data/2021/iclr/Early Stopping in Deep Networks: Double Descent and How to Eliminate it b/data/2021/iclr/Early Stopping in Deep Networks: Double Descent and How to Eliminate it
new file mode 100644
index 0000000000..e206e17363
--- /dev/null
+++ b/data/2021/iclr/Early Stopping in Deep Networks: Double Descent and How to Eliminate it	
@@ -0,0 +1 @@
+Over-parameterized models, such as large deep networks, often exhibit a double descent phenomenon, whereas a function of model size, error first decreases, increases, and decreases at last. This intriguing double descent behavior also occurs as a function of training epochs and has been conjectured to arise because training epochs control the model complexity. In this paper, we show that such epoch-wise double descent arises for a different reason: It is caused by a superposition of two or more bias-variance tradeoffs that arise because different parts of the network are learned at different epochs, and eliminating this by proper scaling of stepsizes can significantly improve the early stopping performance. We show this analytically for i) linear regression, where differently scaled features give rise to a superposition of bias-variance tradeoffs, and for ii) a two-layer neural network, where the first and second layer each govern a bias-variance tradeoff. Inspired by this theory, we study two standard convolutional networks empirically and show that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance significantly.
\ No newline at end of file
diff --git a/data/2021/iclr/Economic Hyperparameter Optimization with Blended Search Strategy b/data/2021/iclr/Economic Hyperparameter Optimization with Blended Search Strategy
new file mode 100644
index 0000000000..eb6d78e95e
--- /dev/null
+++ b/data/2021/iclr/Economic Hyperparameter Optimization with Blended Search Strategy	
@@ -0,0 +1 @@
+This article presents a new approach to modeling and optimizing individual decision-making strategies in multi-agent socio-economic systems (MSES). This approach is based on the synthesis of agent-based modeling methods, machine learning and genetic optimization algorithms. A procedure for the synthesis and training of artificial neural networks (ANNs) that simulate the functionality of MSES and provide an approximation of the values of its objective characteristics has been developed. The feature of the two-step procedure is the combined use of particle swarm optimization methods (to determine the optimal values of hyperparameters) and the Adam machine learning algorithm (to compute weight coefficients of the ANN). The use of such ANN-based surrogate models in parallel multi-agent real-coded genetic algorithms (MA-RCGA) makes it possible to raise substantially the time-efficiency of the evolutionary search for optimal solutions. We have conducted numerical experiments that confirm a significant improvement in the performance of MA-RCGA, which periodically uses the ANN-based surrogate-model to approximate the values of the objective and fitness functions. A software framework has been designed that consists of the original (reference) agent-based model of trade interactions, the ANN-based surrogate model and the MA-RCGA genetic algorithm. At the same time, the software libraries FLAME GPU, OpenNN (Open Neural Networks Library), etc., agent-based modeling and machine learning methods are used. The system we developed can be used by responsible managers.
\ No newline at end of file
diff --git a/data/2021/iclr/Effective Abstract Reasoning with Dual-Contrast Network b/data/2021/iclr/Effective Abstract Reasoning with Dual-Contrast Network
new file mode 100644
index 0000000000..4146e43061
--- /dev/null
+++ b/data/2021/iclr/Effective Abstract Reasoning with Dual-Contrast Network	
@@ -0,0 +1 @@
+As a step towards improving the abstract reasoning capability of machines, we aim to solve Raven's Progressive Matrices (RPM) with neural networks, since solving RPM puzzles is highly correlated with human intelligence. Unlike previous methods that use auxiliary annotations or assume hidden rules to produce appropriate feature representation, we only use the ground truth answer of each question for model learning, aiming for an intelligent agent to have a strong learning capability with a small amount of supervision. Based on the RPM problem formulation, the correct answer filled into the missing entry of the third row/column has to best satisfy the same rules shared between the first two rows/columns. Thus we design a simple yet effective Dual-Contrast Network (DCNet) to exploit the inherent structure of RPM puzzles. Specifically, a rule contrast module is designed to compare the latent rules between the filled row/column and the first two rows/columns; a choice contrast module is designed to increase the relative differences between candidate choices. Experimental results on the RAVEN and PGM datasets show that DCNet outperforms the state-of-the-art methods by a large margin of 5.77%. Further experiments on few training samples and model generalization also show the effectiveness of DCNet. Code is available at https://github.com/visiontao/dcnet.
\ No newline at end of file
diff --git a/data/2021/iclr/Effective Distributed Learning with Random Features: Improved Bounds and Algorithms b/data/2021/iclr/Effective Distributed Learning with Random Features: Improved Bounds and Algorithms
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Effective and Efficient Vote Attack on Capsule Networks b/data/2021/iclr/Effective and Efficient Vote Attack on Capsule Networks
new file mode 100644
index 0000000000..968d549025
--- /dev/null
+++ b/data/2021/iclr/Effective and Efficient Vote Attack on Capsule Networks	
@@ -0,0 +1 @@
+Standard Convolutional Neural Networks (CNNs) can be easily fooled by images with small quasi-imperceptible artificial perturbations. As alternatives to CNNs, the recently proposed Capsule Networks (CapsNets) are shown to be more robust to white-box attacks than CNNs under popular attack protocols. Besides, the class-conditional reconstruction part of CapsNets is also used to detect adversarial examples. In this work, we investigate the adversarial robustness of CapsNets, especially how the inner workings of CapsNets change when the output capsules are attacked. The first observation is that adversarial examples misled CapsNets by manipulating the votes from primary capsules. Another observation is the high computational cost, when we directly apply multi-step attack methods designed for CNNs to attack CapsNets, due to the computationally expensive routing mechanism. Motivated by these two observations, we propose a novel vote attack where we attack votes of CapsNets directly. Our vote attack is not only effective but also efficient by circumventing the routing process. Furthermore, we integrate our vote attack into the detection-aware attack paradigm, which can successfully bypass the class-conditional reconstruction based detection method. Extensive experiments demonstrate the superior attack performance of our vote attack on CapsNets.
\ No newline at end of file
diff --git a/data/2021/iclr/Efficient Certified Defenses Against Patch Attacks on Image Classifiers b/data/2021/iclr/Efficient Certified Defenses Against Patch Attacks on Image Classifiers
new file mode 100644
index 0000000000..3a6a823d48
--- /dev/null
+++ b/data/2021/iclr/Efficient Certified Defenses Against Patch Attacks on Image Classifiers	
@@ -0,0 +1 @@
+Adversarial patches pose a realistic threat model for physical world attacks on autonomous systems via their perception component. Autonomous systems in safety-critical domains such as automated driving should thus contain a fail-safe fallback component that combines certifiable robustness against patches with efficient inference while maintaining high performance on clean inputs. We propose BagCert, a novel combination of model architecture and certification procedure that allows efficient certification. We derive a loss that enables end-to-end optimization of certified robustness against patches of different sizes and locations. On CIFAR10, BagCert certifies 10.000 examples in 43 seconds on a single GPU and obtains 86% clean and 60% certified accuracy against 5x5 patches.
\ No newline at end of file
diff --git a/data/2021/iclr/Efficient Conformal Prediction via Cascaded Inference with Expanded Admission b/data/2021/iclr/Efficient Conformal Prediction via Cascaded Inference with Expanded Admission
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Efficient Continual Learning with Modular Networks and Task-Driven Priors b/data/2021/iclr/Efficient Continual Learning with Modular Networks and Task-Driven Priors
new file mode 100644
index 0000000000..6c3ad3e213
--- /dev/null
+++ b/data/2021/iclr/Efficient Continual Learning with Modular Networks and Task-Driven Priors	
@@ -0,0 +1 @@
+Existing literature in Continual Learning (CL) has focused on overcoming catastrophic forgetting, the inability of the learner to recall how to perform tasks observed in the past. There are however other desirable properties of a CL system, such as the ability to transfer knowledge from previous tasks and to scale memory and compute sub-linearly with the number of tasks. Since most current benchmarks focus only on forgetting using short streams of tasks, we first propose a new suite of benchmarks to probe CL algorithms across these new axes. Finally, we introduce a new modular architecture, whose modules represent atomic skills that can be composed to perform a certain task. Learning a task reduces to figuring out which past modules to re-use, and which new modules to instantiate to solve the current task. Our learning algorithm leverages a task-driven prior over the exponential search space of all possible ways to combine modules, enabling efficient learning on long streams of tasks. Our experiments show that this modular architecture and learning algorithm perform competitively on widely used CL benchmarks while yielding superior performance on the more challenging benchmarks we introduce in this work.
\ No newline at end of file
diff --git a/data/2021/iclr/Efficient Empowerment Estimation for Unsupervised Stabilization b/data/2021/iclr/Efficient Empowerment Estimation for Unsupervised Stabilization
new file mode 100644
index 0000000000..55d99139e8
--- /dev/null
+++ b/data/2021/iclr/Efficient Empowerment Estimation for Unsupervised Stabilization	
@@ -0,0 +1 @@
+Intrinsically motivated artiﬁcial agents learn advantageous behavior without externally-provided rewards. Previously, it was shown that maximizing mutual information between agent actuators and future states, known as the empowerment principle, enables unsupervised stabilization of dynamical systems at upright positions, which is a prototypical intrinsically motivated behavior for upright standing and walking. This follows from the coincidence between the objective of stabilization and the objective of empowerment. Unfortunately, sample-based estimation of this kind of mutual information is challenging. Recently, various variational lower bounds (VLBs) on empowerment have been proposed as solutions; however, they are often biased, unstable in training, and have high sample complexity. In this work, we propose an alternative solution based on a trainable representation of a dynamical system as a Gaussian channel, which allows us to efﬁciently calculate an unbiased estimator of empowerment by convex optimization. We demonstrate our solution for sample-based unsupervised stabilization on different dynamical control systems and show the advantages of our method by comparing it to the existing VLB approaches. Speciﬁcally, we show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images. Consequently, our method opens a path to wider and easier adoption of empowerment for various applications. 1
\ No newline at end of file
diff --git a/data/2021/iclr/Efficient Generalized Spherical CNNs b/data/2021/iclr/Efficient Generalized Spherical CNNs
new file mode 100644
index 0000000000..67aeba3a32
--- /dev/null
+++ b/data/2021/iclr/Efficient Generalized Spherical CNNs	
@@ -0,0 +1 @@
+Many problems across computer vision and the natural sciences require the analysis of spherical data, for which representations may be learned efficiently by encoding equivariance to rotational symmetries. We present a generalized spherical CNN framework that encompasses various existing approaches and allows them to be leveraged alongside each other. The only existing non-linear spherical CNN layer that is strictly equivariant has complexity $\mathcal{O}(C^2L^5)$, where $C$ is a measure of representational capacity and $L$ the spherical harmonic bandlimit. Such a high computational cost often prohibits the use of strictly equivariant spherical CNNs. We develop two new strictly equivariant layers with reduced complexity $\mathcal{O}(CL^4)$ and $\mathcal{O}(CL^3 \log L)$, making larger, more expressive models computationally feasible. Moreover, we adopt efficient sampling theory to achieve further computational savings. We show that these developments allow the construction of more expressive hybrid models that achieve state-of-the-art accuracy and parameter efficiency on spherical benchmark problems.
\ No newline at end of file
diff --git a/data/2021/iclr/Efficient Inference of Flexible Interaction in Spiking-neuron Networks b/data/2021/iclr/Efficient Inference of Flexible Interaction in Spiking-neuron Networks
new file mode 100644
index 0000000000..0e717bfce9
--- /dev/null
+++ b/data/2021/iclr/Efficient Inference of Flexible Interaction in Spiking-neuron Networks	
@@ -0,0 +1 @@
+Hawkes process provides an effective statistical framework for analyzing the time-dependent interaction of neuronal spiking activities. Although utilized in many real applications, the classic Hawkes process is incapable of modelling inhibitory interactions among neurons. Instead, the nonlinear Hawkes process allows for a more flexible influence pattern with excitatory or inhibitory interactions. In this paper, three sets of auxiliary latent variables (Polya-Gamma variables, latent marked Poisson processes and sparsity variables) are augmented to make functional connection weights in a Gaussian form, which allows for a simple iterative algorithm with analytical updates. As a result, an efficient expectation-maximization (EM) algorithm is derived to obtain the maximum a posteriori (MAP) estimate. We demonstrate the accuracy and efficiency performance of our algorithm on synthetic and real data. For real neural recordings, we show our algorithm can estimate the temporal dynamics of interaction and reveal the interpretable functional connectivity underlying neural spike trains.
\ No newline at end of file
diff --git a/data/2021/iclr/Efficient Reinforcement Learning in Factored MDPs with Application to Constrained RL b/data/2021/iclr/Efficient Reinforcement Learning in Factored MDPs with Application to Constrained RL
new file mode 100644
index 0000000000..99f96c9624
--- /dev/null
+++ b/data/2021/iclr/Efficient Reinforcement Learning in Factored MDPs with Application to Constrained RL	
@@ -0,0 +1 @@
+Reinforcement learning (RL) in episodic, factored Markov decision processes (FMDPs) is studied. We propose an algorithm called FMDP-BF, which leverages the factorization structure of FMDP. The regret of FMDP-BF is shown to be exponentially smaller than that of optimal algorithms designed for non-factored MDPs, and improves on the best previous result for FMDPs~\citep{osband2014near} by a factored of $\sqrt{H|\mathcal{S}_i|}$, where $|\mathcal{S}_i|$ is the cardinality of the factored state subspace and $H$ is the planning horizon. To show the optimality of our bounds, we also provide a lower bound for FMDP, which indicates that our algorithm is near-optimal w.r.t. timestep $T$, horizon $H$ and factored state-action subspace cardinality. Finally, as an application, we study a new formulation of constrained RL, known as RL with knapsack constraints (RLwK), and provides the first sample-efficient algorithm based on FMDP-BF.
\ No newline at end of file
diff --git a/data/2021/iclr/Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation b/data/2021/iclr/Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation
new file mode 100644
index 0000000000..aefb8f804f
--- /dev/null
+++ b/data/2021/iclr/Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation	
@@ -0,0 +1 @@
+Many real-world applications such as robotics provide hard constraints on power and compute that limit the viable model complexity of Reinforcement Learning (RL) agents. Similarly, in many distributed RL settings, acting is done on un-accelerated hardware such as CPUs, which likewise restricts model size to prevent intractable experiment run times. These"actor-latency"constrained settings present a major obstruction to the scaling up of model complexity that has recently been extremely successful in supervised learning. To be able to utilize large model capacity while still operating within the limits imposed by the system during acting, we develop an"Actor-Learner Distillation"(ALD) procedure that leverages a continual form of distillation that transfers learning progress from a large capacity learner model to a small capacity actor model. As a case study, we develop this procedure in the context of partially-observable environments, where transformer models have had large improvements over LSTMs recently, at the cost of significantly higher computational complexity. With transformer models as the learner and LSTMs as the actor, we demonstrate in several challenging memory environments that using Actor-Learner Distillation recovers the clear sample-efficiency gains of the transformer learner model while maintaining the fast inference and reduced total training time of the LSTM actor model.
\ No newline at end of file
diff --git a/data/2021/iclr/Efficient Wasserstein Natural Gradients for Reinforcement Learning b/data/2021/iclr/Efficient Wasserstein Natural Gradients for Reinforcement Learning
new file mode 100644
index 0000000000..b014c80db4
--- /dev/null
+++ b/data/2021/iclr/Efficient Wasserstein Natural Gradients for Reinforcement Learning	
@@ -0,0 +1 @@
+A novel optimization approach is proposed for application to policy gradient methods and evolution strategies for reinforcement learning (RL). The procedure uses a computationally efficient Wasserstein natural gradient (WNG) descent that takes advantage of the geometry induced by a Wasserstein penalty to speed optimization. This method follows the recent theme in RL of including a divergence penalty in the objective to establish a trust region. Experiments on challenging tasks demonstrate improvements in both computational cost and performance over advanced baselines.
\ No newline at end of file
diff --git a/data/2021/iclr/EigenGame: PCA as a Nash Equilibrium b/data/2021/iclr/EigenGame: PCA as a Nash Equilibrium
new file mode 100644
index 0000000000..f0e89b0223
--- /dev/null
+++ b/data/2021/iclr/EigenGame: PCA as a Nash Equilibrium	
@@ -0,0 +1 @@
+We present a novel view on principal component analysis (PCA) as a competitive game in which each approximate eigenvector is controlled by a player whose goal is to maximize their own utility function. We analyze the properties of this PCA game and the behavior of its gradient based updates. The resulting algorithm which combines elements from Oja's rule with a generalized Gram-Schmidt orthogonalization is naturally decentralized and hence parallelizable through message passing. We demonstrate the scalability of the algorithm with experiments on large image datasets and neural network activations. We discuss how this new view of PCA as a differentiable game can lead to further algorithmic developments and insights.
\ No newline at end of file
diff --git a/data/2021/iclr/Emergent Road Rules In Multi-Agent Driving Environments b/data/2021/iclr/Emergent Road Rules In Multi-Agent Driving Environments
new file mode 100644
index 0000000000..642e16c637
--- /dev/null
+++ b/data/2021/iclr/Emergent Road Rules In Multi-Agent Driving Environments	
@@ -0,0 +1 @@
+For autonomous vehicles to safely share the road with human drivers, autonomous vehicles must abide by specific "road rules" that human drivers have agreed to follow. "Road rules" include rules that drivers are required to follow by law -- such as the requirement that vehicles stop at red lights -- as well as more subtle social rules -- such as the implicit designation of fast lanes on the highway. In this paper, we provide empirical evidence that suggests that -- instead of hard-coding road rules into self-driving algorithms -- a scalable alternative may be to design multi-agent environments in which road rules emerge as optimal solutions to the problem of maximizing traffic flow. We analyze what ingredients in driving environments cause the emergence of these road rules and find that two crucial factors are noisy perception and agents' spatial density. We provide qualitative and quantitative evidence of the emergence of seven social driving behaviors, ranging from obeying traffic signals to following lanes, all of which emerge from training agents to drive quickly to destinations without colliding. Our results add empirical support for the social road rules that countries worldwide have agreed on for safe, efficient driving.
\ No newline at end of file
diff --git a/data/2021/iclr/Emergent Symbols through Binding in External Memory b/data/2021/iclr/Emergent Symbols through Binding in External Memory
new file mode 100644
index 0000000000..7638d764e2
--- /dev/null
+++ b/data/2021/iclr/Emergent Symbols through Binding in External Memory	
@@ -0,0 +1 @@
+A key aspect of human intelligence is the ability to infer abstract rules directly from high-dimensional sensory data, and to do so given only a limited amount of training experience. Deep neural network algorithms have proven to be a powerful tool for learning directly from high-dimensional data, but currently lack this capacity for data-efficient induction of abstract rules, leading some to argue that symbol-processing mechanisms will be necessary to account for this capacity. In this work, we take a step toward bridging this gap by introducing the Emergent Symbol Binding Network (ESBN), a recurrent network augmented with an external memory that enables a form of variable-binding and indirection. This binding mechanism allows symbol-like representations to emerge through the learning process without the need to explicitly incorporate symbol-processing machinery, enabling the ESBN to learn rules in a manner that is abstracted away from the particular entities to which those rules apply. Across a series of tasks, we show that this architecture displays nearly perfect generalization of learned rules to novel entities given only a limited number of training examples, and outperforms a number of other competitive neural network architectures.
\ No newline at end of file
diff --git a/data/2021/iclr/Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition b/data/2021/iclr/Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition
new file mode 100644
index 0000000000..f12d336156
--- /dev/null
+++ b/data/2021/iclr/Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition	
@@ -0,0 +1 @@
+In many scenarios, named entity recognition (NER) models severely suffer from unlabeled entity problem, where the entities of a sentence may not be fully annotated. Through empirical studies performed on synthetic datasets, we find two causes of the performance degradation. One is the reduction of annotated entities and the other is treating unlabeled entities as negative instances. The first cause has less impact than the second one and can be mitigated by adopting pretraining language models. The second cause seriously misguides a model in training and greatly affects its performances. Based on the above observations, we propose a general approach that is capable of eliminating the misguidance brought by unlabeled entities. The core idea is using negative sampling to keep the probability of training with unlabeled entities at a very low level. Experiments on synthetic datasets and real-world datasets show that our model is robust to unlabeled entity problem and surpasses prior baselines. On well-annotated datasets, our model is competitive with state-of-the-art method.
\ No newline at end of file
diff --git a/data/2021/iclr/Empirical or Invariant Risk Minimization? A Sample Complexity Perspective b/data/2021/iclr/Empirical or Invariant Risk Minimization? A Sample Complexity Perspective
new file mode 100644
index 0000000000..96ee975bb0
--- /dev/null
+++ b/data/2021/iclr/Empirical or Invariant Risk Minimization? A Sample Complexity Perspective	
@@ -0,0 +1 @@
+Recently, invariant risk minimization (IRM) was proposed as a promising solution to address out-of-distribution (OOD) generalization. However, it is unclear when IRM should be preferred over the widely-employed empirical risk minimization (ERM) framework. In this work, we analyze both these frameworks from the perspective of sample complexity, thus taking a firm step towards answering this important question. We find that depending on the type of data generation mechanism, the two approaches might have very different finite sample and asymptotic behavior. For example, in the covariate shift setting we see that the two approaches not only arrive at the same asymptotic solution, but also have similar finite sample behavior with no clear winner. For other distribution shifts such as those involving confounders or anti-causal variables, however, the two approaches arrive at different asymptotic solutions where IRM is guaranteed to be close to the desired OOD solutions in the finite sample regime, while ERM is biased even asymptotically. We further investigate how different factors -- the number of environments, complexity of the model, and IRM penalty weight -- impact the sample complexity of IRM in relation to its distance from the OOD solutions
\ No newline at end of file
diff --git a/data/2021/iclr/End-to-End Egospheric Spatial Memory b/data/2021/iclr/End-to-End Egospheric Spatial Memory
new file mode 100644
index 0000000000..8edfa4bd38
--- /dev/null
+++ b/data/2021/iclr/End-to-End Egospheric Spatial Memory	
@@ -0,0 +1 @@
+Spatial memory, or the ability to remember and recall specific locations and objects, is central to autonomous agents' ability to carry out tasks in real environments. However, most existing artificial memory modules are not very adept at storing spatial information. We propose a parameter-free module, Egospheric Spatial Memory (ESM), which encodes the memory in an ego-sphere around the agent, enabling expressive 3D representations. ESM can be trained end-to-end via either imitation or reinforcement learning, and improves both training efficiency and final performance against other memory baselines on both drone and manipulator visuomotor control tasks. The explicit egocentric geometry also enables us to seamlessly combine the learned controller with other non-learned modalities, such as local obstacle avoidance. We further show applications to semantic segmentation on the ScanNet dataset, where ESM naturally combines image-level and map-level inference modalities. Through our broad set of experiments, we show that ESM provides a general computation graph for embodied spatial reasoning, and the module forms a bridge between real-time mapping systems and differentiable memory architectures. Implementation at: https://github.com/ivy-dl/memory.
\ No newline at end of file
diff --git a/data/2021/iclr/End-to-end Adversarial Text-to-Speech b/data/2021/iclr/End-to-end Adversarial Text-to-Speech
new file mode 100644
index 0000000000..6741cc1187
--- /dev/null
+++ b/data/2021/iclr/End-to-end Adversarial Text-to-Speech	
@@ -0,0 +1 @@
+Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs. Our proposed generator is feed-forward and thus efficient for both training and inference, using a differentiable alignment scheme based on token length prediction. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses constraining the generated audio to roughly match the ground truth in terms of its total duration and mel-spectrogram. To allow the model to capture temporal variation in the generated audio, we employ soft dynamic time warping in the spectrogram-based prediction loss. The resulting model achieves a mean opinion score exceeding 4 on a 5 point scale, which is comparable to the state-of-the-art models relying on multi-stage training and additional supervision.
\ No newline at end of file
diff --git a/data/2021/iclr/Enforcing robust control guarantees within neural network policies b/data/2021/iclr/Enforcing robust control guarantees within neural network policies
new file mode 100644
index 0000000000..84a09e24d3
--- /dev/null
+++ b/data/2021/iclr/Enforcing robust control guarantees within neural network policies	
@@ -0,0 +1 @@
+When designing controllers for safety-critical systems, practitioners often face a challenging tradeoff between robustness and performance. While robust control methods provide rigorous guarantees on system stability under certain worst-case disturbances, they often result in simple controllers that perform poorly in the average (non-worst) case. In contrast, nonlinear control methods trained using deep learning have achieved state-of-the-art performance on many control tasks, but often lack robustness guarantees. We propose a technique that combines the strengths of these two approaches: a generic nonlinear control policy class, parameterized by neural networks, that nonetheless enforces the same provable robustness criteria as robust control. Specifically, we show that by integrating custom convex-optimization-based projection layers into a nonlinear policy, we can construct a provably robust neural network policy class that outperforms robust control methods in the average (non-adversarial) setting. We demonstrate the power of this approach on several domains, improving in performance over existing robust control methods and in stability over (non-robust) RL methods.
\ No newline at end of file
diff --git a/data/2021/iclr/Enjoy Your Editing: Controllable GANs for Image Editing via Latent Space Navigation b/data/2021/iclr/Enjoy Your Editing: Controllable GANs for Image Editing via Latent Space Navigation
new file mode 100644
index 0000000000..8a35b213ca
--- /dev/null
+++ b/data/2021/iclr/Enjoy Your Editing: Controllable GANs for Image Editing via Latent Space Navigation	
@@ -0,0 +1 @@
+Controllable semantic image editing enables a user to change entire image attributes with few clicks, e.g., gradually making a summer scene look like it was taken in winter. Classic approaches for this task use a Generative Adversarial Net (GAN) to learn a latent space and suitable latent-space transformations. However, current approaches often suffer from attribute edits that are entangled, global image identity changes, and diminished photo-realism. To address these concerns, we learn multiple attribute transformations simultaneously, we integrate attribute regression into the training of transformation functions, apply a content loss and an adversarial loss that encourage the maintenance of image identity and photo-realism. We propose quantitative evaluation strategies for measuring controllable editing performance, unlike prior work which primarily focuses on qualitative evaluation. Our model permits better control for both single- and multiple-attribute editing, while also preserving image identity and realism during transformation. We provide empirical results for both real and synthetic images, highlighting that our model achieves state-of-the-art performance for targeted image manipulation.
\ No newline at end of file
diff --git a/data/2021/iclr/Entropic gradient descent algorithms and wide flat minima b/data/2021/iclr/Entropic gradient descent algorithms and wide flat minima
new file mode 100644
index 0000000000..5bb25bf6e5
--- /dev/null
+++ b/data/2021/iclr/Entropic gradient descent algorithms and wide flat minima	
@@ -0,0 +1 @@
+The properties of flat minima in the empirical risk landscape of neural networks have been debated for some time. Increasing evidence suggests they possess better generalization capabilities with respect to sharp ones. In this work we first discuss the relationship between alternative measures of flatness: the local entropy, which is useful for analysis and algorithm development, and the local energy, which is easier to compute and was shown empirically in extensive tests on state-of-the-art networks to be the best predictor of generalization capabilities. We show semi-analytically in simple controlled scenarios that these two measures correlate strongly with each other and with generalization. Then, we extend the analysis to the deep learning scenario by extensive numerical validations. We study two algorithms, entropy-stochastic gradient descent and replicated-stochastic gradient descent, that explicitly include the local entropy in the optimization objective. We devise a training schedule by which we consistently find flatter minima (using both flatness measures), and improve the generalization error for common architectures (e.g. ResNet, EfficientNet).
\ No newline at end of file
diff --git a/data/2021/iclr/Estimating Lipschitz constants of monotone deep equilibrium models b/data/2021/iclr/Estimating Lipschitz constants of monotone deep equilibrium models
new file mode 100644
index 0000000000..595d9f7557
--- /dev/null
+++ b/data/2021/iclr/Estimating Lipschitz constants of monotone deep equilibrium models	
@@ -0,0 +1 @@
+Several methods have been proposed in recent years to provide bounds on the Lipschitz constants of deep networks, which can be used to provide robustness guarantees, generalization bounds, and characterize the smoothness of decision boundaries. However, existing bounds get substantially weaker with increasing depth of the network, which makes it unclear how to apply such bounds to recently proposed models such as the deep equilibrium (DEQ) model, which can be viewed as representing an infinitely-deep network. In this paper, we show that monotone DEQs, a recently-proposed subclass of DEQs, have Lipschitz constants that can be bounded as a simple function of the strong monotonicity parameter of the network. We derive simple-yet-tight bounds on both the input-output mapping and the weight-output mapping defined by these networks, and demonstrate that they are small relative to those for comparable standard DNNs. We show that one can use these bounds to design monotone DEQ models, even with e.g. multi-scale convolutional structure, that still have constraints on the Lipschitz constant. We also highlight how to use these bounds to develop PAC-Bayes generalization bounds that do not depend on any depth of the network, and which avoid the exponential depth-dependence of comparable DNN bounds.
\ No newline at end of file
diff --git a/data/2021/iclr/Estimating and Evaluating Regression Predictive Uncertainty in Deep Object Detectors b/data/2021/iclr/Estimating and Evaluating Regression Predictive Uncertainty in Deep Object Detectors
new file mode 100644
index 0000000000..c1e20089db
--- /dev/null
+++ b/data/2021/iclr/Estimating and Evaluating Regression Predictive Uncertainty in Deep Object Detectors	
@@ -0,0 +1 @@
+Predictive uncertainty estimation is an essential next step for the reliable deployment of deep object detectors in safety-critical tasks. In this work, we focus on estimating predictive distributions for bounding box regression output with variance networks. We show that in the context of object detection, training variance networks with negative log likelihood (NLL) can lead to high entropy predictive distributions regardless of the correctness of the output mean. We propose to use the energy score as a non-local proper scoring rule and find that when used for training, the energy score leads to better calibrated and lower entropy predictive distributions than NLL. We also address the widespread use of non-proper scoring metrics for evaluating predictive distributions from deep object detectors by proposing an alternate evaluation approach founded on proper scoring rules. Using the proposed evaluation tools, we show that although variance networks can be used to produce high quality predictive distributions, ad-hoc approaches used by seminal object detectors for choosing regression targets during training do not provide wide enough data support for reliable variance learning. We hope that our work helps shift evaluation in probabilistic object detection to better align with predictive uncertainty evaluation in other machine learning domains. Code for all models, evaluation, and datasets is available at: https://github.com/asharakeh/probdet.git.
\ No newline at end of file
diff --git a/data/2021/iclr/Estimating informativeness of samples with Smooth Unique Information b/data/2021/iclr/Estimating informativeness of samples with Smooth Unique Information
new file mode 100644
index 0000000000..5602cb385b
--- /dev/null
+++ b/data/2021/iclr/Estimating informativeness of samples with Smooth Unique Information	
@@ -0,0 +1 @@
+We define a notion of information that an individual sample provides to the training of a neural network, and we specialize it to measure both how much a sample informs the final weights and how much it informs the function computed by the weights. Though related, we show that these quantities have a qualitatively different behavior. We give efficient approximations of these quantities using a linearized network and demonstrate empirically that the approximation is accurate for real-world architectures, such as pre-trained ResNets. We apply these measures to several problems, such as dataset summarization, analysis of under-sampled classes, comparison of informativeness of different data sources, and detection of adversarial and corrupted examples. Our work generalizes existing frameworks but enjoys better computational properties for heavily over-parametrized models, which makes it possible to apply it to real-world networks.
\ No newline at end of file
diff --git a/data/2021/iclr/Evaluating the Disentanglement of Deep Generative Models through Manifold Topology b/data/2021/iclr/Evaluating the Disentanglement of Deep Generative Models through Manifold Topology
new file mode 100644
index 0000000000..da2f3889f4
--- /dev/null
+++ b/data/2021/iclr/Evaluating the Disentanglement of Deep Generative Models through Manifold Topology	
@@ -0,0 +1 @@
+Learning disentangled representations is regarded as a fundamental task for improving the generalization, robustness, and interpretability of generative models. However, measuring disentanglement has been challenging and inconsistent, often dependent on an ad-hoc external model or specific to a certain dataset. To address this, we present a method for quantifying disentanglement that only uses the generative model, by measuring the topological similarity of conditional submanifolds in the learned representation. This method showcases both unsupervised and supervised variants. To illustrate the effectiveness and applicability of our method, we empirically evaluate several state-of-the-art models across multiple datasets. We find that our method ranks models similarly to existing methods.
\ No newline at end of file
diff --git a/data/2021/iclr/Evaluation of Neural Architectures trained with square Loss vs Cross-Entropy in Classification Tasks b/data/2021/iclr/Evaluation of Neural Architectures trained with square Loss vs Cross-Entropy in Classification Tasks
new file mode 100644
index 0000000000..226873180f
--- /dev/null
+++ b/data/2021/iclr/Evaluation of Neural Architectures trained with square Loss vs Cross-Entropy in Classification Tasks	
@@ -0,0 +1,2 @@
+Modern neural architectures for classification tasks are trained using the cross-entropy loss, which is widely believed to be empirically superior to the square loss. In this work we provide evidence indicating that this belief may not be well-founded. We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or better when trained with the square loss, even after equalizing computational resources. Indeed, we observe that the square loss produces better results in the dominant majority of NLP and ASR experiments. Cross-entropy appears to have a slight edge on computer vision tasks. 
+We argue that there is little compelling empirical or theoretical evidence indicating a clear-cut advantage to the cross-entropy loss. Indeed, in our experiments, performance on nearly all non-vision tasks can be improved, sometimes significantly, by switching to the square loss. Furthermore, training with square loss appears to be less sensitive to the randomness in initialization. We posit that training using the square loss for classification needs to be a part of best practices of modern deep learning on equal footing with cross-entropy.
\ No newline at end of file
diff --git a/data/2021/iclr/Evaluation of Similarity-based Explanations b/data/2021/iclr/Evaluation of Similarity-based Explanations
new file mode 100644
index 0000000000..092d918be9
--- /dev/null
+++ b/data/2021/iclr/Evaluation of Similarity-based Explanations	
@@ -0,0 +1 @@
+Explaining the predictions made by complex machine learning models helps users to understand and accept the predicted outputs with confidence. One promising way is to use similarity-based explanation that provides similar instances as evidence to support model predictions. Several relevance metrics are used for this purpose. In this study, we investigated relevance metrics that can provide reasonable explanations to users. Specifically, we adopted three tests to evaluate whether the relevance metrics satisfy the minimal requirements for similarity-based explanation. Our experiments revealed that the cosine similarity of the gradients of the loss performs best, which would be a recommended choice in practice. In addition, we showed that some metrics perform poorly in our tests and analyzed the reasons of their failure. We expect our insights to help practitioners in selecting appropriate relevance metrics and also aid further researches for designing better relevance metrics for explanations.
\ No newline at end of file
diff --git a/data/2021/iclr/Evaluations and Methods for Explanation through Robustness Analysis b/data/2021/iclr/Evaluations and Methods for Explanation through Robustness Analysis
new file mode 100644
index 0000000000..10beeea41c
--- /dev/null
+++ b/data/2021/iclr/Evaluations and Methods for Explanation through Robustness Analysis	
@@ -0,0 +1 @@
+Among multiple ways of interpreting a machine learning model, measuring the importance of a set of features tied to a prediction is probably one of the most intuitive ways to explain a model. In this paper, we establish the link between a set of features to a prediction with a new evaluation criterion, robustness analysis, which measures the minimum distortion distance of adversarial perturbation. By measuring the tolerance level for an adversarial attack, we can extract a set of features that provides the most robust support for a prediction, and also can extract a set of features that contrasts the current prediction to a target class by setting a targeted adversarial attack. By applying this methodology to various prediction tasks across multiple domains, we observe the derived explanations are indeed capturing the significant feature set qualitatively and quantitatively.
\ No newline at end of file
diff --git a/data/2021/iclr/Evolving Reinforcement Learning Algorithms b/data/2021/iclr/Evolving Reinforcement Learning Algorithms
new file mode 100644
index 0000000000..bb4153c983
--- /dev/null
+++ b/data/2021/iclr/Evolving Reinforcement Learning Algorithms	
@@ -0,0 +1 @@
+We propose a method for meta-learning reinforcement learning algorithms by searching over the space of computational graphs which compute the loss function for a value-based model-free RL agent to optimize. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. Our method can both learn from scratch and bootstrap off known existing algorithms, like DQN, enabling interpretable modifications which improve performance. Learning from scratch on simple classical control and gridworld tasks, our method rediscovers the temporal-difference (TD) algorithm. Bootstrapped from DQN, we highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games. The analysis of the learned algorithm behavior shows resemblance to recently proposed RL algorithms that address overestimation in value-based methods.
\ No newline at end of file
diff --git a/data/2021/iclr/Exemplary Natural Images Explain CNN Activations Better than State-of-the-Art Feature Visualization b/data/2021/iclr/Exemplary Natural Images Explain CNN Activations Better than State-of-the-Art Feature Visualization
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Explainable Deep One-Class Classification b/data/2021/iclr/Explainable Deep One-Class Classification
new file mode 100644
index 0000000000..37ec205b4c
--- /dev/null
+++ b/data/2021/iclr/Explainable Deep One-Class Classification	
@@ -0,0 +1 @@
+Deep one-class classification variants for anomaly detection learn a mapping that concentrates nominal samples in feature space causing anomalies to be mapped away. Because this transformation is highly non-linear, finding interpretations poses a significant challenge. In this paper we present an explainable deep one-class classification method, Fully Convolutional Data Description (FCDD), where the mapped samples are themselves also an explanation heatmap. FCDD yields competitive detection performance and provides reasonable explanations on common anomaly detection benchmarks with CIFAR-10 and ImageNet. On MVTec-AD, a recent manufacturing dataset offering ground-truth anomaly maps, FCDD meets the state of the art in an unsupervised setting, and outperforms its competitors in a semi-supervised setting. Finally, using FCDD's explanations we demonstrate the vulnerability of deep one-class classification models to spurious image features such as image watermarks.
\ No newline at end of file
diff --git a/data/2021/iclr/Explainable Subgraph Reasoning for Forecasting on Temporal Knowledge Graphs b/data/2021/iclr/Explainable Subgraph Reasoning for Forecasting on Temporal Knowledge Graphs
new file mode 100644
index 0000000000..866acf4f54
--- /dev/null
+++ b/data/2021/iclr/Explainable Subgraph Reasoning for Forecasting on Temporal Knowledge Graphs	
@@ -0,0 +1 @@
+Modeling time-evolving knowledge graphs (KGs) has recently gained increasing interest. Here, graph representation learning has become the dominant paradigm for link prediction on temporal KGs. However, the embedding-based approaches largely operate in a black-box fashion, lacking the ability to interpret their predictions. This paper provides a link forecasting framework that reasons over query-relevant subgraphs of temporal KGs and jointly models the structural dependencies and the temporal dynamics. Especially, we propose a temporal relational attention mechanism and a novel reverse representation update scheme to guide the extraction of an enclosing subgraph around the query. The subgraph is expanded by an iterative sampling of temporal neighbors and by attention propagation. Our approach provides human-understandable evidence explaining the forecast. We evaluate our model on four benchmark temporal knowledge graphs for the link forecasting task. While being more explainable, our model obtains a relative improvement of up to 20 % on Hits@1 compared to the previous best temporal KG forecasting method. We also conduct a survey with 53 respondents, and the results show that the evidence extracted by the model for link forecasting is aligned with human understanding.
\ No newline at end of file
diff --git a/data/2021/iclr/Explaining by Imitating: Understanding Decisions by Interpretable Policy Learning b/data/2021/iclr/Explaining by Imitating: Understanding Decisions by Interpretable Policy Learning
new file mode 100644
index 0000000000..6fcd45874e
--- /dev/null
+++ b/data/2021/iclr/Explaining by Imitating: Understanding Decisions by Interpretable Policy Learning	
@@ -0,0 +1 @@
+Understanding human behavior from observed data is critical for transparency and accountability in decision-making. Consider real-world settings such as healthcare, in which modeling a decision-maker's policy is challenging -- with no access to underlying states, no knowledge of environment dynamics, and no allowance for live experimentation. We desire learning a data-driven representation of decision-making behavior that (1) inheres transparency by design, (2) accommodates partial observability, and (3) operates completely offline. To satisfy these key criteria, we propose a novel model-based Bayesian method for interpretable policy learning ("Interpole") that jointly estimates an agent's (possibly biased) belief-update process together with their (possibly suboptimal) belief-action mapping. Through experiments on both simulated and real-world data for the problem of Alzheimer's disease diagnosis, we illustrate the potential of our approach as an investigative device for auditing, quantifying, and understanding human decision-making behavior.
\ No newline at end of file
diff --git a/data/2021/iclr/Explaining the Efficacy of Counterfactually Augmented Data b/data/2021/iclr/Explaining the Efficacy of Counterfactually Augmented Data
new file mode 100644
index 0000000000..6ebb12d459
--- /dev/null
+++ b/data/2021/iclr/Explaining the Efficacy of Counterfactually Augmented Data	
@@ -0,0 +1 @@
+In attempts to produce machine learning models less reliant on spurious patterns in training data, researchers have recently proposed a human-in-the-loop process for generating counterfactually augmented datasets. As applied in NLP, given some documents and their (initial) labels, humans are tasked with revising the text to make a (given) counterfactual label applicable. Importantly, the instructions prohibit edits that are not necessary to flip the applicable label. Models trained on the augmented (original and revised) data have been shown to rely less on semantically irrelevant words and to generalize better out-of-domain. While this work draws on causal thinking, casting edits as interventions and relying on human understanding to assess outcomes, the underlying causal model is not clear nor are the principles underlying the observed improvements in out-of-domain evaluation. In this paper, we explore a toy analog, using linear Gaussian models. Our analysis reveals interesting relationships between causal models, measurement noise, out-of-domain generalization, and reliance on spurious signals. Interestingly our analysis suggests that data corrupted by adding noise to causal features will degrade out-of-domain performance, while noise added to non-causal features may make models more robust out-of-domain. This analysis yields interesting insights that help to explain the efficacy of counterfactually augmented data. Finally, we present a large-scale empirical study that supports this hypothesis.
\ No newline at end of file
diff --git a/data/2021/iclr/Exploring Balanced Feature Spaces for Representation Learning b/data/2021/iclr/Exploring Balanced Feature Spaces for Representation Learning
new file mode 100644
index 0000000000..109f8cc83d
--- /dev/null
+++ b/data/2021/iclr/Exploring Balanced Feature Spaces for Representation Learning	
@@ -0,0 +1 @@
+Existing self-supervised learning (SSL) methods are mostly applied for training representation models from artiﬁcially balanced datasets ( e.g . ImageNet). It is unclear how well they will perform in the practical scenarios where datasets are often imbalanced w.r.t. the classes. Motivated by this question, we conduct a series of studies on the performance of self-supervised contrastive learning and supervised learning methods over multiple datasets where training instance distributions vary from a balanced one to a long-tailed one. Our ﬁndings are quite intriguing. Different from supervised methods with large performance drop, the self-supervised contrastive learning methods perform stably well even when the datasets are heavily imbalanced. This motivates us to explore the balanced feature spaces learned by contrastive learning, where the feature representations present similar linear separability w.r.t. all the classes. Our further experiments reveal that a representation model generating a balanced feature space can generalize better than that yielding an imbalanced one across multiple settings. Inspired by these insights, we develop a novel representation learning method, called k -positive contrastive learning. It effectively combines strengths of the supervised method and the contrastive learning method to learn representations that are both discriminative and balanced. Extensive experiments demonstrate its superiority on multiple recognition tasks, including both long-tailed ones and normal balanced ones. Code is available at https://github.com/bingykang/BalFeat .
\ No newline at end of file
diff --git a/data/2021/iclr/Exploring the Uncertainty Properties of Neural Networks' Implicit Priors in the Infinite-Width Limit b/data/2021/iclr/Exploring the Uncertainty Properties of Neural Networks' Implicit Priors in the Infinite-Width Limit
new file mode 100644
index 0000000000..f25260fd4a
--- /dev/null
+++ b/data/2021/iclr/Exploring the Uncertainty Properties of Neural Networks' Implicit Priors in the Infinite-Width Limit	
@@ -0,0 +1 @@
+Modern deep learning models have achieved great success in predictive accuracy for many data modalities. However, their application to many real-world tasks is restricted by poor uncertainty estimates, such as overconfidence on out-of-distribution (OOD) data and ungraceful failing under distributional shift. Previous benchmarks have found that ensembles of neural networks (NNs) are typically the best calibrated models on OOD data. Inspired by this, we leverage recent theoretical advances that characterize the function-space prior of an ensemble of infinitely-wide NNs as a Gaussian process, termed the neural network Gaussian process (NNGP). We use the NNGP with a softmax link function to build a probabilistic model for multi-class classification and marginalize over the latent Gaussian outputs to sample from the posterior. This gives us a better understanding of the implicit prior NNs place on function space and allows a direct comparison of the calibration of the NNGP and its finite-width analogue. We also examine the calibration of previous approaches to classification with the NNGP, which treat classification problems as regression to the one-hot labels. In this case the Bayesian posterior is exact, and we compare several heuristics to generate a categorical distribution over classes. We find these methods are well calibrated under distributional shift. Finally, we consider an infinite-width final layer in conjunction with a pre-trained embedding. This replicates the important practical use case of transfer learning and allows scaling to significantly larger datasets. As well as achieving competitive predictive accuracy, this approach is better calibrated than its finite width analogue.
\ No newline at end of file
diff --git a/data/2021/iclr/Expressive Power of Invariant and Equivariant Graph Neural Networks b/data/2021/iclr/Expressive Power of Invariant and Equivariant Graph Neural Networks
new file mode 100644
index 0000000000..3136aee157
--- /dev/null
+++ b/data/2021/iclr/Expressive Power of Invariant and Equivariant Graph Neural Networks	
@@ -0,0 +1 @@
+Various classes of Graph Neural Networks (GNN) have been proposed and shown to be successful in a wide range of applications with graph structured data. In this paper, we propose a theoretical framework able to compare the expressive power of these GNN architectures. The current universality theorems only apply to intractable classes of GNNs. Here, we prove the first approximation guarantees for practical GNNs, paving the way for a better understanding of their generalization. Our theoretical results are proved for invariant GNNs computing a graph embedding (permutation of the nodes of the input graph does not affect the output) and equivariant GNNs computing an embedding of the nodes (permutation of the input permutes the output). We show that Folklore Graph Neural Networks (FGNN), which are tensor based GNNs augmented with matrix multiplication are the most expressive architectures proposed so far for a given tensor order. We illustrate our results on the Quadratic Assignment Problem (a NP-Hard combinatorial problem) by showing that FGNNs are able to learn how to solve the problem, leading to much better average performances than existing algorithms (based on spectral, SDP or other GNNs architectures). On a practical side, we also implement masked tensors to handle batches of graphs of varying sizes.
\ No newline at end of file
diff --git a/data/2021/iclr/Extracting Strong Policies for Robotics Tasks from Zero-Order Trajectory Optimizers b/data/2021/iclr/Extracting Strong Policies for Robotics Tasks from Zero-Order Trajectory Optimizers
new file mode 100644
index 0000000000..63e2afa809
--- /dev/null
+++ b/data/2021/iclr/Extracting Strong Policies for Robotics Tasks from Zero-Order Trajectory Optimizers	
@@ -0,0 +1 @@
+Solving high-dimensional, continuous robotic tasks is a challenging optimization problem. Model-based methods that rely on zero-order optimizers like the crossentropy method (CEM) have so far shown strong performance and are considered state-of-the-art in the model-based reinforcement learning community. However, this success comes at the cost of high computational complexity, being therefore not suitable for real-time control. In this paper, we propose a technique to jointly optimize the trajectory and distill a policy, which is essential for fast execution in real robotic systems. Our method builds upon standard approaches, like guidance cost and dataset aggregation, and introduces a novel adaptive factor which prevents the optimizer from collapsing to the learner’s behavior at the beginning of the training. The extracted policies reach unprecedented performance on challenging tasks like making a humanoid stand up and opening a door without reward shaping. Figure 1: Environments and exemplary behaviors of the learned policy using APEX. From left to right: FETCH PICK&PLACE (sparse reward), DOOR (sparse reward), and HUMANOID STANDUP.
\ No newline at end of file
diff --git a/data/2021/iclr/Extreme Memorization via Scale of Initialization b/data/2021/iclr/Extreme Memorization via Scale of Initialization
new file mode 100644
index 0000000000..7d01471856
--- /dev/null
+++ b/data/2021/iclr/Extreme Memorization via Scale of Initialization	
@@ -0,0 +1 @@
+We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD, interpolating from good generalization performance to completely memorizing the training set while making little progress on the test set. Moreover, we find that the extent and manner in which generalization ability is affected depends on the activation and loss function used, with $\sin$ activation being the most extreme. In the case of the homogeneous ReLU activation, we show that this behavior can be attributed to the loss function. Our empirical investigation reveals that increasing the scale of initialization could cause the representations and gradients to be increasingly misaligned across examples in the same class. We further demonstrate that a similar misalignment phenomenon occurs in other scenarios affecting generalization performance, such as changes to the architecture or data distribution.
\ No newline at end of file
diff --git a/data/2021/iclr/FOCAL: Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization b/data/2021/iclr/FOCAL: Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Factorizing Declarative and Procedural Knowledge in Structured, Dynamical Environments b/data/2021/iclr/Factorizing Declarative and Procedural Knowledge in Structured, Dynamical Environments
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Fair Mixup: Fairness via Interpolation b/data/2021/iclr/Fair Mixup: Fairness via Interpolation
new file mode 100644
index 0000000000..e8db8c4c02
--- /dev/null
+++ b/data/2021/iclr/Fair Mixup: Fairness via Interpolation	
@@ -0,0 +1 @@
+Training classifiers under fairness constraints such as group fairness, regularizes the disparities of predictions between the groups. Nevertheless, even though the constraints are satisfied during training, they might not generalize at evaluation time. To improve the generalizability of fair classifiers, we propose fair mixup, a new data augmentation strategy for imposing the fairness constraint. In particular, we show that fairness can be achieved by regularizing the models on paths of interpolated samples between the groups. We use mixup, a powerful data augmentation strategy to generate these interpolates. We analyze fair mixup and empirically show that it ensures a better generalization for both accuracy and fairness measurement in tabular, vision, and language benchmarks.
\ No newline at end of file
diff --git a/data/2021/iclr/FairBatch: Batch Selection for Model Fairness b/data/2021/iclr/FairBatch: Batch Selection for Model Fairness
new file mode 100644
index 0000000000..c8d1f7431c
--- /dev/null
+++ b/data/2021/iclr/FairBatch: Batch Selection for Model Fairness	
@@ -0,0 +1 @@
+Training a fair machine learning model is essential to prevent demographic disparity. Existing techniques for improving model fairness require broad changes in either data preprocessing or model training, rendering themselves difficult-to-adopt for potentially already complex machine learning systems. We address this problem via the lens of bilevel optimization. While keeping the standard training algorithm as an inner optimizer, we incorporate an outer optimizer so as to equip the inner problem with an additional functionality: Adaptively selecting minibatch sizes for the purpose of improving model fairness. Our batch selection algorithm, which we call FairBatch, implements this optimization and supports prominent fairness measures: equal opportunity, equalized odds, and demographic parity. FairBatch comes with a significant implementation benefit -- it does not require any modification to data preprocessing or model training. For instance, a single-line change of PyTorch code for replacing batch selection part of model training suffices to employ FairBatch. Our experiments conducted both on synthetic and benchmark real data demonstrate that FairBatch can provide such functionalities while achieving comparable (or even greater) performances against the state of the arts. Furthermore, FairBatch can readily improve fairness of any pre-trained model simply via fine-tuning. It is also compatible with existing batch selection techniques intended for different purposes, such as faster convergence, thus gracefully achieving multiple purposes.
\ No newline at end of file
diff --git a/data/2021/iclr/FairFil: Contrastive Neural Debiasing Method for Pretrained Text Encoders b/data/2021/iclr/FairFil: Contrastive Neural Debiasing Method for Pretrained Text Encoders
new file mode 100644
index 0000000000..1a6eaa1b84
--- /dev/null
+++ b/data/2021/iclr/FairFil: Contrastive Neural Debiasing Method for Pretrained Text Encoders	
@@ -0,0 +1 @@
+Pretrained text encoders, such as BERT, have been applied increasingly in various natural language processing (NLP) tasks, and have recently demonstrated significant performance gains. However, recent studies have demonstrated the existence of social bias in these pretrained NLP models. Although prior works have made progress on word-level debiasing, improved sentence-level fairness of pretrained encoders still lacks exploration. In this paper, we proposed the first neural debiasing method for a pretrained sentence encoder, which transforms the pretrained encoder outputs into debiased representations via a fair filter (FairFil) network. To learn the FairFil, we introduce a contrastive learning framework that not only minimizes the correlation between filtered embeddings and bias words but also preserves rich semantic information of the original sentences. On real-world datasets, our FairFil effectively reduces the bias degree of pretrained text encoders, while continuously showing desirable performance on downstream tasks. Moreover, our post-hoc method does not require any retraining of the text encoders, further enlarging FairFil's application space.
\ No newline at end of file
diff --git a/data/2021/iclr/Fantastic Four: Differentiable and Efficient Bounds on Singular Values of Convolution Layers b/data/2021/iclr/Fantastic Four: Differentiable and Efficient Bounds on Singular Values of Convolution Layers
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Fast And Slow Learning Of Recurrent Independent Mechanisms b/data/2021/iclr/Fast And Slow Learning Of Recurrent Independent Mechanisms
new file mode 100644
index 0000000000..59c53f357b
--- /dev/null
+++ b/data/2021/iclr/Fast And Slow Learning Of Recurrent Independent Mechanisms	
@@ -0,0 +1 @@
+Decomposing knowledge into interchangeable pieces promises a generalization advantage when there are changes in distribution. A learning agent interacting with its environment is likely to be faced with situations requiring novel combinations of existing pieces of knowledge. We hypothesize that such a decomposition of knowledge is particularly relevant for being able to generalize in a systematic manner to out-of-distribution changes. To study these ideas, we propose a particular training framework in which we assume that the pieces of knowledge an agent needs and its reward function are stationary and can be re-used across tasks. An attention mechanism dynamically selects which modules can be adapted to the current task, and the parameters of the selected modules are allowed to change quickly as the learner is confronted with variations in what it experiences, while the parameters of the attention mechanisms act as stable, slowly changing, meta-parameters. We focus on pieces of knowledge captured by an ensemble of modules sparsely communicating with each other via a bottleneck of attention. We find that meta-learning the modular aspects of the proposed system greatly helps in achieving faster adaptation in a reinforcement learning setup involving navigation in a partially observed grid world with image-level input. We also find that reversing the role of parameters and meta-parameters does not work nearly as well, suggesting a particular role for fast adaptation of the dynamically selected modules.
\ No newline at end of file
diff --git a/data/2021/iclr/Fast Geometric Projections for Local Robustness Certification b/data/2021/iclr/Fast Geometric Projections for Local Robustness Certification
new file mode 100644
index 0000000000..edf81cf345
--- /dev/null
+++ b/data/2021/iclr/Fast Geometric Projections for Local Robustness Certification	
@@ -0,0 +1 @@
+Local robustness ensures that a model classifies all inputs within an $\epsilon$-ball consistently, which precludes various forms of adversarial inputs. In this paper, we present a fast procedure for checking local robustness in feed-forward neural networks with piecewise linear activation functions. The key insight is that such networks partition the input space into a polyhedral complex such that the network is linear inside each polyhedral region; hence, a systematic search for decision boundaries within the regions around a given input is sufficient for assessing robustness. Crucially, we show how these regions can be analyzed using geometric projections instead of expensive constraint solving, thus admitting an efficient, highly-parallel GPU implementation at the price of incompleteness, which can be addressed by falling back on prior approaches. Empirically, we find that incompleteness is not often an issue, and that our method performs one to two orders of magnitude faster than existing robustness-certification techniques based on constraint solving.
\ No newline at end of file
diff --git a/data/2021/iclr/Fast and Complete: Enabling Complete Neural Network Verification with Rapid and Massively Parallel Incomplete Verifiers b/data/2021/iclr/Fast and Complete: Enabling Complete Neural Network Verification with Rapid and Massively Parallel Incomplete Verifiers
new file mode 100644
index 0000000000..b78e4950c3
--- /dev/null
+++ b/data/2021/iclr/Fast and Complete: Enabling Complete Neural Network Verification with Rapid and Massively Parallel Incomplete Verifiers	
@@ -0,0 +1 @@
+Formal verification of neural networks (NNs) is a challenging and important problem. Existing efficient complete solvers typically require the branch-and-bound (BaB) process, which splits the problem domain into sub-domains and solves each sub-domain using faster but weaker incomplete verifiers, such as Linear Programming (LP) on linearly relaxed sub-domains. In this paper, we propose to use the backward mode linear relaxation based perturbation analysis (LiRPA) to replace LP during the BaB process, which can be efficiently implemented on the typical machine learning accelerators such as GPUs and TPUs. However, unlike LP, LiRPA when applied naively can produce much weaker bounds and even cannot check certain conflicts of sub-domains during splitting, making the entire procedure incomplete after BaB. To address these challenges, we apply a fast gradient based bound tightening procedure combined with batch splits and the design of minimal usage of LP bound procedure, enabling us to effectively use LiRPA on the accelerator hardware for the challenging complete NN verification problem and significantly outperform LP-based approaches. On a single GPU, we demonstrate an order of magnitude speedup compared to existing LP-based approaches.
\ No newline at end of file
diff --git a/data/2021/iclr/Fast convergence of stochastic subgradient method under interpolation b/data/2021/iclr/Fast convergence of stochastic subgradient method under interpolation
new file mode 100644
index 0000000000..83c06c059d
--- /dev/null
+++ b/data/2021/iclr/Fast convergence of stochastic subgradient method under interpolation	
@@ -0,0 +1 @@
+This paper studies the behaviour of the stochastic subgradient descent (SSGD) method applied to over-parameterized nonsmooth optimization problems that satisfy an interpolation condition. By leveraging the composite structure of the empirical risk minimization problems, we prove that SSGD converges, respectively, with rates O (1 /(cid:15) ) and O (log(1 /(cid:15) )) for convex and strongly-convex objectives when interpolation holds. These rates coincide with established rates for the stochastic gradient descent (SGD) method applied to smooth problems that also satisfy an interpolation condition. Our analysis provides a partial explanation for the empirical observation that sometimes SGD and SSGD behave similarly for training smooth and nonsmooth machine learning models. We also prove that the rate O (1 /(cid:15) ) is optimal for the subgradient method in the convex and interpolation setting.
\ No newline at end of file
diff --git a/data/2021/iclr/FastSpeech 2: Fast and High-Quality End-to-End Text to Speech b/data/2021/iclr/FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
new file mode 100644
index 0000000000..518ce67e92
--- /dev/null
+++ b/data/2021/iclr/FastSpeech 2: Fast and High-Quality End-to-End Text to Speech	
@@ -0,0 +1 @@
+Advanced text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss due to data simplification, both of which limit the voice quality. In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs during training and use predicted values during inference. We further design FastSpeech 2s, which is the first attempt to directly generate speech waveform from text in parallel, enjoying the benefit of full end-to-end training and even faster inference than FastSpeech. Experimental results show that 1) FastSpeech 2 and 2s outperform FastSpeech in voice quality with much simplified training pipeline and reduced training time; 2) FastSpeech 2 and 2s can match the voice quality of autoregressive models while enjoying much faster inference speed.
\ No newline at end of file
diff --git a/data/2021/iclr/Faster Binary Embeddings for Preserving Euclidean Distances b/data/2021/iclr/Faster Binary Embeddings for Preserving Euclidean Distances
new file mode 100644
index 0000000000..588520121e
--- /dev/null
+++ b/data/2021/iclr/Faster Binary Embeddings for Preserving Euclidean Distances	
@@ -0,0 +1 @@
+We propose a fast, distance-preserving, binary embedding algorithm to transform a high-dimensional dataset $\mathcal{T}\subseteq\mathbb{R}^n$ into binary sequences in the cube $\{\pm 1\}^m$. When $\mathcal{T}$ consists of well-spread (i.e., non-sparse) vectors, our embedding method applies a stable noise-shaping quantization scheme to $A x$ where $A\in\mathbb{R}^{m\times n}$ is a sparse Gaussian random matrix. This contrasts with most binary embedding methods, which usually use $x\mapsto \mathrm{sign}(Ax)$ for the embedding. Moreover, we show that Euclidean distances among the elements of $\mathcal{T}$ are approximated by the $\ell_1$ norm on the images of $\{\pm 1\}^m$ under a fast linear transformation. This again contrasts with standard methods, where the Hamming distance is used instead. Our method is both fast and memory efficient, with time complexity $O(m)$ and space complexity $O(m)$. Further, we prove that the method is accurate and its associated error is comparable to that of a continuous valued Johnson-Lindenstrauss embedding plus a quantization error that admits a polynomial decay as the embedding dimension $m$ increases. Thus the length of the binary codes required to achieve a desired accuracy is quite small, and we show it can even be compressed further without compromising the accuracy. To illustrate our results, we test the proposed method on natural images and show that it achieves strong performance.
\ No newline at end of file
diff --git a/data/2021/iclr/FedBE: Making Bayesian Model Ensemble Applicable to Federated Learning b/data/2021/iclr/FedBE: Making Bayesian Model Ensemble Applicable to Federated Learning
new file mode 100644
index 0000000000..7a8aafc7d0
--- /dev/null
+++ b/data/2021/iclr/FedBE: Making Bayesian Model Ensemble Applicable to Federated Learning	
@@ -0,0 +1 @@
+Federated learning aims to collaboratively train a strong global model by accessing users' locally trained models but not their own data. A crucial step is therefore to aggregate local models into a global model, which has been shown challenging when users have non-i.i.d. data. In this paper, we propose a novel aggregation algorithm named FedBE, which takes a Bayesian inference perspective by sampling higher-quality global models and combining them via Bayesian model Ensemble, leading to much robust aggregation. We show that an effective model distribution can be constructed by simply fitting a Gaussian or Dirichlet distribution to the local models. Our empirical studies validate FedBE's superior performance, especially when users' data are not i.i.d. and when the neural networks go deeper. Moreover, FedBE is compatible with recent efforts in regularizing users' model training, making it an easily applicable module: you only need to replace the aggregation method but leave other parts of your federated learning algorithm intact. Our code is publicly available at https://github.com/hongyouc/FedBE.
\ No newline at end of file
diff --git a/data/2021/iclr/FedBN: Federated Learning on Non-IID Features via Local Batch Normalization b/data/2021/iclr/FedBN: Federated Learning on Non-IID Features via Local Batch Normalization
new file mode 100644
index 0000000000..22ce36bc75
--- /dev/null
+++ b/data/2021/iclr/FedBN: Federated Learning on Non-IID Features via Local Batch Normalization	
@@ -0,0 +1 @@
+The emerging paradigm of federated learning (FL) strives to enable collaborative training of deep models on the network edge without centrally aggregating raw data and hence improving data privacy. In most cases, the assumption of independent and identically distributed samples across local clients does not hold for federated learning setups. Under this setting, neural network training performance may vary significantly according to the data distribution and even hurt training convergence. Most of the previous work has focused on a difference in the distribution of labels or client shifts. Unlike those settings, we address an important problem of FL, e.g., different scanners/sensors in medical imaging, different scenery distribution in autonomous driving (highway vs. city), where local clients store examples with different distributions compared to other clients, which we denote as feature shift non-iid. In this work, we propose an effective method that uses local batch normalization to alleviate the feature shift before averaging models. The resulting scheme, called FedBN, outperforms both classical FedAvg, as well as the state-of-the-art for non-iid data (FedProx) on our extensive experiments. These empirical results are supported by a convergence analysis that shows in a simplified setting that FedBN has a faster convergence rate than FedAvg. Code is available at https://github.com/med-air/FedBN.
\ No newline at end of file
diff --git a/data/2021/iclr/FedMix: Approximation of Mixup under Mean Augmented Federated Learning b/data/2021/iclr/FedMix: Approximation of Mixup under Mean Augmented Federated Learning
new file mode 100644
index 0000000000..c72ec84bc9
--- /dev/null
+++ b/data/2021/iclr/FedMix: Approximation of Mixup under Mean Augmented Federated Learning	
@@ -0,0 +1 @@
+Federated learning (FL) allows edge devices to collectively learn a model without directly sharing data within each device, thus preserving privacy and eliminating the need to store data globally. While there are promising results under the assumption of independent and identically distributed (iid) local data, current state-of-the-art algorithms suffer from performance degradation as the heterogeneity of local data across clients increases. To resolve this issue, we propose a simple framework, Mean Augmented Federated Learning (MAFL), where clients send and receive averaged local data, subject to the privacy requirements of target applications. Under our framework, we propose a new augmentation algorithm, named FedMix, which is inspired by a phenomenal yet simple data augmentation method, Mixup, but does not require local raw data to be directly shared among devices. Our method shows greatly improved performance in the standard benchmark datasets of FL, under highly non-iid federated settings, compared to conventional algorithms.
\ No newline at end of file
diff --git a/data/2021/iclr/Federated Learning Based on Dynamic Regularization b/data/2021/iclr/Federated Learning Based on Dynamic Regularization
new file mode 100644
index 0000000000..3d13c97ca3
--- /dev/null
+++ b/data/2021/iclr/Federated Learning Based on Dynamic Regularization	
@@ -0,0 +1 @@
+We propose a novel federated learning method for distributively training neural network models, where the server orchestrates cooperation between a subset of randomly chosen devices in each round. We view Federated Learning problem primarily from a communication perspective and allow more device level computations to save transmission costs. We point out a fundamental dilemma, in that the minima of the local-device level empirical loss are inconsistent with those of the global empirical loss. Different from recent prior works, that either attempt inexact minimization or utilize devices for parallelizing gradient computation, we propose a dynamic regularizer for each device at each round, so that in the limit the global and device solutions are aligned. We demonstrate both through empirical results on real and synthetic data as well as analytical results that our scheme leads to efficient training, in both convex and non-convex settings, while being fully agnostic to device heterogeneity and robust to large number of devices, partial participation and unbalanced data.
\ No newline at end of file
diff --git a/data/2021/iclr/Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms b/data/2021/iclr/Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms
new file mode 100644
index 0000000000..5efcaa495f
--- /dev/null
+++ b/data/2021/iclr/Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms	
@@ -0,0 +1 @@
+Federated learning is typically approached as an optimization problem, where the goal is to minimize a global loss function by distributing computation across client devices that possess local data and specify different parts of the global objective. We present an alternative perspective and formulate federated learning as a posterior inference problem, where the goal is to infer a global posterior distribution by having client devices each infer the posterior of their local data. While exact inference is often intractable, this perspective provides a principled way to search for global optima in federated settings. Further, starting with the analysis of federated quadratic objectives, we develop a computation- and communication-efficient approximate posterior inference algorithm -- federated posterior averaging (FedPA). Our algorithm uses MCMC for approximate inference of local posteriors on the clients and efficiently communicates their statistics to the server, where the latter uses them to refine a global estimate of the posterior mode. Finally, we show that FedPA generalizes federated averaging (FedAvg), can similarly benefit from adaptive optimizers, and yields state-of-the-art results on four realistic and challenging benchmarks, converging faster, to better optima.
\ No newline at end of file
diff --git a/data/2021/iclr/Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint Learning b/data/2021/iclr/Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint Learning
new file mode 100644
index 0000000000..94c57ca265
--- /dev/null
+++ b/data/2021/iclr/Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint Learning	
@@ -0,0 +1 @@
+While existing federated learning approaches mostly require that clients have fully-labeled data to train on, in realistic settings, data obtained at the client side often comes without any accompanying labels. Such deficiency of labels may result from either high labeling cost, or difficulty of annotation due to requirement of expert knowledge. Thus the private data at each client may be only partly labeled, or completely unlabeled with labeled data being available only at the server, which leads us to a new problem of Federated Semi-Supervised Learning (FSSL). In this work, we study this new problem of semi-supervised learning under federated learning framework, and propose a novel method to tackle it, which we refer to as Federated Matching (FedMatch). FedMatch improves upon naive federated semi-supervised learning approaches with a new inter-client consistency loss and decomposition of the parameters into parameters for labeled and unlabeled data. Through extensive experimental validation of our method in two different scenarios, we show that our method outperforms both local semi-supervised learning and baselines which naively combine federated learning with semi-supervised learning.
\ No newline at end of file
diff --git a/data/2021/iclr/Few-Shot Bayesian Optimization with Deep Kernel Surrogates b/data/2021/iclr/Few-Shot Bayesian Optimization with Deep Kernel Surrogates
new file mode 100644
index 0000000000..d67a619ddd
--- /dev/null
+++ b/data/2021/iclr/Few-Shot Bayesian Optimization with Deep Kernel Surrogates	
@@ -0,0 +1 @@
+Hyperparameter optimization (HPO) is a central pillar in the automation of machine learning solutions and is mainly performed via Bayesian optimization, where a parametric surrogate is learned to approximate the black box response function (e.g. validation error). Unfortunately, evaluating the response function is computationally intensive. As a remedy, earlier work emphasizes the need for transfer learning surrogates which learn to optimize hyperparameters for an algorithm from other tasks. In contrast to previous work, we propose to rethink HPO as a few-shot learning problem in which we train a shared deep surrogate model to quickly adapt (with few response evaluations) to the response function of a new task. We propose the use of a deep kernel network for a Gaussian process surrogate that is meta-learned in an end-to-end fashion in order to jointly approximate the response functions of a collection of training data sets. As a result, the novel few-shot optimization of our deep kernel surrogate leads to new state-of-the-art results at HPO compared to several recent methods on diverse metadata sets.
\ No newline at end of file
diff --git a/data/2021/iclr/Few-Shot Learning via Learning the Representation, Provably b/data/2021/iclr/Few-Shot Learning via Learning the Representation, Provably
new file mode 100644
index 0000000000..df6b20c87b
--- /dev/null
+++ b/data/2021/iclr/Few-Shot Learning via Learning the Representation, Provably	
@@ -0,0 +1 @@
+This paper studies few-shot learning via representation learning, where one uses $T$ source tasks with $n_1$ data per task to learn a representation in order to reduce the sample complexity of a target task for which there is only $n_2 (\ll n_1)$ data. Specifically, we focus on the setting where there exists a good \emph{common representation} between source and target, and our goal is to understand how much of a sample size reduction is possible. First, we study the setting where this common representation is low-dimensional and provide a fast rate of $O\left(\frac{\mathcal{C}\left(\Phi\right)}{n_1T} + \frac{k}{n_2}\right)$; here, $\Phi$ is the representation function class, $\mathcal{C}\left(\Phi\right)$ is its complexity measure, and $k$ is the dimension of the representation. When specialized to linear representation functions, this rate becomes $O\left(\frac{dk}{n_1T} + \frac{k}{n_2}\right)$ where $d (\gg k)$ is the ambient input dimension, which is a substantial improvement over the rate without using representation learning, i.e. over the rate of $O\left(\frac{d}{n_2}\right)$. Second, we consider the setting where the common representation may be high-dimensional but is capacity-constrained (say in norm); here, we again demonstrate the advantage of representation learning in both high-dimensional linear regression and neural network learning. Our results demonstrate representation learning can fully utilize all $n_1T$ samples from source tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Fidelity-based Deep Adiabatic Scheduling b/data/2021/iclr/Fidelity-based Deep Adiabatic Scheduling
new file mode 100644
index 0000000000..6dd519af42
--- /dev/null
+++ b/data/2021/iclr/Fidelity-based Deep Adiabatic Scheduling	
@@ -0,0 +1 @@
+Adiabatic quantum computation is a form of computation that acts by slowly interpolating a quantum system between an easy to prepare initial state and a ﬁnal state that represents a solution to a given computational problem. The choice of the interpolation schedule is critical to the performance: if at a certain time point, the evolution is too rapid, the system has a high probability to transfer to a higher energy state, which does not represent a solution to the problem. On the other hand, an evolution that is too slow leads to a loss of computation time and increases the probability of failure due to decoherence. In this work, we train deep neural models to produce optimal schedules that are conditioned on the problem at hand. We consider two types of problem representation: the Hamiltonian form
\ No newline at end of file
diff --git a/data/2021/iclr/Filtered Inner Product Projection for Crosslingual Embedding Alignment b/data/2021/iclr/Filtered Inner Product Projection for Crosslingual Embedding Alignment
new file mode 100644
index 0000000000..cd44469eff
--- /dev/null
+++ b/data/2021/iclr/Filtered Inner Product Projection for Crosslingual Embedding Alignment	
@@ -0,0 +1 @@
+Due to widespread interest in machine translation and transfer learning, there are numerous algorithms for mapping multiple embeddings to a shared representation space. Recently, these algorithms have been studied in the setting of bilingual lexicon induction where one seeks to align the embeddings of a source and a target language such that translated word pairs lie close to one another in a common representation space. In this paper, we propose a method, Filtered Inner Product Projection (FIPP), for mapping embeddings to a common representation space. As semantic shifts are pervasive across languages and domains, FIPP first identifies the common geometric structure in both embeddings and then, only on the common structure, aligns the Gram matrices of these embeddings. FIPP aligns embeddings to isomorphic vector spaces even when the source and target embeddings are of differing dimensionalities. Additionally, FIPP provides computational benefits in ease of implementation and is faster to compute than current approaches. Following the baselines in Glavaš et al. (2019), we evaluate FIPP in the context of bilingual lexicon induction and downstream language tasks. We show that FIPP outperforms existing methods on the XLING (5K) BLI dataset and the XLING (1K) BLI dataset, when using a self-learning approach, while also providing robust performance across downstream tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis b/data/2021/iclr/Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
new file mode 100644
index 0000000000..5a2bc4966c
--- /dev/null
+++ b/data/2021/iclr/Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis	
@@ -0,0 +1 @@
+In this paper we propose Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis with control over speech variation and style transfer. Flowtron borrows insights from IAF and revamps Tacotron in order to provide high-quality and expressive mel-spectrogram synthesis. Flowtron is optimized by maximizing the likelihood of the training data, which makes training simple and stable. Flowtron learns an invertible mapping of data to a latent space that can be manipulated to control many aspects of speech synthesis (pitch, tone, speech rate, cadence, accent). Our mean opinion scores (MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech quality. In addition, we provide results on control of speech variation, interpolation between samples and style transfer between speakers seen and unseen during training. Code and pre-trained models will be made publicly available at this https URL
\ No newline at end of file
diff --git a/data/2021/iclr/Fooling a Complete Neural Network Verifier b/data/2021/iclr/Fooling a Complete Neural Network Verifier
new file mode 100644
index 0000000000..068b52175e
--- /dev/null
+++ b/data/2021/iclr/Fooling a Complete Neural Network Verifier	
@@ -0,0 +1 @@
+The efﬁcient and accurate characterization of the robustness of neural networks to input perturbation is an important open problem. Many approaches exist including heuristic and exact (or complete) methods. Complete methods are expensive but their mathematical formulation guarantees that they provide exact robustness metrics. However, this guarantee is valid only if we assume that the veriﬁed network applies arbitrary-precision arithmetic and the veriﬁer is reliable. In practice, however, both the networks and the veriﬁers apply limited-precision ﬂoating point arithmetic. In this paper, we show that numerical roundoff errors can be exploited to craft adversarial networks, in which the actual robustness and the robustness computed by a state-of-the-art complete veriﬁer radically differ. We also show that such adversarial networks can be used to insert a backdoor into any network in such a way that the backdoor is completely missed by the veriﬁer. The attack is easy to detect in its naive form but, as we show, the adversarial network can be transformed to make its detection less trivial. We offer a simple defense against our particular attack based on adding a very small perturbation to the network weights. However, our conjecture is that other numerical attacks are possible, and exact veriﬁcation has to take into account all the details of the computation executed by the veriﬁed networks, which makes the problem signiﬁcantly harder.
\ No newline at end of file
diff --git a/data/2021/iclr/For self-supervised learning, Rationality implies generalization, provably b/data/2021/iclr/For self-supervised learning, Rationality implies generalization, provably
new file mode 100644
index 0000000000..8605259506
--- /dev/null
+++ b/data/2021/iclr/For self-supervised learning, Rationality implies generalization, provably	
@@ -0,0 +1 @@
+We prove a new upper bound on the generalization gap of classifiers that are obtained by first using self-supervision to learn a representation $r$ of the training data, and then fitting a simple (e.g., linear) classifier $g$ to the labels. Specifically, we show that (under the assumptions described below) the generalization gap of such classifiers tends to zero if $\mathsf{C}(g) \ll n$, where $\mathsf{C}(g)$ is an appropriately-defined measure of the simple classifier $g$'s complexity, and $n$ is the number of training samples. We stress that our bound is independent of the complexity of the representation $r$. We do not make any structural or conditional-independence assumptions on the representation-learning task, which can use the same training dataset that is later used for classification. Rather, we assume that the training procedure satisfies certain natural noise-robustness (adding small amount of label noise causes small degradation in performance) and rationality (getting the wrong label is not better than getting no label at all) conditions that widely hold across many standard architectures. We show that our bound is non-vacuous for many popular representation-learning based classifiers on CIFAR-10 and ImageNet, including SimCLR, AMDIM and MoCo.
\ No newline at end of file
diff --git a/data/2021/iclr/Fourier Neural Operator for Parametric Partial Differential Equations b/data/2021/iclr/Fourier Neural Operator for Parametric Partial Differential Equations
new file mode 100644
index 0000000000..65cecc1b44
--- /dev/null
+++ b/data/2021/iclr/Fourier Neural Operator for Parametric Partial Differential Equations	
@@ -0,0 +1 @@
+The classical development of neural networks has primarily focused on learning mappings between finite-dimensional Euclidean spaces. Recently, this has been generalized to neural operators that learn mappings between function spaces. For partial differential equations (PDEs), neural operators directly learn the mapping from any functional parametric dependence to the solution. Thus, they learn an entire family of PDEs, in contrast to classical methods which solve one instance of the equation. In this work, we formulate a new neural operator by parameterizing the integral kernel directly in Fourier space, allowing for an expressive and efficient architecture. We perform experiments on Burgers' equation, Darcy flow, and the Navier-Stokes equation (including the turbulent regime). Our Fourier neural operator shows state-of-the-art performance compared to existing neural network methodologies and it is up to three orders of magnitude faster compared to traditional PDE solvers.
\ No newline at end of file
diff --git a/data/2021/iclr/Free Lunch for Few-shot Learning: Distribution Calibration b/data/2021/iclr/Free Lunch for Few-shot Learning: Distribution Calibration
new file mode 100644
index 0000000000..a72f1e26a9
--- /dev/null
+++ b/data/2021/iclr/Free Lunch for Few-shot Learning: Distribution Calibration	
@@ -0,0 +1 @@
+Learning from a limited number of samples is challenging since the learned model can easily become overfitted based on the biased distribution formed by only a few training examples. In this paper, we calibrate the distribution of these few-sample classes by transferring statistics from the classes with sufficient examples, then an adequate number of examples can be sampled from the calibrated distribution to expand the inputs to the classifier. We assume every dimension in the feature representation follows a Gaussian distribution so that the mean and the variance of the distribution can borrow from that of similar classes whose statistics are better estimated with an adequate number of samples. Our method can be built on top of off-the-shelf pretrained feature extractors and classification models without extra parameters. We show that a simple logistic regression classifier trained using the features sampled from our calibrated distribution can outperform the state-of-the-art accuracy on two datasets (~5% improvement on miniImageNet compared to the next best). The visualization of these generated features demonstrates that our calibrated distribution is an accurate estimation.
\ No newline at end of file
diff --git a/data/2021/iclr/Fully Unsupervised Diversity Denoising with Convolutional Variational Autoencoders b/data/2021/iclr/Fully Unsupervised Diversity Denoising with Convolutional Variational Autoencoders
new file mode 100644
index 0000000000..83e80c157e
--- /dev/null
+++ b/data/2021/iclr/Fully Unsupervised Diversity Denoising with Convolutional Variational Autoencoders	
@@ -0,0 +1 @@
+Deep Learning based methods have emerged as the indisputable leaders for virtually all image restoration tasks. Especially in the domain of microscopy images, various content-aware image restoration (CARE) approaches are now used to improve the interpretability of acquired data. Naturally, there are limitations to what can be restored in corrupted images, and like for all inverse problems, many potential solutions exist, and one of them must be chosen. Here, we propose DIVNOISING, a denoising approach based on fully convolutional variational autoencoders (VAEs), overcoming the problem of having to choose a single solution by predicting a whole distribution of denoised images. First we introduce a principled way of formulating the unsupervised denoising problem within the VAE framework by explicitly incorporating imaging noise models into the decoder. Our approach is fully unsupervised, only requiring noisy images and a suitable description of the imaging noise distribution. We show that such a noise model can either be measured, bootstrapped from noisy data, or co-learned during training. If desired, consensus predictions can be inferred from a set of DIVNOISING predictions, leading to competitive results with other unsupervised methods and, on occasion, even with the supervised state-of-the-art. DIVNOISING samples from the posterior enable a plethora of useful applications. We are piq showing denoising results for 13 datasets, piiq discussing how optical character recognition (OCR) applications can benefit from diverse predictions, and are piiiq demonstrating how instance cell segmentation improves when using diverse DIVNOISING predictions.
\ No newline at end of file
diff --git a/data/2021/iclr/Fuzzy Tiling Activations: A Simple Approach to Learning Sparse Representations Online b/data/2021/iclr/Fuzzy Tiling Activations: A Simple Approach to Learning Sparse Representations Online
new file mode 100644
index 0000000000..cdfa1c70f3
--- /dev/null
+++ b/data/2021/iclr/Fuzzy Tiling Activations: A Simple Approach to Learning Sparse Representations Online	
@@ -0,0 +1 @@
+Recent work has shown that sparse representations—where only a small percentage of units are active—can signiﬁcantly reduce interference. Those works, however, relied on relatively complex regularization or meta-learning approaches, that have only been used ofﬂine in a pre-training phase. In this work, we pursue a direction that achieves sparsity by design, rather than by learning. Speciﬁcally, we design an activation function that produces sparse representations deterministically by construction, and so is more amenable to online training. The idea relies on the simple approach of binning, but overcomes the two key limitations of binning: zero gradients for the ﬂat regions almost everywhere, and lost precision—reduced discrimination—due to coarse aggregation. We introduce a Fuzzy Tiling Activation (FTA) that provides non-negligible gradients and produces overlap between bins that improves discrimination. We ﬁrst show that FTA is robust under covariate shift in a synthetic online supervised learning problem, where we can vary the level of correlation and drift. Then we move to the deep reinforcement learning setting and investigate both value-based and policy gradient algorithms that use neural networks with FTAs, in classic discrete control and Mujoco continuous control environments. We show that algorithms equipped with FTAs are able to learn a stable policy faster without needing target networks on most domains. 1
\ No newline at end of file
diff --git "a/data/2021/iclr/GAN \"Steerability\" without optimization" "b/data/2021/iclr/GAN \"Steerability\" without optimization"
new file mode 100644
index 0000000000..913149ca40
--- /dev/null
+++ "b/data/2021/iclr/GAN \"Steerability\" without optimization"	
@@ -0,0 +1 @@
+Recent research has shown remarkable success in revealing "steering" directions in the latent spaces of pre-trained GANs. These directions correspond to semantically meaningful image transformations e.g., shift, zoom, color manipulations), and have similar interpretable effects across all categories that the GAN can generate. Some methods focus on user-specified transformations, while others discover transformations in an unsupervised manner. However, all existing techniques rely on an optimization procedure to expose those directions, and offer no control over the degree of allowed interaction between different transformations. In this paper, we show that "steering" trajectories can be computed in closed form directly from the generator's weights without any form of training or optimization. This applies to user-prescribed geometric transformations, as well as to unsupervised discovery of more complex effects. Our approach allows determining both linear and nonlinear trajectories, and has many advantages over previous methods. In particular, we can control whether one transformation is allowed to come on the expense of another (e.g. zoom-in with or without allowing translation to keep the object centered). Moreover, we can determine the natural end-point of the trajectory, which corresponds to the largest extent to which a transformation can be applied without incurring degradation. Finally, we show how transferring attributes between images can be achieved without optimization, even across different categories.
\ No newline at end of file
diff --git a/data/2021/iclr/GAN2GAN: Generative Noise Learning for Blind Denoising with Single Noisy Images b/data/2021/iclr/GAN2GAN: Generative Noise Learning for Blind Denoising with Single Noisy Images
new file mode 100644
index 0000000000..129eca470c
--- /dev/null
+++ b/data/2021/iclr/GAN2GAN: Generative Noise Learning for Blind Denoising with Single Noisy Images	
@@ -0,0 +1 @@
+We tackle a challenging blind image denoising problem, in which only single distinct noisy images are available for training a denoiser, and no information about noise is known, except for it being zero-mean, additive, and independent of the clean image. In such a setting, which often occurs in practice, it is not possible to train a denoiser with the standard discriminative training or with the recently developed Noise2Noise (N2N) training; the former requires the underlying clean image for the given noisy image, and the latter requires two independently realized noisy image pair for a clean image. To that end, we propose GAN2GAN (Generated-Artificial-Noise to Generated-Artificial-Noise) method that first learns a generative model that can 1) simulate the noise in the given noisy images and 2) generate a rough, noisy estimates of the clean images, then 3) iteratively trains a denoiser with subsequently synthesized noisy image pairs (as in N2N), obtained from the generative model. In results, we show the denoiser trained with our GAN2GAN achieves an impressive denoising performance on both synthetic and real-world datasets for the blind denoising setting; it almost approaches the performance of the standard discriminatively-trained or N2N-trained models that have more information than ours, and it significantly outperforms the recent baseline for the same setting, \textit{e.g.}, Noise2Void, and a more conventional yet strong one, BM3D. The official code of our method is available at https://github.com/csm9493/GAN2GAN.
\ No newline at end of file
diff --git a/data/2021/iclr/GANs Can Play Lottery Tickets Too b/data/2021/iclr/GANs Can Play Lottery Tickets Too
new file mode 100644
index 0000000000..13d129f264
--- /dev/null
+++ b/data/2021/iclr/GANs Can Play Lottery Tickets Too	
@@ -0,0 +1 @@
+Deep generative adversarial networks (GANs) have gained growing popularity in numerous scenarios, while usually suffer from high parameter complexities for resource-constrained real-world applications. However, the compression of GANs has less been explored. A few works show that heuristically applying compression techniques normally leads to unsatisfactory results, due to the notorious training instability of GANs. In parallel, the lottery ticket hypothesis shows prevailing success on discriminative models, in locating sparse matching subnetworks capable of training in isolation to full model performance. In this work, we for the first time study the existence of such trainable matching subnetworks in deep GANs. For a range of GANs, we certainly find matching subnetworks at 67%-74% sparsity. We observe that with or without pruning discriminator has a minor effect on the existence and quality of matching subnetworks, while the initialization weights used in the discriminator play a significant role. We then show the powerful transferability of these subnetworks to unseen tasks. Furthermore, extensive experimental results demonstrate that our found subnetworks substantially outperform previous state-of-the-art GAN compression approaches in both image generation (e.g. SNGAN) and image-to-image translation GANs (e.g. CycleGAN). Codes available at https://github.com/VITA-Group/GAN-LTH.
\ No newline at end of file
diff --git a/data/2021/iclr/GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding b/data/2021/iclr/GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
new file mode 100644
index 0000000000..dc8c6af331
--- /dev/null
+++ b/data/2021/iclr/GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding	
@@ -0,0 +1 @@
+Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
\ No newline at end of file
diff --git a/data/2021/iclr/Gauge Equivariant Mesh CNNs: Anisotropic convolutions on geometric graphs b/data/2021/iclr/Gauge Equivariant Mesh CNNs: Anisotropic convolutions on geometric graphs
new file mode 100644
index 0000000000..d0a53d775d
--- /dev/null
+++ b/data/2021/iclr/Gauge Equivariant Mesh CNNs: Anisotropic convolutions on geometric graphs	
@@ -0,0 +1 @@
+A common approach to define convolutions on meshes is to interpret them as a graph and apply graph convolutional networks (GCNs). Such GCNs utilize isotropic kernels and are therefore insensitive to the relative orientation of vertices and thus to the geometry of the mesh as a whole. We propose Gauge Equivariant Mesh CNNs which generalize GCNs to apply anisotropic gauge equivariant kernels. Since the resulting features carry orientation information, we introduce a geometric message passing scheme defined by parallel transporting features over mesh edges. Our experiments validate the significantly improved expressivity of the proposed model over conventional GCNs and other methods.
\ No newline at end of file
diff --git a/data/2021/iclr/Generalization bounds via distillation b/data/2021/iclr/Generalization bounds via distillation
new file mode 100644
index 0000000000..178553b1c1
--- /dev/null
+++ b/data/2021/iclr/Generalization bounds via distillation	
@@ -0,0 +1 @@
+This paper theoretically investigates the following empirical phenomenon: given a high-complexity network with poor generalization bounds, one can distill it into a network with nearly identical predictions but low complexity and vastly smaller generalization bounds. The main contribution is an analysis showing that the original network inherits this good generalization bound from its distillation, assuming the use of well-behaved data augmentation. This bound is presented both in an abstract and in a concrete form, the latter complemented by a reduction technique to handle modern computation graphs featuring convolutional layers, fully-connected layers, and skip connections, to name a few. To round out the story, a (looser) classical uniform convergence analysis of compression is also presented, as well as a variety of experiments on cifar and mnist demonstrating similar generalization performance between the original network and its distillation.
\ No newline at end of file
diff --git a/data/2021/iclr/Generalization in data-driven models of primary visual cortex b/data/2021/iclr/Generalization in data-driven models of primary visual cortex
new file mode 100644
index 0000000000..121aaa86af
--- /dev/null
+++ b/data/2021/iclr/Generalization in data-driven models of primary visual cortex	
@@ -0,0 +1 @@
+Deep neural networks (DNN) have set new standards at predicting responses of neural populations to visual input. Most such DNNs consist of a convolutional network (core) shared across all neurons which learns a representation of neural computation in visual cortex and a neuron-specific readout that linearly combines the relevant features in this representation. The goal of this paper is to test whether such a representation is indeed generally characteristic for visual cortex, i.e. generalizes between animals of a species, and what factors contribute to obtaining such a generalizing core. To push all non-linear computations into the core where the generalizing cortical features should be learned, we devise a novel readout that reduces the number of parameters per neuron in the readout by up to two orders of magnitude compared to the previous state-of-the-art. It does so by taking advantage of retinotopy and learns a Gaussian distribution over the neuron’s receptive field position. With this new readout we train our network on neural responses from mouse primary visual cortex (V1) and obtain a gain in performance of 7% compared to the previous state-of-the-art network. We then investigate whether the convolutional core indeed captures general cortical features by using the core in transfer learning to a different animal. When transferring a core trained on thousands of neurons from various animals and scans we exceed the performance of training directly on that animal by 12%, and outperform a commonly used VGG16 core pre-trained on imagenet by 33%. In addition, transfer learning with our data-driven core is more data-efficient than direct training, achieving the same performance with only 40% of the data. Our model with its novel readout thus sets a new state-of-the-art for neural response prediction in mouse visual cortex from natural images, generalizes between animals, and captures better characteristic cortical features than current task-driven pre-training approaches such as VGG16.
\ No newline at end of file
diff --git a/data/2021/iclr/Generalized Energy Based Models b/data/2021/iclr/Generalized Energy Based Models
new file mode 100644
index 0000000000..ae423985a6
--- /dev/null
+++ b/data/2021/iclr/Generalized Energy Based Models	
@@ -0,0 +1 @@
+We introduce the Generalized Energy Based Model (GEBM) for generative modelling. These models combine two trained components: a base distribution (generally an implicit model), which can learn the support of data with low intrinsic dimension in a high dimensional space; and an energy function, to refine the probability mass on the learned support. Both the energy function and base jointly constitute the final model, unlike GANs, which retain only the base distribution (the "generator"). GEBMs are trained by alternating between learning the energy and the base. We show that both training stages are well-defined: the energy is learned by maximising a generalized likelihood, and the resulting energy-based loss provides informative gradients for learning the base. Samples from the posterior on the latent space of the trained model can be obtained via MCMC, thus finding regions in this space that produce better quality samples. Empirically, the GEBM samples on image-generation tasks are of much better quality than those from the learned generator alone, indicating that all else being equal, the GEBM will outperform a GAN of the same complexity. GEBMs also return state-of-the-art performance on density modelling tasks, and when using base measures with an explicit form.
\ No newline at end of file
diff --git a/data/2021/iclr/Generalized Multimodal ELBO b/data/2021/iclr/Generalized Multimodal ELBO
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Generalized Variational Continual Learning b/data/2021/iclr/Generalized Variational Continual Learning
new file mode 100644
index 0000000000..88d67fcba1
--- /dev/null
+++ b/data/2021/iclr/Generalized Variational Continual Learning	
@@ -0,0 +1 @@
+Continual learning deals with training models on new tasks and datasets in an online fashion. One strand of research has used probabilistic regularization for continual learning, with two of the main approaches in this vein being Online Elastic Weight Consolidation (Online EWC) and Variational Continual Learning (VCL). VCL employs variational inference, which in other settings has been improved empirically by applying likelihood-tempering. We show that applying this modification to VCL recovers Online EWC as a limiting case, allowing for interpolation between the two approaches. We term the general algorithm Generalized VCL (GVCL). In order to mitigate the observed overpruning effect of VI, we take inspiration from a common multi-task architecture, neural networks with task-specific FiLM layers, and find that this addition leads to significant performance gains, specifically for variational methods. In the small-data regime, GVCL strongly outperforms existing baselines. In larger datasets, GVCL with FiLM layers outperforms or is competitive with existing baselines in terms of accuracy, whilst also providing significantly better calibration.
\ No newline at end of file
diff --git a/data/2021/iclr/Generating Adversarial Computer Programs using Optimized Obfuscations b/data/2021/iclr/Generating Adversarial Computer Programs using Optimized Obfuscations
new file mode 100644
index 0000000000..4a561df496
--- /dev/null
+++ b/data/2021/iclr/Generating Adversarial Computer Programs using Optimized Obfuscations	
@@ -0,0 +1 @@
+Machine learning (ML) models that learn and predict properties of computer programs are increasingly being adopted and deployed. These models have demonstrated success in applications such as auto-completing code, summarizing large programs, and detecting bugs and malware in programs. In this work, we investigate principled ways to adversarially perturb a computer program to fool such learned models, and thus determine their adversarial robustness. We use program obfuscations, which have conventionally been used to avoid attempts at reverse engineering programs, as adversarial perturbations. These perturbations modify programs in ways that do not alter their functionality but can be crafted to deceive an ML model when making a decision. We provide a general formulation for an adversarial program that allows applying multiple obfuscation transformations to a program in any language. We develop first-order optimization algorithms to efficiently determine two key aspects -- which parts of the program to transform, and what transformations to use. We show that it is important to optimize both these aspects to generate the best adversarially perturbed program. Due to the discrete nature of this problem, we also propose using randomized smoothing to improve the attack loss landscape to ease optimization. We evaluate our work on Python and Java programs on the problem of program summarization. We show that our best attack proposal achieves a $52\%$ improvement over a state-of-the-art attack generation approach for programs trained on a seq2seq model. We further show that our formulation is better at training models that are robust to adversarial attacks.
\ No newline at end of file
diff --git a/data/2021/iclr/Generating Furry Cars: Disentangling Object Shape and Appearance across Multiple Domains b/data/2021/iclr/Generating Furry Cars: Disentangling Object Shape and Appearance across Multiple Domains
new file mode 100644
index 0000000000..af99e33f79
--- /dev/null
+++ b/data/2021/iclr/Generating Furry Cars: Disentangling Object Shape and Appearance across Multiple Domains	
@@ -0,0 +1 @@
+• University of California, Davis (Fall, 2015 – Spring, 2020) PhD in Computer Science GPA: 3.93 Advisor: Prof. Yong Jae Lee • Robotics Institute, Carnegie Mellon University, USA (August 2013 – December 2014) Masters in Robotics QPA: 4.05 Advisors: Prof. Alexei Efros, Prof. Kayvon Fatahalian • International Institute of Information Technology (IIIT), Hyderabad, India (August 2009 – May 2013) B.Tech ( Honours ) in Computer Science and Engineering GPA: 9.07/10 Advisor: Prof. P. J. Narayanan
\ No newline at end of file
diff --git a/data/2021/iclr/Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule b/data/2021/iclr/Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule
new file mode 100644
index 0000000000..6936c42ded
--- /dev/null
+++ b/data/2021/iclr/Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule	
@@ -0,0 +1 @@
+Vision-and-language navigation (VLN) is a task in which an agent is embodied in a realistic 3D environment and follows an instruction to reach the goal node. While most of the previous studies have built and investigated a discriminative approach, we notice that there are in fact two possible approaches to building such a VLN agent: discriminative \textit{and} generative. In this paper, we design and investigate a generative language-grounded policy which uses a language model to compute the distribution over all possible instructions i.e. all possible sequences of vocabulary tokens given action and the transition history. In experiments, we show that the proposed generative approach outperforms the discriminative approach in the Room-2-Room (R2R) and Room-4-Room (R4R) datasets, especially in the unseen environments. We further show that the combination of the generative and discriminative policies achieves close to the state-of-the art results in the R2R dataset, demonstrating that the generative and discriminative policies capture the different aspects of VLN.
\ No newline at end of file
diff --git a/data/2021/iclr/Generative Scene Graph Networks b/data/2021/iclr/Generative Scene Graph Networks
new file mode 100644
index 0000000000..ba0deb2489
--- /dev/null
+++ b/data/2021/iclr/Generative Scene Graph Networks	
@@ -0,0 +1 @@
+Human perception excels at building compositional hierarchies of parts and objects from unlabeled scenes that help systematic generalization. Yet most work on generative scene modeling either ignores the part-whole relationship or assumes access to predefined part labels. In this paper, we propose Generative Scene Graph Networks (GSGNs), the first deep generative model that learns to discover the primitive parts and infer the part-whole relationship jointly from multi-object scenes without supervision and in an end-to-end trainable way. We formulate GSGN as a variational autoencoder in which the latent representation is a treestructured probabilistic scene graph. The leaf nodes in the latent tree correspond to primitive parts, and the edges represent the symbolic pose variables required for recursively composing the parts into whole objects and then the full scene. This allows novel objects and scenes to be generated both by sampling from the prior and by manual configuration of the pose variables, as we do with graphics engines. We evaluate GSGN on datasets of scenes containing multiple compositional objects, including a challenging Compositional CLEVR dataset that we have developed. We show that GSGN is able to infer the latent scene graph, generalize out of the training regime, and improve data efficiency in downstream tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Generative Time-series Modeling with Fourier Flows b/data/2021/iclr/Generative Time-series Modeling with Fourier Flows
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Genetic Soft Updates for Policy Evolution in Deep Reinforcement Learning b/data/2021/iclr/Genetic Soft Updates for Policy Evolution in Deep Reinforcement Learning
new file mode 100644
index 0000000000..15e21e49fa
--- /dev/null
+++ b/data/2021/iclr/Genetic Soft Updates for Policy Evolution in Deep Reinforcement Learning	
@@ -0,0 +1 @@
+The combination of Evolutionary Algorithms (EAs) and Deep Reinforcement Learning (DRL) has been recently proposed to merge the beneﬁts of both solutions. Existing mixed approaches, however, have been successfully applied only to actor-critic methods and present signiﬁcant overhead. We address these issues by introducing a novel mixed framework that exploits a periodical genetic evaluation to soft update the weights of a DRL agent. The resulting approach is applicable with any DRL method and, in a worst-case scenario, it does not exhibit detrimental behaviours. Experiments in robotic applications and continuous control benchmarks demonstrate the versatility of our approach that signiﬁcantly outperforms prior DRL, EAs, and mixed approaches. Finally, we employ formal veriﬁcation to conﬁrm the policy improvement, mitigating the inefﬁcient exploration and hyper-parameter sensitivity of DRL
\ No newline at end of file
diff --git a/data/2021/iclr/Geometry-Aware Gradient Algorithms for Neural Architecture Search b/data/2021/iclr/Geometry-Aware Gradient Algorithms for Neural Architecture Search
new file mode 100644
index 0000000000..189c383b2c
--- /dev/null
+++ b/data/2021/iclr/Geometry-Aware Gradient Algorithms for Neural Architecture Search	
@@ -0,0 +1 @@
+Recent state-of-the-art methods for neural architecture search (NAS) exploit gradient-based optimization by relaxing the problem into continuous optimization over architectures and shared-weights, a noisy process that remains poorly understood. We argue for the study of single-level empirical risk minimization to understand NAS with weight-sharing, reducing the design of NAS methods to devising optimizers and regularizers that can quickly obtain high-quality solutions to this problem. Invoking the theory of mirror descent, we present a geometry-aware framework that exploits the underlying structure of this optimization to return sparse architectural parameters, leading to simple yet novel algorithms that enjoy fast convergence guarantees and achieve state-of-the-art accuracy on the latest NAS benchmarks in computer vision. Notably, we exceed the best published results for both CIFAR and ImageNet on both the DARTS search space and NAS-Bench-201; on the latter we achieve near-oracle-optimal performance on CIFAR-10 and CIFAR-100. Together, our theory and experiments demonstrate a principled way to co-design optimizers and continuous relaxations of discrete NAS search spaces.
\ No newline at end of file
diff --git a/data/2021/iclr/Geometry-aware Instance-reweighted Adversarial Training b/data/2021/iclr/Geometry-aware Instance-reweighted Adversarial Training
new file mode 100644
index 0000000000..7522f7503f
--- /dev/null
+++ b/data/2021/iclr/Geometry-aware Instance-reweighted Adversarial Training	
@@ -0,0 +1 @@
+In adversarial machine learning, there was a common belief that robustness and accuracy hurt each other. The belief was challenged by recent studies where we can maintain the robustness and improve the accuracy. However, the other direction, whether we can keep the accuracy while improving the robustness, is conceptually and practically more interesting, since robust accuracy should be lower than standard accuracy for any model. In this paper, we show this direction is also promising. Firstly, we find even over-parameterized deep networks may still have insufficient model capacity, because adversarial training has an overwhelming smoothing effect. Secondly, given limited model capacity, we argue adversarial data should have unequal importance: geometrically speaking, a natural data point closer to/farther from the class boundary is less/more robust, and the corresponding adversarial data point should be assigned with larger/smaller weight. Finally, to implement the idea, we propose geometry-aware instance-reweighted adversarial training, where the weights are based on how difficult it is to attack a natural data point. Experiments show that our proposal boosts the robustness of standard adversarial training; combining two directions, we improve both robustness and accuracy of standard adversarial training.
\ No newline at end of file
diff --git a/data/2021/iclr/Getting a CLUE: A Method for Explaining Uncertainty Estimates b/data/2021/iclr/Getting a CLUE: A Method for Explaining Uncertainty Estimates
new file mode 100644
index 0000000000..6146f7bf5b
--- /dev/null
+++ b/data/2021/iclr/Getting a CLUE: A Method for Explaining Uncertainty Estimates	
@@ -0,0 +1 @@
+Both uncertainty estimation and interpretability are important factors for trustworthy machine learning systems. However, there is little work at the intersection of these two areas. We address this gap by proposing a novel method for interpreting uncertainty estimates from differentiable probabilistic models, like Bayesian Neural Networks (BNNs). Our method, Counterfactual Latent Uncertainty Explanations (CLUE), indicates how to change an input, while keeping it on the data manifold, such that a BNN becomes more confident about the input's prediction. We validate CLUE through 1) a novel framework for evaluating counterfactual explanations of uncertainty, 2) a series of ablation experiments, and 3) a user study. Our experiments show that CLUE outperforms baselines and enables practitioners to better understand which input patterns are responsible for predictive uncertainty.
\ No newline at end of file
diff --git a/data/2021/iclr/Global Convergence of Three-layer Neural Networks in the Mean Field Regime b/data/2021/iclr/Global Convergence of Three-layer Neural Networks in the Mean Field Regime
new file mode 100644
index 0000000000..44a703af1a
--- /dev/null
+++ b/data/2021/iclr/Global Convergence of Three-layer Neural Networks in the Mean Field Regime	
@@ -0,0 +1 @@
+In the mean field regime, neural networks are appropriately scaled so that as the width tends to infinity, the learning dynamics tends to a nonlinear and nontrivial dynamical limit, known as the mean field limit. This lends a way to study large-width neural networks via analyzing the mean field limit. Recent works have successfully applied such analysis to two-layer networks and provided global convergence guarantees. The extension to multilayer ones however has been a highly challenging puzzle, and little is known about the optimization efficiency in the mean field regime when there are more than two layers. In this work, we prove a global convergence result for unregularized feedforward three-layer networks in the mean field regime. We first develop a rigorous framework to establish the mean field limit of three-layer networks under stochastic gradient descent training. To that end, we propose the idea of a \textit{neuronal embedding}, which comprises of a fixed probability space that encapsulates neural networks of arbitrary sizes. The identified mean field limit is then used to prove a global convergence guarantee under suitable regularity and convergence mode assumptions, which -- unlike previous works on two-layer networks -- does not rely critically on convexity. Underlying the result is a universal approximation property, natural of neural networks, which importantly is shown to hold at \textit{any} finite training time (not necessarily at convergence) via an algebraic topology argument.
\ No newline at end of file
diff --git a/data/2021/iclr/Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime b/data/2021/iclr/Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime
new file mode 100644
index 0000000000..98224ca692
--- /dev/null
+++ b/data/2021/iclr/Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime	
@@ -0,0 +1 @@
+We study the problem of policy optimization for infinite-horizon discounted Markov Decision Processes with softmax policy and nonlinear function approximation trained with policy gradient algorithms. We concentrate on the training dynamics in the mean-field regime, modeling e.g., the behavior of wide single hidden layer neural networks, when exploration is encouraged through entropy regularization. The dynamics of these models is established as a Wasserstein gradient flow of distributions in parameter space. We further prove global optimality of the fixed points of this dynamics under mild conditions on their initialization.
\ No newline at end of file
diff --git a/data/2021/iclr/Go with the flow: Adaptive control for Neural ODEs b/data/2021/iclr/Go with the flow: Adaptive control for Neural ODEs
new file mode 100644
index 0000000000..7b67b8dd3e
--- /dev/null
+++ b/data/2021/iclr/Go with the flow: Adaptive control for Neural ODEs	
@@ -0,0 +1 @@
+Despite their elegant formulation and lightweight memory cost, neural ordinary differential equations (NODEs) suffer from known representational limitations. In particular, the single flow learned by NODEs cannot express all homeomorphisms from a given data space to itself, and their static weight parametrization restricts the type of functions they can learn compared to discrete architectures with layer-dependent weights. Here, we describe a new module called neurally-controlled ODE (N-CODE) designed to improve the expressivity of NODEs. The parameters of N-CODE modules are dynamic variables governed by a trainable map from initial or current activation state, resulting in forms of open-loop and closed-loop control, respectively. A single module is sufficient for learning a distribution on non-autonomous flows that adaptively drive neural representations. We provide theoretical and empirical evidence that N-CODE circumvents limitations of previous models and show how increased model expressivity manifests in several domains. In supervised learning, we demonstrate that our framework achieves better performance than NODEs as measured by both training speed and testing accuracy. In unsupervised learning, we apply this control perspective to an image Autoencoder endowed with a latent transformation flow, greatly improving representational power over a vanilla model and leading to state-of-the-art image reconstruction on CIFAR-10.
\ No newline at end of file
diff --git a/data/2021/iclr/GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing b/data/2021/iclr/GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing
new file mode 100644
index 0000000000..ea9b1edefc
--- /dev/null
+++ b/data/2021/iclr/GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing	
@@ -0,0 +1 @@
+We present GraPPa, an effective pre-training approach for table semantic parsing that learns a compositional inductive bias in the joint representations of textual and tabular data. We construct synthetic question-SQL pairs over high-quality tables via a synchronous context-free grammar (SCFG) induced from existing text-to-SQL datasets. We pre-train our model on the synthetic data using a novel text-schema linking objective that predicts the syntactic role of a table field in the SQL for each question-SQL pair. To maintain the model's ability to represent real-world data, we also include masked language modeling (MLM) over several existing table-and-language datasets to regularize the pre-training process. On four popular fully supervised and weakly supervised table semantic parsing benchmarks, GraPPa significantly outperforms RoBERTa-large as the feature representation layers and establishes new state-of-the-art results on all of them.
\ No newline at end of file
diff --git a/data/2021/iclr/Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability b/data/2021/iclr/Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability
new file mode 100644
index 0000000000..a2b1138762
--- /dev/null
+++ b/data/2021/iclr/Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability	
@@ -0,0 +1 @@
+We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / \text{(step size)}$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to whether these presumptions are relevant to neural network training. We hope that our findings will inspire future efforts aimed at rigorously understanding optimization at the Edge of Stability. Code is available at https://github.com/locuslab/edge-of-stability.
\ No newline at end of file
diff --git a/data/2021/iclr/Gradient Projection Memory for Continual Learning b/data/2021/iclr/Gradient Projection Memory for Continual Learning
new file mode 100644
index 0000000000..d8c71ad6c4
--- /dev/null
+++ b/data/2021/iclr/Gradient Projection Memory for Continual Learning	
@@ -0,0 +1 @@
+The ability to learn continually without forgetting the past tasks is a desired attribute for artificial learning systems. Existing approaches to enable such learning in artificial neural networks usually rely on network growth, importance based weight update or replay of old data from the memory. In contrast, we propose a novel approach where a neural network learns new tasks by taking gradient steps in the orthogonal direction to the gradient subspaces deemed important for the past tasks. We find the bases of these subspaces by analyzing network representations (activations) after learning each task with Singular Value Decomposition (SVD) in a single shot manner and store them in the memory as Gradient Projection Memory (GPM). With qualitative and quantitative analyses, we show that such orthogonal gradient descent induces minimum to no interference with the past tasks, thereby mitigates forgetting. We evaluate our algorithm on diverse image classification datasets with short and long sequences of tasks and report better or on-par performance compared to the state-of-the-art approaches.
\ No newline at end of file
diff --git a/data/2021/iclr/Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models b/data/2021/iclr/Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models
new file mode 100644
index 0000000000..0b4922a513
--- /dev/null
+++ b/data/2021/iclr/Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models	
@@ -0,0 +1 @@
+Massively multilingual models subsuming tens or even hundreds of languages pose great challenges to multi-task optimization. While it is a common practice to apply a language-agnostic procedure optimizing a joint multilingual task objective, how to properly characterize and take advantage of its underlying problem structure for improving optimization efficiency remains under-explored. In this paper, we attempt to peek into the black-box of multilingual optimization through the lens of loss function geometry. We find that gradient similarity measured along the optimization trajectory is an important signal, which correlates well with not only language proximity but also the overall model performance. Such observation helps us to identify a critical limitation of existing gradient-based multi-task learning methods, and thus we derive a simple and scalable optimization procedure, named Gradient Vaccine, which encourages more geometrically aligned parameter updates for close tasks. Empirically, our method obtains significant model performance gains on multilingual machine translation and XTREME benchmark tasks for multilingual language models. Our work reveals the importance of properly measuring and utilizing language proximity in multilingual optimization, and has broader implications for multi-task learning beyond multilingual modeling.
\ No newline at end of file
diff --git a/data/2021/iclr/Graph Coarsening with Neural Networks b/data/2021/iclr/Graph Coarsening with Neural Networks
new file mode 100644
index 0000000000..e51c79f0a5
--- /dev/null
+++ b/data/2021/iclr/Graph Coarsening with Neural Networks	
@@ -0,0 +1 @@
+As large-scale graphs become increasingly more prevalent, it poses significant computational challenges to process, extract and analyze large graph data. Graph coarsening is one popular technique to reduce the size of a graph while maintaining essential properties. Despite rich graph coarsening literature, there is only limited exploration of data-driven methods in the field. In this work, we leverage the recent progress of deep learning on graphs for graph coarsening. We first propose a framework for measuring the quality of coarsening algorithm and show that depending on the goal, we need to carefully choose the Laplace operator on the coarse graph and associated projection/lift operators. Motivated by the observation that the current choice of edge weight for the coarse graph may be sub-optimal, we parametrize the weight assignment map with graph neural networks and train it to improve the coarsening quality in an unsupervised way. Through extensive experiments on both synthetic and real networks, we demonstrate that our method significantly improves common graph coarsening methods under various metrics, reduction ratios, graph sizes, and graph types. It generalizes to graphs of larger size ($25\times$ of training graphs), is adaptive to different losses (differentiable and non-differentiable), and scales to much larger graphs than previous work.
\ No newline at end of file
diff --git a/data/2021/iclr/Graph Convolution with Low-rank Learnable Local Filters b/data/2021/iclr/Graph Convolution with Low-rank Learnable Local Filters
new file mode 100644
index 0000000000..0f14045340
--- /dev/null
+++ b/data/2021/iclr/Graph Convolution with Low-rank Learnable Local Filters	
@@ -0,0 +1 @@
+Geometric variations like rotation, scaling, and viewpoint changes pose a significant challenge to visual understanding. One common solution is to directly model certain intrinsic structures, e.g., using landmarks. However, it then becomes non-trivial to build effective deep models, especially when the underlying non-Euclidean grid is irregular and coarse. Recent deep models using graph convolutions provide an appropriate framework to handle such non-Euclidean data, but many of them, particularly those based on global graph Laplacians, lack expressiveness to capture local features required for representation of signals lying on the non-Euclidean grid. The current paper introduces a new type of graph convolution with learnable low-rank local filters, which is provably more expressive than previous spectral graph convolution methods. The model also provides a unified framework for both spectral and spatial graph convolutions. To improve model robustness, regularization by local graph Laplacians is introduced. The representation stability against input graph data perturbation is theoretically proved, making use of the graph filter locality and the local graph regularization. Experiments on spherical mesh data, real-world facial expression recognition/skeleton-based action recognition data, and data with simulated graph noise show the empirical advantage of the proposed model.
\ No newline at end of file
diff --git a/data/2021/iclr/Graph Edit Networks b/data/2021/iclr/Graph Edit Networks
new file mode 100644
index 0000000000..2e65efe2a1
--- /dev/null
+++ b/data/2021/iclr/Graph Edit Networks	
@@ -0,0 +1 @@
+a
\ No newline at end of file
diff --git a/data/2021/iclr/Graph Information Bottleneck for Subgraph Recognition b/data/2021/iclr/Graph Information Bottleneck for Subgraph Recognition
new file mode 100644
index 0000000000..5be22746bc
--- /dev/null
+++ b/data/2021/iclr/Graph Information Bottleneck for Subgraph Recognition	
@@ -0,0 +1 @@
+Given the input graph and its label/property, several key problems of graph learning, such as finding interpretable subgraphs, graph denoising and graph compression, can be attributed to the fundamental problem of recognizing a subgraph of the original one. This subgraph shall be as informative as possible, yet contains less redundant and noisy structure. This problem setting is closely related to the well-known information bottleneck (IB) principle, which, however, has less been studied for the irregular graph data and graph neural networks (GNNs). In this paper, we propose a framework of Graph Information Bottleneck (GIB) for the subgraph recognition problem in deep graph learning. Under this framework, one can recognize the maximally informative yet compressive subgraph, named IB-subgraph. However, the GIB objective is notoriously hard to optimize, mostly due to the intractability of the mutual information of irregular graph data and the unstable optimization process. In order to tackle these challenges, we propose: i) a GIB objective based-on a mutual information estimator for the irregular graph data; ii) a bi-level optimization scheme to maximize the GIB objective; iii) a connectivity loss to stabilize the optimization process. We evaluate the properties of the IB-subgraph in three application scenarios: improvement of graph classification, graph interpretation and graph denoising. Extensive experiments demonstrate that the information-theoretic IB-subgraph enjoys superior graph properties.
\ No newline at end of file
diff --git a/data/2021/iclr/Graph Traversal with Tensor Functionals: A Meta-Algorithm for Scalable Learning b/data/2021/iclr/Graph Traversal with Tensor Functionals: A Meta-Algorithm for Scalable Learning
new file mode 100644
index 0000000000..92e4d31834
--- /dev/null
+++ b/data/2021/iclr/Graph Traversal with Tensor Functionals: A Meta-Algorithm for Scalable Learning	
@@ -0,0 +1 @@
+Graph Representation Learning (GRL) methods have impacted fields from chemistry to social science. However, their algorithmic implementations are specialized to specific use-cases e.g.message passing methods are run differently from node embedding ones. Despite their apparent differences, all these methods utilize the graph structure, and therefore, their learning can be approximated with stochastic graph traversals. We propose Graph Traversal via Tensor Functionals(GTTF), a unifying meta-algorithm framework for easing the implementation of diverse graph algorithms and enabling transparent and efficient scaling to large graphs. GTTF is founded upon a data structure (stored as a sparse tensor) and a stochastic graph traversal algorithm (described using tensor operations). The algorithm is a functional that accept two functions, and can be specialized to obtain a variety of GRL models and objectives, simply by changing those two functions. We show for a wide class of methods, our algorithm learns in an unbiased fashion and, in expectation, approximates the learning as if the specialized implementations were run directly. With these capabilities, we scale otherwise non-scalable methods to set state-of-the-art on large graph datasets while being more efficient than existing GRL libraries - with only a handful of lines of code for each method specialization. GTTF and its various GRL implementations are on: https://github.com/isi-usc-edu/gttf.
\ No newline at end of file
diff --git a/data/2021/iclr/Graph-Based Continual Learning b/data/2021/iclr/Graph-Based Continual Learning
new file mode 100644
index 0000000000..9647235861
--- /dev/null
+++ b/data/2021/iclr/Graph-Based Continual Learning	
@@ -0,0 +1 @@
+Despite significant advances, continual learning models still suffer from catastrophic forgetting when exposed to incrementally available data from non-stationary distributions. Rehearsal approaches alleviate the problem by maintaining and replaying a small episodic memory of previous samples, often implemented as an array of independent memory slots. In this work, we propose to augment such an array with a learnable random graph that captures pairwise similarities between its samples, and use it not only to learn new tasks but also to guard against forgetting. Empirical results on several benchmark datasets show that our model consistently outperforms recently proposed baselines for task-free continual learning.
\ No newline at end of file
diff --git a/data/2021/iclr/GraphCodeBERT: Pre-training Code Representations with Data Flow b/data/2021/iclr/GraphCodeBERT: Pre-training Code Representations with Data Flow
new file mode 100644
index 0000000000..d1fa0896b0
--- /dev/null
+++ b/data/2021/iclr/GraphCodeBERT: Pre-training Code Representations with Data Flow	
@@ -0,0 +1 @@
+Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search.
\ No newline at end of file
diff --git a/data/2021/iclr/Greedy-GQ with Variance Reduction: Finite-time Analysis and Improved Complexity b/data/2021/iclr/Greedy-GQ with Variance Reduction: Finite-time Analysis and Improved Complexity
new file mode 100644
index 0000000000..5182e8a9fc
--- /dev/null
+++ b/data/2021/iclr/Greedy-GQ with Variance Reduction: Finite-time Analysis and Improved Complexity	
@@ -0,0 +1 @@
+Greedy-GQ is a value-based reinforcement learning (RL) algorithm for optimal control. Recently, the finite-time analysis of Greedy-GQ has been developed under linear function approximation and Markovian sampling, and the algorithm is shown to achieve an $\epsilon$-stationary point with a sample complexity in the order of $\mathcal{O}(\epsilon^{-3})$. Such a high sample complexity is due to the large variance induced by the Markovian samples. In this paper, we propose a variance-reduced Greedy-GQ (VR-Greedy-GQ) algorithm for off-policy optimal control. In particular, the algorithm applies the SVRG-based variance reduction scheme to reduce the stochastic variance of the two time-scale updates. We study the finite-time convergence of VR-Greedy-GQ under linear function approximation and Markovian sampling and show that the algorithm achieves a much smaller bias and variance error than the original Greedy-GQ. In particular, we prove that VR-Greedy-GQ achieves an improved sample complexity that is in the order of $\mathcal{O}(\epsilon^{-2})$. We further compare the performance of VR-Greedy-GQ with that of Greedy-GQ in various RL experiments to corroborate our theoretical findings.
\ No newline at end of file
diff --git a/data/2021/iclr/Grounded Language Learning Fast and Slow b/data/2021/iclr/Grounded Language Learning Fast and Slow
new file mode 100644
index 0000000000..327068930c
--- /dev/null
+++ b/data/2021/iclr/Grounded Language Learning Fast and Slow	
@@ -0,0 +1 @@
+Recent work has shown that large text-based neural language models, trained with conventional supervised learning objectives, acquire a surprising propensity for few- and one-shot learning. Here, we show that an embodied agent situated in a simulated 3D world, and endowed with a novel dual-coding external memory, can exhibit similar one-shot word learning when trained with conventional reinforcement learning algorithms. After a single introduction to a novel object via continuous visual perception and a language prompt ("This is a dax"), the agent can re-identify the object and manipulate it as instructed ("Put the dax on the bed"). In doing so, it seamlessly integrates short-term, within-episode knowledge of the appropriate referent for the word "dax" with long-term lexical and motor knowledge acquired across episodes (i.e. "bed" and "putting"). We find that, under certain training conditions and with a particular memory writing mechanism, the agent's one-shot word-object binding generalizes to novel exemplars within the same ShapeNet category, and is effective in settings with unfamiliar numbers of objects. We further show how dual-coding memory can be exploited as a signal for intrinsic motivation, stimulating the agent to seek names for objects that may be useful for later executing instructions. Together, the results demonstrate that deep neural networks can exploit meta-learning, episodic memory and an explicitly multi-modal environment to account for 'fast-mapping', a fundamental pillar of human cognitive development and a potentially transformative capacity for agents that interact with human users.
\ No newline at end of file
diff --git a/data/2021/iclr/Grounding Language to Autonomously-Acquired Skills via Goal Generation b/data/2021/iclr/Grounding Language to Autonomously-Acquired Skills via Goal Generation
new file mode 100644
index 0000000000..05f293c29e
--- /dev/null
+++ b/data/2021/iclr/Grounding Language to Autonomously-Acquired Skills via Goal Generation	
@@ -0,0 +1 @@
+We are interested in the autonomous acquisition of repertoires of skills. Language-conditioned reinforcement learning ( LC - RL ) approaches are great tools in this quest, as they allow to express abstract goals as sets of constraints on the states. However, most LC - RL agents are not autonomous and cannot learn without ex-ternal instructions and feedback. Besides, their direct language condition cannot account for the goal-directed behavior of pre-verbal infants and strongly limits the expression of behavioral diversity for a given language input. To resolve these issues, we propose a new conceptual approach to language-conditioned RL : the Language-Goal-Behavior architecture ( LGB ). LGB decouples skill learning and language grounding via an intermediate semantic representation of the world. To showcase the properties of LGB , we present a speciﬁc implementation called DECSTR . DECSTR is an intrinsically motivated learning agent endowed with an innate semantic representation describing spatial relations between physical objects. In a ﬁrst stage ( G → B ), it freely explores its environment and targets self-generated semantic conﬁgurations. In a second stage ( L → G ), it trains a language-conditioned goal generator to generate semantic goals that match the constraints expressed in language-based inputs. We showcase the additional properties of LGB w.r.t. both an end-to-end LC - RL approach and a similar approach leveraging non-semantic, continuous intermediate representations. Intermediate semantic representations help satisfy language commands in a diversity of ways, enable strategy switching after a failure and facilitate language grounding.
\ No newline at end of file
diff --git a/data/2021/iclr/Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning b/data/2021/iclr/Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning
new file mode 100644
index 0000000000..423b686820
--- /dev/null
+++ b/data/2021/iclr/Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning	
@@ -0,0 +1 @@
+We study the problem of dynamic visual reasoning on raw videos. This is a challenging problem; currently, state-of-the-art models often require dense supervision on physical object properties and events from simulation, which are impractical to obtain in real life. In this paper, we present the Dynamic Concept Learner (DCL), a unified framework that grounds physical objects and events from video and language. DCL first adopts a trajectory extractor to track each object over time and to represent it as a latent, object-centric feature vector. Building upon this object-centric representation, DCL learns to approximate the dynamic interaction among objects using graph networks. DCL further incorporates a semantic parser to parse questions into semantic programs and, finally, a program executor to run the program to answer the question, levering the learned dynamics model. After training, DCL can detect and associate objects across the frames, ground visual properties, and physical events, understand the causal relationship between events, make future and counterfactual predictions, and leverage these extracted presentations for answering queries. DCL achieves state-of-the-art performance on CLEVRER, a challenging causal video reasoning dataset, even without using ground-truth attributes and collision labels from simulations for training. We further test DCL on a newly proposed video-retrieval and event localization dataset derived from CLEVRER, showing its strong generalization capacity.
\ No newline at end of file
diff --git a/data/2021/iclr/Group Equivariant Conditional Neural Processes b/data/2021/iclr/Group Equivariant Conditional Neural Processes
new file mode 100644
index 0000000000..f7d752d9d6
--- /dev/null
+++ b/data/2021/iclr/Group Equivariant Conditional Neural Processes	
@@ -0,0 +1 @@
+We present the group equivariant conditional neural process (EquivCNP), a metalearning method with permutation invariance in a data set as in conventional conditional neural processes (CNPs), and it also has transformation equivariance in data space. Incorporating group equivariance, such as rotation and scaling equivariance, provides a way to consider the symmetry of real-world data. We give a decomposition theorem for permutation-invariant and group-equivariant maps, which leads us to construct EquivCNPs with an infinite-dimensional latent space to handle group symmetries. In this paper, we build architecture using Lie group convolutional layers for practical implementation. We show that EquivCNP with translation equivariance achieves comparable performance to conventional CNPs in a 1D regression task. Moreover, we demonstrate that incorporating an appropriate Lie group equivariance, EquivCNP is capable of zero-shot generalization for an image-completion task by selecting an appropriate Lie group equivariance.
\ No newline at end of file
diff --git a/data/2021/iclr/Group Equivariant Generative Adversarial Networks b/data/2021/iclr/Group Equivariant Generative Adversarial Networks
new file mode 100644
index 0000000000..68ead5d48e
--- /dev/null
+++ b/data/2021/iclr/Group Equivariant Generative Adversarial Networks	
@@ -0,0 +1 @@
+Generative adversarial networks are the state of the art for generative modeling in vision, yet are notoriously unstable in practice. This instability is further exacerbated with limited training data. However, in the synthesis of domains such as medical or satellite imaging, it is often overlooked that the image label is invariant to global image symmetries (e.g., rotations and reflections). In this work, we improve gradient feedback between generator and discriminator using an inductive symmetry prior via group-equivariant convolutional networks. We replace convolutional layers with equivalent group-convolutional layers in both generator and discriminator, allowing for better optimization steps and increased expressive power with limited samples. In the process, we extend recent GAN developments to the group-equivariant setting. We demonstrate the utility of our methods by improving both sample fidelity and diversity in the class-conditional synthesis of a diverse set of globally-symmetric imaging modalities.
\ No newline at end of file
diff --git a/data/2021/iclr/Group Equivariant Stand-Alone Self-Attention For Vision b/data/2021/iclr/Group Equivariant Stand-Alone Self-Attention For Vision
new file mode 100644
index 0000000000..916d004e37
--- /dev/null
+++ b/data/2021/iclr/Group Equivariant Stand-Alone Self-Attention For Vision	
@@ -0,0 +1 @@
+We provide a general self-attention formulation to impose group equivariance to arbitrary symmetry groups. This is achieved by defining positional encodings that are invariant to the action of the group considered. Since the group acts on the positional encoding directly, group equivariant self-attention networks (GSA-Nets) are steerable by nature. Our experiments on vision benchmarks demonstrate consistent improvements of GSA-Nets over non-equivariant self-attention networks.
\ No newline at end of file
diff --git a/data/2021/iclr/Growing Efficient Deep Networks by Structured Continuous Sparsification b/data/2021/iclr/Growing Efficient Deep Networks by Structured Continuous Sparsification
new file mode 100644
index 0000000000..de1ea3c604
--- /dev/null
+++ b/data/2021/iclr/Growing Efficient Deep Networks by Structured Continuous Sparsification	
@@ -0,0 +1 @@
+We develop an approach to training deep networks while dynamically adjusting their architecture, driven by a principled combination of accuracy and sparsity objectives. Unlike conventional pruning approaches, our method adopts a gradual continuous relaxation of discrete network structure optimization and then samples sparse subnetworks, enabling efficient deep networks to be trained in a growing and pruning manner. Extensive experiments across CIFAR-10, ImageNet, PASCAL VOC, and Penn Treebank, with convolutional models for image classification and semantic segmentation, and recurrent models for language modeling, show that our training scheme yields efficient networks that are smaller and more accurate than those produced by competing pruning methods.
\ No newline at end of file
diff --git a/data/2021/iclr/HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark b/data/2021/iclr/HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark
new file mode 100644
index 0000000000..1838872652
--- /dev/null
+++ b/data/2021/iclr/HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark	
@@ -0,0 +1 @@
+HardWare-aware Neural Architecture Search (HW-NAS) has recently gained tremendous attention by automating the design of DNNs deployed in more resource-constrained daily life devices. Despite its promising performance, developing optimal HW-NAS solutions can be prohibitively challenging as it requires cross-disciplinary knowledge in the algorithm, micro-architecture, and device-specific compilation. First, to determine the hardware-cost to be incorporated into the NAS process, existing works mostly adopt either pre-collected hardware-cost look-up tables or device-specific hardware-cost models. Both of them limit the development of HW-NAS innovations and impose a barrier-to-entry to non-hardware experts. Second, similar to generic NAS, it can be notoriously difficult to benchmark HW-NAS algorithms due to their significant required computational resources and the differences in adopted search spaces, hyperparameters, and hardware devices. To this end, we develop HW-NAS-Bench, the first public dataset for HW-NAS research which aims to democratize HW-NAS research to non-hardware experts and make HW-NAS research more reproducible and accessible. To design HW-NAS-Bench, we carefully collected the measured/estimated hardware performance of all the networks in the search spaces of both NAS-Bench-201 and FBNet, on six hardware devices that fall into three categories (i.e., commercial edge devices, FPGA, and ASIC). Furthermore, we provide a comprehensive analysis of the collected measurements in HW-NAS-Bench to provide insights for HW-NAS research. Finally, we demonstrate exemplary user cases to (1) show that HW-NAS-Bench allows non-hardware experts to perform HW-NAS by simply querying it and (2) verify that dedicated device-specific HW-NAS can indeed lead to optimal accuracy-cost trade-offs. The codes and all collected data are available at https://github.com/RICE-EIC/HW-NAS-Bench.
\ No newline at end of file
diff --git a/data/2021/iclr/HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents b/data/2021/iclr/HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents
new file mode 100644
index 0000000000..617255478f
--- /dev/null
+++ b/data/2021/iclr/HalentNet: Multimodal Trajectory Forecasting with Hallucinative Intents	
@@ -0,0 +1 @@
+of
\ No newline at end of file
diff --git a/data/2021/iclr/Heating up decision boundaries: isocapacitory saturation, adversarial scenarios and generalization bounds b/data/2021/iclr/Heating up decision boundaries: isocapacitory saturation, adversarial scenarios and generalization bounds
new file mode 100644
index 0000000000..aaac9e41b9
--- /dev/null
+++ b/data/2021/iclr/Heating up decision boundaries: isocapacitory saturation, adversarial scenarios and generalization bounds	
@@ -0,0 +1 @@
+In the present work we study classifiers' decision boundaries via Brownian motion processes in ambient data space and associated probabilistic techniques. Intuitively, our ideas correspond to placing a heat source at the decision boundary and observing how effectively the sample points warm up. We are largely motivated by the search for a soft measure that sheds further light on the decision boundary's geometry. En route, we bridge aspects of potential theory and geometric analysis (Mazya, 2011, Grigoryan-Saloff-Coste, 2002) with active fields of ML research such as adversarial examples and generalization bounds. First, we focus on the geometric behavior of decision boundaries in the light of adversarial attack/defense mechanisms. Experimentally, we observe a certain capacitory trend over different adversarial defense strategies: decision boundaries locally become flatter as measured by isoperimetric inequalities (Ford et al, 2019); however, our more sensitive heat-diffusion metrics extend this analysis and further reveal that some non-trivial geometry invisible to plain distance-based methods is still preserved. Intuitively, we provide evidence that the decision boundaries nevertheless retain many persistent"wiggly and fuzzy"regions on a finer scale. Second, we show how Brownian hitting probabilities translate to soft generalization bounds which are in turn connected to compression and noise stability (Arora et al, 2018), and these bounds are significantly stronger if the decision boundary has controlled geometric features.
\ No newline at end of file
diff --git a/data/2021/iclr/HeteroFL: Computation and Communication Efficient Federated Learning for Heterogeneous Clients b/data/2021/iclr/HeteroFL: Computation and Communication Efficient Federated Learning for Heterogeneous Clients
new file mode 100644
index 0000000000..6ba56b089f
--- /dev/null
+++ b/data/2021/iclr/HeteroFL: Computation and Communication Efficient Federated Learning for Heterogeneous Clients	
@@ -0,0 +1 @@
+Federated Learning (FL) is a method of training machine learning models on private data distributed over a large number of possibly heterogeneous clients such as mobile phones and IoT devices. In this work, we propose a new federated learning framework named HeteroFL to address heterogeneous clients equipped with very different computation and communication capabilities. Our solution can enable the training of heterogeneous local models with varying computation complexities and still produce a single global inference model. For the first time, our method challenges the underlying assumption of existing work that local models have to share the same architecture as the global model. We demonstrate several strategies to enhance FL training and conduct extensive empirical evaluations, including five computation complexity levels of three model architecture on three datasets. We show that adaptively distributing subnetworks according to clients' capabilities is both computation and communication efficient.
\ No newline at end of file
diff --git a/data/2021/iclr/Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization b/data/2021/iclr/Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization
new file mode 100644
index 0000000000..4a1e13c133
--- /dev/null
+++ b/data/2021/iclr/Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization	
@@ -0,0 +1 @@
+Real-world large-scale datasets are heteroskedastic and imbalanced -- labels have varying levels of uncertainty and label distributions are long-tailed. Heteroskedasticity and imbalance challenge deep learning algorithms due to the difficulty of distinguishing among mislabeled, ambiguous, and rare examples. Addressing heteroskedasticity and imbalance simultaneously is under-explored. We propose a data-dependent regularization technique for heteroskedastic datasets that regularizes different regions of the input space differently. Inspired by the theoretical derivation of the optimal regularization strength in a one-dimensional nonparametric classification setting, our approach adaptively regularizes the data points in higher-uncertainty, lower-density regions more heavily. We test our method on several benchmark tasks, including a real-world heteroskedastic and imbalanced dataset, WebVision. Our experiments corroborate our theory and demonstrate a significant improvement over other methods in noise-robust deep learning.
\ No newline at end of file
diff --git a/data/2021/iclr/Hierarchical Autoregressive Modeling for Neural Video Compression b/data/2021/iclr/Hierarchical Autoregressive Modeling for Neural Video Compression
new file mode 100644
index 0000000000..360aec1c06
--- /dev/null
+++ b/data/2021/iclr/Hierarchical Autoregressive Modeling for Neural Video Compression	
@@ -0,0 +1 @@
+Recent work by Marino et al. (2020) showed improved performance in sequential density estimation by combining masked autoregressive flows with hierarchical latent variable models. We draw a connection between such autoregressive generative models and the task of lossy video compression. Specifically, we view recent neural video compression methods (Lu et al., 2019; Yang et al., 2020b; Agustssonet al., 2020) as instances of a generalized stochastic temporal autoregressive trans-form, and propose avenues for enhancement based on this insight. Comprehensive evaluations on large-scale video data show improved rate-distortion performance over both state-of-the-art neural and conventional video compression methods.
\ No newline at end of file
diff --git a/data/2021/iclr/Hierarchical Reinforcement Learning by Discovering Intrinsic Options b/data/2021/iclr/Hierarchical Reinforcement Learning by Discovering Intrinsic Options
new file mode 100644
index 0000000000..ce68782f8a
--- /dev/null
+++ b/data/2021/iclr/Hierarchical Reinforcement Learning by Discovering Intrinsic Options	
@@ -0,0 +1 @@
+We propose a hierarchical reinforcement learning method, HIDIO, that can learn task-agnostic options in a self-supervised manner while jointly learning to utilize them to solve sparse-reward tasks. Unlike current hierarchical RL approaches that tend to formulate goal-reaching low-level tasks or pre-define ad hoc lower-level policies, HIDIO encourages lower-level option learning that is independent of the task at hand, requiring few assumptions or little knowledge about the task structure. These options are learned through an intrinsic entropy minimization objective conditioned on the option sub-trajectories. The learned options are diverse and task-agnostic. In experiments on sparse-reward robotic manipulation and navigation tasks, HIDIO achieves higher success rates with greater sample efficiency than regular RL baselines and two state-of-the-art hierarchical RL methods.
\ No newline at end of file
diff --git a/data/2021/iclr/High-Capacity Expert Binary Networks b/data/2021/iclr/High-Capacity Expert Binary Networks
new file mode 100644
index 0000000000..3cd200d47b
--- /dev/null
+++ b/data/2021/iclr/High-Capacity Expert Binary Networks	
@@ -0,0 +1 @@
+Network binarization is a promising hardware-aware direction for creating efficient deep models. Despite its memory and computational advantages, reducing the accuracy gap between such models and their real-valued counterparts remains an unsolved challenging research problem. To this end, we make the following 3 contributions: (a) To increase model capacity, we propose Expert Binary Convolution, which, for the first time, tailors conditional computing to binary networks by learning to select one data-specific expert binary filter at a time conditioned on input features. (b) To increase representation capacity, we propose to address the inherent information bottleneck in binary networks by introducing an efficient width expansion mechanism which keeps the binary operations within the same budget. (c) To improve network design, we propose a principled binary network growth mechanism that unveils a set of network topologies of favorable properties. Overall, our method improves upon prior work, with no increase in computational cost by ~6%, reaching a groundbreaking ~71% on ImageNet classification.
\ No newline at end of file
diff --git a/data/2021/iclr/Hopfield Networks is All You Need b/data/2021/iclr/Hopfield Networks is All You Need
new file mode 100644
index 0000000000..bbb89c8c2e
--- /dev/null
+++ b/data/2021/iclr/Hopfield Networks is All You Need	
@@ -0,0 +1 @@
+We show that the transformer attention mechanism is the update rule of a modern Hopfield network with continuous states. This new Hopfield network can store exponentially (with the dimension) many patterns, converges with one update, and has exponentially small retrieval errors. The number of stored patterns is traded off against convergence speed and retrieval error. The new Hopfield network has three types of energy minima (fixed points of the update): (1) global fixed point averaging over all patterns, (2) metastable states averaging over a subset of patterns, and (3) fixed points which store a single pattern. Transformer and BERT models operate in their first layers preferably in the global averaging regime, while they operate in higher layers in metastable states. The gradient in transformers is maximal for metastable states, is uniformly distributed for global averaging, and vanishes for a fixed point near a stored pattern. Using the Hopfield network interpretation, we analyzed learning of transformer and BERT models. Learning starts with attention heads that average and then most of them switch to metastable states. However, the majority of heads in the first layers still averages and can be replaced by averaging, e.g. our proposed Gaussian weighting. In contrast, heads in the last layers steadily learn and seem to use metastable states to collect information created in lower layers. These heads seem to be a promising target for improving transformers. Neural networks with Hopfield networks outperform other methods on immune repertoire classification, where the Hopfield net stores several hundreds of thousands of patterns. We provide a new PyTorch layer called "Hopfield", which allows to equip deep learning architectures with modern Hopfield networks as a new powerful concept comprising pooling, memory, and attention. GitHub: this https URL
\ No newline at end of file
diff --git a/data/2021/iclr/Hopper: Multi-hop Transformer for Spatiotemporal Reasoning b/data/2021/iclr/Hopper: Multi-hop Transformer for Spatiotemporal Reasoning
new file mode 100644
index 0000000000..f960bec79b
--- /dev/null
+++ b/data/2021/iclr/Hopper: Multi-hop Transformer for Spatiotemporal Reasoning	
@@ -0,0 +1 @@
+This paper considers the problem of spatiotemporal object-centric reasoning in videos. Central to our approach is the notion of object permanence, i.e., the ability to reason about the location of objects as they move through the video while being occluded, contained or carried by other objects. Existing deep learning based approaches often suffer from spatiotemporal biases when applied to video reasoning problems. We propose Hopper, which uses a Multi-hop Transformer for reasoning object permanence in videos. Given a video and a localization query, Hopper reasons over image and object tracks to automatically hop over critical frames in an iterative fashion to predict the final position of the object of interest. We demonstrate the effectiveness of using a contrastive loss to reduce spatiotemporal biases. We evaluate over CATER dataset and find that Hopper achieves 73.2% Top-1 accuracy using just 1 FPS by hopping through just a few critical frames. We also demonstrate Hopper can perform long-term reasoning by building a CATER-h dataset that requires multi-step reasoning to localize objects of interest correctly.
\ No newline at end of file
diff --git a/data/2021/iclr/How Benign is Benign Overfitting ? b/data/2021/iclr/How Benign is Benign Overfitting ?
new file mode 100644
index 0000000000..6778beae45
--- /dev/null
+++ b/data/2021/iclr/How Benign is Benign Overfitting ?	
@@ -0,0 +1 @@
+We investigate two causes for adversarial vulnerability in deep neural networks: bad data and (poorly) trained models. When trained with SGD, deep neural networks essentially achieve zero training error, even in the presence of label noise, while also exhibiting good generalization on natural test data, something referred to as benign overfitting [2, 10]. However, these models are vulnerable to adversarial attacks. We identify label noise as one of the causes for adversarial vulnerability, and provide theoretical and empirical evidence in support of this. Surprisingly, we find several instances of label noise in datasets such as MNIST and CIFAR, and that robustly trained models incur training error on some of these, i.e. they don't fit the noise. However, removing noisy labels alone does not suffice to achieve adversarial robustness. Standard training procedures bias neural networks towards learning "simple" classification boundaries, which may be less robust than more complex ones. We observe that adversarial training does produce more complex decision boundaries. We conjecture that in part the need for complex decision boundaries arises from sub-optimal representation learning. By means of simple toy examples, we show theoretically how the choice of representation can drastically affect adversarial robustness.
\ No newline at end of file
diff --git a/data/2021/iclr/How Does Mixup Help With Robustness and Generalization? b/data/2021/iclr/How Does Mixup Help With Robustness and Generalization?
new file mode 100644
index 0000000000..4787d6cd01
--- /dev/null
+++ b/data/2021/iclr/How Does Mixup Help With Robustness and Generalization?	
@@ -0,0 +1 @@
+Mixup is a popular data augmentation technique based on taking convex combinations of pairs of examples and their labels. This simple technique has been shown to substantially improve both the robustness and the generalization of the trained model. However, it is not well-understood why such improvement occurs. In this paper, we provide theoretical analysis to demonstrate how using Mixup in training helps model robustness and generalization. For robustness, we show that minimizing the Mixup loss corresponds to approximately minimizing an upper bound of the adversarial loss. This explains why models obtained by Mixup training exhibits robustness to several kinds of adversarial attacks such as Fast Gradient Sign Method (FGSM). For generalization, we prove that Mixup augmentation corresponds to a specific type of data-adaptive regularization which reduces overfitting. Our analysis provides new insights and a framework to understand Mixup.
\ No newline at end of file
diff --git a/data/2021/iclr/How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks? b/data/2021/iclr/How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?
new file mode 100644
index 0000000000..dacd0b277b
--- /dev/null
+++ b/data/2021/iclr/How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?	
@@ -0,0 +1 @@
+A recent line of research on deep learning focuses on the extremely over-parameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size $n$ and the inverse of the target accuracy $\epsilon^{-1}$, deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees. Very recently, it is shown that under certain margin assumption on the training data, a polylogarithmic width condition suffices for two-layer ReLU networks to converge and generalize (Ji and Telgarsky, 2019). However, how much over-parameterization is sufficient to guarantee optimization and generalization for deep neural networks still remains an open question. In this work, we establish sharp optimization and generalization guarantees for deep ReLU networks. Under various assumptions made in previous work, our optimization and generalization guarantees hold with network width polylogarithmic in $n$ and $\epsilon^{-1}$. Our results push the study of over-parameterized deep neural networks towards more practical settings.
\ No newline at end of file
diff --git a/data/2021/iclr/How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks b/data/2021/iclr/How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks
new file mode 100644
index 0000000000..7399d5ef0c
--- /dev/null
+++ b/data/2021/iclr/How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks	
@@ -0,0 +1 @@
+We study how neural networks trained by gradient descent extrapolate, i.e., what they learn outside the support of the training distribution. Previous works report mixed empirical results when extrapolating with neural networks: while multilayer perceptrons (MLPs) do not extrapolate well in certain simple tasks, Graph Neural Network (GNN), a structured network with MLP modules, has shown some success in more complex tasks. Working towards a theoretical explanation, we identify conditions under which MLPs and GNNs extrapolate well. First, we quantify the observation that ReLU MLPs quickly converge to linear functions along any direction from the origin, which implies that ReLU MLPs do not extrapolate most non-linear functions. But, they can provably learn a linear target function when the training distribution is sufficiently "diverse". Second, in connection to analyzing successes and limitations of GNNs, these results suggest a hypothesis for which we provide theoretical and empirical evidence: the success of GNNs in extrapolating algorithmic tasks to new data (e.g., larger graphs or edge weights) relies on encoding task-specific non-linearities in the architecture or features.
\ No newline at end of file
diff --git a/data/2021/iclr/How to Find Your Friendly Neighborhood: Graph Attention Design with Self-Supervision b/data/2021/iclr/How to Find Your Friendly Neighborhood: Graph Attention Design with Self-Supervision
new file mode 100644
index 0000000000..f96b2206a3
--- /dev/null
+++ b/data/2021/iclr/How to Find Your Friendly Neighborhood: Graph Attention Design with Self-Supervision	
@@ -0,0 +1 @@
+Attention mechanism in graph neural networks is designed to assign larger weights to important neighbor nodes for better representation. However, what graph attention learns is not understood well, particularly when graphs are noisy. In this paper, we propose a self-supervised graph attention network (SuperGAT), an improved graph attention model for noisy graphs. Specifically, we exploit two attention forms compatible with a self-supervised task to predict edges, whose presence and absence contain the inherent information about the importance of the relationships between nodes. By encoding edges, SuperGAT learns more expressive attention in distinguishing mislinked neighbors. We find two graph characteristics influence the effectiveness of attention forms and self-supervision: homophily and average degree. Thus, our recipe provides guidance on which attention design to use when those two graph characteristics are known. Our experiment on 17 real-world datasets demonstrates that our recipe generalizes across 15 datasets of them, and our models designed by recipe show improved performance over baselines.
\ No newline at end of file
diff --git a/data/2021/iclr/Human-Level Performance in No-Press Diplomacy via Equilibrium Search b/data/2021/iclr/Human-Level Performance in No-Press Diplomacy via Equilibrium Search
new file mode 100644
index 0000000000..294789a22c
--- /dev/null
+++ b/data/2021/iclr/Human-Level Performance in No-Press Diplomacy via Equilibrium Search	
@@ -0,0 +1 @@
+Prior AI breakthroughs in complex games have focused on either the purely adversarial or purely cooperative settings. In contrast, Diplomacy is a game of shifting alliances that involves both cooperation and competition. For this reason, Diplomacy has proven to be a formidable research challenge. In this paper we describe an agent for the no-press variant of Diplomacy that combines supervised learning on human data with one-step lookahead search via external regret minimization. External regret minimization techniques have been behind previous AI successes in adversarial games, most notably poker, but have not previously been shown to be successful in large-scale games involving cooperation. We show that our agent greatly exceeds the performance of past no-press Diplomacy bots, is unexploitable by expert humans, and achieves a rank of 23 out of 1,128 human players when playing anonymous games on a popular Diplomacy website.
\ No newline at end of file
diff --git a/data/2021/iclr/HyperDynamics: Meta-Learning Object and Agent Dynamics with Hypernetworks b/data/2021/iclr/HyperDynamics: Meta-Learning Object and Agent Dynamics with Hypernetworks
new file mode 100644
index 0000000000..a77a4f53f5
--- /dev/null
+++ b/data/2021/iclr/HyperDynamics: Meta-Learning Object and Agent Dynamics with Hypernetworks	
@@ -0,0 +1 @@
+We propose HyperDynamics, a dynamics meta-learning framework that conditions on an agent's interactions with the environment and optionally its visual observations, and generates the parameters of neural dynamics models based on inferred properties of the dynamical system. Physical and visual properties of the environment that are not part of the low-dimensional state yet affect its temporal dynamics are inferred from the interaction history and visual observations, and are implicitly captured in the generated parameters. We test HyperDynamics on a set of object pushing and locomotion tasks. It outperforms existing dynamics models in the literature that adapt to environment variations by learning dynamics over high dimensional visual observations, capturing the interactions of the agent in recurrent state representations, or using gradient-based meta-optimization. We also show our method matches the performance of an ensemble of separately trained experts, while also being able to generalize well to unseen environment variations at test time. We attribute its good performance to the multiplicative interactions between the inferred system properties -- captured in the generated parameters -- and the low-dimensional state representation of the dynamical system.
\ No newline at end of file
diff --git a/data/2021/iclr/HyperGrid Transformers: Towards A Single Model for Multiple Tasks b/data/2021/iclr/HyperGrid Transformers: Towards A Single Model for Multiple Tasks
new file mode 100644
index 0000000000..63914e59d5
--- /dev/null
+++ b/data/2021/iclr/HyperGrid Transformers: Towards A Single Model for Multiple Tasks	
@@ -0,0 +1 @@
+Achieving state-of-the-art performance on natural language understanding tasks typically relies on fine-tuning a fresh model for every task. Consequently, this approach leads to a higher overall parameter cost, along with higher technical maintenance for serving multiple models. Learning a single multi-task model that is able to do well for all the tasks has been a challenging and yet attractive proposition. In this paper, we propose HyperGrid Transformers, a new Transformer architecture that leverages task-conditioned hyper networks for controlling its feed-forward layers. Specifically, we propose a decomposable hypernetwork that learns grid-wise projections that help to specialize regions in weight matrices for different tasks. In order to construct the proposed hypernetwork, our method learns the interactions and composition between a global (task-agnostic) state and a local task-specific state. We conduct an extensive set of experiments on GLUE/SuperGLUE. On the SuperGLUE test set, we match the performance of the state-of-the-art while being 16 times more parameter efficient. Our method helps bridge the gap between fine-tuning and multi-task learning approaches.
\ No newline at end of file
diff --git a/data/2021/iclr/Hyperbolic Neural Networks++ b/data/2021/iclr/Hyperbolic Neural Networks++
new file mode 100644
index 0000000000..2aab166e00
--- /dev/null
+++ b/data/2021/iclr/Hyperbolic Neural Networks++	
@@ -0,0 +1 @@
+Hyperbolic spaces have recently gained momentum in the context of machine learning due to their high capacity and tree-likeliness properties. However, the representational power of hyperbolic geometry is not yet on par with Euclidean geometry, mostly because of the absence of corresponding hyperbolic neural network layers. This makes it hard to use hyperbolic embeddings in downstream tasks. Here, we bridge this gap in a principled manner by combining the formalism of Mobius gyrovector spaces with the Riemannian geometry of the Poincare model of hyperbolic spaces. As a result, we derive hyperbolic versions of important deep learning tools: multinomial logistic regression, feed-forward and recurrent neural networks such as gated recurrent units. This allows to embed sequential data and perform classification in the hyperbolic space. Empirically, we show that, even if hyperbolic optimization tools are limited, hyperbolic sentence embeddings either outperform or are on par with their Euclidean variants on textual entailment and noisy-prefix recognition tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/IDF++: Analyzing and Improving Integer Discrete Flows for Lossless Compression b/data/2021/iclr/IDF++: Analyzing and Improving Integer Discrete Flows for Lossless Compression
new file mode 100644
index 0000000000..532c07b6b9
--- /dev/null
+++ b/data/2021/iclr/IDF++: Analyzing and Improving Integer Discrete Flows for Lossless Compression	
@@ -0,0 +1 @@
+In this paper we analyse and improve integer discrete flows for lossless compression. Integer discrete flows are a recently proposed class of models that learn invertible transformations for integer-valued random variables. Due to its discrete nature, they can be combined in a straightforward manner with entropy coding schemes for lossless compression without the need for bits-back coding. We discuss the potential difference in flexibility between invertible flows for discrete random variables and flows for continuous random variables and show that (integer) discrete flows are more flexible than previously claimed. We furthermore investigate the influence of quantization operators on optimization and gradient bias in integer discrete flows. Finally, we introduce modifications to the architecture to improve the performance of this model class for lossless compression.
\ No newline at end of file
diff --git a/data/2021/iclr/IEPT: Instance-Level and Episode-Level Pretext Tasks for Few-Shot Learning b/data/2021/iclr/IEPT: Instance-Level and Episode-Level Pretext Tasks for Few-Shot Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving b/data/2021/iclr/INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving
new file mode 100644
index 0000000000..f9dfa9cc14
--- /dev/null
+++ b/data/2021/iclr/INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving	
@@ -0,0 +1 @@
+In learning-assisted theorem proving, one of the most critical challenges is to generalize to theorems unlike those seen at training time. In this paper, we introduce INT, an INequality Theorem proving benchmark, specifically designed to test agents' generalization ability. INT is based on a procedure for generating theorems and proofs; this procedure's knobs allow us to measure 6 different types of generalization, each reflecting a distinct challenge characteristic to automated theorem proving. In addition, unlike prior benchmarks for learning-assisted theorem proving, INT provides a lightweight and user-friendly theorem proving environment with fast simulations, conducive to performing learning-based and search-based research. We introduce learning-based baselines and evaluate them across 6 dimensions of generalization with the benchmark. We then evaluate the same agents augmented with Monte Carlo Tree Search (MCTS) at test time, and show that MCTS can help to prove new theorems.
\ No newline at end of file
diff --git a/data/2021/iclr/IOT: Instance-wise Layer Reordering for Transformer Structures b/data/2021/iclr/IOT: Instance-wise Layer Reordering for Transformer Structures
new file mode 100644
index 0000000000..699ac5a89b
--- /dev/null
+++ b/data/2021/iclr/IOT: Instance-wise Layer Reordering for Transformer Structures	
@@ -0,0 +1 @@
+With sequentially stacked self-attention, (optional) encoder-decoder attention, and feed-forward layers, Transformer achieves big success in natural language processing (NLP), and many variants have been proposed. Currently, almost all these models assume that the layer order is fixed and kept the same across data samples. We observe that different data samples actually favor different orders of the layers. Based on this observation, in this work, we break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure. Our Instance-wise Ordered Transformer (IOT) can model variant functions by reordered layers, which enables each sample to select the better one to improve the model performance under the constraint of almost the same number of parameters. To achieve this, we introduce a light predictor with negligible parameter and inference cost to decide the most capable and favorable layer order for any input sequence. Experiments on 3 tasks (neural machine translation, abstractive summarization, and code generation) and 9 datasets demonstrate consistent improvements of our method. We further show that our method can also be applied to other architectures beyond Transformer. Our code is released at Github.
\ No newline at end of file
diff --git a/data/2021/iclr/Identifying Physical Law of Hamiltonian Systems via Meta-Learning b/data/2021/iclr/Identifying Physical Law of Hamiltonian Systems via Meta-Learning
new file mode 100644
index 0000000000..f8281e7319
--- /dev/null
+++ b/data/2021/iclr/Identifying Physical Law of Hamiltonian Systems via Meta-Learning	
@@ -0,0 +1 @@
+Hamiltonian mechanics is an effective tool to represent many physical processes with concise yet well-generalized mathematical expressions. A well-modeled Hamiltonian makes it easy for researchers to analyze and forecast many related phenomena that are governed by the same physical law. However, in general, identifying a functional or shared expression of the Hamiltonian is very difficult. It requires carefully designed experiments and the researcher's insight that comes from years of experience. We propose that meta-learning algorithms can be potentially powerful data-driven tools for identifying the physical law governing Hamiltonian systems without any mathematical assumptions on the representation, but with observations from a set of systems governed by the same physical law. We show that a well meta-trained learner can identify the shared representation of the Hamiltonian by evaluating our method on several types of physical systems with various experimental settings.
\ No newline at end of file
diff --git a/data/2021/iclr/Identifying nonlinear dynamical systems with multiple time scales and long-range dependencies b/data/2021/iclr/Identifying nonlinear dynamical systems with multiple time scales and long-range dependencies
new file mode 100644
index 0000000000..93d1d01c4d
--- /dev/null
+++ b/data/2021/iclr/Identifying nonlinear dynamical systems with multiple time scales and long-range dependencies	
@@ -0,0 +1 @@
+A main theoretical interest in biology and physics is to identify the nonlinear dynamical system (DS) that generated observed time series. Recurrent Neural Networks (RNNs) are, in principle, powerful enough to approximate any underlying DS, but in their vanilla form suffer from the exploding vs. vanishing gradients problem. Previous attempts to alleviate this problem resulted either in more complicated, mathematically less tractable RNN architectures
\ No newline at end of file
diff --git a/data/2021/iclr/Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels b/data/2021/iclr/Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels
new file mode 100644
index 0000000000..c9587b5637
--- /dev/null
+++ b/data/2021/iclr/Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels	
@@ -0,0 +1 @@
+We propose a simple data augmentation technique that can be applied to standard model-free reinforcement learning algorithms, enabling robust learning directly from pixels without the need for auxiliary losses or pre-training. The approach leverages input perturbations commonly used in computer vision tasks to regularize the value function. Existing model-free approaches, such as Soft Actor-Critic (SAC), are not able to train deep networks effectively from image pixels. However, the addition of our augmentation method dramatically improves SAC's performance, enabling it to reach state-of-the-art performance on the DeepMind control suite, surpassing model-based (Dreamer, PlaNet, and SLAC) methods and recently proposed contrastive learning (CURL). Our approach can be combined with any model-free reinforcement learning algorithm, requiring only minor modifications. An implementation can be found at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering b/data/2021/iclr/Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering
new file mode 100644
index 0000000000..4a26e828a8
--- /dev/null
+++ b/data/2021/iclr/Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering	
@@ -0,0 +1 @@
+Differentiable rendering has paved the way to training neural networks to perform "inverse graphics" tasks such as predicting 3D geometry from monocular photographs. To train high performing models, most of the current approaches rely on multi-view imagery which are not readily available in practice. Recent Generative Adversarial Networks (GANs) that synthesize images, in contrast, seem to acquire 3D knowledge implicitly during training: object viewpoints can be manipulated by simply manipulating the latent codes. However, these latent codes often lack further physical interpretation and thus GANs cannot easily be inverted to perform explicit 3D reasoning. In this paper, we aim to extract and disentangle 3D knowledge learned by generative models by utilizing differentiable renderers. Key to our approach is to exploit GANs as a multi-view data generator to train an inverse graphics network using an off-the-shelf differentiable renderer, and the trained inverse graphics network as a teacher to disentangle the GAN's latent code into interpretable 3D properties. The entire architecture is trained iteratively using cycle consistency losses. We show that our approach significantly outperforms state-of-the-art inverse graphics networks trained on existing datasets, both quantitatively and via user studies. We further showcase the disentangled GAN as a controllable 3D "neural renderer", complementing traditional graphics renderers.
\ No newline at end of file
diff --git a/data/2021/iclr/Impact of Representation Learning in Linear Bandits b/data/2021/iclr/Impact of Representation Learning in Linear Bandits
new file mode 100644
index 0000000000..2868f5856f
--- /dev/null
+++ b/data/2021/iclr/Impact of Representation Learning in Linear Bandits	
@@ -0,0 +1 @@
+We study how representation learning can improve the efficiency of bandit problems. We study the setting where we play $T$ linear bandits with dimension $d$ concurrently, and these $T$ bandit tasks share a common $k (\ll d)$ dimensional linear representation. For the finite-action setting, we present a new algorithm which achieves $\widetilde{O}(T\sqrt{kN} + \sqrt{dkNT})$ regret, where $N$ is the number of rounds we play for each bandit. When $T$ is sufficiently large, our algorithm significantly outperforms the naive algorithm (playing $T$ bandits independently) that achieves $\widetilde{O}(T\sqrt{d N})$ regret. We also provide an $\Omega(T\sqrt{kN} + \sqrt{dkNT})$ regret lower bound, showing that our algorithm is minimax-optimal up to poly-logarithmic factors. Furthermore, we extend our algorithm to the infinite-action setting and obtain a corresponding regret bound which demonstrates the benefit of representation learning in certain regimes. We also present experiments on synthetic and real-world data to illustrate our theoretical findings and demonstrate the effectiveness of our proposed algorithms.
\ No newline at end of file
diff --git a/data/2021/iclr/Implicit Convex Regularizers of CNN Architectures: Convex Optimization of Two- and Three-Layer Networks in Polynomial Time b/data/2021/iclr/Implicit Convex Regularizers of CNN Architectures: Convex Optimization of Two- and Three-Layer Networks in Polynomial Time
new file mode 100644
index 0000000000..32206a811b
--- /dev/null
+++ b/data/2021/iclr/Implicit Convex Regularizers of CNN Architectures: Convex Optimization of Two- and Three-Layer Networks in Polynomial Time	
@@ -0,0 +1 @@
+We study training of Convolutional Neural Networks (CNNs) with ReLU activations and introduce exact convex optimization formulations with a polynomial complexity with respect to the number of data samples, the number of neurons, and data dimension. More specifically, we develop a convex analytic framework utilizing semi-infinite duality to obtain equivalent convex optimization problems for several two- and three-layer CNN architectures. We first prove that two-layer CNNs can be globally optimized via an $\ell_2$ norm regularized convex program. We then show that three-layer CNN training problems are equivalent to an $\ell_1$ regularized convex program that encourages sparsity in the spectral domain. We also extend these results to multi-layer CNN architectures including three-layer networks with two ReLU layers and deeper circular convolutions with a single ReLU layer. Furthermore, we present extensions of our approach to different pooling methods, which elucidates the implicit architectural bias as convex regularizers.
\ No newline at end of file
diff --git a/data/2021/iclr/Implicit Gradient Regularization b/data/2021/iclr/Implicit Gradient Regularization
new file mode 100644
index 0000000000..52899f3490
--- /dev/null
+++ b/data/2021/iclr/Implicit Gradient Regularization	
@@ -0,0 +1 @@
+Gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization. We find that the discrete steps of gradient descent implicitly regularize models by penalizing gradient descent trajectories that have large loss gradients. We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization. We confirm empirically that implicit gradient regularization biases gradient descent toward flat minima, where test errors are small and solutions are robust to noisy parameter perturbations. Furthermore, we demonstrate that the implicit gradient regularization term can be used as an explicit regularizer, allowing us to control this gradient regularization directly. More broadly, our work indicates that backward error analysis is a useful theoretical approach to the perennial question of how learning rate, model size, and parameter regularization interact to determine the properties of overparameterized models optimized with gradient descent.
\ No newline at end of file
diff --git a/data/2021/iclr/Implicit Normalizing Flows b/data/2021/iclr/Implicit Normalizing Flows
new file mode 100644
index 0000000000..04983e0e6a
--- /dev/null
+++ b/data/2021/iclr/Implicit Normalizing Flows	
@@ -0,0 +1 @@
+Normalizing flows define a probability distribution by an explicit invertible transformation $\boldsymbol{\mathbf{z}}=f(\boldsymbol{\mathbf{x}})$. In this work, we present implicit normalizing flows (ImpFlows), which generalize normalizing flows by allowing the mapping to be implicitly defined by the roots of an equation $F(\boldsymbol{\mathbf{z}}, \boldsymbol{\mathbf{x}})= \boldsymbol{\mathbf{0}}$. ImpFlows build on residual flows (ResFlows) with a proper balance between expressiveness and tractability. Through theoretical analysis, we show that the function space of ImpFlow is strictly richer than that of ResFlows. Furthermore, for any ResFlow with a fixed number of blocks, there exists some function that ResFlow has a non-negligible approximation error. However, the function is exactly representable by a single-block ImpFlow. We propose a scalable algorithm to train and draw samples from ImpFlows. Empirically, we evaluate ImpFlow on several classification and density modeling tasks, and ImpFlow outperforms ResFlow with a comparable amount of parameters on all the benchmarks.
\ No newline at end of file
diff --git a/data/2021/iclr/Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning b/data/2021/iclr/Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning
new file mode 100644
index 0000000000..78175fb49d
--- /dev/null
+++ b/data/2021/iclr/Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning	
@@ -0,0 +1 @@
+We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network. We characterize this loss of expressivity in terms of a drop in the rank of the learned value network features, and show that this corresponds to a drop in performance. We demonstrate this phenomenon on widely studies domains, including Atari and Gym benchmarks, in both offline and online RL settings. We formally analyze this phenomenon and show that it results from a pathological interaction between bootstrapping and gradient-based optimization. We further show that mitigating implicit under-parameterization by controlling rank collapse improves performance.
\ No newline at end of file
diff --git a/data/2021/iclr/Improve Object Detection with Feature-based Knowledge Distillation: Towards Accurate and Efficient Detectors b/data/2021/iclr/Improve Object Detection with Feature-based Knowledge Distillation: Towards Accurate and Efficient Detectors
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Improved Autoregressive Modeling with Distribution Smoothing b/data/2021/iclr/Improved Autoregressive Modeling with Distribution Smoothing
new file mode 100644
index 0000000000..f3a25669e7
--- /dev/null
+++ b/data/2021/iclr/Improved Autoregressive Modeling with Distribution Smoothing	
@@ -0,0 +1 @@
+While autoregressive models excel at image compression, their sample quality is often lacking. Although not realistic, generated images often have high likelihood according to the model, resembling the case of adversarial examples. Inspired by a successful adversarial defense method, we incorporate randomized smoothing into autoregressive generative modeling. We first model a smoothed version of the data distribution, and then reverse the smoothing process to recover the original data distribution. This procedure drastically improves the sample quality of existing autoregressive models on several synthetic and real-world image datasets while obtaining competitive likelihoods on synthetic datasets.
\ No newline at end of file
diff --git "a/data/2021/iclr/Improved Estimation of Concentration Under \342\204\223p-Norm Distance Metrics Using Half Spaces" "b/data/2021/iclr/Improved Estimation of Concentration Under \342\204\223p-Norm Distance Metrics Using Half Spaces"
new file mode 100644
index 0000000000..0be34e6dcd
--- /dev/null
+++ "b/data/2021/iclr/Improved Estimation of Concentration Under \342\204\223p-Norm Distance Metrics Using Half Spaces"	
@@ -0,0 +1 @@
+Concentration of measure has been argued to be the fundamental cause of adversarial vulnerability. Mahloujifar et al. presented an empirical way to measure the concentration of a data distribution using samples, and employed it to find lower bounds on intrinsic robustness for several benchmark datasets. However, it remains unclear whether these lower bounds are tight enough to provide a useful approximation for the intrinsic robustness of a dataset. To gain a deeper understanding of the concentration of measure phenomenon, we first extend the Gaussian Isoperimetric Inequality to non-spherical Gaussian measures and arbitrary $\ell_p$-norms ($p \geq 2$). We leverage these theoretical insights to design a method that uses half-spaces to estimate the concentration of any empirical dataset under $\ell_p$-norm distance metrics. Our proposed algorithm is more efficient than Mahloujifar et al.'s, and our experiments on synthetic datasets and image benchmarks demonstrate that it is able to find much tighter intrinsic robustness bounds. These tighter estimates provide further evidence that rules out intrinsic dataset concentration as a possible explanation for the adversarial vulnerability of state-of-the-art classifiers.
\ No newline at end of file
diff --git a/data/2021/iclr/Improving Adversarial Robustness via Channel-wise Activation Suppressing b/data/2021/iclr/Improving Adversarial Robustness via Channel-wise Activation Suppressing
new file mode 100644
index 0000000000..1681920813
--- /dev/null
+++ b/data/2021/iclr/Improving Adversarial Robustness via Channel-wise Activation Suppressing	
@@ -0,0 +1 @@
+The study of adversarial examples and their activation has attracted significant attention for secure and robust learning with deep neural networks (DNNs). Different from existing works, in this paper, we highlight two new characteristics of adversarial examples from the channel-wise activation perspective: 1) the activation magnitudes of adversarial examples are higher than that of natural examples; and 2) the channels are activated more uniformly by adversarial examples than natural examples. We find that the state-of-the-art defense adversarial training has addressed the first issue of high activation magnitudes via training on adversarial examples, while the second issue of uniform activation remains. This motivates us to suppress redundant activation from being activated by adversarial perturbations via a Channel-wise Activation Suppressing (CAS) strategy. We show that CAS can train a model that inherently suppresses adversarial activation, and can be easily applied to existing defense methods to further improve their robustness. Our work provides a simple but generic training strategy for robustifying the intermediate layer activation of DNNs.
\ No newline at end of file
diff --git a/data/2021/iclr/Improving Relational Regularized Autoencoders with Spherical Sliced Fused Gromov Wasserstein b/data/2021/iclr/Improving Relational Regularized Autoencoders with Spherical Sliced Fused Gromov Wasserstein
new file mode 100644
index 0000000000..be45378156
--- /dev/null
+++ b/data/2021/iclr/Improving Relational Regularized Autoencoders with Spherical Sliced Fused Gromov Wasserstein	
@@ -0,0 +1 @@
+Relational regularized autoencoder (RAE) is a framework to learn the distribution of data by minimizing a reconstruction loss together with a relational regularization on the latent space. A recent attempt to reduce the inner discrepancy between the prior and aggregated posterior distributions is to incorporate sliced fused Gromov-Wasserstein (SFG) between these distributions. That approach has a weakness since it treats every slicing direction similarly, meanwhile several directions are not useful for the discriminative task. To improve the discrepancy and consequently the relational regularization, we propose a new relational discrepancy, named spherical sliced fused Gromov Wasserstein (SSFG), that can find an important area of projections characterized by a von Mises-Fisher distribution. Then, we introduce two variants of SSFG to improve its performance. The first variant, named mixture spherical sliced fused Gromov Wasserstein (MSSFG), replaces the vMF distribution by a mixture of von Mises-Fisher distributions to capture multiple important areas of directions that are far from each other. The second variant, named power spherical sliced fused Gromov Wasserstein (PSSFG), replaces the vMF distribution by a power spherical distribution to improve the sampling time in high dimension settings. We then apply the new discrepancies to the RAE framework to achieve its new variants. Finally, we conduct extensive experiments to show that the new proposed autoencoders have favorable performance in learning latent manifold structure, image generation, and reconstruction.
\ No newline at end of file
diff --git a/data/2021/iclr/Improving Transformation Invariance in Contrastive Representation Learning b/data/2021/iclr/Improving Transformation Invariance in Contrastive Representation Learning
new file mode 100644
index 0000000000..29a5b7d1ba
--- /dev/null
+++ b/data/2021/iclr/Improving Transformation Invariance in Contrastive Representation Learning	
@@ -0,0 +1 @@
+We propose methods to strengthen the invariance properties of representations obtained by contrastive learning. While existing approaches implicitly induce a degree of invariance as representations are learned, we look to more directly enforce invariance in the encoding process. To this end, we first introduce a training objective for contrastive learning that uses a novel regularizer to control how the representation changes under transformation. We show that representations trained with this objective perform better on downstream tasks and are more robust to the introduction of nuisance transformations at test time. Second, we propose a change to how test time representations are generated by introducing a feature averaging approach that combines encodings from multiple transformations of the original input, finding that this leads to across the board performance gains. Finally, we introduce the novel Spirograph dataset to explore our ideas in the context of a differentiable generative process with multiple downstream tasks, showing that our techniques for learning invariance are highly beneficial.
\ No newline at end of file
diff --git a/data/2021/iclr/Improving VAEs' Robustness to Adversarial Attack b/data/2021/iclr/Improving VAEs' Robustness to Adversarial Attack
new file mode 100644
index 0000000000..6ca2fc7568
--- /dev/null
+++ b/data/2021/iclr/Improving VAEs' Robustness to Adversarial Attack	
@@ -0,0 +1 @@
+Variational autoencoders (VAEs) have recently been shown to be vulnerable to adversarial attacks, wherein they are fooled into reconstructing a chosen target image. However, how to defend against such attacks remains an open problem. We make significant advances in addressing this issue by introducing methods for producing adversarially robust VAEs. Namely, we first demonstrate that methods used to obtain disentangled latent representations produce VAEs that are more robust to these attacks. However, this robustness comes at the cost of reducing the quality of the reconstructions. We, therefore, introduce a new hierarchical VAE, the $\textit{Seatbelt-VAE}$, which can produce high-fidelity autoencoders that are also adversarially robust. We confirm the capabilities of the Seatbelt-VAE on several different datasets and with current state-of-the-art VAE adversarial attacks.
\ No newline at end of file
diff --git a/data/2021/iclr/Improving Zero-Shot Voice Style Transfer via Disentangled Representation Learning b/data/2021/iclr/Improving Zero-Shot Voice Style Transfer via Disentangled Representation Learning
new file mode 100644
index 0000000000..158d0e3558
--- /dev/null
+++ b/data/2021/iclr/Improving Zero-Shot Voice Style Transfer via Disentangled Representation Learning	
@@ -0,0 +1 @@
+Voice style transfer, also called voice conversion, seeks to modify one speaker's voice to generate speech as if it came from another (target) speaker. Previous works have made progress on voice conversion with parallel training data and pre-known speakers. However, zero-shot voice style transfer, which learns from non-parallel data and generates voices for previously unseen speakers, remains a challenging problem. We propose a novel zero-shot voice transfer method via disentangled representation learning. The proposed method first encodes speaker-related style and voice content of each input voice into separated low-dimensional embedding spaces, and then transfers to a new voice by combining the source content embedding and target style embedding through a decoder. With information-theoretic guidance, the style and content embedding spaces are representative and (ideally) independent of each other. On real-world VCTK datasets, our method outperforms other baselines and obtains state-of-the-art results in terms of transfer accuracy and voice naturalness for voice style transfer experiments under both many-to-many and zero-shot setups.
\ No newline at end of file
diff --git a/data/2021/iclr/In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning b/data/2021/iclr/In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning
new file mode 100644
index 0000000000..69c59f5a15
--- /dev/null
+++ b/data/2021/iclr/In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning	
@@ -0,0 +1 @@
+The recent research in semi-supervised learning (SSL) is mostly dominated by consistency regularization based methods which achieve strong performance. However, they heavily rely on domain-specific data augmentations, which are not easy to generate for all data modalities. Pseudo-labeling (PL) is a general SSL approach that does not have this constraint but performs relatively poorly in its original formulation. We argue that PL underperforms due to the erroneous high confidence predictions from poorly calibrated models; these predictions generate many incorrect pseudo-labels, leading to noisy training. We propose an uncertainty-aware pseudo-label selection (UPS) framework which improves pseudo labeling accuracy by drastically reducing the amount of noise encountered in the training process. Furthermore, UPS generalizes the pseudo-labeling process, allowing for the creation of negative pseudo-labels; these negative pseudo-labels can be used for multi-label classification as well as negative learning to improve the single-label classification. We achieve strong performance when compared to recent SSL methods on the CIFAR-10 and CIFAR-100 datasets. Also, we demonstrate the versatility of our method on the video dataset UCF-101 and the multi-label dataset Pascal VOC.
\ No newline at end of file
diff --git a/data/2021/iclr/In Search of Lost Domain Generalization b/data/2021/iclr/In Search of Lost Domain Generalization
new file mode 100644
index 0000000000..3dda7233c6
--- /dev/null
+++ b/data/2021/iclr/In Search of Lost Domain Generalization	
@@ -0,0 +1 @@
+The goal of domain generalization algorithms is to predict well on distributions different from those seen during training. While a myriad of domain generalization algorithms exist, inconsistencies in experimental conditions -- datasets, architectures, and model selection criteria -- render fair and realistic comparisons difficult. In this paper, we are interested in understanding how useful domain generalization algorithms are in realistic settings. As a first step, we realize that model selection is non-trivial for domain generalization tasks. Contrary to prior work, we argue that domain generalization algorithms without a model selection strategy should be regarded as incomplete. Next, we implement DomainBed, a testbed for domain generalization including seven multi-domain datasets, nine baseline algorithms, and three model selection criteria. We conduct extensive experiments using DomainBed and find that, when carefully implemented, empirical risk minimization shows state-of-the-art performance across all datasets. Looking forward, we hope that the release of DomainBed, along with contributions from fellow researchers, will streamline reproducible and rigorous research in domain generalization.
\ No newline at end of file
diff --git a/data/2021/iclr/In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness b/data/2021/iclr/In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness
new file mode 100644
index 0000000000..f82903b723
--- /dev/null
+++ b/data/2021/iclr/In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness	
@@ -0,0 +1 @@
+Consider a prediction setting where a few inputs (e.g., satellite images) are expensively annotated with the prediction targets (e.g., crop types), and many inputs are cheaply annotated with auxiliary information (e.g., climate information). How should we best leverage this auxiliary information for the prediction task? Empirically across three image and time-series datasets, and theoretically in a multi-task linear regression setting, we show that (i) using auxiliary information as input features improves in-distribution error but can hurt out-of-distribution (OOD) error; while (ii) using auxiliary information as outputs of auxiliary tasks to pre-train a model improves OOD error. To get the best of both worlds, we introduce In-N-Out, which first trains a model with auxiliary inputs and uses it to pseudolabel all the in-distribution inputs, then pre-trains a model on OOD auxiliary outputs and fine-tunes this model with the pseudolabels (self-training). We show both theoretically and empirically that In-N-Out outperforms auxiliary inputs or outputs alone on both in-distribution and OOD error.
\ No newline at end of file
diff --git a/data/2021/iclr/Incorporating Symmetry into Deep Dynamics Models for Improved Generalization b/data/2021/iclr/Incorporating Symmetry into Deep Dynamics Models for Improved Generalization
new file mode 100644
index 0000000000..700870732d
--- /dev/null
+++ b/data/2021/iclr/Incorporating Symmetry into Deep Dynamics Models for Improved Generalization	
@@ -0,0 +1 @@
+Recent work has shown deep learning can accelerate the prediction of physical dynamics relative to numerical solvers. However, limited physical accuracy and an inability to generalize under the distributional shift limit its applicability to the real world. We propose to improve accuracy and generalization by incorporating symmetries into deep neural networks. Specifically, we employ a variety of methods each tailored to enforce a different symmetry. Our models are both theoretically and experimentally robust to distributional shift by the symmetry group transformations and enjoy favorable sample complexity. We demonstrate the advantage of our approach on a variety of physical dynamics including Rayleigh-Benard Convection and real-world ocean currents and temperatures. This is the first time that equivariant neural networks have been used to forecast physical dynamics.
\ No newline at end of file
diff --git a/data/2021/iclr/Incremental few-shot learning via vector quantization in deep embedded space b/data/2021/iclr/Incremental few-shot learning via vector quantization in deep embedded space
new file mode 100644
index 0000000000..c2986ec1bc
--- /dev/null
+++ b/data/2021/iclr/Incremental few-shot learning via vector quantization in deep embedded space	
@@ -0,0 +1 @@
+The capability of incrementally learning new tasks without forgetting old ones is a challenging problem due to catastrophic forgetting. This challenge becomes greater when novel tasks contain very few labelled training samples. Currently, most methods are dedicated to class-incremental learning and rely on sufﬁcient training data to learn additional weights for newly added classes. Those methods cannot be easily extended to incremental regression tasks and could suffer from severe overﬁtting when learning few-shot novel tasks. In this study, we propose a nonparametric method in deep embedded space to tackle incremental few-shot learning problems. The knowledge about the learned tasks is compressed into a small number of quantized reference vectors. The proposed method learns new tasks sequentially by adding more reference vectors to the model using few-shot samples in each novel task. For classiﬁcation problems, we employ the nearest neighbor scheme to make classiﬁcation on sparsely available data and incorporate intra-class variation, less forgetting regularization and calibration of reference vectors to mitigate catastrophic forgetting. In addition, the proposed learning vector quantization (LVQ) in deep embedded space can be customized as a kernel smoother to handle incremental few-shot regression tasks. Experimental results demonstrate that the proposed method outperforms other state-of-the-art methods in incremental learning.
\ No newline at end of file
diff --git a/data/2021/iclr/Individually Fair Gradient Boosting b/data/2021/iclr/Individually Fair Gradient Boosting
new file mode 100644
index 0000000000..085bbf42fb
--- /dev/null
+++ b/data/2021/iclr/Individually Fair Gradient Boosting	
@@ -0,0 +1 @@
+We consider the task of enforcing individual fairness in gradient boosting. Gradient boosting is a popular method for machine learning from tabular data, which arise often in applications where algorithmic fairness is a concern. At a high level, our approach is a functional gradient descent on a (distributionally) robust loss function that encodes our intuition of algorithmic fairness for the ML task at hand. Unlike prior approaches to individual fairness that only work with smooth ML models, our approach also works with non-smooth models such as decision trees. We show that our algorithm converges globally and generalizes. We also demonstrate the efficacy of our algorithm on three ML problems susceptible to algorithmic bias.
\ No newline at end of file
diff --git a/data/2021/iclr/Individually Fair Rankings b/data/2021/iclr/Individually Fair Rankings
new file mode 100644
index 0000000000..90e58d4699
--- /dev/null
+++ b/data/2021/iclr/Individually Fair Rankings	
@@ -0,0 +1 @@
+Rankings on online platforms help their end-users find the relevant information—people, news, media, and products—quickly. Fair ranking tasks, which ask to rank a set of items to maximize utility subject to satisfying group-fairness constraints, have gained significant interest in the Algorithmic Fairness, Information Retrieval, and Machine Learning literature. Recent works, however, identify uncertainty in the utilities of items as a primary cause of unfairness and propose introducing randomness in the output. This randomness is carefully chosen to guarantee an adequate representation of each item (while accounting for the uncertainty). However, due to this randomness, the output rankings may violate group fairness constraints. We give an efficient algorithm that samples rankings from an individually-fair distribution while ensuring that every output ranking is group fair. The expected utility of the output ranking is at least α times the utility of the optimal fair solution. Here, α depends on the utilities, position-discounts, and constraints—it approaches 1 as the range of utilities or the position-discounts shrinks, or when utilities satisfy distributional assumptions. Empirically, we observe that our algorithm achieves individual and group fairness and that Pareto dominates the state-of-the-art baselines.
\ No newline at end of file
diff --git a/data/2021/iclr/Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks b/data/2021/iclr/Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks
new file mode 100644
index 0000000000..5271e4f26f
--- /dev/null
+++ b/data/2021/iclr/Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks	
@@ -0,0 +1 @@
+Temporal networks serve as abstractions of many real-world dynamic systems. These networks typically evolve according to certain laws, such as the law of triadic closure, which is universal in social networks. Inductive representation learning of temporal networks should be able to capture such laws and further be applied to systems that follow the same laws but have not been unseen during the training stage. Previous works in this area depend on either network node identities or rich edge attributes and typically fail to extract these laws. Here, we propose Causal Anonymous Walks (CAWs) to inductively represent a temporal network. CAWs are extracted by temporal random walks and work as automatic retrieval of temporal network motifs to represent network dynamics while avoiding the time-consuming selection and counting of those motifs. CAWs adopt a novel anonymization strategy that replaces node identities with the hitting counts of the nodes based on a set of sampled walks to keep the method inductive, and simultaneously establish the correlation between motifs. We further propose a neural-network model CAW-N to encode CAWs, and pair it with a CAW sampling strategy with constant memory and time cost to support online training and inference. CAW-N is evaluated to predict links over 6 real temporal networks and uniformly outperforms previous SOTA methods by averaged 10% AUC gain in the inductive setting. CAW-N also outperforms previous methods in 4 out of the 6 networks in the transductive setting.
\ No newline at end of file
diff --git a/data/2021/iclr/Influence Estimation for Generative Adversarial Networks b/data/2021/iclr/Influence Estimation for Generative Adversarial Networks
new file mode 100644
index 0000000000..3b8bd2ea54
--- /dev/null
+++ b/data/2021/iclr/Influence Estimation for Generative Adversarial Networks	
@@ -0,0 +1 @@
+Identifying harmful instances, whose absence in a training dataset improves model performance, is important for building better machine learning models. Although previous studies have succeeded in estimating harmful instances under supervised settings, they cannot be trivially extended to generative adversarial networks (GANs). This is because previous approaches require that (1) the absence of a training instance directly affects the loss value and that (2) the change in the loss directly measures the harmfulness of the instance for the performance of a model. In GAN training, however, neither of the requirements is satisfied. This is because, (1) the generator's loss is not directly affected by the training instances as they are not part of the generator's training steps, and (2) the values of GAN's losses normally do not capture the generative performance of a model. To this end, (1) we propose an influence estimation method that uses the Jacobian of the gradient of the generator's loss with respect to the discriminator's parameters (and vice versa) to trace how the absence of an instance in the discriminator's training affects the generator's parameters, and (2) we propose a novel evaluation scheme, in which we assess harmfulness of each training instance on the basis of how GAN evaluation metric (e.g., inception score) is expect to change due to the removal of the instance. We experimentally verified that our influence estimation method correctly inferred the changes in GAN evaluation metrics. Further, we demonstrated that the removal of the identified harmful instances effectively improved the model's generative performance with respect to various GAN evaluation metrics.
\ No newline at end of file
diff --git a/data/2021/iclr/Influence Functions in Deep Learning Are Fragile b/data/2021/iclr/Influence Functions in Deep Learning Are Fragile
new file mode 100644
index 0000000000..557602a438
--- /dev/null
+++ b/data/2021/iclr/Influence Functions in Deep Learning Are Fragile	
@@ -0,0 +1 @@
+Influence functions approximate the effect of training samples in test-time predictions and have a wide variety of applications in machine learning interpretability and uncertainty estimation. A commonly-used (first-order) influence function can be implemented efficiently as a post-hoc method requiring access only to the gradients and Hessian of the model. For linear models, influence functions are well-defined due to the convexity of the underlying loss function and are generally accurate even across difficult settings where model changes are fairly large such as estimating group influences. Influence functions, however, are not well-understood in the context of deep learning with non-convex loss functions. In this paper, we provide a comprehensive and large-scale empirical study of successes and failures of influence functions in neural network models trained on datasets such as Iris, MNIST, CIFAR-10 and ImageNet. Through our extensive experiments, we show that the network architecture, its depth and width, as well as the extent of model parameterization and regularization techniques have strong effects in the accuracy of influence functions. In particular, we find that (i) influence estimates are fairly accurate for shallow networks, while for deeper networks the estimates are often erroneous; (ii) for certain network architectures and datasets, training with weight-decay regularization is important to get high-quality influence estimates; and (iii) the accuracy of influence estimates can vary significantly depending on the examined test points. These results suggest that in general influence functions in deep learning are fragile and call for developing improved influence estimation methods to mitigate these issues in non-convex setups.
\ No newline at end of file
diff --git a/data/2021/iclr/InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective b/data/2021/iclr/InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective
new file mode 100644
index 0000000000..cdd01c2c62
--- /dev/null
+++ b/data/2021/iclr/InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective	
@@ -0,0 +1 @@
+Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks. Recent studies, however, show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks. We aim to address this problem from an information-theoretic perspective, and propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models. InfoBERT contains two mutual-information-based regularizers for model training: (i) an Information Bottleneck regularizer, which suppresses noisy mutual information between the input and the feature representation; and (ii) a Robust Feature regularizer, which increases the mutual information between local robust features and global features. We provide a principled way to theoretically analyze and improve the robustness of representation learning for language models in both standard and adversarial training. Extensive experiments demonstrate that InfoBERT achieves state-of-the-art robust accuracy over several adversarial datasets on Natural Language Inference (NLI) and Question Answering (QA) tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Information Laundering for Model Privacy b/data/2021/iclr/Information Laundering for Model Privacy
new file mode 100644
index 0000000000..5f3929dbbc
--- /dev/null
+++ b/data/2021/iclr/Information Laundering for Model Privacy	
@@ -0,0 +1 @@
+In this work, we propose information laundering, a novel framework for enhancing model privacy. Unlike data privacy that concerns the protection of raw data information, model privacy aims to protect an already-learned model that is to be deployed for public use. The private model can be obtained from general learning methods, and its deployment means that it will return a deterministic or random response for a given input query. An information-laundered model consists of probabilistic components that deliberately maneuver the intended input and output for queries to the model, so the model's adversarial acquisition is less likely. Under the proposed framework, we develop an information-theoretic principle to quantify the fundamental tradeoffs between model utility and privacy leakage and derive the optimal design.
\ No newline at end of file
diff --git a/data/2021/iclr/Initialization and Regularization of Factorized Neural Layers b/data/2021/iclr/Initialization and Regularization of Factorized Neural Layers
new file mode 100644
index 0000000000..d52ef78336
--- /dev/null
+++ b/data/2021/iclr/Initialization and Regularization of Factorized Neural Layers	
@@ -0,0 +1 @@
+Factorized layers--operations parameterized by products of two or more matrices--occur in a variety of deep learning contexts, including compressed model training, certain types of knowledge distillation, and multi-head self-attention architectures. We study how to initialize and regularize deep nets containing such layers, examining two simple, understudied schemes, spectral initialization and Frobenius decay, for improving their performance. The guiding insight is to design optimization routines for these networks that are as close as possible to that of their well-tuned, non-decomposed counterparts; we back this intuition with an analysis of how the initialization and regularization schemes impact training with gradient descent, drawing on modern attempts to understand the interplay of weight-decay and batch-normalization. Empirically, we highlight the benefits of spectral initialization and Frobenius decay across a variety of settings. In model compression, we show that they enable low-rank methods to significantly outperform both unstructured sparsity and tensor methods on the task of training low-memory residual networks; analogs of the schemes also improve the performance of tensor decomposition techniques. For knowledge distillation, Frobenius decay enables a simple, overcomplete baseline that yields a compact model from over-parameterized training without requiring retraining with or pruning a teacher network. Finally, we show how both schemes applied to multi-head attention lead to improved performance on both translation and unsupervised pre-training.
\ No newline at end of file
diff --git a/data/2021/iclr/Integrating Categorical Semantics into Unsupervised Domain Translation b/data/2021/iclr/Integrating Categorical Semantics into Unsupervised Domain Translation
new file mode 100644
index 0000000000..b3ed94ac64
--- /dev/null
+++ b/data/2021/iclr/Integrating Categorical Semantics into Unsupervised Domain Translation	
@@ -0,0 +1 @@
+While unsupervised domain translation (UDT) has seen a lot of success recently, we argue that allowing its translation to be mediated via categorical semantic features could enable wider applicability. In particular, we argue that categorical semantics are important when translating between domains with multiple object categories possessing distinctive styles, or even between domains that are simply too different but still share high-level semantics. We propose a method to learn, in an unsupervised manner, categorical semantic features (such as object labels) that are invariant of the source and target domains. We show that conditioning the style of a unsupervised domain translation methods on the learned categorical semantics leads to a considerably better high-level features preservation on tasks such as MNIST$\leftrightarrow$SVHN and to a more realistic stylization on Sketches$\to$Reals.
\ No newline at end of file
diff --git a/data/2021/iclr/Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling b/data/2021/iclr/Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling
new file mode 100644
index 0000000000..c96210a349
--- /dev/null
+++ b/data/2021/iclr/Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling	
@@ -0,0 +1 @@
+Obtaining large annotated datasets is critical for training successful machine learning models and it is often a bottleneck in practice. Weak supervision offers a promising alternative for producing labeled datasets without ground truth annotations by generating probabilistic labels using multiple noisy heuristics. This process can scale to large datasets and has demonstrated state of the art performance in diverse domains such as healthcare and e-commerce. One practical issue with learning from user-generated heuristics is that their creation requires creativity, foresight, and domain expertise from those who hand-craft them, a process which can be tedious and subjective. We develop the first framework for interactive weak supervision in which a method proposes heuristics and learns from user feedback given on each proposed heuristic. Our experiments demonstrate that only a small number of feedback iterations are needed to train models that achieve highly competitive test set performance without access to ground truth training labels. We conduct user studies, which show that users are able to effectively provide feedback on heuristics and that test set results track the performance of simulated oracles.
\ No newline at end of file
diff --git a/data/2021/iclr/Interpretable Models for Granger Causality Using Self-explaining Neural Networks b/data/2021/iclr/Interpretable Models for Granger Causality Using Self-explaining Neural Networks
new file mode 100644
index 0000000000..0783459ca2
--- /dev/null
+++ b/data/2021/iclr/Interpretable Models for Granger Causality Using Self-explaining Neural Networks	
@@ -0,0 +1 @@
+Exploratory analysis of time series data can yield a better understanding of complex dynamical systems. Granger causality is a practical framework for analysing interactions in sequential data, applied in a wide range of domains. In this paper, we propose a novel framework for inferring multivariate Granger causality under nonlinear dynamics based on an extension of self-explaining neural networks. This framework is more interpretable than other neural-network-based techniques for inferring Granger causality, since in addition to relational inference, it also allows detecting signs of Granger-causal effects and inspecting their variability over time. In comprehensive experiments on simulated data, we show that our framework performs on par with several powerful baseline methods at inferring Granger causality and that it achieves better performance at inferring interaction signs. The results suggest that our framework is a viable and more interpretable alternative to sparse-input neural networks for inferring Granger causality.
\ No newline at end of file
diff --git a/data/2021/iclr/Interpretable Neural Architecture Search via Bayesian Optimisation with Weisfeiler-Lehman Kernels b/data/2021/iclr/Interpretable Neural Architecture Search via Bayesian Optimisation with Weisfeiler-Lehman Kernels
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking b/data/2021/iclr/Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking
new file mode 100644
index 0000000000..bb42b9c088
--- /dev/null
+++ b/data/2021/iclr/Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking	
@@ -0,0 +1 @@
+Graph neural networks (GNNs) have become a popular approach to integrating structural inductive biases into NLP models. However, there has been little work on interpreting them, and specifically on understanding which parts of the graphs (e.g. syntactic trees or co-reference structures) contribute to a prediction. In this work, we introduce a post-hoc method for interpreting the predictions of GNNs which identifies unnecessary edges. Given a trained GNN model, we learn a simple classifier that, for every edge in every layer, predicts if that edge can be dropped. We demonstrate that such a classifier can be trained in a fully differentiable fashion, employing stochastic gates and encouraging sparsity through the expected $L_0$ norm. We use our technique as an attribution method to analyze GNN models for two tasks -- question answering and semantic role labeling -- providing insights into the information flow in these models. We show that we can drop a large proportion of edges without deteriorating the performance of the model, while we can analyse the remaining edges for interpreting model predictions.
\ No newline at end of file
diff --git a/data/2021/iclr/Interpreting Knowledge Graph Relation Representation from Word Embeddings b/data/2021/iclr/Interpreting Knowledge Graph Relation Representation from Word Embeddings
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Interpreting and Boosting Dropout from a Game-Theoretic View b/data/2021/iclr/Interpreting and Boosting Dropout from a Game-Theoretic View
new file mode 100644
index 0000000000..0a99597408
--- /dev/null
+++ b/data/2021/iclr/Interpreting and Boosting Dropout from a Game-Theoretic View	
@@ -0,0 +1 @@
+This paper aims to understand and improve the utility of the dropout operation from the perspective of game-theoretic interactions. We prove that dropout can suppress the strength of interactions between input variables of deep neural networks (DNNs). The theoretic proof is also verified by various experiments. Furthermore, we find that such interactions were strongly related to the over-fitting problem in deep learning. Thus, the utility of dropout can be regarded as decreasing interactions to alleviate the significance of over-fitting. Based on this understanding, we propose an interaction loss to further improve the utility of dropout. Experimental results have shown that the interaction loss can effectively improve the utility of dropout and boost the performance of DNNs.
\ No newline at end of file
diff --git a/data/2021/iclr/Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds b/data/2021/iclr/Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
new file mode 100644
index 0000000000..3b4d593870
--- /dev/null
+++ b/data/2021/iclr/Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds	
@@ -0,0 +1 @@
+Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos. Prior audio-visual separation work assumed artificial limitations on the domain of sound classes (e.g., to speech or music), constrained the number of sources, and required strong sound separation or visual segmentation labels. AudioScope overcomes these limitations, operating on an open domain of sounds, with variable numbers of sources, and without labels or prior visual segmentation. The training procedure for AudioScope uses mixture invariant training (MixIT) to separate synthetic mixtures of mixtures (MoMs) into individual sources, where noisy labels for mixtures are provided by an unsupervised audio-visual coincidence model. Using the noisy labels, along with attention between video and audio features, AudioScope learns to identify audio-visual similarity and to suppress off-screen sounds. We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data. This dataset contains a wide diversity of sound classes recorded in unconstrained conditions, making the application of previous methods unsuitable. For evaluation and semi-supervised experiments, we collected human labels for presence of on-screen and off-screen sounds on a small subset of clips.
\ No newline at end of file
diff --git a/data/2021/iclr/Intraclass clustering: an implicit learning ability that regularizes DNNs b/data/2021/iclr/Intraclass clustering: an implicit learning ability that regularizes DNNs
new file mode 100644
index 0000000000..61c0bd39a3
--- /dev/null
+++ b/data/2021/iclr/Intraclass clustering: an implicit learning ability that regularizes DNNs	
@@ -0,0 +1 @@
+Several works have shown that the regularization mechanisms underlying deep neural networks' generalization performances are still poorly understood. In this paper, we hypothesize that deep neural networks are regularized through their ability to extract meaningful clusters among the samples of a class. This constitutes an implicit form of regularization, as no explicit training mechanisms or supervision target such behaviour. To support our hypothesis, we design four different measures of intraclass clustering, based on the neuron- and layer-level representations of the training data. We then show that these measures constitute accurate predictors of generalization performance across variations of a large set of hyperparameters (learning rate, batch size, optimizer, weight decay, dropout rate, data augmentation, network depth and width).
\ No newline at end of file
diff --git a/data/2021/iclr/Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures b/data/2021/iclr/Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures
new file mode 100644
index 0000000000..d0b610c482
--- /dev/null
+++ b/data/2021/iclr/Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures	
@@ -0,0 +1 @@
+commonly used algorithms in protein learningwere specifically designed for protein data, and are able tocapture all relevant structural levels of a protein during learning. To fill this gap,we propose two new learning operators, specifically designed to process proteinstructures. First, we introduce a novel convolution operator that considers theprimary, secondary, and tertiary structure of a protein by usingn-D convolutionsdefined on both the Euclidean distance, as well as multiple geodesic distancesbetween the atoms in a multi-graph. Second, we introduce a set of hierarchicalpooling operators that enable multi-scale protein analysis. We further evaluate theaccuracy of our algorithms on common downstream tasks, where we outperformstate-of-the-art protein learning algorithms.
\ No newline at end of file
diff --git a/data/2021/iclr/Is Attention Better Than Matrix Decomposition? b/data/2021/iclr/Is Attention Better Than Matrix Decomposition?
new file mode 100644
index 0000000000..89678f9e3e
--- /dev/null
+++ b/data/2021/iclr/Is Attention Better Than Matrix Decomposition?	
@@ -0,0 +1 @@
+As an essential ingredient of modern deep learning, attention mechanism, especially self-attention, plays a vital role in the global correlation discovery. However, is hand-crafted attention irreplaceable when modeling the global context? Our intriguing finding is that self-attention is not better than the matrix decomposition (MD) model developed 20 years ago regarding the performance and computational cost for encoding the long-distance dependencies. We model the global context issue as a low-rank recovery problem and show that its optimization algorithms can help design global information blocks. This paper then proposes a series of Hamburgers, in which we employ the optimization algorithms for solving MDs to factorize the input representations into sub-matrices and reconstruct a low-rank embedding. Hamburgers with different MDs can perform favorably against the popular global context module self-attention when carefully coping with gradients back-propagated through MDs. Comprehensive experiments are conducted in the vision tasks where it is crucial to learn the global context, including semantic segmentation and image generation, demonstrating significant improvements over self-attention and its variants.
\ No newline at end of file
diff --git a/data/2021/iclr/Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study b/data/2021/iclr/Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study
new file mode 100644
index 0000000000..83c2389b6f
--- /dev/null
+++ b/data/2021/iclr/Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study	
@@ -0,0 +1 @@
+This work aims to empirically clarify a recently discovered perspective that label smoothing is incompatible with knowledge distillation. We begin by introducing the motivation behind on how this incompatibility is raised, i.e., label smoothing erases relative information between teacher logits. We provide a novel connection on how label smoothing affects distributions of semantically similar and dissimilar classes. Then we propose a metric to quantitatively measure the degree of erased information in sample's representation. After that, we study its one-sidedness and imperfection of the incompatibility view through massive analyses, visualizations and comprehensive experiments on Image Classification, Binary Networks, and Neural Machine Translation. Finally, we broadly discuss several circumstances wherein label smoothing will indeed lose its effectiveness. Project page: http://zhiqiangshen.com/projects/LS_and_KD/index.html.
\ No newline at end of file
diff --git a/data/2021/iclr/IsarStep: a Benchmark for High-level Mathematical Reasoning b/data/2021/iclr/IsarStep: a Benchmark for High-level Mathematical Reasoning
new file mode 100644
index 0000000000..7e9e7c7f5e
--- /dev/null
+++ b/data/2021/iclr/IsarStep: a Benchmark for High-level Mathematical Reasoning	
@@ -0,0 +1 @@
+A well-defined benchmark is essential for measuring and accelerating research progress of machine learning models. In this paper, we present a benchmark for high-level mathematical reasoning and study the reasoning capabilities of neural sequence-to-sequence models. We build a non-synthetic dataset from the largest repository of proofs written by human experts in a theorem prover. The dataset has a broad coverage of undergraduate and research-level mathematical and computer science theorems. In our defined task, a model is required to fill in a missing intermediate proposition given surrounding proofs. This task provides a starting point for the long-term goal of having machines generate human-readable proofs automatically. Our experiments and analysis reveal that while the task is challenging, neural models can capture non-trivial mathematical reasoning. We further design a hierarchical transformer that outperforms the transformer baseline. We will make the dataset and models publicly available.
\ No newline at end of file
diff --git a/data/2021/iclr/Isometric Propagation Network for Generalized Zero-shot Learning b/data/2021/iclr/Isometric Propagation Network for Generalized Zero-shot Learning
new file mode 100644
index 0000000000..58c16d4469
--- /dev/null
+++ b/data/2021/iclr/Isometric Propagation Network for Generalized Zero-shot Learning	
@@ -0,0 +1 @@
+Zero-shot learning (ZSL) aims to classify images of an unseen class only based on a few attributes describing that class but no access to any training sample. A popular strategy is to learn a mapping between the semantic space of class attributes and the visual space of images based on the seen classes and their data. Thus, an unseen class image can be ideally mapped to its corresponding class attributes. The key challenge is how to align the representations in the two spaces. For most ZSL settings, the attributes for each seen/unseen class are only represented by a vector while the seen-class data provide much more information. Thus, the imbalanced supervision from the semantic and the visual space can make the learned mapping easily overfitting to the seen classes. To resolve this problem, we propose Isometric Propagation Network (IPN), which learns to strengthen the relation between classes within each space and align the class dependency in the two spaces. Specifically, IPN learns to propagate the class representations on an auto-generated graph within each space. In contrast to only aligning the resulted static representation, we regularize the two dynamic propagation procedures to be isometric in terms of the two graphs' edge weights per step by minimizing a consistency loss between them. IPN achieves state-of-the-art performance on three popular ZSL benchmarks. To evaluate the generalization capability of IPN, we further build two larger benchmarks with more diverse unseen classes and demonstrate the advantages of IPN on them.
\ No newline at end of file
diff --git a/data/2021/iclr/Isometric Transformation Invariant and Equivariant Graph Convolutional Networks b/data/2021/iclr/Isometric Transformation Invariant and Equivariant Graph Convolutional Networks
new file mode 100644
index 0000000000..b195044960
--- /dev/null
+++ b/data/2021/iclr/Isometric Transformation Invariant and Equivariant Graph Convolutional Networks	
@@ -0,0 +1 @@
+Graphs are one of the most important data structures for representing pairwise relations between objects. Specifically, a graph embedded in a Euclidean space is essential to solving real problems, such as object detection, structural chemistry analyses, and physical simulation. A crucial requirement to applying a graph in a Euclidean space is learning the isometric transformation invariant and equivariant features. In the present paper, we propose a set of transformation invariant and equivariant models based on graph convolutional networks (GCNs), called IsoGCNs. We demonstrate that the proposed model outperforms state-of-the-art methods on tasks related with geometrical and physical data. Moreover, the proposed model can scale up to the graphs with 1M vertices and conduct an inference faster than a conventional finite element analysis.
\ No newline at end of file
diff --git a/data/2021/iclr/Isotropy in the Contextual Embedding Space: Clusters and Manifolds b/data/2021/iclr/Isotropy in the Contextual Embedding Space: Clusters and Manifolds
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Iterated learning for emergent systematicity in VQA b/data/2021/iclr/Iterated learning for emergent systematicity in VQA
new file mode 100644
index 0000000000..9132f4fbf7
--- /dev/null
+++ b/data/2021/iclr/Iterated learning for emergent systematicity in VQA	
@@ -0,0 +1 @@
+Although neural module networks have an architectural bias towards compositionality, they require gold standard layouts to generalize systematically in practice. When instead learning layouts and modules jointly, compositionality does not arise automatically and an explicit pressure is necessary for the emergence of layouts exhibiting the right structure. We propose to address this problem using iterated learning, a cognitive science theory of the emergence of compositional languages in nature that has primarily been applied to simple referential games in machine learning. Considering the layouts of module networks as samples from an emergent language, we use iterated learning to encourage the development of structure within this language. We show that the resulting layouts support systematic generalization in neural agents solving the more complex task of visual question-answering. Our regularized iterated learning method can outperform baselines without iterated learning on SHAPES-SyGeT (SHAPES Systematic Generalization Test), a new split of the SHAPES dataset we introduce to evaluate systematic generalization, and on CLOSURE, an extension of CLEVR also designed to test systematic generalization. We demonstrate superior performance in recovering ground-truth compositional program structure with limited supervision on both SHAPES-SyGeT and CLEVR.
\ No newline at end of file
diff --git a/data/2021/iclr/Iterative Empirical Game Solving via Single Policy Best Response b/data/2021/iclr/Iterative Empirical Game Solving via Single Policy Best Response
new file mode 100644
index 0000000000..617c4f814f
--- /dev/null
+++ b/data/2021/iclr/Iterative Empirical Game Solving via Single Policy Best Response	
@@ -0,0 +1 @@
+Policy-Space Response Oracles (PSRO) is a general algorithmic framework for learning policies in multiagent systems by interleaving empirical game analysis with deep reinforcement learning (Deep RL). At each iteration, Deep RL is invoked to train a best response to a mixture of opponent policies. The repeated application of Deep RL poses an expensive computational burden as we look to apply this algorithm to more complex domains. We introduce two variations of PSRO designed to reduce the amount of simulation required during Deep RL training. Both algorithms modify how PSRO adds new policies to the empirical game, based on learned responses to a single opponent policy. The first, Mixed-Oracles, transfers knowledge from previous iterations of Deep RL, requiring training only against the opponent's newest policy. The second, Mixed-Opponents, constructs a pure-strategy opponent by mixing existing strategy's action-value estimates, instead of their policies. Learning against a single policy mitigates variance in state outcomes that is induced by an unobserved distribution of opponents. We empirically demonstrate that these algorithms substantially reduce the amount of simulation during training required by PSRO, while producing equivalent or better solutions to the game.
\ No newline at end of file
diff --git a/data/2021/iclr/Kanerva++: Extending the Kanerva Machine With Differentiable, Locally Block Allocated Latent Memory b/data/2021/iclr/Kanerva++: Extending the Kanerva Machine With Differentiable, Locally Block Allocated Latent Memory
new file mode 100644
index 0000000000..d8611f50c0
--- /dev/null
+++ b/data/2021/iclr/Kanerva++: Extending the Kanerva Machine With Differentiable, Locally Block Allocated Latent Memory	
@@ -0,0 +1 @@
+Episodic and semantic memory are critical components of the human memory model. The theory of complementary learning systems (McClelland et al., 1995) suggests that the compressed representation produced by a serial event (episodic memory) is later restructured to build a more generalized form of reusable knowledge (semantic memory). In this work we develop a new principled Bayesian memory allocation scheme that bridges the gap between episodic and semantic memory via a hierarchical latent variable model. We take inspiration from traditional heap allocation and extend the idea of locally contiguous memory to the Kanerva Machine, enabling a novel differentiable block allocated latent memory. In contrast to the Kanerva Machine, we simplify the process of memory writing by treating it as a fully feed forward deterministic process, relying on the stochasticity of the read key distribution to disperse information within the memory. We demonstrate that this allocation scheme improves performance in memory conditional image generation, resulting in new state-of-the-art conditional likelihood values on binarized MNIST (<=41.58 nats/image) , binarized Omniglot (<=66.24 nats/image), as well as presenting competitive performance on CIFAR10, DMLab Mazes, Celeb-A and ImageNet32x32.
\ No newline at end of file
diff --git a/data/2021/iclr/Knowledge Distillation as Semiparametric Inference b/data/2021/iclr/Knowledge Distillation as Semiparametric Inference
new file mode 100644
index 0000000000..2765740dd8
--- /dev/null
+++ b/data/2021/iclr/Knowledge Distillation as Semiparametric Inference	
@@ -0,0 +1 @@
+A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model. Surprisingly, this two-step knowledge distillation process often leads to higher accuracy than training the student directly on labeled data. To explain and enhance this phenomenon, we cast knowledge distillation as a semiparametric inference problem with the optimal student model as the target, the unknown Bayes class probabilities as nuisance, and the teacher probabilities as a plug-in nuisance estimate. By adapting modern semiparametric tools, we derive new guarantees for the prediction error of standard distillation and develop two enhancements -- cross-fitting and loss correction -- to mitigate the impact of teacher overfitting and underfitting on student performance. We validate our findings empirically on both tabular and image data and observe consistent improvements from our knowledge distillation enhancements.
\ No newline at end of file
diff --git a/data/2021/iclr/Knowledge distillation via softmax regression representation learning b/data/2021/iclr/Knowledge distillation via softmax regression representation learning
new file mode 100644
index 0000000000..b5cbd39dc9
--- /dev/null
+++ b/data/2021/iclr/Knowledge distillation via softmax regression representation learning	
@@ -0,0 +1 @@
+This paper addresses the problem of model compression via knowledge distillation. We advocate for a method that optimizes the output feature of the penulti-mate layer of the student network and hence is directly related to representation learning. To this end, we ﬁrstly propose a direct feature matching approach which focuses on optimizing the student’s penultimate layer only. Secondly and more importantly , because feature matching does not take into account the classiﬁcation problem at hand, we propose a second approach that decouples representation learning and classiﬁcation and utilizes the teacher’s pre-trained classiﬁer to train the student’s penultimate layer feature. In particular, for the same input image, we wish the teacher’s and student’s feature to produce the same output when passed through the teacher’s classiﬁer, which is achieved with a simple L 2 loss. Our method is extremely simple to implement and straightforward to train and is shown to consistently outperform previous state-of-the-art methods over a large set of experimental settings including different (a) network architectures, (b) teacher-student capacities, (c) datasets, and (d) domains. The code is available at https://github.com/jingyang2017/KD_SRRL .
\ No newline at end of file
diff --git a/data/2021/iclr/LEAF: A Learnable Frontend for Audio Classification b/data/2021/iclr/LEAF: A Learnable Frontend for Audio Classification
new file mode 100644
index 0000000000..9fb175b007
--- /dev/null
+++ b/data/2021/iclr/LEAF: A Learnable Frontend for Audio Classification	
@@ -0,0 +1 @@
+Mel-filterbanks are fixed, engineered audio features which emulate human perception and have been used through the history of audio understanding up to today. However, their undeniable qualities are counterbalanced by the fundamental limitations of handmade representations. In this work we show that we can train a single learnable frontend that outperforms mel-filterbanks on a wide range of audio signals, including speech, music, audio events and animal sounds, providing a general-purpose learned frontend for audio classification. To do so, we introduce a new principled, lightweight, fully learnable architecture that can be used as a drop-in replacement of mel-filterbanks. Our system learns all operations of audio features extraction, from filtering to pooling, compression and normalization, and can be integrated into any neural network at a negligible parameter cost. We perform multi-task training on eight diverse audio classification tasks, and show consistent improvements of our model over mel-filterbanks and previous learnable alternatives. Moreover, our system outperforms the current state-of-the-art learnable frontend on Audioset, with orders of magnitude fewer parameters.
\ No newline at end of file
diff --git a/data/2021/iclr/LambdaNetworks: Modeling long-range Interactions without Attention b/data/2021/iclr/LambdaNetworks: Modeling long-range Interactions without Attention
new file mode 100644
index 0000000000..584dd7e31a
--- /dev/null
+++ b/data/2021/iclr/LambdaNetworks: Modeling long-range Interactions without Attention	
@@ -0,0 +1 @@
+We present lambda layers -- an alternative framework to self-attention -- for capturing long-range interactions between an input and structured contextual information (e.g. a pixel surrounded by other pixels). Lambda layers capture such interactions by transforming available contexts into linear functions, termed lambdas, and applying these linear functions to each input separately. Similar to linear attention, lambda layers bypass expensive attention maps, but in contrast, they model both content and position-based interactions which enables their application to large structured inputs such as images. The resulting neural network architectures, LambdaNetworks, significantly outperform their convolutional and attentional counterparts on ImageNet classification, COCO object detection and COCO instance segmentation, while being more computationally efficient. Additionally, we design LambdaResNets, a family of hybrid architectures across different scales, that considerably improves the speed-accuracy tradeoff of image classification models. LambdaResNets reach excellent accuracies on ImageNet while being 3.2 - 4.4x faster than the popular EfficientNets on modern machine learning accelerators. When training with an additional 130M pseudo-labeled images, LambdaResNets achieve up to a 9.5x speed-up over the corresponding EfficientNet checkpoints.
\ No newline at end of file
diff --git a/data/2021/iclr/Language-Agnostic Representation Learning of Source Code from Structure and Context b/data/2021/iclr/Language-Agnostic Representation Learning of Source Code from Structure and Context
new file mode 100644
index 0000000000..99ceb7bad0
--- /dev/null
+++ b/data/2021/iclr/Language-Agnostic Representation Learning of Source Code from Structure and Context	
@@ -0,0 +1 @@
+Source code (Context) and its parsed abstract syntax tree (AST; Structure) are two complementary representations of the same computer program. Traditionally, designers of machine learning models have relied predominantly either on Structure or Context. We propose a new model, which jointly learns on Context and Structure of source code. In contrast to previous approaches, our model uses only language-agnostic features, i.e., source code and features that can be computed directly from the AST. Besides obtaining state-of-the-art on monolingual code summarization on all five programming languages considered in this work, we propose the first multilingual code summarization model. We show that jointly training on non-parallel data from multiple programming languages improves results on all individual languages, where the strongest gains are on low-resource languages. Remarkably, multilingual training only from Context does not lead to the same improvements, highlighting the benefits of combining Structure and Context for representation learning on code.
\ No newline at end of file
diff --git a/data/2021/iclr/Large Associative Memory Problem in Neurobiology and Machine Learning b/data/2021/iclr/Large Associative Memory Problem in Neurobiology and Machine Learning
new file mode 100644
index 0000000000..ca97a1624e
--- /dev/null
+++ b/data/2021/iclr/Large Associative Memory Problem in Neurobiology and Machine Learning	
@@ -0,0 +1 @@
+Dense Associative Memories or modern Hopfield networks permit storage and reliable retrieval of an exponentially large (in the dimension of feature space) number of memories. At the same time, their naive implementation is non-biological, since it seemingly requires the existence of many-body synaptic junctions between the neurons. We show that these models are effective descriptions of a more microscopic (written in terms of biological degrees of freedom) theory that has additional (hidden) neurons and only requires two-body interactions between them. For this reason our proposed microscopic theory is a valid model of large associative memory with a degree of biological plausibility. The dynamics of our network and its reduced dimensional equivalent both minimize energy (Lyapunov) functions. When certain dynamical variables (hidden neurons) are integrated out from our microscopic theory, one can recover many of the models that were previously discussed in the literature, e.g. the model presented in ''Hopfield Networks is All You Need'' paper. We also provide an alternative derivation of the energy function and the update rule proposed in the aforementioned paper and clarify the relationships between various models of this class.
\ No newline at end of file
diff --git a/data/2021/iclr/Large Batch Simulation for Deep Reinforcement Learning b/data/2021/iclr/Large Batch Simulation for Deep Reinforcement Learning
new file mode 100644
index 0000000000..a030d9e975
--- /dev/null
+++ b/data/2021/iclr/Large Batch Simulation for Deep Reinforcement Learning	
@@ -0,0 +1 @@
+We accelerate deep reinforcement learning-based training in visually complex 3D environments by two orders of magnitude over prior work, realizing end-to-end training speeds of over 19,000 frames of experience per second on a single GPU and up to 72,000 frames per second on a single eight-GPU machine. The key idea of our approach is to design a 3D renderer and embodied navigation simulator around the principle of"batch simulation": accepting and executing large batches of requests simultaneously. Beyond exposing large amounts of work at once, batch simulation allows implementations to amortize in-memory storage of scene assets, rendering work, data loading, and synchronization costs across many simulation requests, dramatically improving the number of simulated agents per GPU and overall simulation throughput. To balance DNN inference and training costs with faster simulation, we also build a computationally efficient policy DNN that maintains high task performance, and modify training algorithms to maintain sample efficiency when training with large mini-batches. By combining batch simulation and DNN performance optimizations, we demonstrate that PointGoal navigation agents can be trained in complex 3D environments on a single GPU in 1.5 days to 97% of the accuracy of agents trained on a prior state-of-the-art system using a 64-GPU cluster over three days. We provide open-source reference implementations of our batch 3D renderer and simulator to facilitate incorporation of these ideas into RL systems.
\ No newline at end of file
diff --git a/data/2021/iclr/Large Scale Image Completion via Co-Modulated Generative Adversarial Networks b/data/2021/iclr/Large Scale Image Completion via Co-Modulated Generative Adversarial Networks
new file mode 100644
index 0000000000..c25c1a2023
--- /dev/null
+++ b/data/2021/iclr/Large Scale Image Completion via Co-Modulated Generative Adversarial Networks	
@@ -0,0 +1 @@
+Numerous task-specific variants of conditional generative adversarial networks have been developed for image completion. Yet, a serious limitation remains that all existing algorithms tend to fail when handling large-scale missing regions. To overcome this challenge, we propose a generic new approach that bridges the gap between image-conditional and recent modulated unconditional generative architectures via co-modulation of both conditional and stochastic style representations. Also, due to the lack of good quantitative metrics for image completion, we propose the new Paired/Unpaired Inception Discriminative Score (P-IDS/U-IDS), which robustly measures the perceptual fidelity of inpainted images compared to real images via linear separability in a feature space. Experiments demonstrate superior performance in terms of both quality and diversity over state-of-the-art methods in free-form image completion and easy generalization to image-to-image translation. Code is available at https://github.com/zsyzzsoft/co-mod-gan.
\ No newline at end of file
diff --git a/data/2021/iclr/Large-width functional asymptotics for deep Gaussian neural networks b/data/2021/iclr/Large-width functional asymptotics for deep Gaussian neural networks
new file mode 100644
index 0000000000..d09953de3a
--- /dev/null
+++ b/data/2021/iclr/Large-width functional asymptotics for deep Gaussian neural networks	
@@ -0,0 +1 @@
+In this paper, we consider fully connected feed-forward deep neural networks where weights and biases are independent and identically distributed according to Gaussian distributions. Extending previous results (Matthews et al., 2018a;b; Yang, 2019) we adopt a function-space perspective, i.e. we look at neural networks as infinite-dimensional random elements on the input space $\mathbb{R}^I$. Under suitable assumptions on the activation function we show that: i) a network defines a continuous Gaussian process on the input space $\mathbb{R}^I$; ii) a network with re-scaled weights converges weakly to a continuous Gaussian process in the large-width limit; iii) the limiting Gaussian process has almost surely locally $\gamma$-H\"older continuous paths, for $0<\gamma<1$. Our results contribute to recent theoretical studies on the interplay between infinitely wide deep neural networks and Gaussian processes by establishing weak convergence in function-space with respect to a stronger metric.
\ No newline at end of file
diff --git a/data/2021/iclr/Latent Convergent Cross Mapping b/data/2021/iclr/Latent Convergent Cross Mapping
new file mode 100644
index 0000000000..a383911f49
--- /dev/null
+++ b/data/2021/iclr/Latent Convergent Cross Mapping	
@@ -0,0 +1 @@
+Discovering causal structures of temporal processes is a major tool of scientific inquiry because it helps us better understand and explain the mechanisms driving a phenomenon of interest, thereby facilitating analysis, reasoning, and synthesis for such systems. However, accurately inferring causal structures within a phenomenon based on observational data only is still an open problem. Indeed, this type of data usually consists in short time series with missing or noisy values for which causal inference is increasingly difficult. In this work, we propose a method to uncover causal relations in chaotic dynamical systems from short, noisy and sporadic time series (that is, incomplete observations at infrequent and irregular intervals) where the classical convergent cross mapping (CCM) fails. Our method works by learning a Neural ODE latent process modeling the state-space dynamics of the time series and by checking the existence of a continuous map between the resulting processes. We provide theoretical analysis and show empirically that Latent-CCM can reliably uncover the true causal pattern, unlike traditional methods.
\ No newline at end of file
diff --git a/data/2021/iclr/Latent Skill Planning for Exploration and Transfer b/data/2021/iclr/Latent Skill Planning for Exploration and Transfer
new file mode 100644
index 0000000000..4f975e5e36
--- /dev/null
+++ b/data/2021/iclr/Latent Skill Planning for Exploration and Transfer	
@@ -0,0 +1 @@
+To quickly solve new tasks in complex environments, intelligent agents need to build up reusable knowledge. For example, a learned world model captures knowledge about the environment that applies to new tasks. Similarly, skills capture general behaviors that can apply to new tasks. In this paper, we investigate how these two approaches can be integrated into a single reinforcement learning agent. Specifically, we leverage the idea of partial amortization for fast adaptation at test time. For this, actions are produced by a policy that is learned over time while the skills it conditions on are chosen using online planning. We demonstrate the benefits of our design decisions across a suite of challenging locomotion tasks and demonstrate improved sample efficiency in single tasks as well as in transfer from one task to another, as compared to competitive baselines. Videos are available at: https://sites.google.com/view/latent-skill-planning/
\ No newline at end of file
diff --git a/data/2021/iclr/Layer-adaptive Sparsity for the Magnitude-based Pruning b/data/2021/iclr/Layer-adaptive Sparsity for the Magnitude-based Pruning
new file mode 100644
index 0000000000..dbb36f5f3e
--- /dev/null
+++ b/data/2021/iclr/Layer-adaptive Sparsity for the Magnitude-based Pruning	
@@ -0,0 +1 @@
+Recent discoveries on neural network pruning reveal that, with a carefully chosen layerwise sparsity, a simple magnitude-based pruning achieves state-of-the-art tradeoff between sparsity and performance. However, without a clear consensus on"how to choose,"the layerwise sparsities are mostly selected algorithm-by-algorithm, often resorting to handcrafted heuristics or an extensive hyperparameter search. To fill this gap, we propose a novel importance score for global pruning, coined layer-adaptive magnitude-based pruning (LAMP) score; the score is a rescaled version of weight magnitude that incorporates the model-level $\ell_2$ distortion incurred by pruning, and does not require any hyperparameter tuning or heavy computation. Under various image classification setups, LAMP consistently outperforms popular existing schemes for layerwise sparsity selection. Furthermore, we observe that LAMP continues to outperform baselines even in weight-rewinding setups, while the connectivity-oriented layerwise sparsity (the strongest baseline overall) performs worse than a simple global magnitude-based pruning in this case. Code: https://github.com/jaeho-lee/layer-adaptive-sparsity
\ No newline at end of file
diff --git a/data/2021/iclr/Learnable Embedding sizes for Recommender Systems b/data/2021/iclr/Learnable Embedding sizes for Recommender Systems
new file mode 100644
index 0000000000..8fb170d4ad
--- /dev/null
+++ b/data/2021/iclr/Learnable Embedding sizes for Recommender Systems	
@@ -0,0 +1 @@
+The embedding-based representation learning is commonly used in deep learning recommendation models to map the raw sparse features to dense vectors. The traditional embedding manner that assigns a uniform size to all features has two issues. First, the numerous features inevitably lead to a gigantic embedding table that causes a high memory usage cost. Second, it is likely to cause the over-fitting problem for those features that do not require too large representation capacity. Existing works that try to address the problem always cause a significant drop in recommendation performance or suffers from the limitation of unaffordable training time cost. In this paper, we proposed a novel approach, named PEP (short for Plug-in Embedding Pruning), to reduce the size of the embedding table while obviating a drop in accuracy and computational optimization. PEP prunes embedding parameter where the pruning threshold(s) can be adaptively learned from data. Therefore we can automatically obtain a mixed-dimension embedding-scheme by pruning redundant parameters for each feature. PEP is a general framework that can plug in various base recommendation models. Extensive experiments demonstrate it can efficiently cut down embedding parameters and boost the base model's performance. Specifically, it achieves strong recommendation performance while reducing 97-99% parameters. As for the computation cost, PEP only brings an additional 20-30% time cost compared with base models. Codes are available at https://github.com/ssui-liu/learnable-embed-sizes-for-RecSys.
\ No newline at end of file
diff --git "a/data/2021/iclr/Learning \"What-if\" Explanations for Sequential Decision-Making" "b/data/2021/iclr/Learning \"What-if\" Explanations for Sequential Decision-Making"
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Learning A Minimax Optimizer: A Pilot Study b/data/2021/iclr/Learning A Minimax Optimizer: A Pilot Study
new file mode 100644
index 0000000000..eb7f88ffed
--- /dev/null
+++ b/data/2021/iclr/Learning A Minimax Optimizer: A Pilot Study	
@@ -0,0 +1 @@
+Solving continuous minimax optimization is of extensive practical interest, yet notoriously unstable and difﬁcult. This paper introduces the learning to optimize ( L2O ) methodology to the minimax problems for the ﬁrst time and addresses its accompanying unique challenges. We ﬁrst present Twin-L2O , the ﬁrst dedicated minimax L2O framework consisting of two LSTMs for updating min and max variables separately. The decoupled design is found to facilitate learning, particularly when the min and max variables are highly asymmetric. Empirical experiments on a variety of minimax problems corroborate the effectiveness of Twin-L2O. We then discuss a crucial concern of Twin-L2O, i.e., its inevitably limited generalizability to unseen optimizees. To address this issue, we present two complementary strategies. Our ﬁrst solution, Enhanced Twin-L2O , is empirically applicable for general mini-max problems, by improving L2O training via leveraging curriculum learning. Our second alternative, called Safeguarded Twin-L2O , is a preliminary theoretical exploration stating that under some strong assumptions, it is possible to theoretically establish the convergence of Twin-L2O. We benchmark our algorithms on several testbed problems and compare against state-of-the-art minimax solvers. The code is available at: https://github.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Accurate Entropy Model with Global Reference for Image Compression b/data/2021/iclr/Learning Accurate Entropy Model with Global Reference for Image Compression
new file mode 100644
index 0000000000..89225176da
--- /dev/null
+++ b/data/2021/iclr/Learning Accurate Entropy Model with Global Reference for Image Compression	
@@ -0,0 +1 @@
+In recent deep image compression neural networks, the entropy model plays a critical role in estimating the prior distribution of deep image encodings. Existing methods combine hyperprior with local context in the entropy estimation function. This greatly limits their performance due to the absence of a global vision. In this work, we propose a novel Global Reference Model for image compression to effectively leverage both the local and the global context information, leading to an enhanced compression rate. The proposed method scans decoded latents and then finds the most relevant latent to assist the distribution estimating of the current latent. A by-product of this work is the innovation of a mean-shifting GDN module that further improves the performance. Experimental results demonstrate that the proposed model outperforms the rate-distortion performance of most of the state-of-the-art methods in the industry.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Associative Inference Using Fast Weight Memory b/data/2021/iclr/Learning Associative Inference Using Fast Weight Memory
new file mode 100644
index 0000000000..748a6db91b
--- /dev/null
+++ b/data/2021/iclr/Learning Associative Inference Using Fast Weight Memory	
@@ -0,0 +1 @@
+Humans can quickly associate stimuli to solve problems in novel contexts. Our novel neural network model learns state representations of facts that can be composed to perform such associative inference. To this end, we augment the LSTM model with an associative memory, dubbed Fast Weight Memory (FWM). Through differentiable operations at every step of a given input sequence, the LSTM updates and maintains compositional associations stored in the rapidly changing FWM weights. Our model is trained end-to-end by gradient descent and yields excellent performance on compositional language reasoning problems, meta-reinforcement-learning for POMDPs, and small-scale word-level language modelling.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Better Structured Representations Using Low-rank Adaptive Label Smoothing b/data/2021/iclr/Learning Better Structured Representations Using Low-rank Adaptive Label Smoothing
new file mode 100644
index 0000000000..9c044852a6
--- /dev/null
+++ b/data/2021/iclr/Learning Better Structured Representations Using Low-rank Adaptive Label Smoothing	
@@ -0,0 +1 @@
+Training with soft targets instead of hard targets has been shown to improve performance and calibration of deep neural networks. Label smoothing is a popular way of computing soft targets, where one-hot encoding of a class is smoothed with a uniform distribution. Owing to its simplicity, label smoothing has found wide-spread use for training deep neural networks on a wide variety of tasks, ranging from image and text classification to machine translation and semantic parsing. Complementing recent empirical justification for label smoothing, we obtain PAC-Bayesian generalization bounds for label smoothing and show that the generalization error depends on the choice of the noise (smoothing) distribution. Then we propose low-rank adaptive label smoothing (LORAS): a simple yet novel method for training with learned soft targets that generalizes label smoothing and adapts to the latent structure of the label space in structured prediction tasks. Specifically, we evaluate our method on semantic parsing tasks and show that training with appropriately smoothed soft targets can significantly improve accuracy and model calibration, especially in low-resource settings. Used in conjunction with pre-trained sequence-to-sequence models, our method achieves state of the art performance on four semantic parsing data sets. LORAS can be used with any model, improves performance and implicit model calibration without increasing the number of model parameters, and can be scaled to problems with large label spaces containing tens of thousands of labels.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Cross-Domain Correspondence for Control with Dynamics Cycle-Consistency b/data/2021/iclr/Learning Cross-Domain Correspondence for Control with Dynamics Cycle-Consistency
new file mode 100644
index 0000000000..2330c3dbef
--- /dev/null
+++ b/data/2021/iclr/Learning Cross-Domain Correspondence for Control with Dynamics Cycle-Consistency	
@@ -0,0 +1 @@
+At the heart of many robotics problems is the challenge of learning correspondences across domains. For instance, imitation learning requires obtaining correspondence between humans and robots; sim-to-real requires correspondence between physics simulators and the real world; transfer learning requires correspondences between different robotics environments. This paper aims to learn correspondence across domains differing in representation (vision vs. internal state), physics parameters (mass and friction), and morphology (number of limbs). Importantly, correspondences are learned using unpaired and randomly collected data from the two domains. We propose \textit{dynamics cycles} that align dynamic robot behavior across two domains using a cycle-consistency constraint. Once this correspondence is found, we can directly transfer the policy trained on one domain to the other, without needing any additional fine-tuning on the second domain. We perform experiments across a variety of problem domains, both in simulation and on real robot. Our framework is able to align uncalibrated monocular video of a real robot arm to dynamic state-action trajectories of a simulated arm without paired data. Video demonstrations of our results are available at: this https URL .
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Deep Features in Instrumental Variable Regression b/data/2021/iclr/Learning Deep Features in Instrumental Variable Regression
new file mode 100644
index 0000000000..daa044b998
--- /dev/null
+++ b/data/2021/iclr/Learning Deep Features in Instrumental Variable Regression	
@@ -0,0 +1 @@
+Instrumental variable (IV) regression is a standard strategy for learning causal relationships between confounded treatment and outcome variables from observational data by utilizing an instrumental variable, which affects the outcome only through the treatment. In classical IV regression, learning proceeds in two stages: stage 1 performs linear regression from the instrument to the treatment; and stage 2 performs linear regression from the treatment to the outcome, conditioned on the instrument. We propose a novel method, deep feature instrumental variable regression (DFIV), to address the case where relations between instruments, treatments, and outcomes may be nonlinear. In this case, deep neural nets are trained to define informative nonlinear features on the instruments and treatments. We propose an alternating training regime for these features to ensure good end-to-end performance when composing stages 1 and 2, thus obtaining highly flexible feature maps in a computationally efficient manner. DFIV outperforms recent state-of-the-art methods on challenging IV benchmarks, including settings involving high dimensional image data. DFIV also exhibits competitive performance in off-policy policy evaluation for reinforcement learning, which can be understood as an IV regression task.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Energy-Based Generative Models via Coarse-to-Fine Expanding and Sampling b/data/2021/iclr/Learning Energy-Based Generative Models via Coarse-to-Fine Expanding and Sampling
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Learning Energy-Based Models by Diffusion Recovery Likelihood b/data/2021/iclr/Learning Energy-Based Models by Diffusion Recovery Likelihood
new file mode 100644
index 0000000000..6477b8da59
--- /dev/null
+++ b/data/2021/iclr/Learning Energy-Based Models by Diffusion Recovery Likelihood	
@@ -0,0 +1 @@
+While energy-based models (EBMs) exhibit a number of desirable properties, training and sampling on high-dimensional datasets remains challenging. Inspired by recent progress on diffusion probabilistic models, we present a diffusion recovery likelihood method to tractably learn and sample from a sequence of EBMs trained on increasingly noisy versions of a dataset. Each EBM is trained by maximizing the recovery likelihood: the conditional probability of the data at a certain noise level given their noisy versions at a higher noise level. The recovery likelihood objective is more tractable than the marginal likelihood objective, since it only requires MCMC sampling from a relatively concentrated conditional distribution. Moreover, we show that this estimation method is theoretically consistent: it learns the correct conditional and marginal distributions at each noise level, given sufficient data. After training, synthesized images can be generated efficiently by a sampling process that initializes from a spherical Gaussian distribution and progressively samples the conditional distributions at decreasingly lower noise levels. Our method generates high fidelity samples on various image datasets. On unconditional CIFAR-10 our method achieves FID 9.60 and inception score 8.58, superior to the majority of GANs. Moreover, we demonstrate that unlike previous work on EBMs, our long-run MCMC samples from the conditional distributions do not diverge and still represent realistic images, allowing us to accurately estimate the normalized density of data even for high-dimensional datasets.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Generalizable Visual Representations via Interactive Gameplay b/data/2021/iclr/Learning Generalizable Visual Representations via Interactive Gameplay
new file mode 100644
index 0000000000..186ff87782
--- /dev/null
+++ b/data/2021/iclr/Learning Generalizable Visual Representations via Interactive Gameplay	
@@ -0,0 +1 @@
+Numerous approaches have recently emerged in the realm of self-supervised visual representation learning. While these methods have demonstrated empirical success, a theoretical foundation that understands and unifies these diverse techniques remains to be established. In this work, we draw inspiration from the principles underlying brain-based learning and propose a new method named self-supervised information bottleneck. Our method aims to maximize the mutual information between representations of views derived from the same image, while maintaining a minimal mutual information between the view and its corresponding representation at the same time. The brain-inspired method provides a unified information-theoretic perspective on various self-supervised approaches. This unified framework also empowers the model to learn generalizable visual representations for diverse downstream tasks and data distributions, achieving state-of-the-art performance across a wide variety of image and video tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Hyperbolic Representations of Topological Features b/data/2021/iclr/Learning Hyperbolic Representations of Topological Features
new file mode 100644
index 0000000000..98fad6f81e
--- /dev/null
+++ b/data/2021/iclr/Learning Hyperbolic Representations of Topological Features	
@@ -0,0 +1 @@
+Learning task-specific representations of persistence diagrams is an important problem in topological data analysis and machine learning. However, current state of the art methods are restricted in terms of their expressivity as they are focused on Euclidean representations. Persistence diagrams often contain features of infinite persistence (i.e., essential features) and Euclidean spaces shrink their importance relative to non-essential features because they cannot assign infinite distance to finite points. To deal with this issue, we propose a method to learn representations of persistence diagrams on hyperbolic spaces, more specifically on the Poincare ball. By representing features of infinite persistence infinitesimally close to the boundary of the ball, their distance to non-essential features approaches infinity, thereby their relative importance is preserved. This is achieved without utilizing extremely high values for the learnable parameters, thus the representation can be fed into downstream optimization methods and trained efficiently in an end-to-end fashion. We present experimental results on graph and image classification tasks and show that the performance of our method is on par with or exceeds the performance of other state of the art methods.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Incompressible Fluid Dynamics from Scratch - Towards Fast, Differentiable Fluid Models that Generalize b/data/2021/iclr/Learning Incompressible Fluid Dynamics from Scratch - Towards Fast, Differentiable Fluid Models that Generalize
new file mode 100644
index 0000000000..e92be09f22
--- /dev/null
+++ b/data/2021/iclr/Learning Incompressible Fluid Dynamics from Scratch - Towards Fast, Differentiable Fluid Models that Generalize	
@@ -0,0 +1 @@
+Fast and stable fluid simulations are an essential prerequisite for applications ranging from computer-generated imagery to computer-aided design in research and development. However, solving the partial differential equations of incompressible fluids is a challenging task and traditional numerical approximation schemes come at high computational costs. Recent deep learning based approaches promise vast speed-ups but do not generalize to new fluid domains, require fluid simulation data for training, or rely on complex pipelines that outsource major parts of the fluid simulation to traditional methods. In this work, we propose a novel physics-constrained training approach that generalizes to new fluid domains, requires no fluid simulation data, and allows convolutional neural networks to map a fluid state from time-point t to a subsequent state at time t + dt in a single forward pass. This simplifies the pipeline to train and evaluate neural fluid models. After training, the framework yields models that are capable of fast fluid simulations and can handle various fluid phenomena including the Magnus effect and Karman vortex streets. We present an interactive real-time demo to show the speed and generalization capabilities of our trained models. Moreover, the trained neural networks are efficient differentiable fluid solvers as they offer a differentiable update step to advance the fluid simulation in time. We exploit this fact in a proof-of-concept optimal control experiment. Our models significantly outperform a recent differentiable fluid solver in terms of computational speed and accuracy.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Invariant Representations for Reinforcement Learning without Reconstruction b/data/2021/iclr/Learning Invariant Representations for Reinforcement Learning without Reconstruction
new file mode 100644
index 0000000000..2aa4679b25
--- /dev/null
+++ b/data/2021/iclr/Learning Invariant Representations for Reinforcement Learning without Reconstruction	
@@ -0,0 +1 @@
+We study how representation learning can accelerate reinforcement learning from rich observations, such as images, without relying either on domain knowledge or pixel-reconstruction. Our goal is to learn representations that both provide for effective downstream control and invariance to task-irrelevant details. Bisimulation metrics quantify behavioral similarity between states in continuous MDPs, which we propose using to learn robust latent representations which encode only the task-relevant information from observations. Our method trains encoders such that distances in latent space equal bisimulation distances in state space. We demonstrate the effectiveness of our method at disregarding task-irrelevant information using modified visual MuJoCo tasks, where the background is replaced with moving distractors and natural videos, while achieving SOTA performance. We also test a first-person highway driving task where our method learns invariance to clouds, weather, and time of day. Finally, we provide generalization results drawn from properties of bisimulation metrics, and links to causal inference.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Long-term Visual Dynamics with Region Proposal Interaction Networks b/data/2021/iclr/Learning Long-term Visual Dynamics with Region Proposal Interaction Networks
new file mode 100644
index 0000000000..d7aa90eb64
--- /dev/null
+++ b/data/2021/iclr/Learning Long-term Visual Dynamics with Region Proposal Interaction Networks	
@@ -0,0 +1 @@
+Learning long-term dynamics models is the key to understanding physical common sense. Most existing approaches on learning dynamics from visual input sidestep long-term predictions by resorting to rapid re-planning with short-term models. This not only requires such models to be super accurate but also limits them only to tasks where an agent can continuously obtain feedback and take action at each step until completion. In this paper, we aim to leverage the ideas from success stories in visual recognition tasks to build object representations that can capture inter-object and object-environment interactions over a long-range. To this end, we propose Region Proposal Interaction Networks (RPIN), which reason about each object's trajectory in a latent region-proposal feature space. Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin both in terms of prediction quality and their ability to plan for downstream tasks, and also generalize well to novel environments. Code, pre-trained models, and more visualization results are available at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Manifold Patch-Based Representations of Man-Made Shapes b/data/2021/iclr/Learning Manifold Patch-Based Representations of Man-Made Shapes
new file mode 100644
index 0000000000..54771d1e01
--- /dev/null
+++ b/data/2021/iclr/Learning Manifold Patch-Based Representations of Man-Made Shapes	
@@ -0,0 +1 @@
+Choosing the right shape representation for geometry is crucial for making 3D models compatible with existing applications. Focusing on piecewise-smooth man-made shapes, we propose a new representation that is usable in conventional CAD modeling pipelines and can also be learned by deep neural networks. We demonstrate the benefits of our representation by applying it to the task of sketch-based modeling. Given a raster image, our system infers a set of parametric surfaces that realize the input in 3D. To capture the piecewise smooth geometry of man-made shapes, we learn a special shape representation: a deformable parametric template composed of Coons patches. Naively training such a system, however, would suffer from non-manifold artifacts of the parametric shapes as well as from a lack of data. To address this, we introduce loss functions that bias the network to output non-self-intersecting shapes and implement them as part of a fully self-supervised system, automatically generating both shape templates and synthetic training data. To test the efficacy of our system, we develop a testbed for sketch-based modeling and show results on a gallery of synthetic and real artist sketches. As additional applications, we also demonstrate shape interpolation and provide comparison to related work.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Mesh-Based Simulation with Graph Networks b/data/2021/iclr/Learning Mesh-Based Simulation with Graph Networks
new file mode 100644
index 0000000000..b7487b8eb0
--- /dev/null
+++ b/data/2021/iclr/Learning Mesh-Based Simulation with Graph Networks	
@@ -0,0 +1 @@
+Mesh-based simulations are central to modeling complex physical systems in many disciplines across science and engineering. Mesh representations support powerful numerical integration methods and their resolution can be adapted to strike favorable trade-offs between accuracy and efficiency. However, high-dimensional scientific simulations are very expensive to run, and solvers and parameters must often be tuned individually to each system studied. Here we introduce MeshGraphNets, a framework for learning mesh-based simulations using graph neural networks. Our model can be trained to pass messages on a mesh graph and to adapt the mesh discretization during forward simulation. Our results show it can accurately predict the dynamics of a wide range of physical systems, including aerodynamics, structural mechanics, and cloth. The model's adaptivity supports learning resolution-independent dynamics and can scale to more complex state spaces at test time. Our method is also highly efficient, running 1-2 orders of magnitude faster than the simulation on which it is trained. Our approach broadens the range of problems on which neural network simulators can operate and promises to improve the efficiency of complex, scientific modeling tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch b/data/2021/iclr/Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch
new file mode 100644
index 0000000000..92a3e0c15b
--- /dev/null
+++ b/data/2021/iclr/Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch	
@@ -0,0 +1 @@
+Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments. It can be generally categorized into unstructured fine-grained sparsity that zeroes out multiple individual weights distributed across the neural network, and structured coarse-grained sparsity which prunes blocks of sub-networks of a neural network. Fine-grained sparsity can achieve a high compression ratio but is not hardware friendly and hence receives limited speed gains. On the other hand, coarse-grained sparsity cannot simultaneously achieve both apparent acceleration on modern GPUs and decent performance. In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network, which can maintain the advantages of both unstructured fine-grained sparsity and structured coarse-grained sparsity simultaneously on specifically designed GPUs. Specifically, a 2 : 4 sparse network could achieve 2× speed-up without performance drop on Nvidia A100 GPUs. Furthermore, we propose a novel and effective ingredient, sparse-refined straight-through estimator (SR-STE), to alleviate the negative influence of the approximated gradients computed by vanilla STE during optimization. We also define a metric, Sparse Architecture Divergence (SAD), to measure the sparse network’s topology change during the training process. Finally, We justify SR-STE’s advantages with SAD and demonstrate the effectiveness of SR-STE by performing comprehensive experiments on various tasks. Anonymous code and model will be at available at https://github.com/anonymous-NM-sparsity/NM-sparsity.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Neural Event Functions for Ordinary Differential Equations b/data/2021/iclr/Learning Neural Event Functions for Ordinary Differential Equations
new file mode 100644
index 0000000000..904eb9e07d
--- /dev/null
+++ b/data/2021/iclr/Learning Neural Event Functions for Ordinary Differential Equations	
@@ -0,0 +1 @@
+The existing Neural ODE formulation relies on an explicit knowledge of the termination time. We extend Neural ODEs to implicitly defined termination criteria modeled by neural event functions, which can be chained together and differentiated through. Neural Event ODEs are capable of modeling discrete (instantaneous) changes in a continuous-time system, without prior knowledge of when these changes should occur or how many such changes should exist. We test our approach in modeling hybrid discrete- and continuous- systems such as switching dynamical systems and collision in multi-body systems, and we propose simulation-based training of point processes with applications in discrete control.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Neural Generative Dynamics for Molecular Conformation Generation b/data/2021/iclr/Learning Neural Generative Dynamics for Molecular Conformation Generation
new file mode 100644
index 0000000000..7c7b6be996
--- /dev/null
+++ b/data/2021/iclr/Learning Neural Generative Dynamics for Molecular Conformation Generation	
@@ -0,0 +1 @@
+We study how to generate molecule conformations (\textit{i.e.}, 3D structures) from a molecular graph. Traditional methods, such as molecular dynamics, sample conformations via computationally expensive simulations. Recently, machine learning methods have shown great potential by training on a large collection of conformation data. Challenges arise from the limited model capacity for capturing complex distributions of conformations and the difficulty in modeling long-range dependencies between atoms. Inspired by the recent progress in deep generative models, in this paper, we propose a novel probabilistic framework to generate valid and diverse conformations given a molecular graph. We propose a method combining the advantages of both flow-based and energy-based models, enjoying: (1) a high model capacity to estimate the multimodal conformation distribution; (2) explicitly capturing the complex long-range dependencies between atoms in the observation space. Extensive experiments demonstrate the superior performance of the proposed method on several benchmarks, including conformation generation and distance modeling tasks, with a significant improvement over existing generative models for molecular conformation sampling.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Parametrised Graph Shift Operators b/data/2021/iclr/Learning Parametrised Graph Shift Operators
new file mode 100644
index 0000000000..0639224726
--- /dev/null
+++ b/data/2021/iclr/Learning Parametrised Graph Shift Operators	
@@ -0,0 +1 @@
+In many domains data is currently represented as graphs and therefore, the graph representation of this data becomes increasingly important in machine learning. Network data is, implicitly or explicitly, always represented using a graph shift operator (GSO) with the most common choices being the adjacency, Laplacian matrices and their normalisations. In this paper, a novel parametrised GSO (PGSO) is proposed, where specific parameter values result in the most commonly used GSOs and message-passing operators in graph neural network (GNN) frameworks. The PGSO is suggested as a replacement of the standard GSOs that are used in state-of-the-art GNN architectures and the optimisation of the PGSO parameters is seamlessly included in the model training. It is proved that the PGSO has real eigenvalues and a set of real eigenvectors independent of the parameter values and spectral bounds on the PGSO are derived. PGSO parameters are shown to adapt to the sparsity of the graph structure in a study on stochastic blockmodel networks, where they are found to automatically replicate the GSO regularisation found in the literature. On several real-world datasets the accuracy of state-of-the-art GNN architectures is improved by the inclusion of the PGSO in both node- and graph-classification tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues b/data/2021/iclr/Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues
new file mode 100644
index 0000000000..c3683082b9
--- /dev/null
+++ b/data/2021/iclr/Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues	
@@ -0,0 +1 @@
+Compared to traditional visual question answering, video-grounded dialogues require additional reasoning over dialogue context to answer questions in a multi-turn setting. Previous approaches to video-grounded dialogues mostly use dialogue context as a simple text input without modelling the inherent information flows at the turn level. In this paper, we propose a novel framework of Reasoning Paths in Dialogue Context (PDC). PDC model discovers information flows among dialogue turns through a semantic graph constructed based on lexical components in each question and answer. PDC model then learns to predict reasoning paths over this semantic graph. Our path prediction model predicts a path from the current turn through past dialogue turns that contain additional visual cues to answer the current question. Our reasoning model sequentially processes both visual and textual information through this reasoning path and the propagated features are used to generate the answer. Our experimental results demonstrate the effectiveness of our method and provide additional insights on how models use semantic dependencies in a dialogue context to retrieve visual cues.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Robust State Abstractions for Hidden-Parameter Block MDPs b/data/2021/iclr/Learning Robust State Abstractions for Hidden-Parameter Block MDPs
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Learning Safe Multi-agent Control with Decentralized Neural Barrier Certificates b/data/2021/iclr/Learning Safe Multi-agent Control with Decentralized Neural Barrier Certificates
new file mode 100644
index 0000000000..77fc9ef466
--- /dev/null
+++ b/data/2021/iclr/Learning Safe Multi-agent Control with Decentralized Neural Barrier Certificates	
@@ -0,0 +1 @@
+We study the multi-agent safe control problem where agents should avoid collisions to static obstacles and collisions with each other while reaching their goals. Our core idea is to learn the multi-agent control policy jointly with learning the control barrier functions as safety certificates. We propose a novel joint-learning framework that can be implemented in a decentralized fashion, with generalization guarantees for certain function classes. Such a decentralized framework can adapt to an arbitrarily large number of agents. Building upon this framework, we further improve the scalability by incorporating neural network architectures that are invariant to the quantity and permutation of neighboring agents. In addition, we propose a new spontaneous policy refinement method to further enforce the certificate condition during testing. We provide extensive experiments to demonstrate that our method significantly outperforms other leading multi-agent control approaches in terms of maintaining safety and completing original tasks. Our approach also shows exceptional generalization capability in that the control policy can be trained with 8 agents in one scenario, while being used on other scenarios with up to 1024 agents in complex multi-agent environments and dynamics.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Structural Edits via Incremental Tree Transformations b/data/2021/iclr/Learning Structural Edits via Incremental Tree Transformations
new file mode 100644
index 0000000000..4d546e569b
--- /dev/null
+++ b/data/2021/iclr/Learning Structural Edits via Incremental Tree Transformations	
@@ -0,0 +1 @@
+While most neural generative models generate outputs in a single pass, the human creative process is usually one of iterative building and refinement. Recent work has proposed models of editing processes, but these mostly focus on editing sequential data and/or only model a single editing pass. In this paper, we present a generic model for incremental editing of structured data (i.e. ''structural edits''). Particularly, we focus on tree-structured data, taking abstract syntax trees of computer programs as our canonical example. Our editor learns to iteratively generate tree edits (e.g. deleting or adding a subtree) and applies them to the partially edited data, thereby the entire editing process can be formulated as consecutive, incremental tree transformations. To show the unique benefits of modeling tree edits directly, we further propose a novel edit encoder for learning to represent edits, as well as an imitation learning method that allows the editor to be more robust. We evaluate our proposed editor on two source code edit datasets, where results show that, with the proposed edit encoder, our editor significantly improves accuracy over previous approaches that generate the edited program directly in one pass. Finally, we demonstrate that training our editor to imitate experts and correct its mistakes dynamically can further improve its performance.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Subgoal Representations with Slow Dynamics b/data/2021/iclr/Learning Subgoal Representations with Slow Dynamics
new file mode 100644
index 0000000000..5d05fc6234
--- /dev/null
+++ b/data/2021/iclr/Learning Subgoal Representations with Slow Dynamics	
@@ -0,0 +1 @@
+In goal-conditioned Hierarchical Reinforcement Learning (HRL), a high-level policy periodically sets subgoals for a low-level policy, and the low-level policy is trained to reach those subgoals. A proper subgoal representation function, which abstracts a state space to a latent subgoal space, is crucial for effective goal-conditioned HRL, since different low-level behaviors are induced by reaching subgoals in the compressed representation space. Observing that the high-level agent operates at an abstract temporal scale, we propose a slowness objective to effectively learn the subgoal representation (i.e., the high-level action space). We provide a theoretical grounding for the slowness objective. That is, selecting slow features as the subgoal space can achieve efﬁcient hierarchical exploration. As a result of better exploration ability, our approach signiﬁcantly outperforms state-of-the-art HRL and exploration methods on a number of benchmark continuous-control tasks 12 . Thanks to the generality of the proposed subgoal representation learning method, empirical results also demonstrate that the learned representation and corresponding low-level policies can be transferred between distinct tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Task Decomposition with Ordered Memory Policy Network b/data/2021/iclr/Learning Task Decomposition with Ordered Memory Policy Network
new file mode 100644
index 0000000000..3409b6431f
--- /dev/null
+++ b/data/2021/iclr/Learning Task Decomposition with Ordered Memory Policy Network	
@@ -0,0 +1 @@
+Many complex real-world tasks are composed of several levels of sub-tasks. Humans leverage these hierarchical structures to accelerate the learning process and achieve better generalization. In this work, we study the inductive bias and propose Ordered Memory Policy Network (OMPN) to discover subtask hierarchy by learning from demonstration. The discovered subtask hierarchy could be used to perform task decomposition, recovering the subtask boundaries in an unstruc-tured demonstration. Experiments on Craft and Dial demonstrate that our modelcan achieve higher task decomposition performance under both unsupervised and weakly supervised settings, comparing with strong baselines. OMPN can also bedirectly applied to partially observable environments and still achieve higher task decomposition performance. Our visualization further confirms that the subtask hierarchy can emerge in our model.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Task-General Representations with Generative Neuro-Symbolic Modeling b/data/2021/iclr/Learning Task-General Representations with Generative Neuro-Symbolic Modeling
new file mode 100644
index 0000000000..77c9161bc3
--- /dev/null
+++ b/data/2021/iclr/Learning Task-General Representations with Generative Neuro-Symbolic Modeling	
@@ -0,0 +1 @@
+A hallmark of human intelligence is the ability to interact directly with raw data and acquire rich, general-purpose conceptual representations. In machine learning, symbolic models can capture the compositional and causal knowledge that enables flexible generalization, but they struggle to learn from raw inputs, relying on strong abstractions and simplifying assumptions. Neural network models can learn directly from raw data, but they struggle to capture compositional and causal structure and typically must retrain to tackle new tasks. To help bridge this gap, we propose Generative Neuro-Symbolic (GNS) Modeling, a framework for learning task-general representations by combining the structure of symbolic models with the expressivity of neural networks. Concepts and conceptual background knowledge are represented as probabilistic programs with neural network sub-routines, maintaining explicit causal and compositional structure while capturing nonparametric relationships and learning directly from raw data. We apply GNS to the Omniglot challenge of learning simple visual concepts at a human level. We report competitive results on 4 unique tasks including one-shot classification, parsing, generating new exemplars, and generating new concepts. To our knowledge, this is the strongest neurally-grounded model to complete a diverse set of Omniglot tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning Value Functions in Deep Policy Gradients using Residual Variance b/data/2021/iclr/Learning Value Functions in Deep Policy Gradients using Residual Variance
new file mode 100644
index 0000000000..e182b97f77
--- /dev/null
+++ b/data/2021/iclr/Learning Value Functions in Deep Policy Gradients using Residual Variance	
@@ -0,0 +1 @@
+Policy gradient algorithms have proven to be successful in diverse decision making and control tasks. However, these methods suffer from high sample complexity and instability issues. In this paper, we address these challenges by providing a different approach for training the critic in the actor-critic framework. Our work builds on recent studies indicating that traditional actor-critic algorithms do not succeed in fitting the true value function, calling for the need to identify a better objective for the critic. In our method, the critic uses a new state-value (resp. state-actionvalue) function approximation that learns the value of the states (resp. state-action pairs) relative to their mean value rather than the absolute value as in conventional actor-critic. We prove the theoretical consistency of the new gradient estimator and observe dramatic empirical improvement across a variety of continuous control tasks and algorithms. Furthermore, we validate our method in tasks with sparse rewards, where we provide experimental evidence and theoretical insights.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning What To Do by Simulating the Past b/data/2021/iclr/Learning What To Do by Simulating the Past
new file mode 100644
index 0000000000..45c905d4e1
--- /dev/null
+++ b/data/2021/iclr/Learning What To Do by Simulating the Past	
@@ -0,0 +1 @@
+Since reward functions are hard to specify, recent work has focused on learning policies from human feedback. However, such approaches are impeded by the expense of acquiring such feedback. Recent work proposed that agents have access to a source of information that is effectively free: in any environment that humans have acted in, the state will already be optimized for human preferences, and thus an agent can extract information about what humans want from the state. Such learning is possible in principle, but requires simulating all possible past trajectories that could have led to the observed state. This is feasible in gridworlds, but how do we scale it to complex tasks? In this work, we show that by combining a learned feature encoder with learned inverse models, we can enable agents to simulate human actions backwards in time to infer what they must have done. The resulting algorithm is able to reproduce a specific skill in MuJoCo environments given a single state sampled from the optimal policy for that skill.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning a Latent Search Space for Routing Problems using Variational Autoencoders b/data/2021/iclr/Learning a Latent Search Space for Routing Problems using Variational Autoencoders
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Learning a Latent Simplex in Input Sparsity Time b/data/2021/iclr/Learning a Latent Simplex in Input Sparsity Time
new file mode 100644
index 0000000000..ea5a4bbfbe
--- /dev/null
+++ b/data/2021/iclr/Learning a Latent Simplex in Input Sparsity Time	
@@ -0,0 +1 @@
+We consider the problem of learning a latent $k$-vertex simplex $K\subset\mathbb{R}^d$, given access to $A\in\mathbb{R}^{d\times n}$, which can be viewed as a data matrix with $n$ points that are obtained by randomly perturbing latent points in the simplex $K$ (potentially beyond $K$). A large class of latent variable models, such as adversarial clustering, mixed membership stochastic block models, and topic models can be cast as learning a latent simplex. Bhattacharyya and Kannan (SODA, 2020) give an algorithm for learning such a latent simplex in time roughly $O(k\cdot\textrm{nnz}(A))$, where $\textrm{nnz}(A)$ is the number of non-zeros in $A$. We show that the dependence on $k$ in the running time is unnecessary given a natural assumption about the mass of the top $k$ singular values of $A$, which holds in many of these applications. Further, we show this assumption is necessary, as otherwise an algorithm for learning a latent simplex would imply an algorithmic breakthrough for spectral low rank approximation. At a high level, Bhattacharyya and Kannan provide an adaptive algorithm that makes $k$ matrix-vector product queries to $A$ and each query is a function of all queries preceding it. Since each matrix-vector product requires $\textrm{nnz}(A)$ time, their overall running time appears unavoidable. Instead, we obtain a low-rank approximation to $A$ in input-sparsity time and show that the column space thus obtained has small $\sin\Theta$ (angular) distance to the right top-$k$ singular space of $A$. Our algorithm then selects $k$ points in the low-rank subspace with the largest inner product with $k$ carefully chosen random vectors. By working in the low-rank subspace, we avoid reading the entire matrix in each iteration and thus circumvent the $\Theta(k\cdot\textrm{nnz}(A))$ running time.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning advanced mathematical computations from examples b/data/2021/iclr/Learning advanced mathematical computations from examples
new file mode 100644
index 0000000000..d46088aa68
--- /dev/null
+++ b/data/2021/iclr/Learning advanced mathematical computations from examples	
@@ -0,0 +1 @@
+Using transformers over large generated datasets, we train models to learn mathematical properties of diﬀerential systems, such as local stability, behavior at inﬁnity and controllability. We achieve near perfect prediction of qualitative characteristics, and good approximations of numerical features of the system. This demonstrates that neural networks can learn to perform complex computations, grounded in advanced theory, from examples, without built-in mathematical knowledge
\ No newline at end of file
diff --git a/data/2021/iclr/Learning and Evaluating Representations for Deep One-Class Classification b/data/2021/iclr/Learning and Evaluating Representations for Deep One-Class Classification
new file mode 100644
index 0000000000..8de35d785b
--- /dev/null
+++ b/data/2021/iclr/Learning and Evaluating Representations for Deep One-Class Classification	
@@ -0,0 +1 @@
+We present a two-stage framework for deep one-class classification. We first learn self-supervised representations from one-class data, and then build one-class classifiers on learned representations. The framework not only allows to learn better representations, but also permits building one-class classifiers that are faithful to the target task. In particular, we present a novel distribution-augmented contrastive learning that extends training distributions via data augmentation to obstruct the uniformity of contrastive representations. Moreover, we argue that classifiers inspired by the statistical perspective in generative or discriminative models are more effective than existing approaches, such as an average of normality scores from a surrogate classifier. In experiments, we demonstrate state-of-the-art performance on visual domain one-class classification benchmarks. Finally, we present visual explanations, confirming that the decision-making process of our deep one-class classifier is intuitive to humans. The code is available at: this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning continuous-time PDEs from sparse data with graph neural networks b/data/2021/iclr/Learning continuous-time PDEs from sparse data with graph neural networks
new file mode 100644
index 0000000000..a45437b526
--- /dev/null
+++ b/data/2021/iclr/Learning continuous-time PDEs from sparse data with graph neural networks	
@@ -0,0 +1 @@
+The behavior of many dynamical systems follow complex, yet still unknown partial differential equations (PDEs). While several machine learning methods have been proposed to learn PDEs directly from data, previous methods are limited to discrete-time approximations or make the limiting assumption of the observations arriving at regular grids. We propose a general continuous-time differential model for dynamical systems whose governing equations are parameterized by message passing graph neural networks. The model admits arbitrary space and time discretizations, which removes constraints on the locations of observation points and time intervals between the observations. The model is trained with continuous-time adjoint method enabling efficient neural PDE inference. We demonstrate the model's ability to work with unstructured grids, arbitrary time steps, and noisy observations. We compare our method with existing approaches on several well-known physical systems that involve first and higher-order PDEs with state-of-the-art predictive performance.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning explanations that are hard to vary b/data/2021/iclr/Learning explanations that are hard to vary
new file mode 100644
index 0000000000..167e367725
--- /dev/null
+++ b/data/2021/iclr/Learning explanations that are hard to vary	
@@ -0,0 +1 @@
+In this paper, we investigate the principle that `good explanations are hard to vary' in the context of deep learning. We show that averaging gradients across examples -- akin to a logical OR of patterns -- can favor memorization and `patchwork' solutions that sew together different strategies, instead of identifying invariances. To inspect this, we first formalize a notion of consistency for minima of the loss surface, which measures to what extent a minimum appears only when examples are pooled. We then propose and experimentally validate a simple alternative algorithm based on a logical AND, that focuses on invariances and prevents memorization in a set of real-world tasks. Finally, using a synthetic dataset with a clear distinction between invariant and spurious mechanisms, we dissect learning signals and compare this approach to well-established regularizers.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning from Demonstration with Weakly Supervised Disentanglement b/data/2021/iclr/Learning from Demonstration with Weakly Supervised Disentanglement
new file mode 100644
index 0000000000..890e019c01
--- /dev/null
+++ b/data/2021/iclr/Learning from Demonstration with Weakly Supervised Disentanglement	
@@ -0,0 +1 @@
+Robotic manipulation tasks, such as wiping with a soft sponge, require control from multiple rich sensory modalities. Human-robot interaction, aimed at teaching robots, is difficult in this setting as there is potential for mismatch between human and machine comprehension of the rich data streams. We treat the task of interpretable learning from demonstration as an optimisation problem over a probabilistic generative model. To account for the high-dimensionality of the data, a high-capacity neural network is chosen to represent the model. The latent variables in this model are explicitly aligned with high-level notions and concepts that are manifested in a set of demonstrations. We show that such alignment is best achieved through the use of labels from the end user, in an appropriately restricted vocabulary, in contrast to the conventional approach of the designer picking a prior over the latent variables. Our approach is evaluated in the context of a table-top robot manipulation task performed by a PR2 robot -- that of dabbing liquids with a sponge (forcefully pressing a sponge and moving it along a surface). The robot provides visual information, arm joint positions and arm joint efforts. We have made videos of the task and data available - see supplementary materials at this https URL
\ No newline at end of file
diff --git a/data/2021/iclr/Learning from Protein Structure with Geometric Vector Perceptrons b/data/2021/iclr/Learning from Protein Structure with Geometric Vector Perceptrons
new file mode 100644
index 0000000000..0fe6d0c048
--- /dev/null
+++ b/data/2021/iclr/Learning from Protein Structure with Geometric Vector Perceptrons	
@@ -0,0 +1 @@
+Learning on 3D structures of large biomolecules is emerging as a distinct area in machine learning, but there has yet to emerge a unifying network architecture that simultaneously leverages the graph-structured and geometric aspects of the problem domain. To address this gap, we introduce geometric vector perceptrons, which extend standard dense layers to operate on collections of Euclidean vectors. Graph neural networks equipped with such layers are able to perform both geometric and relational reasoning on efficient and natural representations of macromolecular structure. We demonstrate our approach on two important problems in learning from protein structure: model quality assessment and computational protein design. Our approach improves over existing classes of architectures, including state-of-the-art graph-based and voxel-based methods.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning from others' mistakes: Avoiding dataset biases without modeling them b/data/2021/iclr/Learning from others' mistakes: Avoiding dataset biases without modeling them
new file mode 100644
index 0000000000..5dfd0acf6a
--- /dev/null
+++ b/data/2021/iclr/Learning from others' mistakes: Avoiding dataset biases without modeling them	
@@ -0,0 +1 @@
+State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended underlying task. Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available. We consider cases where the bias issues may not be explicitly identified, and show a method for training models that learn to ignore these problematic correlations. Our approach relies on the observation that models with limited capacity primarily learn to exploit biases in the dataset. We can leverage the errors of such limited capacity models to train a more robust model in a product of experts, thus bypassing the need to hand-craft a biased model. We show the effectiveness of this method to retain improvements in out-of-distribution settings even if no particular bias is targeted by the biased model.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning perturbation sets for robust machine learning b/data/2021/iclr/Learning perturbation sets for robust machine learning
new file mode 100644
index 0000000000..5db32894fe
--- /dev/null
+++ b/data/2021/iclr/Learning perturbation sets for robust machine learning	
@@ -0,0 +1 @@
+Although much progress has been made towards robust deep learning, a significant gap in robustness remains between real-world perturbations and more narrowly defined sets typically studied in adversarial defenses. In this paper, we aim to bridge this gap by learning perturbation sets from data, in order to characterize real-world effects for robust training and evaluation. Specifically, we use a conditional generator that defines the perturbation set over a constrained region of the latent space. We formulate desirable properties that measure the quality of a learned perturbation set, and theoretically prove that a conditional variational autoencoder naturally satisfies these criteria. Using this framework, our approach can generate a variety of perturbations at different complexities and scales, ranging from baseline spatial transformations, through common image corruptions, to lighting variations. We measure the quality of our learned perturbation sets both quantitatively and qualitatively, finding that our models are capable of producing a diverse set of meaningful perturbations beyond the limited data seen during training. Finally, we leverage our learned perturbation sets to train models which are empirically and certifiably robust to adversarial image corruptions and adversarial lighting variations, while improving generalization on non-adversarial data. All code and configuration files for reproducing the experiments as well as pretrained model weights can be found at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning the Pareto Front with Hypernetworks b/data/2021/iclr/Learning the Pareto Front with Hypernetworks
new file mode 100644
index 0000000000..0b5c78ecb5
--- /dev/null
+++ b/data/2021/iclr/Learning the Pareto Front with Hypernetworks	
@@ -0,0 +1,2 @@
+Multi-objective optimization problems are prevalent in machine learning. These problems have a set of optimal solutions, called the Pareto front, where each point on the front represents a different trade-off between possibly conflicting objectives. Recent optimization algorithms can target a specific desired ray in loss space, but still face two grave limitations: (i) A separate model has to be trained for each point on the front; and (ii) The exact trade-off must be known prior to the optimization process. Here, we tackle the problem of learning the entire Pareto front, with the capability of selecting a desired operating point on the front after training. We call this new setup Pareto-Front Learning (PFL). 
+We describe an approach to PFL implemented using HyperNetworks, which we term Pareto HyperNetworks (PHNs). PHN learns the entire Pareto front simultaneously using a single hypernetwork, which receives as input a desired preference vector and returns a Pareto-optimal model whose loss vector is in the desired ray. The unified model is runtime efficient compared to training multiple models, and generalizes to new operating points not used during training. We evaluate our method on a wide set of problems, from multi-task regression and classification to fairness. PHNs learns the entire Pareto front in roughly the same time as learning a single point on the front, and also reaches a better solution set. PFL opens the door to new applications where models are selected based on preferences that are only available at run time.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning to Deceive Knowledge Graph Augmented Models via Targeted Perturbation b/data/2021/iclr/Learning to Deceive Knowledge Graph Augmented Models via Targeted Perturbation
new file mode 100644
index 0000000000..3ab099d923
--- /dev/null
+++ b/data/2021/iclr/Learning to Deceive Knowledge Graph Augmented Models via Targeted Perturbation	
@@ -0,0 +1 @@
+Knowledge graphs (KGs) have helped neural-symbolic models improve performance on various knowledge-intensive tasks, like question answering and item recommendation. By using attention over the KG, such models can also "explain" which KG information was most relevant for making a given prediction. In this paper, we question whether these models are really behaving as we expect. We demonstrate that, through a reinforcement learning policy (or even simple heuristics), one can produce deceptively perturbed KGs which maintain the downstream performance of the original KG while significantly deviating from the original semantics and structure. Our findings raise doubts about KG-augmented models' ability to leverage KG information and provide plausible explanations.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning to Generate 3D Shapes with Generative Cellular Automata b/data/2021/iclr/Learning to Generate 3D Shapes with Generative Cellular Automata
new file mode 100644
index 0000000000..b29c37a1b8
--- /dev/null
+++ b/data/2021/iclr/Learning to Generate 3D Shapes with Generative Cellular Automata	
@@ -0,0 +1 @@
+We present a probabilistic 3D generative model, named Generative Cellular Automata, which is able to produce diverse and high quality shapes. We formulate the shape generation process as sampling from the transition kernel of a Markov chain, where the sampling chain eventually evolves to the full shape of the learned distribution. The transition kernel employs the local update rules of cellular automata, effectively reducing the search space in a high-resolution 3D grid space by exploiting the connectivity and sparsity of 3D shapes. Our progressive generation only focuses on the sparse set of occupied voxels and their neighborhood, thus enabling the utilization of an expressive sparse convolutional network. We propose an effective training scheme to obtain the local homogeneous rule of generative cellular automata with sequences that are slightly different from the sampling chain but converge to the full shapes in the training data. Extensive experiments on probabilistic shape completion and shape generation demonstrate that our method achieves competitive performance against recent methods.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning to Make Decisions via Submodular Regularization b/data/2021/iclr/Learning to Make Decisions via Submodular Regularization
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Learning to Reach Goals via Iterated Supervised Learning b/data/2021/iclr/Learning to Reach Goals via Iterated Supervised Learning
new file mode 100644
index 0000000000..ef0472ce0e
--- /dev/null
+++ b/data/2021/iclr/Learning to Reach Goals via Iterated Supervised Learning	
@@ -0,0 +1 @@
+Current reinforcement learning (RL) algorithms can be brittle and difficult to use, especially when learning goal-reaching behaviors from sparse rewards. Although supervised imitation learning provides a simple and stable alternative, it requires access to demonstrations from a human supervisor. In this paper, we study RL algorithms that use imitation learning to acquire goal reaching policies from scratch, without the need for expert demonstrations or a value function. In lieu of demonstrations, we leverage the property that any trajectory is a successful demonstration for reaching the final state in that same trajectory. We propose a simple algorithm in which an agent continually relabels and imitates the trajectories it generates to progressively learn goal-reaching behaviors from scratch. Each iteration, the agent collects new trajectories using the latest policy, and maximizes the likelihood of the actions along these trajectories under the goal that was actually reached, so as to improve the policy. We formally show that this iterated supervised learning procedure optimizes a bound on the RL objective, derive performance bounds of the learned policy, and empirically demonstrate improved goal-reaching performance and robustness over current RL algorithms in several benchmark tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning to Recombine and Resample Data For Compositional Generalization b/data/2021/iclr/Learning to Recombine and Resample Data For Compositional Generalization
new file mode 100644
index 0000000000..330e4ac9e9
--- /dev/null
+++ b/data/2021/iclr/Learning to Recombine and Resample Data For Compositional Generalization	
@@ -0,0 +1 @@
+Flexible neural models outperform grammar- and automaton-based counterparts on a variety of sequence modeling tasks. However, neural models perform poorly in settings requiring compositional generalization beyond the training data -- particularly to rare or unseen subsequences. Past work has found symbolic scaffolding (e.g. grammars or automata) essential in these settings. Here we present a family of learned data augmentation schemes that support a large category of compositional generalizations without appeal to latent symbolic structure. Our approach to data augmentation has two components: recombination of original training examples via a prototype-based generative model and resampling of generated examples to encourage extrapolation. Training an ordinary neural sequence model on a dataset augmented with recombined and resampled examples significantly improves generalization in two language processing problems---instruction following (SCAN) and morphological analysis (Sigmorphon 2018)---where our approach enables learning of new constructions and tenses from as few as eight initial examples.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning to Represent Action Values as a Hypergraph on the Action Vertices b/data/2021/iclr/Learning to Represent Action Values as a Hypergraph on the Action Vertices
new file mode 100644
index 0000000000..6463e0d1ce
--- /dev/null
+++ b/data/2021/iclr/Learning to Represent Action Values as a Hypergraph on the Action Vertices	
@@ -0,0 +1 @@
+Action-value estimation is a critical component of many reinforcement learning (RL) methods whereby sample complexity relies heavily on how fast a good estimator for action value can be learned. By viewing this problem through the lens of representation learning, good representations of both state and action can facilitate action-value estimation. While advances in deep learning have seamlessly driven progress in learning state representations, given the specificity of the notion of agency to RL, little attention has been paid to learning action representations. We conjecture that leveraging the combinatorial structure of multi-dimensional action spaces is a key ingredient for learning good representations of action. To test this, we set forth the action hypergraph networks framework---a class of functions for learning action representations with a relational inductive bias. Using this framework we realise an agent class based on a combination with deep Q-networks, which we dub hypergraph Q-networks. We show the effectiveness of our approach on a myriad of domains: illustrative prediction problems under minimal confounding effects, Atari 2600 games, and physical control benchmarks.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning to Sample with Local and Global Contexts in Experience Replay Buffer b/data/2021/iclr/Learning to Sample with Local and Global Contexts in Experience Replay Buffer
new file mode 100644
index 0000000000..b78c782d74
--- /dev/null
+++ b/data/2021/iclr/Learning to Sample with Local and Global Contexts in Experience Replay Buffer	
@@ -0,0 +1 @@
+Experience replay, which enables the agents to remember and reuse experience from the past, plays a significant role in the success of off-policy reinforcement learning (RL). To utilize the experience replay efficiently, experience transitions should be sampled with consideration of their significance, such that the known prioritized experience replay (PER) further allows to sample more important experience. Yet, the conventional PER may result in generating highly biased samples due to considering a single metric such as TD-error and computing the sampling rate independently for each experience. To tackle this issue, we propose a Neural Experience Replay Sampler (NERS), which adaptively evaluates the relative importance of a sampled transition by obtaining context from not only its (local) values that characterize itself such as TD-error or the raw features but also other (global) transitions. We validate our framework on multiple benchmark tasks for both continuous and discrete controls and show that the proposed framework significantly improves the performance of various off-policy RL methods. Further analysis confirms that the improvements indeed come from the use of diverse features and the consideration of the relative importance of experiences.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning to Set Waypoints for Audio-Visual Navigation b/data/2021/iclr/Learning to Set Waypoints for Audio-Visual Navigation
new file mode 100644
index 0000000000..fcfc7a16a7
--- /dev/null
+++ b/data/2021/iclr/Learning to Set Waypoints for Audio-Visual Navigation	
@@ -0,0 +1 @@
+In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements: 1) waypoints that are dynamically set and learned end-to-end within the navigation policy, and 2) an acoustic memory that provides a structured, spatially grounded record of what the agent has heard as it moves. Both new ideas capitalize on the synergy of audio and visual data for revealing the geometry of an unmapped space. We demonstrate our approach on two challenging datasets of real-world 3D scenes, Replica and Matterport3D. Our model improves the state of the art by a substantial margin, and our experiments reveal that learning the links between sights, sounds, and space is essential for audio-visual navigation.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning to live with Dale's principle: ANNs with separate excitatory and inhibitory units b/data/2021/iclr/Learning to live with Dale's principle: ANNs with separate excitatory and inhibitory units
new file mode 100644
index 0000000000..401dab8464
--- /dev/null
+++ b/data/2021/iclr/Learning to live with Dale's principle: ANNs with separate excitatory and inhibitory units	
@@ -0,0 +1 @@
+The units in artificial neural networks (ANNs) can be thought of as abstractions of biological neurons, and ANNs are increasingly used in neuroscience research. However, there are many important differences between ANN units and real neurons. One of the most notable is the absence of Dale’s principle, which ensures that biological neurons are either exclusively excitatory or inhibitory. Dale’s principle is typically left out of ANNs because its inclusion impairs learning. This is problematic, because one of the great advantages of ANNs for neuroscience research is their ability to learn complicated, realistic tasks. Here, by taking inspiration from feedforward inhibitory interneurons in the brain we show that we can develop ANNs with separate populations of excitatory and inhibitory units that learn just as well as standard ANNs. We call these networks Dale’s ANNs (DANNs). We present two insights that enable DANNs to learn well: (1) DANNs are related to normalization schemes, and can be initialized such that the inhibition centres and standardizes the excitatory activity, (2) updates to inhibitory neuron parameters should be scaled using corrections based on the Fisher Information matrix. These results demonstrate how ANNs that respect Dale’s principle can be built without sacrificing learning performance, which is important for future work using ANNs as models of the brain. The results also may have interesting implications for how inhibitory plasticity in the real brain operates.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning with AMIGo: Adversarially Motivated Intrinsic Goals b/data/2021/iclr/Learning with AMIGo: Adversarially Motivated Intrinsic Goals
new file mode 100644
index 0000000000..7d56b39787
--- /dev/null
+++ b/data/2021/iclr/Learning with AMIGo: Adversarially Motivated Intrinsic Goals	
@@ -0,0 +1 @@
+A key challenge for reinforcement learning (RL) consists of learning in environments with sparse extrinsic rewards. In contrast to current RL methods, humans are able to learn new skills with little or no reward by using various forms of intrinsic motivation. We propose AMIGo, a novel agent incorporating a goal-generating teacher that proposes Adversarially Motivated Intrinsic Goals to train a goal-conditioned "student" policy in the absence of (or alongside) environment reward. Specifically, through a simple but effective "constructively adversarial" objective, the teacher learns to propose increasingly challenging---yet achievable---goals that allow the student to learn general skills for acting in a new environment, independent of the task to be solved. We show that our method generates a natural curriculum of self-proposed goals which ultimately allows the agent to solve challenging procedurally-generated tasks where other forms of intrinsic motivation and state-of-the-art RL methods fail.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning with Feature-Dependent Label Noise: A Progressive Approach b/data/2021/iclr/Learning with Feature-Dependent Label Noise: A Progressive Approach
new file mode 100644
index 0000000000..f9b70229fe
--- /dev/null
+++ b/data/2021/iclr/Learning with Feature-Dependent Label Noise: A Progressive Approach	
@@ -0,0 +1 @@
+Label noise is frequently observed in real-world large-scale datasets. The noise is introduced due to a variety of reasons; it is heterogeneous and feature-dependent. Most existing approaches to handling noisy labels fall into two categories: they either assume an ideal feature-independent noise, or remain heuristic without theoretical guarantees. In this paper, we propose to target a new family of feature-dependent label noise, which is much more general than commonly used i.i.d. label noise and encompasses a broad spectrum of noise patterns. Focusing on this general noise family, we propose a progressive label correction algorithm that iteratively corrects labels and refines the model. We provide theoretical guarantees showing that for a wide variety of (unknown) noise patterns, a classifier trained with this strategy converges to be consistent with the Bayes classifier. In experiments, our method outperforms SOTA baselines and is robust to various noise types and levels.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning with Instance-Dependent Label Noise: A Sample Sieve Approach b/data/2021/iclr/Learning with Instance-Dependent Label Noise: A Sample Sieve Approach
new file mode 100644
index 0000000000..759e1f914d
--- /dev/null
+++ b/data/2021/iclr/Learning with Instance-Dependent Label Noise: A Sample Sieve Approach	
@@ -0,0 +1 @@
+Human-annotated labels are often prone to noise, and the presence of such noise will degrade the performance of the resulting deep neural network (DNN) models. Much of the literature (with several recent exceptions) of learning with noisy labels focuses on the case when the label noise is independent from features. Practically, annotations errors tend to be instance-dependent and often depend on the difficulty levels of recognizing a certain task. Applying existing results from instance-independent settings would require a significant amount of estimation of noise rates. Therefore, learning with instance-dependent label noise remains a challenge. In this paper, we propose CORES^2 (COnfidence REgularized Sample Sieve), which progressively sieves out corrupted samples. The implementation of CORES^2 does not require specifying noise rates and yet we are able to provide theoretical guarantees of CORES^2 in filtering out the corrupted examples. This high-quality sample sieve allows us to treat clean examples and the corrupted ones separately in training a DNN solution, and such a separation is shown to be advantageous in the instance-dependent noise setting. We demonstrate the performance of CORES^2 on CIFAR10 and CIFAR100 datasets with synthetic instance-dependent label noise and Clothing1M with real-world human noise. As of independent interests, our sample sieve provides a generic machinery for anatomizing noisy datasets and provides a flexible interface for various robust training techniques to further improve the performance.
\ No newline at end of file
diff --git a/data/2021/iclr/Learning-based Support Estimation in Sublinear Time b/data/2021/iclr/Learning-based Support Estimation in Sublinear Time
new file mode 100644
index 0000000000..135f70f29d
--- /dev/null
+++ b/data/2021/iclr/Learning-based Support Estimation in Sublinear Time	
@@ -0,0 +1 @@
+We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to $ \pm \varepsilon n$ from a sample of size $O(\log^2(1/\varepsilon) \cdot n/\log n)$, where $n$ is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to \[ \ \log (1/\varepsilon) \cdot n^{1-\Theta(1/\log(1/\varepsilon))}. \] We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from {Hsu et al, ICLR'19} as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.
\ No newline at end of file
diff --git a/data/2021/iclr/Lifelong Learning of Compositional Structures b/data/2021/iclr/Lifelong Learning of Compositional Structures
new file mode 100644
index 0000000000..f9bd259cec
--- /dev/null
+++ b/data/2021/iclr/Lifelong Learning of Compositional Structures	
@@ -0,0 +1 @@
+A hallmark of human intelligence is the ability to construct self-contained chunks of knowledge and adequately reuse them in novel combinations for solving different yet structurally related problems. Learning such compositional structures has been a significant challenge for artificial systems, due to the combinatorial nature of the underlying search problem. To date, research into compositional learning has largely proceeded separately from work on lifelong or continual learning. We integrate these two lines of work to present a general-purpose framework for lifelong learning of compositional structures that can be used for solving a stream of related tasks. Our framework separates the learning process into two broad stages: learning how to best combine existing components in order to assimilate a novel problem, and learning how to adapt the set of existing components to accommodate the new problem. This separation explicitly handles the trade-off between the stability required to remember how to solve earlier tasks and the flexibility required to solve new tasks, as we show empirically in an extensive evaluation.
\ No newline at end of file
diff --git a/data/2021/iclr/LiftPool: Bidirectional ConvNet Pooling b/data/2021/iclr/LiftPool: Bidirectional ConvNet Pooling
new file mode 100644
index 0000000000..23ec92f96a
--- /dev/null
+++ b/data/2021/iclr/LiftPool: Bidirectional ConvNet Pooling	
@@ -0,0 +1 @@
+Pooling is a critical operation in convolutional neural networks for increasing receptive fields and improving robustness to input variations. Most existing pooling operations downsample the feature maps, which is a lossy process. Moreover, they are not invertible: upsampling a downscaled feature map can not recover the lost information in the downsampling. By adopting the philosophy of the classical Lifting Scheme from signal processing, we propose LiftPool for bidirectional pooling layers, including LiftDownPool and LiftUpPool. LiftDownPool decomposes a feature map into various downsized sub-bands, each of which contains information with different frequencies. As the pooling function in LiftDownPool is perfectly invertible, by performing LiftDownPool backward, a corresponding up-pooling layer LiftUpPool is able to generate a refined upsampled feature map using the detail sub-bands, which is useful for image-to-image translation challenges. Experiments show the proposed methods achieve better results on image classification and semantic segmentation, using various backbones. Moreover, LiftDownPool offers better robustness to input corruptions and perturbations.
\ No newline at end of file
diff --git a/data/2021/iclr/Linear Convergent Decentralized Optimization with Compression b/data/2021/iclr/Linear Convergent Decentralized Optimization with Compression
new file mode 100644
index 0000000000..43f16ddc3e
--- /dev/null
+++ b/data/2021/iclr/Linear Convergent Decentralized Optimization with Compression	
@@ -0,0 +1 @@
+Communication compression has been extensively adopted to speed up large-scale distributed optimization. However, most existing decentralized algorithms with compression are unsatisfactory in terms of convergence rate and stability. In this paper, we delineate two key obstacles in the algorithm design -- data heterogeneity and compression error. Our attempt to explicitly overcome these obstacles leads to a novel decentralized algorithm named LEAD. This algorithm is the first \underline{L}in\underline{EA}r convergent \underline{D}ecentralized algorithm with communication compression. Our theory describes the coupled dynamics of the inaccurate model propagation and optimization process. We also provide the first consensus error bound without assuming bounded gradients. Empirical experiments validate our theoretical analysis and show that the proposed algorithm achieves state-of-the-art computation and communication efficiency.
\ No newline at end of file
diff --git a/data/2021/iclr/Linear Last-iterate Convergence in Constrained Saddle-point Optimization b/data/2021/iclr/Linear Last-iterate Convergence in Constrained Saddle-point Optimization
new file mode 100644
index 0000000000..79ea474d82
--- /dev/null
+++ b/data/2021/iclr/Linear Last-iterate Convergence in Constrained Saddle-point Optimization	
@@ -0,0 +1,2 @@
+Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative Weights Update (OMWU) for saddle-point optimization have received growing attention due to their favorable last-iterate convergence. However, their behaviors for simple bilinear games over the probability simplex are still not fully understood -- previous analysis lacks explicit convergence rates, only applies to an exponentially small learning rate, or requires additional assumptions such as the uniqueness of the optimal solution. 
+In this work, we significantly expand the understanding of last-iterate convergence for OGDA and OMWU in the constrained setting. Specifically, for OMWU in bilinear games over the simplex, we show that when the equilibrium is unique, linear last-iterate convergence is achievable with a constant learning rate, which improves the result of (Daskalakis & Panageas, 2019) under the same assumption. We then significantly extend the results to more general objectives and feasible sets for the projected OGDA algorithm, by introducing a sufficient condition under which OGDA exhibits concrete last-iterate convergence rates with a constant learning rate. We show that bilinear games over any polytope satisfy this condition and OGDA converges exponentially fast even without the unique equilibrium assumption. Our condition also holds for strongly-convex-strongly-concave functions, recovering the result of (Hsieh et al., 2019). Finally, we provide experimental results to further support our theory.
\ No newline at end of file
diff --git a/data/2021/iclr/Linear Mode Connectivity in Multitask and Continual Learning b/data/2021/iclr/Linear Mode Connectivity in Multitask and Continual Learning
new file mode 100644
index 0000000000..2371b78ef2
--- /dev/null
+++ b/data/2021/iclr/Linear Mode Connectivity in Multitask and Continual Learning	
@@ -0,0 +1 @@
+Continual (sequential) training and multitask (simultaneous) training are often attempting to solve the same overall objective: to find a solution that performs well on all considered tasks. The main difference is in the training regimes, where continual learning can only have access to one task at a time, which for neural networks typically leads to catastrophic forgetting. That is, the solution found for a subsequent task does not perform well on the previous ones anymore. However, the relationship between the different minima that the two training regimes arrive at is not well understood. What sets them apart? Is there a local structure that could explain the difference in performance achieved by the two different schemes? Motivated by recent work showing that different minima of the same task are typically connected by very simple curves of low error, we investigate whether multitask and continual solutions are similarly connected. We empirically find that indeed such connectivity can be reliably achieved and, more interestingly, it can be done by a linear path, conditioned on having the same initialization for both. We thoroughly analyze this observation and discuss its significance for the continual learning process. Furthermore, we exploit this finding to propose an effective algorithm that constrains the sequentially learned minima to behave as the multitask solution. We show that our method outperforms several state of the art continual learning algorithms on various vision benchmarks.
\ No newline at end of file
diff --git a/data/2021/iclr/Local Convergence Analysis of Gradient Descent Ascent with Finite Timescale Separation b/data/2021/iclr/Local Convergence Analysis of Gradient Descent Ascent with Finite Timescale Separation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Local Search Algorithms for Rank-Constrained Convex Optimization b/data/2021/iclr/Local Search Algorithms for Rank-Constrained Convex Optimization
new file mode 100644
index 0000000000..c6807ad0ca
--- /dev/null
+++ b/data/2021/iclr/Local Search Algorithms for Rank-Constrained Convex Optimization	
@@ -0,0 +1 @@
+We propose greedy and local search algorithms for rank-constrained convex optimization, namely solving $\underset{\mathrm{rank}(A)\leq r^*}{\min}\, R(A)$ given a convex function $R:\mathbb{R}^{m\times n}\rightarrow \mathbb{R}$ and a parameter $r^*$. These algorithms consist of repeating two steps: (a) adding a new rank-1 matrix to $A$ and (b) enforcing the rank constraint on $A$. We refine and improve the theoretical analysis of Shalev-Shwartz et al. (2011), and show that if the rank-restricted condition number of $R$ is $\kappa$, a solution $A$ with rank $O(r^*\cdot \min\{\kappa \log \frac{R(\mathbf{0})-R(A^*)}{\epsilon}, \kappa^2\})$ and $R(A) \leq R(A^*) + \epsilon$ can be recovered, where $A^*$ is the optimal solution. This significantly generalizes associated results on sparse convex optimization, as well as rank-constrained convex optimization for smooth functions. We then introduce new practical variants of these algorithms that have superior runtime and recover better solutions in practice. We demonstrate the versatility of these methods on a wide range of applications involving matrix completion and robust principal component analysis.
\ No newline at end of file
diff --git a/data/2021/iclr/Locally Free Weight Sharing for Network Width Search b/data/2021/iclr/Locally Free Weight Sharing for Network Width Search
new file mode 100644
index 0000000000..f1da214533
--- /dev/null
+++ b/data/2021/iclr/Locally Free Weight Sharing for Network Width Search	
@@ -0,0 +1 @@
+Searching for network width is an effective way to slim deep neural networks with hardware budgets. With this aim, a one-shot supernet is usually leveraged as a performance evaluator to rank the performance \wrt~different width. Nevertheless, current methods mainly follow a manually fixed weight sharing pattern, which is limited to distinguish the performance gap of different width. In this paper, to better evaluate each width, we propose a locally free weight sharing strategy (CafeNet) accordingly. In CafeNet, weights are more freely shared, and each width is jointly indicated by its base channels and free channels, where free channels are supposed to loCAte FrEely in a local zone to better represent each width. Besides, we propose to further reduce the search space by leveraging our introduced FLOPs-sensitive bins. As a result, our CafeNet can be trained stochastically and get optimized within a min-min strategy. Extensive experiments on ImageNet, CIFAR-10, CelebA and MS COCO dataset have verified our superiority comparing to other state-of-the-art baselines. For example, our method can further boost the benchmark NAS network EfficientNet-B0 by 0.41\% via searching its width more delicately.
\ No newline at end of file
diff --git a/data/2021/iclr/Long Live the Lottery: The Existence of Winning Tickets in Lifelong Learning b/data/2021/iclr/Long Live the Lottery: The Existence of Winning Tickets in Lifelong Learning
new file mode 100644
index 0000000000..62d4f32c8a
--- /dev/null
+++ b/data/2021/iclr/Long Live the Lottery: The Existence of Winning Tickets in Lifelong Learning	
@@ -0,0 +1 @@
+The lottery ticket hypothesis states that a highly sparsified sub-network can be trained in isolation, given the appropriate weight initialization. This paper extends that hypothesis from one-shot task leaning, and demonstrates for the first time that such extremely compact and independently trainable sub-networks can be also identified in the lifelong learning scenario, which we call lifelong tickets. We show that the resulting lifelong ticket can further be leveraged to improve the performance of learning over continual tasks. However, it is highly non-trivial to conduct network pruning in the lifelong setting. Two critical roadblocks arise: i) As many tasks now arrive sequentially, finding tickets in a greedy weight pruning fashion will inevitably suffer from the intrinsic bias, that the earlier emerging tasks impact more; ii) As lifelong learning is consistently challenged by catastrophic forgetting, the compact network capacity of tickets might amplify the risk of forgetting. In view of those, we introduce two pruning options, e.g., top-down and bottom-up, for finding lifelong tickets. Compared to the top-down pruning that extends vanilla (iterative) pruning over sequential tasks, we show that the bottomup one, which can dynamically shrink and (re-)expand model capacity, effectively avoids the undesirable excessive pruning in the early stage. We additionally introduce lottery teaching that further overcomes forgetting via knowledge distillation aided by external unlabeled data. Unifying those ingredients, we demonstrate the existence of very competitive lifelong tickets, e.g., achieving 3− 8% of the dense model size with even higher accuracy, compared to strong class-incremental learning baselines on CIFAR-10/CIFAR-100/Tiny-ImageNet datasets.
\ No newline at end of file
diff --git a/data/2021/iclr/Long Range Arena : A Benchmark for Efficient Transformers b/data/2021/iclr/Long Range Arena : A Benchmark for Efficient Transformers
new file mode 100644
index 0000000000..c88a4b2ccc
--- /dev/null
+++ b/data/2021/iclr/Long Range Arena : A Benchmark for Efficient Transformers	
@@ -0,0 +1 @@
+Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative model quality amongst many models. This paper proposes a systematic and unified benchmark, LRA, specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens, encompassing a wide range of data types and modalities such as text, natural, synthetic images, and mathematical expressions requiring similarity, structural, and visual-spatial reasoning. We systematically evaluate ten well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, and Longformers) on our newly proposed benchmark suite. LRA paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle. Our benchmark code will be released at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/Long-tail learning via logit adjustment b/data/2021/iclr/Long-tail learning via logit adjustment
new file mode 100644
index 0000000000..d21eba522f
--- /dev/null
+++ b/data/2021/iclr/Long-tail learning via logit adjustment	
@@ -0,0 +1 @@
+Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels are associated with only a few samples. This poses a challenge for generalisation on such labels, and also makes naive learning biased towards dominant labels. In this paper, we present two simple modifications of standard softmax cross-entropy training to cope with these challenges. Our techniques revisit the classic idea of logit adjustment based on the label frequencies, either applied post-hoc to a trained model, or enforced in the loss during training. Such adjustment encourages a large relative margin between logits of rare versus dominant labels. These techniques unify and generalise several recent proposals in the literature, while possessing firmer statistical grounding and empirical performance.
\ No newline at end of file
diff --git a/data/2021/iclr/Long-tailed Recognition by Routing Diverse Distribution-Aware Experts b/data/2021/iclr/Long-tailed Recognition by Routing Diverse Distribution-Aware Experts
new file mode 100644
index 0000000000..2134575b69
--- /dev/null
+++ b/data/2021/iclr/Long-tailed Recognition by Routing Diverse Distribution-Aware Experts	
@@ -0,0 +1 @@
+Natural data are often long-tail distributed over semantic classes. Existing recognition methods tend to focus on tail performance gain, often at the expense of head performance loss from increased classifier variance. The low tail performance manifests itself in large inter-class confusion and high classifier variance. We aim to reduce both the bias and the variance of a long-tailed classifier by RoutIng Diverse Experts (RIDE). It has three components: 1) a shared architecture for multiple classifiers (experts); 2) a distribution-aware diversity loss that encourages more diverse decisions for classes with fewer training instances; and 3) an expert routing module that dynamically assigns more ambiguous instances to additional experts. With on-par computational complexity, RIDE significantly outperforms the state-of-the-art methods by 5% to 7% on all the benchmarks including CIFAR100-LT, ImageNet-LT and iNaturalist. RIDE is also a universal framework that can be applied to different backbone networks and integrated into various long-tailed algorithms and training mechanisms for consistent performance gains.
\ No newline at end of file
diff --git a/data/2021/iclr/Loss Function Discovery for Object Detection via Convergence-Simulation Driven Search b/data/2021/iclr/Loss Function Discovery for Object Detection via Convergence-Simulation Driven Search
new file mode 100644
index 0000000000..c000182b59
--- /dev/null
+++ b/data/2021/iclr/Loss Function Discovery for Object Detection via Convergence-Simulation Driven Search	
@@ -0,0 +1 @@
+Designing proper loss functions for vision tasks has been a long-standing research direction to advance the capability of existing models. For object detection, the well-established classification and regression loss functions have been carefully designed by considering diverse learning challenges. Inspired by the recent progress in network architecture search, it is interesting to explore the possibility of discovering new loss function formulations via directly searching the primitive operation combinations. So that the learned losses not only fit for diverse object detection challenges to alleviate huge human efforts, but also have better alignment with evaluation metric and good mathematical convergence property. Beyond the previous auto-loss works on face recognition and image classification, our work makes the first attempt to discover new loss functions for the challenging object detection from primitive operation levels. We propose an effective convergence-simulation driven evolutionary search algorithm, called CSE-Autoloss, for speeding up the search progress by regularizing the mathematical rationality of loss candidates via convergence property verification and model optimization simulation. CSE-Autoloss involves the search space that cover a wide range of the possible variants of existing losses and discovers best-searched loss function combination within a short time (around 1.5 wall-clock days). We conduct extensive evaluations of loss function search on popular detectors and validate the good generalization capability of searched losses across diverse architectures and datasets. Our experiments show that the best-discovered loss function combinations outperform default combinations by 1.1% and 0.8% in terms of mAP for two-stage and one-stage detectors on COCO respectively. Our searched losses are available at https://github.com/PerdonLiu/CSE-Autoloss.
\ No newline at end of file
diff --git a/data/2021/iclr/Lossless Compression of Structured Convolutional Models via Lifting b/data/2021/iclr/Lossless Compression of Structured Convolutional Models via Lifting
new file mode 100644
index 0000000000..bfe6bc269b
--- /dev/null
+++ b/data/2021/iclr/Lossless Compression of Structured Convolutional Models via Lifting	
@@ -0,0 +1 @@
+Lifting is an efficient technique to scale up graphical models generalized to relational domains by exploiting the underlying symmetries. Concurrently, neural models are continuously expanding from grid-like tensor data into structured representations, such as various attributed graphs and relational databases. To address the irregular structure of the data, the models typically extrapolate on the idea of convolution, effectively introducing parameter sharing in their, dynamically unfolded, computation graphs. The computation graphs themselves then reflect the symmetries of the underlying data, similarly to the lifted graphical models. Inspired by lifting, we introduce a simple and efficient technique to detect the symmetries and compress the neural models without loss of any information. We demonstrate through experiments that such compression can lead to significant speedups of structured convolutional models, such as various Graph Neural Networks, across various tasks, such as molecule classification and knowledge-base completion.
\ No newline at end of file
diff --git a/data/2021/iclr/LowKey: Leveraging Adversarial Attacks to Protect Social Media Users from Facial Recognition b/data/2021/iclr/LowKey: Leveraging Adversarial Attacks to Protect Social Media Users from Facial Recognition
new file mode 100644
index 0000000000..86981ed9c8
--- /dev/null
+++ b/data/2021/iclr/LowKey: Leveraging Adversarial Attacks to Protect Social Media Users from Facial Recognition	
@@ -0,0 +1 @@
+Facial recognition systems are increasingly deployed by private corporations, government agencies, and contractors for consumer services and mass surveillance programs alike. These systems are typically built by scraping social media profiles for user images. Adversarial perturbations have been proposed for bypassing facial recognition systems. However, existing methods fail on full-scale systems and commercial APIs. We develop our own adversarial filter that accounts for the entire image processing pipeline and is demonstrably effective against industrial-grade pipelines that include face detection and large scale databases. Additionally, we release an easy-to-use webtool that significantly degrades the accuracy of Amazon Rekognition and the Microsoft Azure Face Recognition API, reducing the accuracy of each to below 1%.
\ No newline at end of file
diff --git a/data/2021/iclr/MALI: A memory efficient and reverse accurate integrator for Neural ODEs b/data/2021/iclr/MALI: A memory efficient and reverse accurate integrator for Neural ODEs
new file mode 100644
index 0000000000..b623dd9f1f
--- /dev/null
+++ b/data/2021/iclr/MALI: A memory efficient and reverse accurate integrator for Neural ODEs	
@@ -0,0 +1 @@
+Neural ordinary differential equations (Neural ODEs) are a new family of deep-learning models with continuous depth. However, the numerical estimation of the gradient in the continuous case is not well solved: existing implementations of the adjoint method suffer from inaccuracy in reverse-time trajectory, while the naive method and the adaptive checkpoint adjoint method (ACA) have a memory cost that grows with integration time. In this project, based on the asynchronous leapfrog (ALF) solver, we propose the Memory-efficient ALF Integrator (MALI), which has a constant memory cost \textit{w.r.t} number of solver steps in integration similar to the adjoint method, and guarantees accuracy in reverse-time trajectory (hence accuracy in gradient estimation). We validate MALI in various tasks: on image recognition tasks, to our knowledge, MALI is the first to enable feasible training of a Neural ODE on ImageNet and outperform a well-tuned ResNet, while existing methods fail due to either heavy memory burden or inaccuracy; for time series modeling, MALI significantly outperforms the adjoint method; and for continuous generative models, MALI achieves new state-of-the-art performance. We provide a pypi package at \url{https://jzkay12.github.io/TorchDiffEqPack/}
\ No newline at end of file
diff --git a/data/2021/iclr/MARS: Markov Molecular Sampling for Multi-objective Drug Discovery b/data/2021/iclr/MARS: Markov Molecular Sampling for Multi-objective Drug Discovery
new file mode 100644
index 0000000000..2345b94b0b
--- /dev/null
+++ b/data/2021/iclr/MARS: Markov Molecular Sampling for Multi-objective Drug Discovery	
@@ -0,0 +1 @@
+Searching for novel molecules with desired chemical properties is crucial in drug discovery. Existing work focuses on developing neural models to generate either molecular sequences or chemical graphs. However, it remains a big challenge to find novel and diverse compounds satisfying several properties. In this paper, we propose MARS, a method for multi-objective drug molecule discovery. MARS is based on the idea of generating the chemical candidates by iteratively editing fragments of molecular graphs. To search for high-quality candidates, it employs Markov chain Monte Carlo sampling (MCMC) on molecules with an annealing scheme and an adaptive proposal. To further improve sample efficiency, MARS uses a graph neural network (GNN) to represent and select candidate edits, where the GNN is trained on-the-fly with samples from MCMC. Experiments show that MARS achieves state-of-the-art performance in various multi-objective settings where molecular bio-activity, drug-likeness, and synthesizability are considered. Remarkably, in the most challenging setting where all four objectives are simultaneously optimized, our approach outperforms previous methods significantly in comprehensive evaluations. The code is available at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/MELR: Meta-Learning via Modeling Episode-Level Relationships for Few-Shot Learning b/data/2021/iclr/MELR: Meta-Learning via Modeling Episode-Level Relationships for Few-Shot Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space b/data/2021/iclr/MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space
new file mode 100644
index 0000000000..1c71e08e3a
--- /dev/null
+++ b/data/2021/iclr/MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space	
@@ -0,0 +1 @@
+Data augmentation is an efficient way to expand a training dataset by creating additional artificial data. While data augmentation is found to be effective in improving the generalization capabilities of models for various machine learning tasks, the underlying augmentation methods are usually manually designed and carefully evaluated for each data modality separately. These include image processing functions for image data and word-replacing rules for text data. In this work, we propose an automated data augmentation approach called MODALS (Modalityagnostic Automated Data Augmentation in the Latent Space) to augment data for any modality in a generic way. MODALS exploits automated data augmentation to fine-tune four universal data transformation operations in the latent space to adapt the transform to data of different modalities. Through comprehensive experiments, we demonstrate the effectiveness of MODALS on multiple datasets for text, tabular, time-series and image modalities.1
\ No newline at end of file
diff --git a/data/2021/iclr/MONGOOSE: A Learnable LSH Framework for Efficient Neural Network Training b/data/2021/iclr/MONGOOSE: A Learnable LSH Framework for Efficient Neural Network Training
new file mode 100644
index 0000000000..58b7180543
--- /dev/null
+++ b/data/2021/iclr/MONGOOSE: A Learnable LSH Framework for Efficient Neural Network Training	
@@ -0,0 +1 @@
+Recent advances by practitioners in the deep learning community have breathed new life into Locality Sensitive Hashing (LSH), using it to reduce memory and time bottlenecks in neural network (NN) training. However, while LSH has sublinear guarantees for approximate near-neighbor search in theory, it is known to have inefficient query time in practice due to its use of random hash functions. Moreover, when model parameters are changing, LSH suffers from update overhead. This work is motivated by an observation that model parameters evolve slowly, such that the changes do not always require an LSH update to maintain performance. This phenomenon points to the potential for a reduction in update time and allows for a modified learnable version of data-dependent LSH to improve query time at a low cost. We use the above insights to build MONGOOSE, an end-to-end LSH framework for efficient NN training. In particular, MONGOOSE is equipped with a scheduling algorithm to adaptively perform LSH updates with provable guarantees and learnable hash functions to improve query efficiency. Empirically, we validate MONGOOSE on large-scale deep learning models for recommendation systems and language modeling. We find that it achieves up to 8% better accuracy compared to previous LSH approaches, with 6.5× speed-up and 6× reduction in memory usage.
\ No newline at end of file
diff --git a/data/2021/iclr/Mapping the Timescale Organization of Neural Language Models b/data/2021/iclr/Mapping the Timescale Organization of Neural Language Models
new file mode 100644
index 0000000000..bf7219a0ef
--- /dev/null
+++ b/data/2021/iclr/Mapping the Timescale Organization of Neural Language Models	
@@ -0,0 +1 @@
+In the human brain, sequences of language input are processed within a distributed and hierarchical architecture, in which higher stages of processing encode contextual information over longer timescales. In contrast, in recurrent neural networks which perform natural language processing, we know little about how the multiple timescales of contextual information are functionally organized. Therefore, we applied tools developed in neuroscience to map the "processing timescales" of individual units within a word-level LSTM language model. This timescale-mapping method assigned long timescales to units previously found to track long-range syntactic dependencies, and revealed a new cluster of previously unreported long-timescale units. Next, we explored the functional role of units by examining the relationship between their processing timescales and network connectivity. We identified two classes of long-timescale units: "Controller" units composed a densely interconnected subnetwork and strongly projected to the forget and input gates of the rest of the network, while "Integrator" units showed the longest timescales in the network, and expressed projection profiles closer to the mean projection profile. Ablating integrator and controller units affected model performance at different position of a sentence, suggesting distinctive functions of these two sets of units. Finally, we tested the generalization of these results to a character-level LSTM model. In summary, we demonstrated a model-free technique for mapping the timescale organization in neural network models, and we applied this method to reveal the timescale and functional organization of LSTM language models.
\ No newline at end of file
diff --git a/data/2021/iclr/Mathematical Reasoning via Self-supervised Skip-tree Training b/data/2021/iclr/Mathematical Reasoning via Self-supervised Skip-tree Training
new file mode 100644
index 0000000000..22eda5b4a5
--- /dev/null
+++ b/data/2021/iclr/Mathematical Reasoning via Self-supervised Skip-tree Training	
@@ -0,0 +1 @@
+We examine whether self-supervised language modeling applied to mathematical formulas enables logical reasoning. We suggest several logical reasoning tasks that can be used to evaluate language models trained on formal mathematical statements, such as type inference, suggesting missing assumptions and completing equalities. To train language models for formal mathematics, we propose a novel skip-tree task. We find that models trained on the skip-tree task show surprisingly strong mathematical reasoning abilities, and outperform models trained on standard skip-sequence tasks. We also analyze the models' ability to formulate new conjectures by measuring how often the predictions are provable and useful in other proofs.
\ No newline at end of file
diff --git a/data/2021/iclr/Measuring Massive Multitask Language Understanding b/data/2021/iclr/Measuring Massive Multitask Language Understanding
new file mode 100644
index 0000000000..0585f8cfd5
--- /dev/null
+++ b/data/2021/iclr/Measuring Massive Multitask Language Understanding	
@@ -0,0 +1 @@
+We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.
\ No newline at end of file
diff --git a/data/2021/iclr/Memory Optimization for Deep Networks b/data/2021/iclr/Memory Optimization for Deep Networks
new file mode 100644
index 0000000000..a551b8ecb6
--- /dev/null
+++ b/data/2021/iclr/Memory Optimization for Deep Networks	
@@ -0,0 +1 @@
+Deep learning is slowly, but steadily, hitting a memory bottleneck. While the tensor computation in top-of-the-line GPUs increased by 32x over the last five years, the total available memory only grew by 2.5x. This prevents researchers from exploring larger architectures, as training large networks requires more memory for storing intermediate outputs. In this paper, we present MONeT, an automatic framework that minimizes both the memory footprint and computational overhead of deep networks. MONeT jointly optimizes the checkpointing schedule and the implementation of various operators. MONeT is able to outperform all prior hand-tuned operations as well as automated checkpointing. MONeT reduces the overall memory requirement by 3x for various PyTorch models, with a 9-16% overhead in computation. For the same computation cost, MONeT requires 1.2-1.8x less memory than current state-of-the-art automated checkpointing frameworks. Our code is available at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/Meta Back-Translation b/data/2021/iclr/Meta Back-Translation
new file mode 100644
index 0000000000..8d52e9ff88
--- /dev/null
+++ b/data/2021/iclr/Meta Back-Translation	
@@ -0,0 +1 @@
+Back-translation is an effective strategy to improve the performance of Neural Machine Translation~(NMT) by generating pseudo-parallel data. However, several recent works have found that better translation quality of the pseudo-parallel data does not necessarily lead to better final translation models, while lower-quality but more diverse data often yields stronger results. In this paper, we propose a novel method to generate pseudo-parallel data from a pre-trained back-translation model. Our method is a meta-learning algorithm which adapts a pre-trained back-translation model so that the pseudo-parallel data it generates would train a forward-translation model to do well on a validation set. In our evaluations in both the standard datasets WMT En-De'14 and WMT En-Fr'14, as well as a multilingual translation setting, our method leads to significant improvements over strong baselines. Our code will be made available.
\ No newline at end of file
diff --git a/data/2021/iclr/Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning b/data/2021/iclr/Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Meta-Learning of Structured Task Distributions in Humans and Machines b/data/2021/iclr/Meta-Learning of Structured Task Distributions in Humans and Machines
new file mode 100644
index 0000000000..b5c73fafa4
--- /dev/null
+++ b/data/2021/iclr/Meta-Learning of Structured Task Distributions in Humans and Machines	
@@ -0,0 +1 @@
+In recent years, meta-learning, in which a model is trained on a family of tasks (i.e. a task distribution), has emerged as an approach to training neural networks to perform tasks that were previously assumed to require structured representations, making strides toward closing the gap between humans and machines. However, we argue that evaluating meta-learning remains a challenge, and can miss whether meta-learning actually uses the structure embedded within the tasks. These meta-learners might therefore still be significantly different from humans learners. To demonstrate this difference, we first define a new meta-reinforcement learning task in which a structured task distribution is generated using a compositional grammar. We then introduce a novel approach to constructing a "null task distribution" with the same statistical complexity as this structured task distribution but without the explicit rule-based structure used to generate the structured task. We train a standard meta-learning agent, a recurrent network trained with model-free reinforcement learning, and compare it with human performance across the two task distributions. We find a double dissociation in which humans do better in the structured task distribution whereas agents do better in the null task distribution -- despite comparable statistical complexity. This work highlights that multiple strategies can achieve reasonable meta-test performance, and that careful construction of control task distributions is a valuable way to understand which strategies meta-learners acquire, and how they might differ from humans.
\ No newline at end of file
diff --git a/data/2021/iclr/Meta-Learning with Neural Tangent Kernels b/data/2021/iclr/Meta-Learning with Neural Tangent Kernels
new file mode 100644
index 0000000000..c7f7048a6e
--- /dev/null
+++ b/data/2021/iclr/Meta-Learning with Neural Tangent Kernels	
@@ -0,0 +1 @@
+Model Agnostic Meta-Learning (MAML) has emerged as a standard framework for meta-learning, where a meta-model is learned with the ability of fast adapting to new tasks. However, as a double-looped optimization problem, MAML needs to differentiate through the whole inner-loop optimization path for every outer-loop training step, which may lead to both computational inefficiency and sub-optimal solutions. In this paper, we generalize MAML to allow meta-learning to be defined in function spaces, and propose the first meta-learning paradigm in the Reproducing Kernel Hilbert Space (RKHS) induced by the meta-model's Neural Tangent Kernel (NTK). Within this paradigm, we introduce two meta-learning algorithms in the RKHS, which no longer need a sub-optimal iterative inner-loop adaptation as in the MAML framework. We achieve this goal by 1) replacing the adaptation with a fast-adaptive regularizer in the RKHS; and 2) solving the adaptation analytically based on the NTK theory. Extensive experimental studies demonstrate advantages of our paradigm in both efficiency and quality of solutions compared to related meta-learning algorithms. Another interesting feature of our proposed methods is that they are demonstrated to be more robust to adversarial attacks and out-of-distribution adaptation than popular baselines, as demonstrated in our experiments.
\ No newline at end of file
diff --git a/data/2021/iclr/Meta-learning Symmetries by Reparameterization b/data/2021/iclr/Meta-learning Symmetries by Reparameterization
new file mode 100644
index 0000000000..9f03d80f6f
--- /dev/null
+++ b/data/2021/iclr/Meta-learning Symmetries by Reparameterization	
@@ -0,0 +1 @@
+Many successful deep learning architectures are equivariant to certain transformations in order to conserve parameters and improve generalization: most famously, convolution layers are equivariant to shifts of the input. This approach only works when practitioners know a-priori symmetries of the task and can manually construct an architecture with the corresponding equivariances. Our goal is a general approach for learning equivariances from data, without needing prior knowledge of a task's symmetries or custom task-specific architectures. We present a method for learning and encoding equivariances into networks by learning corresponding parameter sharing patterns from data. Our method can provably encode equivariance-inducing parameter sharing for any finite group of symmetry transformations, and we find experimentally that it can automatically learn a variety of equivariances from symmetries in data. We provide our experiment code and pre-trained models at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/Meta-learning with negative learning rates b/data/2021/iclr/Meta-learning with negative learning rates
new file mode 100644
index 0000000000..5877e732f2
--- /dev/null
+++ b/data/2021/iclr/Meta-learning with negative learning rates	
@@ -0,0 +1 @@
+Deep learning models require a large amount of data to perform well. When data is scarce for a target task, we can transfer the knowledge gained by training on similar tasks to quickly learn the target. A successful approach is meta-learning, or learning to learn a distribution of tasks, where learning is represented by an outer loop, and to learn by an inner loop of gradient descent. However, a number of recent empirical studies argue that the inner loop is unnecessary and more simple models work equally well or even better. We study the performance of MAML as a function of the learning rate of the inner loop, where zero learning rate implies that there is no inner loop. Using random matrix theory and exact solutions of linear models, we calculate an algebraic expression for the test loss of MAML applied to mixed linear regression and nonlinear regression with overparameterized models. Surprisingly, while the optimal learning rate for adaptation is positive, we find that the optimal learning rate for training is always negative, a setting that has never been considered before. Therefore, not only does the performance increase by decreasing the learning rate to zero, as suggested by recent work, but it can be increased even further by decreasing the learning rate to negative values. These results help clarify under what circumstances meta-learning performs best.
\ No newline at end of file
diff --git a/data/2021/iclr/MetaNorm: Learning to Normalize Few-Shot Batches Across Domains b/data/2021/iclr/MetaNorm: Learning to Normalize Few-Shot Batches Across Domains
new file mode 100644
index 0000000000..0547b6cb13
--- /dev/null
+++ b/data/2021/iclr/MetaNorm: Learning to Normalize Few-Shot Batches Across Domains	
@@ -0,0 +1 @@
+Batch normalization plays a crucial role when training deep neural networks. However, batch statistics become unstable with small batch sizes and are unreliable in the presence of distribution shifts. We propose MetaNorm, a simple yet effective meta-learning normalization. It tackles the aforementioned issues in a uniﬁed way by leveraging the meta-learning setting and learns to infer adaptive statistics for batch normalization. MetaNorm is generic, ﬂexible and model-agnostic, making it a simple plug-and-play module that is seamlessly embedded into existing meta-learning approaches. It can be efﬁciently implemented by lightweight hyper-networks with low computational cost. We verify its effectiveness by extensive evaluation on representative tasks suffering from the small batch and domain shift problems: few-shot learning and domain generalization. We further introduce an even more challenging setting: few-shot domain generalization. Results demonstrate that MetaNorm consistently achieves better, or at least competitive, accuracy compared to existing batch normalization methods.
\ No newline at end of file
diff --git a/data/2021/iclr/MiCE: Mixture of Contrastive Experts for Unsupervised Image Clustering b/data/2021/iclr/MiCE: Mixture of Contrastive Experts for Unsupervised Image Clustering
new file mode 100644
index 0000000000..f64d57560b
--- /dev/null
+++ b/data/2021/iclr/MiCE: Mixture of Contrastive Experts for Unsupervised Image Clustering	
@@ -0,0 +1 @@
+We present Mixture of Contrastive Experts (MiCE), a unified probabilistic clustering framework that simultaneously exploits the discriminative representations learned by contrastive learning and the semantic structures captured by a latent mixture model. Motivated by the mixture of experts, MiCE employs a gating function to partition an unlabeled dataset into subsets according to the latent semantics and multiple experts to discriminate distinct subsets of instances assigned to them in a contrastive learning manner. To solve the nontrivial inference and learning problems caused by the latent variables, we further develop a scalable variant of the Expectation-Maximization (EM) algorithm for MiCE and provide proof of the convergence. Empirically, we evaluate the clustering performance of MiCE on four widely adopted natural image datasets. MiCE achieves significantly better results than various previous methods and a strong contrastive learning baseline.
\ No newline at end of file
diff --git a/data/2021/iclr/Mind the Gap when Conditioning Amortised Inference in Sequential Latent-Variable Models b/data/2021/iclr/Mind the Gap when Conditioning Amortised Inference in Sequential Latent-Variable Models
new file mode 100644
index 0000000000..ac9643bf50
--- /dev/null
+++ b/data/2021/iclr/Mind the Gap when Conditioning Amortised Inference in Sequential Latent-Variable Models	
@@ -0,0 +1 @@
+Amortised inference enables scalable learning of sequential latent-variable models (LVMs) with the evidence lower bound (ELBO). In this setting, variational posteriors are often only partially conditioned. While the true posteriors depend, e.g., on the entire sequence of observations, approximate posteriors are only informed by past observations. This mimics the Bayesian filter -- a mixture of smoothing posteriors. Yet, we show that the ELBO objective forces partially-conditioned amortised posteriors to approximate products of smoothing posteriors instead. Consequently, the learned generative model is compromised. We demonstrate these theoretical findings in three scenarios: traffic flow, handwritten digits, and aerial vehicle dynamics. Using fully-conditioned approximate posteriors, performance improves in terms of generative modelling and multi-step prediction.
\ No newline at end of file
diff --git a/data/2021/iclr/Mind the Pad - CNNs Can Develop Blind Spots b/data/2021/iclr/Mind the Pad - CNNs Can Develop Blind Spots
new file mode 100644
index 0000000000..e2854513a9
--- /dev/null
+++ b/data/2021/iclr/Mind the Pad - CNNs Can Develop Blind Spots	
@@ -0,0 +1 @@
+We show how feature maps in convolutional networks are susceptible to spatial bias. Due to a combination of architectural choices, the activation at certain locations is systematically elevated or weakened. The major source of this bias is the padding mechanism. Depending on several aspects of convolution arithmetic, this mechanism can apply the padding unevenly, leading to asymmetries in the learned weights. We demonstrate how such bias can be detrimental to certain tasks such as small object detection: the activation is suppressed if the stimulus lies in the impacted area, leading to blind spots and misdetection. We propose solutions to mitigate spatial bias and demonstrate how they can improve model accuracy.
\ No newline at end of file
diff --git a/data/2021/iclr/Minimum Width for Universal Approximation b/data/2021/iclr/Minimum Width for Universal Approximation
new file mode 100644
index 0000000000..144b7ffb28
--- /dev/null
+++ b/data/2021/iclr/Minimum Width for Universal Approximation	
@@ -0,0 +1 @@
+The universal approximation property of width-bounded networks has been studied as a dual of classical universal approximation results on depth-bounded networks. However, the critical width enabling the universal approximation has not been exactly characterized in terms of the input dimension $d_x$ and the output dimension $d_y$. In this work, we provide the first definitive result in this direction for networks using the ReLU activation functions: The minimum width required for the universal approximation of the $L^p$ functions is exactly $\max\{d_x+1,d_y\}$. We also prove that the same conclusion does not hold for the uniform approximation with ReLU, but does hold with an additional threshold activation function. Our proof technique can be also used to derive a tighter upper bound on the minimum width required for the universal approximation using networks with general activation functions.
\ No newline at end of file
diff --git a/data/2021/iclr/Mirostat: a Neural Text decoding Algorithm that directly controls perplexity b/data/2021/iclr/Mirostat: a Neural Text decoding Algorithm that directly controls perplexity
new file mode 100644
index 0000000000..d64b974c7a
--- /dev/null
+++ b/data/2021/iclr/Mirostat: a Neural Text decoding Algorithm that directly controls perplexity	
@@ -0,0 +1 @@
+Neural text decoding algorithms strongly inﬂuence the quality of texts generated using language models, but popular algorithms like top-k , top-p (nucleus), and temperature-based sampling may yield texts that have objectionable repetition or incoherence. Although these methods generate high-quality text after ad hoc parameter tuning that depends on the language model and the length of generated text, not much is known about the control they provide over the statistics of the output. This is important, however, since recent reports show that humans prefer when perplexity is neither too much nor too little and since we experimentally show that cross-entropy (log of perplexity) has a near-linear relation with repetition. First we provide a theoretical analysis of perplexity in top-k , top-p , and temperature sampling, under Zipﬁan statistics. Then, we use this analysis to design a feedback-based adaptive top-k text decoding algorithm called mirostat that generates text (of any length) with a predetermined target value of perplexity without any tuning. Experiments show that for low values of k and p , perplexity drops signiﬁcantly with generated text length and leads to excessive repetitions (the boredom trap). Contrarily, for large values of k and p , perplexity increases with generated text length and leads to incoherence (confusion trap). Mirostat avoids both traps. Speciﬁcally, we show that setting target perplexity value beyond a threshold yields negligible sentence-level repetitions. Experiments with human raters for ﬂuency, coherence, and quality further verify our ﬁndings.
\ No newline at end of file
diff --git a/data/2021/iclr/MixKD: Towards Efficient Distillation of Large-scale Language Models b/data/2021/iclr/MixKD: Towards Efficient Distillation of Large-scale Language Models
new file mode 100644
index 0000000000..c389eb96c5
--- /dev/null
+++ b/data/2021/iclr/MixKD: Towards Efficient Distillation of Large-scale Language Models	
@@ -0,0 +1 @@
+Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their applicability to low-resource (memory and computation) platforms. Knowledge distillation (KD) has been demonstrated as an effective framework for compressing such big models. However, large-scale neural network systems are prone to memorize training instances, and thus tend to make inconsistent predictions when the data distribution is altered slightly. Moreover, the student model has few opportunities to request useful information from the teacher model when there is limited task-specific data available. To address these issues, we propose MixKD, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability. Concretely, in addition to the original training examples, the student model is encouraged to mimic the teacher's behavior on the linear interpolation of example pairs as well. We prove, from a theoretical perspective, that under reasonable conditions MixKD gives rise to a smaller gap between the generalization error and the empirical error. To verify its effectiveness, we conduct experiments on the GLUE benchmark, where MixKD consistently leads to significant gains over the standard KD training, and outperforms several competitive baselines. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
\ No newline at end of file
diff --git a/data/2021/iclr/Mixed-Features Vectors and Subspace Splitting b/data/2021/iclr/Mixed-Features Vectors and Subspace Splitting
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/MoPro: Webly Supervised Learning with Momentum Prototypes b/data/2021/iclr/MoPro: Webly Supervised Learning with Momentum Prototypes
new file mode 100644
index 0000000000..c1bb3ec35f
--- /dev/null
+++ b/data/2021/iclr/MoPro: Webly Supervised Learning with Momentum Prototypes	
@@ -0,0 +1 @@
+We propose a webly-supervised representation learning method that does not suffer from the annotation unscalability of supervised learning, nor the computation unscalability of self-supervised learning. Most existing works on webly-supervised representation learning adopt a vanilla supervised learning method without accounting for the prevalent noise in the training data, whereas most prior methods in learning with label noise are less effective for real-world large-scale noisy data. We propose momentum prototypes (MoPro), a simple contrastive learning method that achieves online label noise correction, out-of-distribution sample removal, and representation learning. MoPro achieves state-of-the-art performance on WebVision, a weakly-labeled noisy dataset. MoPro also shows superior performance when the pretrained model is transferred to down-stream image classification and detection tasks. It outperforms the ImageNet supervised pretrained model by +10.5 on 1-shot classification on VOC, and outperforms the best self-supervised pretrained model by +17.3 when finetuned on 1\% of ImageNet labeled samples. Furthermore, MoPro is more robust to distribution shifts. Code and pretrained models are available at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond b/data/2021/iclr/MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond
new file mode 100644
index 0000000000..1ef6c73089
--- /dev/null
+++ b/data/2021/iclr/MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond	
@@ -0,0 +1 @@
+This paper focuses on visual counting, which aims to predict the number of occurrences given a natural image and a query ( e.g . a question or a category). Unlike most prior works that use explicit, symbolic models which can be computationally expensive and limited in generalization, we propose a simple and effective alternative by revisiting modulated convolutions that fuse the query and the image locally. Following the design of residual bottleneck, we call our method MoVie , short for Mo dulated con V olut i onal bottl e necks. Notably, MoVie reasons implicitly and holistically and only needs a single forward-pass during inference. Nevertheless, MoVie showcases strong performance for counting: 1) advancing the state-of-the-art on counting-speciﬁc VQA tasks while being more efﬁcient; 2) outperforming prior-art on difﬁcult benchmarks like COCO for common object counting; 3) helped us secure the ﬁrst place of 2020 VQA challenge when integrated as a module for ‘number’ related questions in generic VQA models. Finally, we show evidence that modulated convolutions such as MoVie can serve as a general mechanism for reasoning tasks beyond counting.
\ No newline at end of file
diff --git a/data/2021/iclr/Model Patching: Closing the Subgroup Performance Gap with Data Augmentation b/data/2021/iclr/Model Patching: Closing the Subgroup Performance Gap with Data Augmentation
new file mode 100644
index 0000000000..43e2bee46c
--- /dev/null
+++ b/data/2021/iclr/Model Patching: Closing the Subgroup Performance Gap with Data Augmentation	
@@ -0,0 +1 @@
+Classifiers in machine learning are often brittle when deployed. Particularly concerning are models with inconsistent performance on specific subgroups of a class, e.g., exhibiting disparities in skin cancer classification in the presence or absence of a spurious bandage. To mitigate these performance differences, we introduce model patching, a two-stage framework for improving robustness that encourages the model to be invariant to subgroup differences, and focus on class information shared by subgroups. Model patching first models subgroup features within a class and learns semantic transformations between them, and then trains a classifier with data augmentations that deliberately manipulate subgroup features. We instantiate model patching with CAMEL, which (1) uses a CycleGAN to learn the intra-class, inter-subgroup augmentations, and (2) balances subgroup performance using a theoretically-motivated subgroup consistency regularizer, accompanied by a new robust objective. We demonstrate CAMEL's effectiveness on 3 benchmark datasets, with reductions in robust error of up to 33% relative to the best baseline. Lastly, CAMEL successfully patches a model that fails due to spurious features on a real-world skin cancer dataset.
\ No newline at end of file
diff --git a/data/2021/iclr/Model-Based Offline Planning b/data/2021/iclr/Model-Based Offline Planning
new file mode 100644
index 0000000000..4e9fafc1ad
--- /dev/null
+++ b/data/2021/iclr/Model-Based Offline Planning	
@@ -0,0 +1 @@
+Offline learning is a key part of making reinforcement learning (RL) useable in real systems. Offline RL looks at scenarios where there is data from a system's operation, but no direct access to the system when learning a policy. Recent work on training RL policies from offline data has shown results both with model-free policies learned directly from the data, or with planning on top of learnt models of the data. Model-free policies tend to be more performant, but are more opaque, harder to command externally, and less easy to integrate into larger systems. We propose an offline learner that generates a model that can be used to control the system directly through planning. This allows us to have easily controllable policies directly from data, without ever interacting with the system. We show the performance of our algorithm, Model-Based Offline Planning (MBOP) on a series of robotics-inspired tasks, and demonstrate its ability leverage planning to respect environmental constraints. We are able to find near-optimal polices for certain simulated systems from as little as 50 seconds of real-time system interaction, and create zero-shot goal-conditioned policies on a series of environments.
\ No newline at end of file
diff --git a/data/2021/iclr/Model-Based Visual Planning with Self-Supervised Functional Distances b/data/2021/iclr/Model-Based Visual Planning with Self-Supervised Functional Distances
new file mode 100644
index 0000000000..908ae60b50
--- /dev/null
+++ b/data/2021/iclr/Model-Based Visual Planning with Self-Supervised Functional Distances	
@@ -0,0 +1 @@
+A generalist robot must be able to complete a variety of tasks in its environment. One appealing way to specify each task is in terms of a goal observation. However, learning goal-reaching policies with reinforcement learning remains a challenging problem, particularly when hand-engineered reward functions are not available. Learned dynamics models are a promising approach for learning about the environment without rewards or task-directed data, but planning to reach goals with such a model requires a notion of functional similarity between observations and goal states. We present a self-supervised method for model-based visual goal reaching, which uses both a visual dynamics model as well as a dynamical distance function learned using model-free reinforcement learning. Our approach learns entirely using offline, unlabeled data, making it practical to scale to large and diverse datasets. In our experiments, we find that our method can successfully learn models that perform a variety of tasks at test-time, moving objects amid distractors with a simulated robotic arm and even learning to open and close a drawer using a real-world robot. In comparisons, we find that this approach substantially outperforms both model-free and model-based prior methods. Videos and visualizations are available here: http://sites.google.com/berkeley.edu/mbold.
\ No newline at end of file
diff --git a/data/2021/iclr/Model-based micro-data reinforcement learning: what are the crucial model properties and which model to choose? b/data/2021/iclr/Model-based micro-data reinforcement learning: what are the crucial model properties and which model to choose?
new file mode 100644
index 0000000000..c7abaf9590
--- /dev/null
+++ b/data/2021/iclr/Model-based micro-data reinforcement learning: what are the crucial model properties and which model to choose?	
@@ -0,0 +1 @@
+We contribute to micro-data model-based reinforcement learning (MBRL) by rigorously comparing popular generative models using a fixed (random shooting) control agent. We find that on an environment that requires multimodal posterior predictives, mixture density nets outperform all other models by a large margin. When multimodality is not required, our surprising finding is that we do not need probabilistic posterior predictives: deterministic models are on par, in fact they consistently (although non-significantly) outperform their probabilistic counterparts. We also found that heteroscedasticity at training time, perhaps acting as a regularizer, improves predictions at longer horizons. At the methodological side, we design metrics and an experimental protocol which can be used to evaluate the various models, predicting their asymptotic performance when using them on the control problem. Using this framework, we improve the state-of-the-art sample complexity of MBRL on Acrobot by two to four folds, using an aggressive training schedule which is outside of the hyperparameter interval usually considered
\ No newline at end of file
diff --git a/data/2021/iclr/Modeling the Second Player in Distributionally Robust Optimization b/data/2021/iclr/Modeling the Second Player in Distributionally Robust Optimization
new file mode 100644
index 0000000000..521d60ed52
--- /dev/null
+++ b/data/2021/iclr/Modeling the Second Player in Distributionally Robust Optimization	
@@ -0,0 +1 @@
+Distributionally robust optimization (DRO) provides a framework for training machine learning models that are able to perform well on a collection of related data distributions (the"uncertainty set"). This is done by solving a min-max game: the model is trained to minimize its maximum expected loss among all distributions in the uncertainty set. While careful design of the uncertainty set is critical to the success of the DRO procedure, previous work has been limited to relatively simple alternatives that keep the min-max optimization problem exactly tractable, such as $f$-divergence balls. In this paper, we argue instead for the use of neural generative models to characterize the worst-case distribution, allowing for more flexible and problem-specific selection of the uncertainty set. However, while simple conceptually, this approach poses a number of implementation and optimization challenges. To circumvent these issues, we propose a relaxation of the KL-constrained inner maximization objective that makes the DRO problem more amenable to gradient-based optimization of large scale generative models, and develop model selection heuristics to guide hyper-parameter search. On both toy settings and realistic NLP tasks, we find that the proposed approach yields models that are more robust than comparable baselines.
\ No newline at end of file
diff --git a/data/2021/iclr/Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue System b/data/2021/iclr/Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue System
new file mode 100644
index 0000000000..666c25581d
--- /dev/null
+++ b/data/2021/iclr/Modelling Hierarchical Structure between Dialogue Policy and Natural Language Generator with Option Framework for Task-oriented Dialogue System	
@@ -0,0 +1 @@
+Designing task-oriented dialogue systems is a challenging research topic, since it needs not only to generate utterances fulfilling user requests but also to guarantee the comprehensibility. Many previous works trained end-to-end (E2E) models with supervised learning (SL), however, the bias in annotated system utterances remains as a bottleneck. Reinforcement learning (RL) deals with the problem through using non-differentiable evaluation metrics (e.g., the success rate) as rewards. Nonetheless, existing works with RL showed that the comprehensibility of generated system utterances could be corrupted when improving the performance on fulfilling user requests. In o gur work, we (1) propose modelling the hierarchical structure between dialogue policy and natural language generator (NLG) with the option framework, called HDNO, where the latent dialogue act is applied to avoid designing specific dialogue act representations; (2) train HDNO via hierarchical reinforcement learning (HRL), as well as suggest the asynchronous updates between dialogue policy and NLG during training to theoretically guarantee their convergence to a local maximizer; and (3) propose using a discriminator modelled with language models as an additional reward to further improve the comprehensibility. We test HDNO on MultiWoz 2.0 and MultiWoz 2.1, the datasets on multi-domain dialogues, in comparison with word-level E2E model trained with RL, LaRL and HDSA, showing improvements on the performance evaluated by automatic evaluation metrics and human evaluation. Finally, we demonstrate the semantic meanings of latent dialogue acts to show the ability of explanation.
\ No newline at end of file
diff --git a/data/2021/iclr/Molecule Optimization by Explainable Evolution b/data/2021/iclr/Molecule Optimization by Explainable Evolution
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Monotonic Kronecker-Factored Lattice b/data/2021/iclr/Monotonic Kronecker-Factored Lattice
new file mode 100644
index 0000000000..37e74dcf7a
--- /dev/null
+++ b/data/2021/iclr/Monotonic Kronecker-Factored Lattice	
@@ -0,0 +1 @@
+It is computationally challenging to learn flexible monotonic functions that guarantee model behavior and provide interpretability beyond a few input features, and in a time where minimizing resource use is increasingly important, we must be able to learn such models that are still efficient. In this paper we show how to effectively and efficiently learn such functions using Kronecker-Factored Lattice (KFL), an efficient reparameterization of flexible monotonic lattice regression via Kronecker product. Both computational and storage costs scale linearly in the number of input features, which is a significant improvement over existing methods that grow exponentially. We also show that we can still properly enforce monotonicity and other shape constraints. The KFL function class consists of products of piecewise-linear functions, and the size of the function class can be further increased through ensembling. We prove that the function class of an ensemble of M base KFL models strictly increases as M increases up to a certain threshold. Beyond this threshold, every multilinear interpolated lattice function can be expressed. Our experimental results demonstrate that KFL trains faster with fewer parameters while still achieving accuracy and evaluation speeds comparable to or better than the baseline methods and preserving monotonicity guarantees on the learned model.
\ No newline at end of file
diff --git a/data/2021/iclr/Monte-Carlo Planning and Learning with Language Action Value Estimates b/data/2021/iclr/Monte-Carlo Planning and Learning with Language Action Value Estimates
new file mode 100644
index 0000000000..ddbd6b6489
--- /dev/null
+++ b/data/2021/iclr/Monte-Carlo Planning and Learning with Language Action Value Estimates	
@@ -0,0 +1 @@
+Interactive Fiction (IF) games provide a useful testbed for language-based reinforcement learning agents, posing significant challenges of natural language understanding, commonsense reasoning, and non-myopic planning in the combinatorial search space. Agents using standard planning algorithms struggle to play IF games due to the massive search space of language actions. Thus, languagegrounded planning is a key ability of such agents, since inferring the consequence of language action based on semantic understanding can drastically improve search. In this paper, we introduce Monte-Carlo planning with Language Action Value Estimates (MC-LAVE) that combines Monte-Carlo tree search with language-driven exploration. MC-LAVE concentrates search effort on semantically promising language actions using locally optimistic language value estimates, yielding a significant reduction in the effective search space of language actions. We then present a reinforcement learning approach built on MC-LAVE, which alternates between MC-LAVE planning and supervised learning of the selfgenerated language actions. In the experiments, we demonstrate that our method achieves new high scores in various IF games.
\ No newline at end of file
diff --git a/data/2021/iclr/More or Less: When and How to Build Convolutional Neural Network Ensembles b/data/2021/iclr/More or Less: When and How to Build Convolutional Neural Network Ensembles
new file mode 100644
index 0000000000..b7cab75542
--- /dev/null
+++ b/data/2021/iclr/More or Less: When and How to Build Convolutional Neural Network Ensembles	
@@ -0,0 +1 @@
+provide
\ No newline at end of file
diff --git a/data/2021/iclr/Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning b/data/2021/iclr/Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning
new file mode 100644
index 0000000000..345b7ff010
--- /dev/null
+++ b/data/2021/iclr/Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning	
@@ -0,0 +1 @@
+Post-hoc calibration is a common approach for providing high-quality confidence estimates of deep neural network predictions. Recent work has shown that widely used scaling methods underestimate their calibration error, while alternative Histogram Binning (HB) methods with verifiable calibration performance often fail to preserve classification accuracy. In the case of multi-class calibration with a large number of classes K, HB also faces the issue of severe sample-inefficiency due to a large class imbalance resulting from the conversion into K one-vs-rest class-wise calibration problems. The goal of this paper is to resolve the identified issues of HB in order to provide verified and calibrated confidence estimates using only a small holdout calibration dataset for bin optimization while preserving multi-class ranking accuracy. From an information-theoretic perspective, we derive the I-Max concept for binning, which maximizes the mutual information between labels and binned (quantized) logits. This concept mitigates potential loss in ranking performance due to lossy quantization, and by disentangling the optimization of bin edges and representatives allows simultaneous improvement of ranking and calibration performance. In addition, we propose a shared class-wise (sCW) binning strategy that fits a single calibrator on the merged training sets of all K class-wise problems, yielding reliable estimates from a small calibration set. The combination of sCW and I-Max binning outperforms the state of the art calibration methods on various evaluation metrics across different benchmark datasets and models, even when using only a small set of calibration data, e.g. 1k samples for ImageNet.
\ No newline at end of file
diff --git a/data/2021/iclr/Multi-Level Local SGD: Distributed SGD for Heterogeneous Hierarchical Networks b/data/2021/iclr/Multi-Level Local SGD: Distributed SGD for Heterogeneous Hierarchical Networks
new file mode 100644
index 0000000000..6830b3baad
--- /dev/null
+++ b/data/2021/iclr/Multi-Level Local SGD: Distributed SGD for Heterogeneous Hierarchical Networks	
@@ -0,0 +1 @@
+We propose Multi-Level Local SGD, a distributed gradient method for learning a smooth, non-convex objective in a heterogeneous multi-level network. Our network model consists of a set of disjoint sub-networks, with a single hub and multiple worker nodes; further, worker nodes may have different operating rates. The hubs exchange information with one another via a connected, but not necessarily complete communication network. In our algorithm, sub-networks execute a distributed SGD algorithm, using a hub-and-spoke paradigm, and the hubs periodically average their models with neighboring hubs. We first provide a unified mathematical framework that describes the Multi-Level Local SGD algorithm. We then present a theoretical analysis of the algorithm; our analysis shows the dependence of the convergence error on the worker node heterogeneity, hub network topology, and the number of local, sub-network, and global iterations. We back up our theoretical results via simulation-based experiments using both convex and non-convex objectives.
\ No newline at end of file
diff --git a/data/2021/iclr/Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network b/data/2021/iclr/Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network
new file mode 100644
index 0000000000..af98083381
--- /dev/null
+++ b/data/2021/iclr/Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network	
@@ -0,0 +1 @@
+Recently, Frankle&Carbin (2019) demonstrated that randomly-initialized dense networks contain subnetworks that once found can be trained to reach test accuracy comparable to the trained dense network. However, finding these high performing trainable subnetworks is expensive, requiring iterative process of training and pruning weights. In this paper, we propose (and prove) a stronger Multi-Prize Lottery Ticket Hypothesis: A sufficiently over-parameterized neural network with random weights contains several subnetworks (winning tickets) that (a) have comparable accuracy to a dense target network with learned weights (prize 1), (b) do not require any further training to achieve prize 1 (prize 2), and (c) is robust to extreme forms of quantization (i.e., binary weights and/or activation) (prize 3). This provides a new paradigm for learning compact yet highly accurate binary neural networks simply by pruning and quantizing randomly weighted full precision neural networks. We also propose an algorithm for finding multi-prize tickets (MPTs) and test it by performing a series of experiments on CIFAR-10 and ImageNet datasets. Empirical results indicate that as models grow deeper and wider, multi-prize tickets start to reach similar (and sometimes even higher) test accuracy compared to their significantly larger and full-precision counterparts that have been weight-trained. Without ever updating the weight values, our MPTs-1/32 not only set new binary weight network state-of-the-art (SOTA) Top-1 accuracy -- 94.8% on CIFAR-10 and 74.03% on ImageNet -- but also outperform their full-precision counterparts by 1.78% and 0.76%, respectively. Further, our MPT-1/1 achieves SOTA Top-1 accuracy (91.9%) for binary neural networks on CIFAR-10. Code and pre-trained models are available at: https://github.com/chrundle/biprop.
\ No newline at end of file
diff --git a/data/2021/iclr/Multi-Time Attention Networks for Irregularly Sampled Time Series b/data/2021/iclr/Multi-Time Attention Networks for Irregularly Sampled Time Series
new file mode 100644
index 0000000000..c826c47d48
--- /dev/null
+++ b/data/2021/iclr/Multi-Time Attention Networks for Irregularly Sampled Time Series	
@@ -0,0 +1 @@
+Irregular sampling occurs in many time series modeling applications where it presents a significant challenge to standard deep learning models. This work is motivated by the analysis of physiological time series data in electronic health records, which are sparse, irregularly sampled, and multivariate. In this paper, we propose a new deep learning framework for this setting that we call Multi-Time Attention Networks. Multi-Time Attention Networks learn an embedding of continuous time values and use an attention mechanism to produce a fixed-length representation of a time series containing a variable number of observations. We investigate the performance of our framework on interpolation and classification tasks using multiple datasets. Our results show that our approach performs as well or better than a range of baseline and recently proposed models while offering significantly faster training times than current state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2021/iclr/Multi-resolution modeling of a discrete stochastic process identifies causes of cancer b/data/2021/iclr/Multi-resolution modeling of a discrete stochastic process identifies causes of cancer
new file mode 100644
index 0000000000..fdfecb6950
--- /dev/null
+++ b/data/2021/iclr/Multi-resolution modeling of a discrete stochastic process identifies causes of cancer	
@@ -0,0 +1 @@
+Detection of cancer-causing mutations within the vast and mostly unexplored human genome is a major challenge. Doing so requires modeling the background mutation rate, a highly non-stationary stochastic process, across regions of interest varying in size from one to millions of positions. Here, we present the splitPoisson-Gamma (SPG) distribution, an extension of the classical Poisson-Gamma formulation, to model a discrete stochastic process at multiple resolutions. We demonstrate that the probability model has a closed-form posterior, enabling efficient and accurate linear-time prediction over any length scale after the parameters of the model have been inferred a single time. We apply our framework to model mutation rates in tumors and show that model parameters can be accurately inferred from high-dimensional epigenetic data using a convolutional neural network, Gaussian process, and maximum-likelihood estimation. Our method is both more accurate and more efficient than existing models over a large range of length scales. We demonstrate the usefulness of multi-resolution modeling by detecting genomic elements that drive tumor emergence and are of vastly differing sizes.
\ No newline at end of file
diff --git a/data/2021/iclr/Multi-timescale Representation Learning in LSTM Language Models b/data/2021/iclr/Multi-timescale Representation Learning in LSTM Language Models
new file mode 100644
index 0000000000..e7ac24b1f0
--- /dev/null
+++ b/data/2021/iclr/Multi-timescale Representation Learning in LSTM Language Models	
@@ -0,0 +1 @@
+Although neural language models are effective at capturing statistics of natural language, their representations are challenging to interpret. In particular, it is unclear how these models retain information over multiple timescales. In this work, we construct explicitly multi-timescale language models by manipulating the input and forget gate biases in a long short-term memory (LSTM) network. The distribution of timescales is selected to approximate power law statistics of natural language through a combination of exponentially decaying memory cells. We then empirically analyze the timescale of information routed through each part of the model using word ablation experiments and forget gate visualizations. These experiments show that the multi-timescale model successfully learns representations at the desired timescales, and that the distribution includes longer timescales than a standard LSTM. Further, information about high-,mid-, and low-frequency words is routed preferentially through units with the appropriate timescales. Thus we show how to construct language models with interpretable representations of different information timescales.
\ No newline at end of file
diff --git a/data/2021/iclr/MultiModalQA: complex question answering over text, tables and images b/data/2021/iclr/MultiModalQA: complex question answering over text, tables and images
new file mode 100644
index 0000000000..eaa3c2e112
--- /dev/null
+++ b/data/2021/iclr/MultiModalQA: complex question answering over text, tables and images	
@@ -0,0 +1 @@
+When answering complex questions, people can seamlessly combine information from visual, textual and tabular sources. While interest in models that reason over multiple pieces of evidence has surged in recent years, there has been relatively little work on question answering models that reason across multiple modalities. In this paper, we present MultiModalQA(MMQA): a challenging question answering dataset that requires joint reasoning over text, tables and images. We create MMQA using a new framework for generating complex multi-modal questions at scale, harvesting tables from Wikipedia, and attaching images and text paragraphs using entities that appear in each table. We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions. Last, crowdsourcing workers take these automatically-generated questions and rephrase them into more fluent language. We create 29,918 questions through this procedure, and empirically demonstrate the necessity of a multi-modal multi-hop approach to solve our task: our multi-hop model, ImplicitDecomp, achieves an average F1of 51.7 over cross-modal questions, substantially outperforming a strong baseline that achieves 38.2 F1, but still lags significantly behind human performance, which is at 90.1 F1
\ No newline at end of file
diff --git a/data/2021/iclr/Multiplicative Filter Networks b/data/2021/iclr/Multiplicative Filter Networks
new file mode 100644
index 0000000000..d72f65ed66
--- /dev/null
+++ b/data/2021/iclr/Multiplicative Filter Networks	
@@ -0,0 +1 @@
+Although deep networks are typically used to approximate functions over high dimensional inputs, recent work has increased interest in neural networks as function approximators for low-dimensional-but-complex functions, such as representing images as a function of pixel coordinates, solving differential equations, or representing signed distance functions or neural radiance fields. Key to these recent successes has been the use of new elements such as sinusoidal nonlinearities or Fourier features in positional encodings, which vastly outperform simple ReLU networks. In this paper, we propose and empirically demonstrate that an arguably simpler class of function approximators can work just as well for such problems: multiplicative filter networks. In these networks, we avoid traditional compositional depth altogether, and simply multiply together (linear functions of) sinusoidal or Gabor wavelet functions applied to the input. This representation has the notable advantage that the entire function can simply be viewed as a linear function approximator over an exponential number of Fourier or Gabor basis functions, respectively. Despite this simplicity, when compared to recent approaches that use Fourier features with ReLU networks or sinusoidal activation networks, we show that these multiplicative filter networks largely outperform or match the performance of these approaches on the domains highlighted in these past works.
\ No newline at end of file
diff --git a/data/2021/iclr/Multiscale Score Matching for Out-of-Distribution Detection b/data/2021/iclr/Multiscale Score Matching for Out-of-Distribution Detection
new file mode 100644
index 0000000000..1bf7f91737
--- /dev/null
+++ b/data/2021/iclr/Multiscale Score Matching for Out-of-Distribution Detection	
@@ -0,0 +1 @@
+We present a new methodology for detecting out-of-distribution (OOD) images by utilizing norms of the score estimates at multiple noise scales. A score is defined to be the gradient of the log density with respect to the input data. Our methodology is completely unsupervised and follows a straight forward training scheme. First, we train a deep network to estimate scores for levels of noise. Once trained, we calculate the noisy score estimates for N in-distribution samples and take the L2-norms across the input dimensions (resulting in an NxL matrix). Then we train an auxiliary model (such as a Gaussian Mixture Model) to learn the in-distribution spatial regions in this L-dimensional space. This auxiliary model can now be used to identify points that reside outside the learned space. Despite its simplicity, our experiments show that this methodology significantly outperforms the state-of-the-art in detecting out-of-distribution images. For example, our method can effectively separate CIFAR-10 (inlier) and SVHN (OOD) images, a setting which has been previously shown to be difficult for deep likelihood models.
\ No newline at end of file
diff --git a/data/2021/iclr/Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows b/data/2021/iclr/Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows
new file mode 100644
index 0000000000..f42c74b65f
--- /dev/null
+++ b/data/2021/iclr/Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows	
@@ -0,0 +1 @@
+Probabilistic forecasting of irregularly sampled multivariate time series with missing values is an important problem in many fields, including health care, astronomy, and climate. State-of-the-art methods for the task estimate only marginal distributions of observations in single channels and at single timepoints, assuming a fixed-shape parametric distribution. In this work, we propose a novel model, ProFITi, for probabilistic forecasting of irregularly sampled time series with missing values using conditional normalizing flows. The model learns joint distributions over the future values of the time series conditioned on past observations and queried channels and times, without assuming any fixed shape of the underlying distribution. As model components, we introduce a novel invertible triangular attention layer and an invertible non-linear activation function on and onto the whole real line. We conduct extensive experiments on four datasets and demonstrate that the proposed model provides $4$ times higher likelihood over the previously best model.
\ No newline at end of file
diff --git a/data/2021/iclr/Mutual Information State Intrinsic Control b/data/2021/iclr/Mutual Information State Intrinsic Control
new file mode 100644
index 0000000000..269659c040
--- /dev/null
+++ b/data/2021/iclr/Mutual Information State Intrinsic Control	
@@ -0,0 +1 @@
+Reinforcement learning has been shown to be highly successful at many challenging tasks. However, success heavily relies on well-shaped rewards. Intrinsically motivated RL attempts to remove this constraint by defining an intrinsic reward function. Motivated by the self-consciousness concept in psychology, we make a natural assumption that the agent knows what constitutes itself, and propose a new intrinsic objective that encourages the agent to have maximum control on the environment. We mathematically formalize this reward as the mutual information between the agent state and the surrounding state under the current agent policy. With this new intrinsic motivation, we are able to outperform previous methods, including being able to complete the pick-and-place task for the first time without using any task reward. A video showing experimental results is available at https://youtu.be/AUCwc9RThpk.
\ No newline at end of file
diff --git a/data/2021/iclr/My Body is a Cage: the Role of Morphology in Graph-Based Incompatible Control b/data/2021/iclr/My Body is a Cage: the Role of Morphology in Graph-Based Incompatible Control
new file mode 100644
index 0000000000..aa1ce0e23d
--- /dev/null
+++ b/data/2021/iclr/My Body is a Cage: the Role of Morphology in Graph-Based Incompatible Control	
@@ -0,0 +1 @@
+Multitask Reinforcement Learning is a promising way to obtain models with better performance, generalisation, data efficiency, and robustness. Most existing work is limited to compatible settings, where the state and action space dimensions are the same across tasks. Graph Neural Networks (GNN) are one way to address incompatible environments, because they can process graphs of arbitrary size. They also allow practitioners to inject biases encoded in the structure of the input graph. Existing work in graph-based continuous control uses the physical morphology of the agent to construct the input graph, i.e., encoding limb features as node labels and using edges to connect the nodes if their corresponded limbs are physically connected. In this work, we present a series of ablations on existing methods that show that morphological information encoded in the graph does not improve their performance. Motivated by the hypothesis that any benefits GNNs extract from the graph structure are outweighed by difficulties they create for message passing, we also propose Amorpheus, a transformer-based approach. Further results show that, while Amorpheus ignores the morphological information that GNNs encode, it nonetheless substantially outperforms GNN-based methods.
\ No newline at end of file
diff --git a/data/2021/iclr/NAS-Bench-ASR: Reproducible Neural Architecture Search for Speech Recognition b/data/2021/iclr/NAS-Bench-ASR: Reproducible Neural Architecture Search for Speech Recognition
new file mode 100644
index 0000000000..788636ffba
--- /dev/null
+++ b/data/2021/iclr/NAS-Bench-ASR: Reproducible Neural Architecture Search for Speech Recognition	
@@ -0,0 +1 @@
+to
\ No newline at end of file
diff --git a/data/2021/iclr/NBDT: Neural-Backed Decision Tree b/data/2021/iclr/NBDT: Neural-Backed Decision Tree
new file mode 100644
index 0000000000..be766b5f02
--- /dev/null
+++ b/data/2021/iclr/NBDT: Neural-Backed Decision Tree	
@@ -0,0 +1 @@
+Machine learning applications such as finance and medicine demand accurate and justifiable predictions, barring most deep learning methods from use. In response, previous work combines decision trees with deep learning, yielding models that (1) sacrifice interpretability for accuracy or (2) sacrifice accuracy for interpretability. We forgo this dilemma by jointly improving accuracy and interpretability using Neural-Backed Decision Trees (NBDTs). NBDTs replace a neural network’s final linear layer with a differentiable sequence of decisions and a surrogate loss. This forces the model to learn high-level concepts and lessens reliance on highlyuncertain decisions, yielding (1) accuracy: NBDTs match or outperform modern neural networks on CIFAR, ImageNet and better generalize to unseen classes by up to 16%. Furthermore, our surrogate loss improves the original model’s accuracy by up to 2%. NBDTs also afford (2) interpretability: improving human trust by clearly identifying model mistakes and assisting in dataset debugging. Code and pretrained NBDTs are at github.com/alvinwan/neural-backed-decision-trees.
\ No newline at end of file
diff --git a/data/2021/iclr/NOVAS: Non-convex Optimization via Adaptive Stochastic Search for End-to-end Learning and Control b/data/2021/iclr/NOVAS: Non-convex Optimization via Adaptive Stochastic Search for End-to-end Learning and Control
new file mode 100644
index 0000000000..49b3e7cae5
--- /dev/null
+++ b/data/2021/iclr/NOVAS: Non-convex Optimization via Adaptive Stochastic Search for End-to-end Learning and Control	
@@ -0,0 +1 @@
+In this work we propose the use of adaptive stochastic search as a building block for general, non-convex optimization operations within deep neural network architectures. Specifically, for an objective function located at some layer in the network and parameterized by some network parameters, we employ adaptive stochastic search to perform optimization over its output. This operation is differentiable and does not obstruct the passing of gradients during backpropagation, thus enabling us to incorporate it as a component in end-to-end learning. We study the proposed optimization module's properties and benchmark it against two existing alternatives on a synthetic energy-based structured prediction task, and further showcase its use in stochastic optimal control applications.
\ No newline at end of file
diff --git a/data/2021/iclr/NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation b/data/2021/iclr/NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation
new file mode 100644
index 0000000000..ae21f5baf8
--- /dev/null
+++ b/data/2021/iclr/NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation	
@@ -0,0 +1 @@
+3D pose estimation is a challenging but important task in computer vision. In this work, we show that standard deep learning approaches to 3D pose estimation are not robust when objects are partially occluded or viewed from a previously unseen pose. Inspired by the robustness of generative vision models to partial occlusion, we propose to integrate deep neural networks with 3D generative representations of objects into a unified neural architecture that we term NeMo. In particular, NeMo learns a generative model of neural feature activations at each vertex on a dense 3D mesh. Using differentiable rendering we estimate the 3D object pose by minimizing the reconstruction error between NeMo and the feature representation of the target image. To avoid local optima in the reconstruction loss, we train the feature extractor to maximize the distance between the individual feature representations on the mesh using contrastive learning. Our extensive experiments on PASCAL3D+, occluded-PASCAL3D+ and ObjectNet3D show that NeMo is much more robust to partial occlusion and unseen pose compared to standard deep networks, while retaining competitive performance on regular data. Interestingly, our experiments also show that NeMo performs reasonably well even when the mesh representation only crudely approximates the true object geometry with a cuboid, hence revealing that the detailed 3D geometry is not needed for accurate 3D pose estimation. The code is publicly available at https://github.com/Angtian/NeMo.
\ No newline at end of file
diff --git a/data/2021/iclr/Nearest Neighbor Machine Translation b/data/2021/iclr/Nearest Neighbor Machine Translation
new file mode 100644
index 0000000000..fa02398610
--- /dev/null
+++ b/data/2021/iclr/Nearest Neighbor Machine Translation	
@@ -0,0 +1 @@
+We introduce $k$-nearest-neighbor machine translation ($k$NN-MT), which predicts tokens with a nearest neighbor classifier over a large datastore of cached examples, using representations from a neural translation model for similarity search. This approach requires no additional training and scales to give the decoder direct access to billions of examples at test time, resulting in a highly expressive model that consistently improves performance across many settings. Simply adding nearest neighbor search improves a state-of-the-art German-English translation model by 1.5 BLEU. $k$NN-MT allows a single model to be adapted to diverse domains by using a domain-specific datastore, improving results by an average of 9.2 BLEU over zero-shot transfer, and achieving new state-of-the-art results---without training on these domains. A massively multilingual model can also be specialized for particular language pairs, with improvements of 3 BLEU for translating from English into German and Chinese. Qualitatively, $k$NN-MT is easily interpretable; it combines source and target context to retrieve highly relevant examples.
\ No newline at end of file
diff --git a/data/2021/iclr/Negative Data Augmentation b/data/2021/iclr/Negative Data Augmentation
new file mode 100644
index 0000000000..a8ec3fbf2c
--- /dev/null
+++ b/data/2021/iclr/Negative Data Augmentation	
@@ -0,0 +1 @@
+In practical applications, the generalization capability of face anti-spoofing (FAS) models on unseen domains is of paramount importance to adapt to diverse camera sensors, device drift, environmental variation, and unpredictable attack types. Recently, various domain generalization (DG) methods have been developed to improve the generalization capability of FAS models via training on multiple source domains. These DG methods commonly require collecting sufficient real-world attack samples of different attack types for each source domain. This work aims to learn a FAS model without using any real-world attack sample in any source domain but can generalize well to the unseen domain, which can significantly reduce the learning cost. Toward this goal, we draw inspiration from the theoretical error bound of domain generalization to use negative data augmentation instead of real-world attack samples for training. We show that using only a few types of simple synthesized negative samples, e.g., color jitter and color mask, the learned model can achieve competitive performance over state-of-the-art DG methods trained using real-world attack samples. Moreover, a dynamic global common loss and a local contrast loss are proposed to prompt the model to learn a compact and common feature representation for real face samples from different source domains, which can further improve the generalization capability. Experimental results of extensive cross-dataset testing demonstrate that our method can even outperform state-of-the-art DG methods using real-world attack samples for training. The code for reproducing the results of our method is available at https://github.com/WeihangWANG/NDA-FAS.
\ No newline at end of file
diff --git a/data/2021/iclr/Net-DNF: Effective Deep Modeling of Tabular Data b/data/2021/iclr/Net-DNF: Effective Deep Modeling of Tabular Data
new file mode 100644
index 0000000000..19c5fbd5f1
--- /dev/null
+++ b/data/2021/iclr/Net-DNF: Effective Deep Modeling of Tabular Data	
@@ -0,0 +1 @@
+A challenging open question in deep learning is how to handle tabular data. Unlike domains such as image and natural language processing, where deep architectures prevail, there is still no widely accepted neural architecture that dominates tabular data. As a step toward bridging this gap, we present Net-DNF a novel generic architecture whose inductive bias elicits models whose structure corresponds to logical Boolean formulas in disjunctive normal form (DNF) over affine soft-threshold decision terms. Net-DNFs also promote localized decisions that are taken over small subsets of the features. We present extensive experiments showing that Net-DNFs significantly and consistently outperform fully connected networks over tabular data. With relatively few hyperparameters, Net-DNFs open the door to practical end-to-end handling of tabular data using neural networks. We present ablation studies, which justify the design choices of Net-DNF including the inductive bias elements, namely, Boolean formulation, locality, and feature selection.
\ No newline at end of file
diff --git a/data/2021/iclr/Network Pruning That Matters: A Case Study on Retraining Variants b/data/2021/iclr/Network Pruning That Matters: A Case Study on Retraining Variants
new file mode 100644
index 0000000000..6cd0195ddf
--- /dev/null
+++ b/data/2021/iclr/Network Pruning That Matters: A Case Study on Retraining Variants	
@@ -0,0 +1 @@
+Network pruning is an effective method to reduce the computational expense of over-parameterized neural networks for deployment on low-resource systems. Recent state-of-the-art techniques for retraining pruned networks such as weight rewinding and learning rate rewinding have been shown to outperform the traditional fine-tuning technique in recovering the lost accuracy (Renda et al., 2020), but so far it is unclear what accounts for such performance. In this work, we conduct extensive experiments to verify and analyze the uncanny effectiveness of learning rate rewinding. We find that the reason behind the success of learning rate rewinding is the usage of a large learning rate. Similar phenomenon can be observed in other learning rate schedules that involve large learning rates, e.g., the 1-cycle learning rate schedule (Smith et al., 2019). By leveraging the right learning rate schedule in retraining, we demonstrate a counter-intuitive phenomenon in that randomly pruned networks could even achieve better performance than methodically pruned networks (fine-tuned with the conventional approach). Our results emphasize the cruciality of the learning rate schedule in pruned network retraining - a detail often overlooked by practitioners during the implementation of network pruning. One-sentence Summary: We study the effective of different retraining mechanisms while doing pruning
\ No newline at end of file
diff --git a/data/2021/iclr/Neural Approximate Sufficient Statistics for Implicit Models b/data/2021/iclr/Neural Approximate Sufficient Statistics for Implicit Models
new file mode 100644
index 0000000000..d5f291f297
--- /dev/null
+++ b/data/2021/iclr/Neural Approximate Sufficient Statistics for Implicit Models	
@@ -0,0 +1 @@
+We consider the fundamental problem of how to automatically construct summary statistics for implicit generative models where the evaluation of likelihood function is intractable but sampling / simulating data from the model is possible. The idea is to frame the task of constructing sufficient statistics as learning mutual information maximizing representation of the data. This representation is computed by a deep neural network trained by a joint statistic-posterior learning strategy. We apply our approach to both traditional approximate Bayesian computation (ABC) and recent neural likelihood approaches, boosting their performance on a range of tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective b/data/2021/iclr/Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective
new file mode 100644
index 0000000000..91eee41550
--- /dev/null
+++ b/data/2021/iclr/Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective	
@@ -0,0 +1 @@
+Neural Architecture Search (NAS) has been explosively studied to automate the discovery of top-performer neural networks. Current works require heavy training of supernet or intensive architecture evaluations, thus suffering from heavy resource consumption and often incurring search bias due to truncated training or approximations. Can we select the best neural architectures without involving any training and eliminate a drastic portion of the search cost? We provide an affirmative answer, by proposing a novel framework called training-free neural architecture search (TE-NAS). TE-NAS ranks architectures by analyzing the spectrum of the neural tangent kernel (NTK) and the number of linear regions in the input space. Both are motivated by recent theory advances in deep networks and can be computed without any training and any label. We show that: (1) these two measurements imply the trainability and expressivity of a neural network; (2) they strongly correlate with the network's test accuracy. Further on, we design a pruning-based NAS mechanism to achieve a more flexible and superior trade-off between the trainability and expressivity during the search. In NAS-Bench-201 and DARTS search spaces, TE-NAS completes high-quality search but only costs 0.5 and 4 GPU hours with one 1080Ti on CIFAR-10 and ImageNet, respectively. We hope our work inspires more attempts in bridging the theoretical findings of deep networks and practical impacts in real NAS applications. Code is available at: https://github.com/VITA-Group/TENAS.
\ No newline at end of file
diff --git a/data/2021/iclr/Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks b/data/2021/iclr/Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks
new file mode 100644
index 0000000000..1604e0b778
--- /dev/null
+++ b/data/2021/iclr/Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks	
@@ -0,0 +1 @@
+Deep neural networks (DNNs) are known vulnerable to backdoor attacks, a training time attack that injects a trigger pattern into a small proportion of training data so as to control the model's prediction at the test time. Backdoor attacks are notably dangerous since they do not affect the model's performance on clean examples, yet can fool the model to make incorrect prediction whenever the trigger pattern appears during testing. In this paper, we propose a novel defense framework Neural Attention Distillation (NAD) to erase backdoor triggers from backdoored DNNs. NAD utilizes a teacher network to guide the finetuning of the backdoored student network on a small clean subset of data such that the intermediate-layer attention of the student network aligns with that of the teacher network. The teacher network can be obtained by an independent finetuning process on the same clean subset. We empirically show, against 6 state-of-the-art backdoor attacks, NAD can effectively erase the backdoor triggers using only 5\% clean training data without causing obvious performance degradation on clean examples. Code is available in https://github.com/bboylyg/NAD.
\ No newline at end of file
diff --git a/data/2021/iclr/Neural Delay Differential Equations b/data/2021/iclr/Neural Delay Differential Equations
new file mode 100644
index 0000000000..8a7ef8edac
--- /dev/null
+++ b/data/2021/iclr/Neural Delay Differential Equations	
@@ -0,0 +1 @@
+The intersection of machine learning and dynamical systems has generated considerable interest recently. Neural Ordinary Differential Equations (NODEs) represent a rich overlap between these fields. In this paper, we develop a continuous time neural network approach based on Delay Differential Equations (DDEs). Our model uses the adjoint sensitivity method to learn the model parameters and delay directly from data. Our approach is inspired by that of NODEs and extends earlier neural DDE models, which have assumed that the value of the delay is known a priori. We perform a sensitivity analysis on our proposed approach and demonstrate its ability to learn DDE parameters from benchmark systems. We conclude our discussion with potential future directions and applications.
\ No newline at end of file
diff --git a/data/2021/iclr/Neural Jump Ordinary Differential Equations: Consistent Continuous-Time Prediction and Filtering b/data/2021/iclr/Neural Jump Ordinary Differential Equations: Consistent Continuous-Time Prediction and Filtering
new file mode 100644
index 0000000000..254aa6822b
--- /dev/null
+++ b/data/2021/iclr/Neural Jump Ordinary Differential Equations: Consistent Continuous-Time Prediction and Filtering	
@@ -0,0 +1 @@
+Combinations of neural ODEs with recurrent neural networks (RNN), like GRUODE-Bayes or ODE-RNN are well suited to model irregularly-sampled time series. While those models outperform existing discrete-time approaches, no theoretical guarantees for their predictive capabilities are available. Assuming that the irregularly-sampled time series data originates from a continuous stochastic processes, the optimal on-line prediction is the conditional expectation given the currently available information. We introduce the Neural Jump ODE (NJ-ODE) that provides a data-driven approach to learn, continuously in time, the conditional expectation of a stochastic process. Our approach models the conditional expectation between two observations with a neural ODE and jumps whenever a new observation is made. We define a novel training framework, which allows us to prove theoretical convergence guarantees for the first time. In particular, we demonstrate the predictive capabilities of our model by proving that, under some regularity assumptions, the output process converges to the conditional expectation process. We provide experiments showing that the theoretical results also hold empirically. Moreover, we experimentally show that our model outperforms one state of the art model in more complex learning tasks and give comparisons on a real-world dataset.
\ No newline at end of file
diff --git a/data/2021/iclr/Neural Learning of One-of-Many Solutions for Combinatorial Problems in Structured Output Spaces b/data/2021/iclr/Neural Learning of One-of-Many Solutions for Combinatorial Problems in Structured Output Spaces
new file mode 100644
index 0000000000..aee3de2fb8
--- /dev/null
+++ b/data/2021/iclr/Neural Learning of One-of-Many Solutions for Combinatorial Problems in Structured Output Spaces	
@@ -0,0 +1 @@
+Recent research has proposed neural architectures for solving combinatorial problems in structured output spaces. In many such problems, there may exist multiple solutions for a given input, e.g. a partially filled Sudoku puzzle may have many completions satisfying all constraints. Further, we are often interested in finding {\em any one} of the possible solutions, without any preference between them. Existing approaches completely ignore this solution multiplicity. In this paper, we argue that being oblivious to the presence of multiple solutions can severely hamper their training ability. Our contribution is two fold. First, we formally define the task of learning one-of-many solutions for combinatorial problems in structured output spaces, which is applicable for solving several problems of interest such as N-Queens, and Sudoku. Second, we present a generic learning framework that adapts an existing prediction network for a combinatorial problem to handle solution multiplicity. Our framework uses a selection module, whose goal is to dynamically determine, for every input, the solution that is most effective for training the network parameters in any given learning iteration. We propose an RL based approach to jointly train the selection module with the prediction network. Experiments on three different domains, and using two different prediction networks, demonstrate that our framework significantly improves the accuracy in our setting, obtaining up to $21$ pt gain over the baselines.
\ No newline at end of file
diff --git a/data/2021/iclr/Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics b/data/2021/iclr/Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics
new file mode 100644
index 0000000000..82c5b050f3
--- /dev/null
+++ b/data/2021/iclr/Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics	
@@ -0,0 +1 @@
+Predicting the dynamics of neural network parameters during training is one of the key challenges in building a theoretical foundation for deep learning. A central obstacle is that the motion of a network in high-dimensional parameter space undergoes discrete finite steps along complex stochastic gradients derived from real-world datasets. We circumvent this obstacle through a unifying theoretical framework based on intrinsic symmetries embedded in a network's architecture that are present for any dataset. We show that any such symmetry imposes stringent geometric constraints on gradients and Hessians, leading to an associated conservation law in the continuous-time limit of stochastic gradient descent (SGD), akin to Noether's theorem in physics. We further show that finite learning rates used in practice can actually break these symmetry induced conservation laws. We apply tools from finite difference methods to derive modified gradient flow, a differential equation that better approximates the numerical trajectory taken by SGD at finite learning rates. We combine modified gradient flow with our framework of symmetries to derive exact integral expressions for the dynamics of certain parameter combinations. We empirically validate our analytic predictions for learning dynamics on VGG-16 trained on Tiny ImageNet. Overall, by exploiting symmetry, our work demonstrates that we can analytically describe the learning dynamics of various parameter combinations at finite learning rates and batch sizes for state of the art architectures trained on any dataset.
\ No newline at end of file
diff --git a/data/2021/iclr/Neural Networks for Learning Counterfactual G-Invariances from Single Environments b/data/2021/iclr/Neural Networks for Learning Counterfactual G-Invariances from Single Environments
new file mode 100644
index 0000000000..907fb2f2e9
--- /dev/null
+++ b/data/2021/iclr/Neural Networks for Learning Counterfactual G-Invariances from Single Environments	
@@ -0,0 +1 @@
+Despite -- or maybe because of -- their astonishing capacity to fit data, neural networks are believed to have difficulties extrapolating beyond training data distribution. This work shows that, for extrapolations based on finite transformation groups, a model's inability to extrapolate is unrelated to its capacity. Rather, the shortcoming is inherited from a learning hypothesis: Examples not explicitly observed with infinitely many training examples have underspecified outcomes in the learner's model. In order to endow neural networks with the ability to extrapolate over group transformations, we introduce a learning framework counterfactually-guided by the learning hypothesis that any group invariance to (known) transformation groups is mandatory even without evidence, unless the learner deems it inconsistent with the training data. Unlike existing invariance-driven methods for (counterfactual) extrapolations, this framework allows extrapolations from a single environment. Finally, we introduce sequence and image extrapolation tasks that validate our framework and showcase the shortcomings of traditional approaches.
\ No newline at end of file
diff --git a/data/2021/iclr/Neural ODE Processes b/data/2021/iclr/Neural ODE Processes
new file mode 100644
index 0000000000..abdf748337
--- /dev/null
+++ b/data/2021/iclr/Neural ODE Processes	
@@ -0,0 +1 @@
+Neural Ordinary Differential Equations (NODEs) use a neural network to model the instantaneous rate of change in the state of a system. However, despite their apparent suitability for dynamics-governed time-series, NODEs present a few disadvantages. First, they are unable to adapt to incoming data points, a fundamental requirement for real-time applications imposed by the natural direction of time. Second, time series are often composed of a sparse set of measurements that could be explained by many possible underlying dynamics. NODEs do not capture this uncertainty. In contrast, Neural Processes (NPs) are a family of models providing uncertainty estimation and fast data adaptation but lack an explicit treatment of the flow of time. To address these problems, we introduce Neural ODE Processes (NDPs), a new class of stochastic processes determined by a distribution over Neural ODEs. By maintaining an adaptive data-dependent distribution over the underlying ODE, we show that our model can successfully capture the dynamics of low-dimensional systems from just a few data points. At the same time, we demonstrate that NDPs scale up to challenging high-dimensional time-series with unknown latent dynamics such as rotating MNIST digits.
\ No newline at end of file
diff --git a/data/2021/iclr/Neural Pruning via Growing Regularization b/data/2021/iclr/Neural Pruning via Growing Regularization
new file mode 100644
index 0000000000..d291f5fe62
--- /dev/null
+++ b/data/2021/iclr/Neural Pruning via Growing Regularization	
@@ -0,0 +1 @@
+Regularization has long been utilized to learn sparsity in deep neural network pruning. However, its role is mainly explored in the small penalty strength regime. In this work, we extend its application to a new scenario where the regularization grows large gradually to tackle two central problems of pruning: pruning schedule and weight importance scoring. (1) The former topic is newly brought up in this work, which we find critical to the pruning performance while receives little research attention. Specifically, we propose an L2 regularization variant with rising penalty factors and show it can bring significant accuracy gains compared with its one-shot counterpart, even when the same weights are removed. (2) The growing penalty scheme also brings us an approach to exploit the Hessian information for more accurate pruning without knowing their specific values, thus not bothered by the common Hessian approximation problems. Empirically, the proposed algorithms are easy to implement and scalable to large datasets and networks in both structured and unstructured pruning. Their effectiveness is demonstrated with modern deep neural networks on the CIFAR and ImageNet datasets, achieving competitive results compared to many state-of-the-art algorithms. Our code and trained models are publicly available at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/Neural Spatio-Temporal Point Processes b/data/2021/iclr/Neural Spatio-Temporal Point Processes
new file mode 100644
index 0000000000..56254fbb78
--- /dev/null
+++ b/data/2021/iclr/Neural Spatio-Temporal Point Processes	
@@ -0,0 +1 @@
+We propose a new class of parameterizations for spatio-temporal point processes which leverage Neural ODEs as a computational method and enable flexible, high-fidelity models of discrete events that are localized in continuous time and space. Central to our approach is a combination of recurrent continuous-time neural networks with two novel neural architectures, i.e., Jump and Attentive Continuous-time Normalizing Flows. This approach allows us to learn complex distributions for both the spatial and temporal domain and to condition non-trivially on the observed event history. We validate our models on data sets from a wide variety of contexts such as seismology, epidemiology, urban mobility, and neuroscience.
\ No newline at end of file
diff --git a/data/2021/iclr/Neural Synthesis of Binaural Speech From Mono Audio b/data/2021/iclr/Neural Synthesis of Binaural Speech From Mono Audio
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Neural Thompson Sampling b/data/2021/iclr/Neural Thompson Sampling
new file mode 100644
index 0000000000..0fa596d2b2
--- /dev/null
+++ b/data/2021/iclr/Neural Thompson Sampling	
@@ -0,0 +1 @@
+We study the Combinatorial Thompson Sampling policy (CTS) for combinatorial multi-armed bandit problems (CMAB), within an approximation regret setting. Although CTS has attracted a lot of interest, it has a drawback that other usual CMAB policies do not have when considering non-exact oracles: for some oracles, CTS has a poor approximation regret (scaling linearly with the time horizon $T$) [Wang and Chen, 2018]. A study is then necessary to discriminate the oracles on which CTS could learn. This study was started by Kong et al. [2021]: they gave the first approximation regret analysis of CTS for the greedy oracle, obtaining an upper bound of order $\mathcal{O}(\log(T)/\Delta^2)$, where $\Delta$ is some minimal reward gap. In this paper, our objective is to push this study further than the simple case of the greedy oracle. We provide the first $\mathcal{O}(\log(T)/\Delta)$ approximation regret upper bound for CTS, obtained under a specific condition on the approximation oracle, allowing a reduction to the exact oracle analysis. We thus term this condition REDUCE2EXACT, and observe that it is satisfied in many concrete examples. Moreover, it can be extended to the probabilistically triggered arms setting, thus capturing even more problems, such as online influence maximization.
\ No newline at end of file
diff --git a/data/2021/iclr/Neural Topic Model via Optimal Transport b/data/2021/iclr/Neural Topic Model via Optimal Transport
new file mode 100644
index 0000000000..d88a526376
--- /dev/null
+++ b/data/2021/iclr/Neural Topic Model via Optimal Transport	
@@ -0,0 +1 @@
+Recently, Neural Topic Models (NTMs) inspired by variational autoencoders have obtained increasingly research interest due to their promising results on text analysis. However, it is usually hard for existing NTMs to achieve good document representation and coherent/diverse topics at the same time. Moreover, they often degrade their performance severely on short documents. The requirement of reparameterisation could also comprise their training quality and model flexibility. To address these shortcomings, we present a new neural topic model via the theory of optimal transport (OT). Specifically, we propose to learn the topic distribution of a document by directly minimising its OT distance to the document's word distributions. Importantly, the cost matrix of the OT distance models the weights between topics and words, which is constructed by the distances between topics and words in an embedding space. Our proposed model can be trained efficiently with a differentiable loss. Extensive experiments show that our framework significantly outperforms the state-of-the-art NTMs on discovering more coherent and diverse topics and deriving better document representations for both regular and short texts.
\ No newline at end of file
diff --git a/data/2021/iclr/Neural gradients are near-lognormal: improved quantized and sparse training b/data/2021/iclr/Neural gradients are near-lognormal: improved quantized and sparse training
new file mode 100644
index 0000000000..f1c9cfdb22
--- /dev/null
+++ b/data/2021/iclr/Neural gradients are near-lognormal: improved quantized and sparse training	
@@ -0,0 +1 @@
+While training can mostly be accelerated by reducing the time needed to propagate neural gradients back throughout the model, most previous works focus on the quantization/pruning of weights and activations. These methods are often not applicable to neural gradients, which have very different statistical properties. Distinguished from weights and activations, we find that the distribution of neural gradients is approximately lognormal. Considering this, we suggest two closed-form analytical methods to reduce the computational and memory burdens of neural gradients. The first method optimizes the floating-point format and scale of the gradients. The second method accurately sets sparsity thresholds for gradient pruning. Each method achieves state-of-the-art results on ImageNet. To the best of our knowledge, this paper is the first to (1) quantize the gradients to 6-bit floating-point formats, or (2) achieve up to 85% gradient sparsity -- in each case without accuracy degradation. Reference implementation accompanies the paper.
\ No newline at end of file
diff --git a/data/2021/iclr/Neural networks with late-phase weights b/data/2021/iclr/Neural networks with late-phase weights
new file mode 100644
index 0000000000..651a7da109
--- /dev/null
+++ b/data/2021/iclr/Neural networks with late-phase weights	
@@ -0,0 +1 @@
+The largely successful method of training neural networks is to learn their weights using some variant of stochastic gradient descent (SGD). Here, we show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning. At the end of learning, we obtain back a single model by taking a spatial average in weight space. To avoid incurring increased computational costs, we investigate a family of low-dimensional late-phase weight models which interact multiplicatively with the remaining parameters. Our results show that augmenting standard models with late-phase weights improves generalization in established benchmarks such as CIFAR-10/100, ImageNet and enwik8. These findings are complemented with a theoretical analysis of a noisy quadratic problem which provides a simplified picture of the late phases of neural network learning.
\ No newline at end of file
diff --git a/data/2021/iclr/Neural representation and generation for RNA secondary structures b/data/2021/iclr/Neural representation and generation for RNA secondary structures
new file mode 100644
index 0000000000..b8fb298d52
--- /dev/null
+++ b/data/2021/iclr/Neural representation and generation for RNA secondary structures	
@@ -0,0 +1 @@
+Our work is concerned with the generation and targeted design of RNA, a type of genetic macromolecule that can adopt complex structures which influence their cellular activities and functions. The design of large scale and complex biological structures spurs dedicated graph-based deep generative modeling techniques, which represents a key but underappreciated aspect of computational drug discovery. In this work, we investigate the principles behind representing and generating different RNA structural modalities, and propose a flexible framework to jointly embed and generate these molecular structures along with their sequence in a meaningful latent space. Equipped with a deep understanding of RNA molecular structures, our most sophisticated encoding and decoding methods operate on the molecular graph as well as the junction tree hierarchy, integrating strong inductive bias about RNA structural regularity and folding mechanism such that high structural validity, stability and diversity of generated RNAs are achieved. Also, we seek to adequately organize the latent space of RNA molecular embeddings with regard to the interaction with proteins, and targeted optimization is used to navigate in this latent space to search for desired novel RNA molecules.
\ No newline at end of file
diff --git a/data/2021/iclr/Neurally Augmented ALISTA b/data/2021/iclr/Neurally Augmented ALISTA
new file mode 100644
index 0000000000..ccaff4de09
--- /dev/null
+++ b/data/2021/iclr/Neurally Augmented ALISTA	
@@ -0,0 +1 @@
+It is well-established that many iterative sparse reconstruction algorithms can be unrolled to yield a learnable neural network for improved empirical performance. A prime example is learned ISTA (LISTA) where weights, step sizes and thresholds are learned from training data. Recently, Analytic LISTA (ALISTA) has been introduced, combining the strong empirical performance of a fully learned approach like LISTA, while retaining theoretical guarantees of classical compressed sensing algorithms and significantly reducing the number of parameters to learn. However, these parameters are trained to work in expectation, often leading to suboptimal reconstruction of individual targets. In this work we therefore introduce Neurally Augmented ALISTA, in which an LSTM network is used to compute step sizes and thresholds individually for each target vector during reconstruction. This adaptive approach is theoretically motivated by revisiting the recovery guarantees of ALISTA. We show that our approach further improves empirical performance in sparse reconstruction, in particular outperforming existing algorithms by an increasing margin as the compression ratio becomes more challenging.
\ No newline at end of file
diff --git a/data/2021/iclr/New Bounds For Distributed Mean Estimation and Variance Reduction b/data/2021/iclr/New Bounds For Distributed Mean Estimation and Variance Reduction
new file mode 100644
index 0000000000..b8443a6ee3
--- /dev/null
+++ b/data/2021/iclr/New Bounds For Distributed Mean Estimation and Variance Reduction	
@@ -0,0 +1 @@
+We consider the problem of distributed mean estimation (DME) , in which n machines are each given a local d -dimensional vector x v ∈ R d , and must cooperate to estimate the mean of their inputs µ = 1 n P nv =1 x v , while minimizing total communication cost. DME is a fundamental construct in distributed machine learning, and there has been considerable work on variants of this problem, especially in the context of distributed variance reduction for stochastic gradients in parallel SGD. Previous work typically assumes an upper bound on the norm of the input vectors, and achieves an error bound in terms of this norm. However, in many real applications, the input vectors are concentrated around the correct output µ , but µ itself has large norm. In such cases, previous output error bounds perform poorly. In this paper, we show that output error bounds need not depend on input norm. We provide a method of quantization which allows distributed mean estimation to be performed with solution quality dependent only on the distance between inputs , not on input norm, and show an analogous result for distributed variance reduction. The technique is based on a new connection with lattice theory. We also provide lower bounds showing that the communication to error trade-oﬀ of our algorithms is asymptotically optimal. As the lattices achieving optimal bounds under ‘ 2 -norm can be computationally impractical, we also present an extension which leverages easy-to-use cubic lattices, and is loose only up to a logarithmic factor in d . We show experimentally that our method yields practical improvements for common applications, relative to prior approaches.
\ No newline at end of file
diff --git a/data/2021/iclr/No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks b/data/2021/iclr/No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks
new file mode 100644
index 0000000000..9d8cebd436
--- /dev/null
+++ b/data/2021/iclr/No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks	
@@ -0,0 +1 @@
+There has been increasing interest in building deep hierarchy-aware classifiers that aim to quantify and reduce the severity of mistakes, and not just reduce the number of errors. The idea is to exploit the label hierarchy (e.g., the WordNet ontology) and consider graph distances as a proxy for mistake severity. Surprisingly, on examining mistake-severity distributions of the top-1 prediction, we find that current state-of-the-art hierarchy-aware deep classifiers do not always show practical improvement over the standard cross-entropy baseline in making better mistakes. The reason for the reduction in average mistake-severity can be attributed to the increase in low-severity mistakes, which may also explain the noticeable drop in their accuracy. To this end, we use the classical Conditional Risk Minimization (CRM) framework for hierarchy-aware classification. Given a cost matrix and a reliable estimate of likelihoods (obtained from a trained network), CRM simply amends mistakes at inference time; it needs no extra hyperparameters and requires adding just a few lines of code to the standard cross-entropy baseline. It significantly outperforms the state-of-the-art and consistently obtains large reductions in the average hierarchical distance of top-$k$ predictions across datasets, with very little loss in accuracy. CRM, because of its simplicity, can be used with any off-the-shelf trained model that provides reliable likelihood estimates.
\ No newline at end of file
diff --git a/data/2021/iclr/No MCMC for me: Amortized sampling for fast and stable training of energy-based models b/data/2021/iclr/No MCMC for me: Amortized sampling for fast and stable training of energy-based models
new file mode 100644
index 0000000000..cae43f695c
--- /dev/null
+++ b/data/2021/iclr/No MCMC for me: Amortized sampling for fast and stable training of energy-based models	
@@ -0,0 +1 @@
+Energy-Based Models (EBMs) present a flexible and appealing way to represent uncertainty. Despite recent advances, training EBMs on high-dimensional data remains a challenging problem as the state-of-the-art approaches are costly, unstable, and require considerable tuning and domain expertise to apply successfully. In this work, we present a simple method for training EBMs at scale which uses an entropy-regularized generator to amortize the MCMC sampling typically used in EBM training. We improve upon prior MCMC-based entropy regularization methods with a fast variational approximation. We demonstrate the effectiveness of our approach by using it to train tractable likelihood models. Next, we apply our estimator to the recently proposed Joint Energy Model (JEM), where we match the original performance with faster and stable training. This allows us to extend JEM models to semi-supervised classification on tabular data from a variety of continuous domains.
\ No newline at end of file
diff --git a/data/2021/iclr/Noise against noise: stochastic label noise helps combat inherent label noise b/data/2021/iclr/Noise against noise: stochastic label noise helps combat inherent label noise
new file mode 100644
index 0000000000..5c8b69d722
--- /dev/null
+++ b/data/2021/iclr/Noise against noise: stochastic label noise helps combat inherent label noise	
@@ -0,0 +1 @@
+The noise in stochastic gradient descent (SGD) provides a crucial implicit regularization effect, previously studied in optimization by analyzing the dynamics of parameter updates. In this paper, we are interested in learning with noisy labels, where we have a collection of samples with potential mislabeling. We show that a previously rarely discussed SGD noise, induced by stochastic label noise (SLN), mitigates the effects of inherent label noise. In contrast, the common SGD noise directly applied to model parameters does not. We formalize the differences and connections of SGD noise variants, showing that SLN induces SGD noise dependent on the sharpness of output landscape and the conﬁdence of output probability, which may help escape from sharp minima and prevent overconﬁdence. SLN not only improves generalization in its simplest form but also boosts popular robust training methods, including sample selection and label correction. Speciﬁcally, we present an enhanced algorithm by applying SLN to label correction. Our code is released 1 .
\ No newline at end of file
diff --git a/data/2021/iclr/Noise or Signal: The Role of Image Backgrounds in Object Recognition b/data/2021/iclr/Noise or Signal: The Role of Image Backgrounds in Object Recognition
new file mode 100644
index 0000000000..513f7fe84e
--- /dev/null
+++ b/data/2021/iclr/Noise or Signal: The Role of Image Backgrounds in Object Recognition	
@@ -0,0 +1 @@
+We assess the tendency of state-of-the-art object recognition models to depend on signals from image backgrounds. We create a toolkit for disentangling foreground and background signal on ImageNet images, and find that (a) models can achieve non-trivial accuracy by relying on the background alone, (b) models often misclassify images even in the presence of correctly classified foregrounds--up to 87.5% of the time with adversarially chosen backgrounds, and (c) more accurate models tend to depend on backgrounds less. Our analysis of backgrounds brings us closer to understanding which correlations machine learning models use, and how they determine models' out of distribution performance.
\ No newline at end of file
diff --git a/data/2021/iclr/Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds b/data/2021/iclr/Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds
new file mode 100644
index 0000000000..a78ad5b630
--- /dev/null
+++ b/data/2021/iclr/Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and Dual Bounds	
@@ -0,0 +1 @@
+Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies. Therefore, OPE is a key step in applying reinforcement learning to real-world domains such as medical treatment, where interactive data collection is expensive or even unsafe. As the observed data tends to be noisy and limited, it is essential to provide rigorous uncertainty quantification, not just a point estimation, when applying OPE to make high stakes decisions. This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation, which remains a challenging open question. We develop a practical algorithm through a primal-dual optimization-based approach, which leverages the kernel Bellman loss (KBL) of Feng et al.(2019) and a new martingale concentration inequality of KBL applicable to time-dependent data with unknown mixing conditions. Our algorithm makes minimum assumptions on the data and the function class of the Q-function, and works for the behavior-agnostic settings where the data is collected under a mix of arbitrary unknown behavior policies. We present empirical results that clearly demonstrate the advantages of our approach over existing methods.
\ No newline at end of file
diff --git a/data/2021/iclr/Nonseparable Symplectic Neural Networks b/data/2021/iclr/Nonseparable Symplectic Neural Networks
new file mode 100644
index 0000000000..e85e2f53fe
--- /dev/null
+++ b/data/2021/iclr/Nonseparable Symplectic Neural Networks	
@@ -0,0 +1 @@
+Predicting the behaviors of Hamiltonian systems has been drawing increasing attention in scientific machine learning. However, the vast majority of the literature was focused on predicting separable Hamiltonian systems with their kinematic and potential energy terms being explicitly decoupled, while building data-driven paradigms to predict nonseparable Hamiltonian systems that are ubiquitous in fluid dynamics and quantum mechanics were rarely explored. The main computational challenge lies in the effective embedding of symplectic priors to describe the inherently coupled evolution of position and momentum, which typically exhibits intricate dynamics with many degrees of freedom. To solve the problem, we propose a novel neural network architecture, Nonseparable Symplectic Neural Networks (NSSNNs), to uncover and embed the symplectic structure of a nonseparable Hamiltonian system from limited observation data. The enabling mechanics of our approach is an augmented symplectic time integrator to decouple the position and momentum energy terms and facilitate their evolution. We demonstrated the efficacy and versatility of our method by predicting a wide range of Hamiltonian systems, both separable and nonseparable, including vortical flow and quantum system. We showed the unique computational merits of our approach to yield long-term, accurate, and robust predictions for large-scale Hamiltonian systems by rigorously enforcing symplectomorphism.
\ No newline at end of file
diff --git a/data/2021/iclr/OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning b/data/2021/iclr/OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning
new file mode 100644
index 0000000000..4facec0277
--- /dev/null
+++ b/data/2021/iclr/OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning	
@@ -0,0 +1 @@
+Reinforcement learning (RL) has achieved impressive performance in a variety of online settings in which an agent's ability to query the environment for transitions and rewards is effectively unlimited. However, in many practical applications, the situation is reversed: an agent may have access to large amounts of undirected offline experience data, while access to the online environment is severely limited. In this work, we focus on this offline setting. Our main insight is that, when presented with offline data composed of a variety of behaviors, an effective way to leverage this data is to extract a continuous space of recurring and temporally extended primitive behaviors before using these primitives for downstream task learning. Primitives extracted in this way serve two purposes: they delineate the behaviors that are supported by the data from those that are not, making them useful for avoiding distributional shift in offline RL; and they provide a degree of temporal abstraction, which reduces the effective horizon yielding better learning in theory, and improved offline RL in practice. In addition to benefiting offline policy optimization, we show that performing offline primitive learning in this way can also be leveraged for improving few-shot imitation learning as well as exploration and transfer in online RL on a variety of benchmark domains. Visualizations are available at this https URL
\ No newline at end of file
diff --git a/data/2021/iclr/Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers b/data/2021/iclr/Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers
new file mode 100644
index 0000000000..dacf2e9ddc
--- /dev/null
+++ b/data/2021/iclr/Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers	
@@ -0,0 +1 @@
+We propose a simple, practical, and intuitive approach for domain adaptation in reinforcement learning. Our approach stems from the idea that the agent's experience in the source domain should look similar to its experience in the target domain. Building off of a probabilistic view of RL, we formally show that we can achieve this goal by compensating for the difference in dynamics by modifying the reward function. This modified reward function is simple to estimate by learning auxiliary classifiers that distinguish source-domain transitions from target-domain transitions. Intuitively, the modified reward function penalizes the agent for visiting states and taking actions in the source domain which are not possible in the target domain. Said another way, the agent is penalized for transitions that would indicate that the agent is interacting with the source domain, rather than the target domain. Our approach is applicable to domains with continuous states and actions and does not require learning an explicit model of the dynamics. On discrete and continuous control tasks, we illustrate the mechanics of our approach and demonstrate its scalability to high-dimensional tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Offline Model-Based Optimization via Normalized Maximum Likelihood Estimation b/data/2021/iclr/Offline Model-Based Optimization via Normalized Maximum Likelihood Estimation
new file mode 100644
index 0000000000..d06daba4eb
--- /dev/null
+++ b/data/2021/iclr/Offline Model-Based Optimization via Normalized Maximum Likelihood Estimation	
@@ -0,0 +1 @@
+In this work we consider data-driven optimization problems where one must maximize a function given only queries at a fixed set of points. This problem setting emerges in many domains where function evaluation is a complex and expensive process, such as in the design of materials, vehicles, or neural network architectures. Because the available data typically only covers a small manifold of the possible space of inputs, a principal challenge is to be able to construct algorithms that can reason about uncertainty and out-of-distribution values, since a naive optimizer can easily exploit an estimated model to return adversarial inputs. We propose to tackle this problem by leveraging the normalized maximum-likelihood (NML) estimator, which provides a principled approach to handling uncertainty and out-of-distribution inputs. While in the standard formulation NML is intractable, we propose a tractable approximation that allows us to scale our method to high-capacity neural network models. We demonstrate that our method can effectively optimize high-dimensional design problems in a variety of disciplines such as chemistry, biology, and materials engineering.
\ No newline at end of file
diff --git a/data/2021/iclr/On Data-Augmentation and Consistency-Based Semi-Supervised Learning b/data/2021/iclr/On Data-Augmentation and Consistency-Based Semi-Supervised Learning
new file mode 100644
index 0000000000..7adbf5e7a7
--- /dev/null
+++ b/data/2021/iclr/On Data-Augmentation and Consistency-Based Semi-Supervised Learning	
@@ -0,0 +1 @@
+Recently proposed consistency-based Semi-Supervised Learning (SSL) methods such as the $\Pi$-model, temporal ensembling, the mean teacher, or the virtual adversarial training, have advanced the state of the art in several SSL tasks. These methods can typically reach performances that are comparable to their fully supervised counterparts while using only a fraction of labelled examples. Despite these methodological advances, the understanding of these methods is still relatively limited. In this text, we analyse (variations of) the $\Pi$-model in settings where analytically tractable results can be obtained. We establish links with Manifold Tangent Classifiers and demonstrate that the quality of the perturbations is key to obtaining reasonable SSL performances. Importantly, we propose a simple extension of the Hidden Manifold Model that naturally incorporates data-augmentation schemes and offers a framework for understanding and experimenting with SSL methods.
\ No newline at end of file
diff --git a/data/2021/iclr/On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections b/data/2021/iclr/On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections
new file mode 100644
index 0000000000..33f52d3ac9
--- /dev/null
+++ b/data/2021/iclr/On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections	
@@ -0,0 +1 @@
+Disparate impact has raised serious concerns in machine learning applications and its societal impacts. In response to the need of mitigating discrimination, fairness has been regarded as a crucial property in algorithmic design. In this work, we study the problem of disparate impact on graph-structured data. Specifically, we focus on dyadic fairness, which articulates a fairness concept that a predictive relationship between two instances should be independent of the sensitive attributes. Based on this, we theoretically relate the graph connections to dyadic fairness on link predictive scores in learning graph neural networks, and reveal that regulating weights on existing edges in a graph contributes to dyadic fairness conditionally. Subsequently, we propose our algorithm, FairAdj, to empirically learn a fair adjacency matrix with proper graph structural constraints for fair link prediction, and in the meanwhile preserve predictive accuracy as much as possible. Empirical validation demonstrates that our method delivers effective dyadic fairness in terms of various statistics, and at the same time enjoys a favorable fairness-utility tradeoff.
\ No newline at end of file
diff --git a/data/2021/iclr/On Fast Adversarial Robustness Adaptation in Model-Agnostic Meta-Learning b/data/2021/iclr/On Fast Adversarial Robustness Adaptation in Model-Agnostic Meta-Learning
new file mode 100644
index 0000000000..1ad17056f8
--- /dev/null
+++ b/data/2021/iclr/On Fast Adversarial Robustness Adaptation in Model-Agnostic Meta-Learning	
@@ -0,0 +1 @@
+Model-agnostic meta-learning (MAML) has emerged as one of the most successful meta-learning techniques in few-shot learning. It enables us to learn a meta-initialization} of model parameters (that we call meta-model) to rapidly adapt to new tasks using a small amount of labeled training data. Despite the generalization power of the meta-model, it remains elusive that how adversarial robustness can be maintained by MAML in few-shot learning. In addition to generalization, robustness is also desired for a meta-model to defend adversarial examples (attacks). Toward promoting adversarial robustness in MAML, we first study WHEN a robustness-promoting regularization should be incorporated, given the fact that MAML adopts a bi-level (fine-tuning vs. meta-update) learning procedure. We show that robustifying the meta-update stage is sufficient to make robustness adapted to the task-specific fine-tuning stage even if the latter uses a standard training protocol. We also make additional justification on the acquired robustness adaptation by peering into the interpretability of neurons' activation maps. Furthermore, we investigate HOW robust regularization can efficiently be designed in MAML. We propose a general but easily-optimized robustness-regularized meta-learning framework, which allows the use of unlabeled data augmentation, fast adversarial attack generation, and computationally-light fine-tuning. In particular, we for the first time show that the auxiliary contrastive learning task can enhance the adversarial robustness of MAML. Finally, extensive experiments are conducted to demonstrate the effectiveness of our proposed methods in robust few-shot learning.
\ No newline at end of file
diff --git a/data/2021/iclr/On Graph Neural Networks versus Graph-Augmented MLPs b/data/2021/iclr/On Graph Neural Networks versus Graph-Augmented MLPs
new file mode 100644
index 0000000000..686dcd7124
--- /dev/null
+++ b/data/2021/iclr/On Graph Neural Networks versus Graph-Augmented MLPs	
@@ -0,0 +1 @@
+From the perspective of expressive power, this work compares multi-layer Graph Neural Networks (GNNs) with a simplified alternative that we call Graph-Augmented Multi-Layer Perceptrons (GA-MLPs), which first augments node features with certain multi-hop operators on the graph and then applies an MLP in a node-wise fashion. From the perspective of graph isomorphism testing, we show both theoretically and numerically that GA-MLPs with suitable operators can distinguish almost all non-isomorphic graphs, just like the Weifeiler-Lehman (WL) test. However, by viewing them as node-level functions and examining the equivalence classes they induce on rooted graphs, we prove a separation in expressive power between GA-MLPs and GNNs that grows exponentially in depth. In particular, unlike GNNs, GA-MLPs are unable to count the number of attributed walks. We also demonstrate via community detection experiments that GA-MLPs can be limited by their choice of operator family, as compared to GNNs with higher flexibility in learning.
\ No newline at end of file
diff --git a/data/2021/iclr/On InstaHide, Phase Retrieval, and Sparse Matrix Factorization b/data/2021/iclr/On InstaHide, Phase Retrieval, and Sparse Matrix Factorization
new file mode 100644
index 0000000000..77f1c32b83
--- /dev/null
+++ b/data/2021/iclr/On InstaHide, Phase Retrieval, and Sparse Matrix Factorization	
@@ -0,0 +1,2 @@
+In this work, we examine the security of InstaHide, a scheme recently proposed by [Huang, Song, Li and Arora, ICML'20] for preserving the security of private datasets in the context of distributed learning. To generate a synthetic training example to be shared among the distributed learners, InstaHide takes a convex combination of private feature vectors and randomly flips the sign of each entry of the resulting vector with probability 1/2. A salient question is whether this scheme is secure in any provable sense, perhaps under a plausible hardness assumption and assuming the distributions generating the public and private data satisfy certain properties. 
+We show that the answer to this appears to be quite subtle and closely related to the average-case complexity of a new multi-task, missing-data version of the classic problem of phase retrieval. Motivated by this connection, we design a provable algorithm that can recover private vectors using only the public vectors and synthetic vectors generated by InstaHide, under the assumption that the private and public vectors are isotropic Gaussian.
\ No newline at end of file
diff --git a/data/2021/iclr/On Learning Universal Representations Across Languages b/data/2021/iclr/On Learning Universal Representations Across Languages
new file mode 100644
index 0000000000..9ae4ed1fac
--- /dev/null
+++ b/data/2021/iclr/On Learning Universal Representations Across Languages	
@@ -0,0 +1 @@
+Recent studies have demonstrated the overwhelming advantage of cross-lingual pre-trained models (PTMs), such as multilingual BERT and XLM, on cross-lingual NLP tasks. However, existing approaches essentially capture the co-occurrence among tokens through involving the masked language model (MLM) objective with token-level cross entropy. In this work, we extend these approaches to learn sentence-level representations, and show the effectiveness on cross-lingual understanding and generation. We propose Hierarchical Contrastive Learning (HiCTL) to (1) learn universal representations for parallel sentences distributed in one or multiple languages and (2) distinguish the semantically-related words from a shared cross-lingual vocabulary for each sentence. We conduct evaluations on three benchmarks: language understanding tasks (QQP, QNLI, SST-2, MRPC, STS-B and MNLI) in the GLUE benchmark, cross-lingual natural language inference (XNLI) and machine translation. Experimental results show that the HiCTL obtains an absolute gain of 1.0%/2.2% accuracy on GLUE/XNLI as well as achieves substantial improvements of +1.7-+3.6 BLEU on both the high-resource and low-resource English-to-X translation tasks over strong baselines. We will release the source codes as soon as possible.
\ No newline at end of file
diff --git a/data/2021/iclr/On Position Embeddings in BERT b/data/2021/iclr/On Position Embeddings in BERT
new file mode 100644
index 0000000000..90818e09a3
--- /dev/null
+++ b/data/2021/iclr/On Position Embeddings in BERT	
@@ -0,0 +1 @@
+relative
\ No newline at end of file
diff --git a/data/2021/iclr/On Self-Supervised Image Representations for GAN Evaluation b/data/2021/iclr/On Self-Supervised Image Representations for GAN Evaluation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/On Statistical Bias In Active Learning: How and When to Fix It b/data/2021/iclr/On Statistical Bias In Active Learning: How and When to Fix It
new file mode 100644
index 0000000000..3a5314acf7
--- /dev/null
+++ b/data/2021/iclr/On Statistical Bias In Active Learning: How and When to Fix It	
@@ -0,0 +1 @@
+Active learning is a powerful tool when labelling data is expensive, but it introduces a bias because the training data no longer follows the population distribution. We formalize this bias and investigate the situations in which it can be harmful and sometimes even helpful. We further introduce novel corrective weights to remove bias when doing so is beneficial. Through this, our work not only provides a useful mechanism that can improve the active learning approach, but also an explanation of the empirical successes of various existing approaches which ignore this bias. In particular, we show that this bias can be actively helpful when training overparameterized models -- like neural networks -- with relatively little data.
\ No newline at end of file
diff --git a/data/2021/iclr/On the Bottleneck of Graph Neural Networks and its Practical Implications b/data/2021/iclr/On the Bottleneck of Graph Neural Networks and its Practical Implications
new file mode 100644
index 0000000000..20e48c5aa8
--- /dev/null
+++ b/data/2021/iclr/On the Bottleneck of Graph Neural Networks and its Practical Implications	
@@ -0,0 +1 @@
+Graph neural networks (GNNs) were shown to effectively learn from highly structured data containing elements (nodes) with relationships (edges) between them. GNN variants differ in how each node in the graph absorbs the information flowing from its neighbor nodes. In this paper, we highlight an inherent problem in GNNs: the mechanism of propagating information between neighbors creates a bottleneck when every node aggregates messages from its neighbors. This bottleneck causes the over-squashing of exponentially-growing information into fixed-size vectors. As a result, the graph fails to propagate messages flowing from distant nodes and performs poorly when the prediction task depends on long-range information. We demonstrate that the bottleneck hinders popular GNNs from fitting the training data. We show that GNNs that absorb incoming edges equally, like GCN and GIN, are more susceptible to over-squashing than other GNN types. We further show that existing, extensively-tuned, GNN-based models suffer from over-squashing and that breaking the bottleneck improves state-of-the-art results without any hyperparameter tuning or additional weights.
\ No newline at end of file
diff --git a/data/2021/iclr/On the Critical Role of Conventions in Adaptive Human-AI Collaboration b/data/2021/iclr/On the Critical Role of Conventions in Adaptive Human-AI Collaboration
new file mode 100644
index 0000000000..8cc31fe142
--- /dev/null
+++ b/data/2021/iclr/On the Critical Role of Conventions in Adaptive Human-AI Collaboration	
@@ -0,0 +1 @@
+Humans can quickly adapt to new partners in collaborative tasks (e.g. playing basketball), because they understand which fundamental skills of the task (e.g. how to dribble, how to shoot) carry over across new partners. Humans can also quickly adapt to similar tasks with the same partners by carrying over conventions that they have developed (e.g. raising hand signals pass the ball), without learning to coordinate from scratch. To collaborate seamlessly with humans, AI agents should adapt quickly to new partners and new tasks as well. However, current approaches have not attempted to distinguish between the complexities intrinsic to a task and the conventions used by a partner, and more generally there has been little focus on leveraging conventions for adapting to new settings. In this work, we propose a learning framework that teases apart rule-dependent representation from convention-dependent representation in a principled way. We show that, under some assumptions, our rule-dependent representation is a sufficient statistic of the distribution over best-response strategies across partners. Using this separation of representations, our agents are able to adapt quickly to new partners, and to coordinate with old partners on new tasks in a zero-shot manner. We experimentally validate our approach on three collaborative tasks varying in complexity: a contextual multi-armed bandit, a block placing task, and the card game Hanabi.
\ No newline at end of file
diff --git a/data/2021/iclr/On the Curse of Memory in Recurrent Neural Networks: Approximation and Optimization Analysis b/data/2021/iclr/On the Curse of Memory in Recurrent Neural Networks: Approximation and Optimization Analysis
new file mode 100644
index 0000000000..2988b51e7e
--- /dev/null
+++ b/data/2021/iclr/On the Curse of Memory in Recurrent Neural Networks: Approximation and Optimization Analysis	
@@ -0,0 +1 @@
+We study the approximation properties and optimization dynamics of recurrent neural networks (RNNs) when applied to learn input-output relationships in temporal data. We consider the simple but representative setting of using continuous-time linear RNNs to learn from data generated by linear relationships. Mathematically, the latter can be understood as a sequence of linear functionals. We prove a universal approximation theorem of such linear functionals, and characterize the approximation rate and its relation with memory. Moreover, we perform a fine-grained dynamical analysis of training linear RNNs, which further reveal the intricate interactions between memory and learning. A unifying theme uncovered is the non-trivial effect of memory, a notion that can be made precise in our framework, on approximation and optimization: when there is long term memory in the target, it takes a large number of neurons to approximate it. Moreover, the training process will suffer from slow downs. In particular, both of these effects become exponentially more pronounced with memory - a phenomenon we call the "curse of memory". These analyses represent a basic step towards a concrete mathematical understanding of new phenomenon that may arise in learning temporal relationships using recurrent architectures.
\ No newline at end of file
diff --git a/data/2021/iclr/On the Dynamics of Training Attention Models b/data/2021/iclr/On the Dynamics of Training Attention Models
new file mode 100644
index 0000000000..e438904c63
--- /dev/null
+++ b/data/2021/iclr/On the Dynamics of Training Attention Models	
@@ -0,0 +1 @@
+The attention mechanism has been widely used in deep neural networks as a model component. By now, it has become a critical building block in many state-of-the-art natural language models. Despite its great success established empirically, the working mechanism of attention has not been investigated at a sufficient theoretical depth to date. In this paper, we set up a simple text classification task and study the dynamics of training a simple attention-based classification model using gradient descent. In this setting, we show that, for the discriminative words that the model should attend to, a persisting identity exists relating its embedding and the inner product of its key and the query. This allows us to prove that training must converge to attending to the discriminative words when the attention output is classified by a linear classifier. Experiments are performed, which validates our theoretical analysis and provides further insights.
\ No newline at end of file
diff --git a/data/2021/iclr/On the Impossibility of Global Convergence in Multi-Loss Optimization b/data/2021/iclr/On the Impossibility of Global Convergence in Multi-Loss Optimization
new file mode 100644
index 0000000000..639bca6905
--- /dev/null
+++ b/data/2021/iclr/On the Impossibility of Global Convergence in Multi-Loss Optimization	
@@ -0,0 +1 @@
+Under mild regularity conditions, gradient-based methods converge globally to a critical point in the single-loss setting. This is known to break down for vanilla gradient descent when moving to multi-loss optimization, but can we hope to build some algorithm with global guarantees? We negatively resolve this open problem by proving that any reasonable algorithm will exhibit limit cycles or diverge to infinite losses in some differentiable game, even in two-player games with zero-sum interactions. A reasonable algorithm is simply one which avoids strict maxima, an exceedingly weak assumption since converging to maxima would be the opposite of minimization. This impossibility theorem holds even if we impose existence of a strict minimum and no other critical points. The proof is constructive, enabling us to display explicit limit cycles for existing gradient-based methods. Nonetheless, it remains an open question whether cycles arise in high-dimensional games of interest to ML practitioners, such as GANs or multi-agent RL.
\ No newline at end of file
diff --git a/data/2021/iclr/On the Origin of Implicit Regularization in Stochastic Gradient Descent b/data/2021/iclr/On the Origin of Implicit Regularization in Stochastic Gradient Descent
new file mode 100644
index 0000000000..c81bbc0b6b
--- /dev/null
+++ b/data/2021/iclr/On the Origin of Implicit Regularization in Stochastic Gradient Descent	
@@ -0,0 +1 @@
+For infinitesimal learning rates, stochastic gradient descent (SGD) follows the path of gradient flow on the full batch loss function. However moderately large learning rates can achieve higher test accuracies, and this generalization benefit is not explained by convergence bounds, since the learning rate which maximizes test accuracy is often larger than the learning rate which minimizes training loss. To interpret this phenomenon we prove that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite, but on a modified loss. This modified loss is composed of the original loss function and an implicit regularizer, which penalizes the norms of the minibatch gradients. Under mild assumptions, when the batch size is small the scale of the implicit regularization term is proportional to the ratio of the learning rate to the batch size. We verify empirically that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small.
\ No newline at end of file
diff --git a/data/2021/iclr/On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines b/data/2021/iclr/On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
new file mode 100644
index 0000000000..e34864c365
--- /dev/null
+++ b/data/2021/iclr/On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines	
@@ -0,0 +1 @@
+Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks. Despite the strong empirical performance of fine-tuned models, fine-tuning is an unstable process: training the same model with multiple random seeds can result in a large variance of the task performance. Previous literature (Devlin et al., 2019; Lee et al., 2020; Dodge et al., 2020) identified two potential reasons for the observed instability: catastrophic forgetting and a small size of the fine-tuning datasets. In this paper, we show that both hypotheses fail to explain the fine-tuning instability. We analyze BERT, RoBERTa, and ALBERT, fine-tuned on three commonly used datasets from the GLUE benchmark and show that the observed instability is caused by optimization difficulties that lead to vanishing gradients. Additionally, we show that the remaining variance of the downstream task performance can be attributed to differences in generalization where fine-tuned models with the same training loss exhibit noticeably different test performance. Based on our analysis, we present a simple but strong baseline that makes fine-tuning BERT-based models significantly more stable than previously proposed approaches. Code to reproduce our results is available online: this https URL .
\ No newline at end of file
diff --git a/data/2021/iclr/On the Theory of Implicit Deep Learning: Global Convergence with Implicit Layers b/data/2021/iclr/On the Theory of Implicit Deep Learning: Global Convergence with Implicit Layers
new file mode 100644
index 0000000000..57f78ced2e
--- /dev/null
+++ b/data/2021/iclr/On the Theory of Implicit Deep Learning: Global Convergence with Implicit Layers	
@@ -0,0 +1 @@
+A deep equilibrium model uses implicit layers, which are implicitly defined through an equilibrium point of an infinite sequence of computation. It avoids any explicit computation of the infinite sequence by finding an equilibrium point directly via root-finding and by computing gradients via implicit differentiation. In this paper, we analyze the gradient dynamics of deep equilibrium models with nonlinearity only on weight matrices and non-convex objective functions of weights for regression and classification. Despite non-convexity, convergence to global optimum at a linear rate is guaranteed without any assumption on the width of the models, allowing the width to be smaller than the output dimension and the number of data points. Moreover, we prove a relation between the gradient dynamics of the deep implicit layer and the dynamics of trust region Newton method of a shallow explicit layer. This mathematically proven relation along with our numerical observation suggests the importance of understanding implicit bias of implicit layers and an open problem on the topic. Our proofs deal with implicit layers, weight tying and nonlinearity on weights, and differ from those in the related literature.
\ No newline at end of file
diff --git a/data/2021/iclr/On the Transfer of Disentangled Representations in Realistic Settings b/data/2021/iclr/On the Transfer of Disentangled Representations in Realistic Settings
new file mode 100644
index 0000000000..63e6f03504
--- /dev/null
+++ b/data/2021/iclr/On the Transfer of Disentangled Representations in Realistic Settings	
@@ -0,0 +1 @@
+Learning meaningful representations that disentangle the underlying structure of the data generating process is considered to be of key importance in machine learning. While disentangled representations were found to be useful for diverse tasks such as abstract reasoning and fair classification, their scalability and real-world impact remain questionable. We introduce a new high-resolution dataset with 1M simulated images and over 1,800 annotated real-world images of the same robotic setup. In contrast to previous work, this new dataset exhibits correlations, a complex underlying structure, and allows to evaluate transfer to unseen simulated and real-world settings where the encoder i) remains in distribution or ii) is out of distribution. We propose new architectures in order to scale disentangled representation learning to realistic high-resolution settings and conduct a large-scale empirical study of disentangled representations on this dataset. We observe that disentanglement is a good predictor for out-of-distribution (OOD) task performance.
\ No newline at end of file
diff --git a/data/2021/iclr/On the Universality of Rotation Equivariant Point Cloud Networks b/data/2021/iclr/On the Universality of Rotation Equivariant Point Cloud Networks
new file mode 100644
index 0000000000..faa1e14494
--- /dev/null
+++ b/data/2021/iclr/On the Universality of Rotation Equivariant Point Cloud Networks	
@@ -0,0 +1,2 @@
+Learning functions on point clouds has applications in many fields, including computer vision, computer graphics, physics, and chemistry. Recently, there has been a growing interest in neural architectures that are invariant or equivariant to all three shape-preserving transformations of point clouds: translation, rotation, and permutation. 
+In this paper, we present a first study of the approximation power of these architectures. We first derive two sufficient conditions for an equivariant architecture to have the universal approximation property, based on a novel characterization of the space of equivariant polynomials. We then use these conditions to show that two recently suggested models are universal, and for devising two other novel universal architectures.
\ No newline at end of file
diff --git a/data/2021/iclr/On the Universality of the Double Descent Peak in Ridgeless Regression b/data/2021/iclr/On the Universality of the Double Descent Peak in Ridgeless Regression
new file mode 100644
index 0000000000..38dbed1cc3
--- /dev/null
+++ b/data/2021/iclr/On the Universality of the Double Descent Peak in Ridgeless Regression	
@@ -0,0 +1 @@
+We prove a non-asymptotic distribution-independent lower bound for the expected mean squared generalization error caused by label noise in ridgeless linear regression. Our lower bound generalizes a similar known result to the overparameterized (interpolating) regime. In contrast to most previous works, our analysis applies to a broad class of input distributions with almost surely full-rank feature matrices, which allows us to cover various types of deterministic or random feature maps. Our lower bound is asymptotically sharp and implies that in the presence of label noise, ridgeless linear regression does not perform well around the interpolation threshold for any of these feature maps. We analyze the imposed assumptions in detail and provide a theory for analytic (random) feature maps. Using this theory, we can show that our assumptions are satisfied for input distributions with a (Lebesgue) density and feature maps given by random deep neural networks with analytic activation functions like sigmoid, tanh, softplus or GELU. As further examples, we show that feature maps from random Fourier features and polynomial kernels also satisfy our assumptions. We complement our theory with further experimental and analytic results.
\ No newline at end of file
diff --git a/data/2021/iclr/On the geometry of generalization and memorization in deep neural networks b/data/2021/iclr/On the geometry of generalization and memorization in deep neural networks
new file mode 100644
index 0000000000..bd6b47dfcc
--- /dev/null
+++ b/data/2021/iclr/On the geometry of generalization and memorization in deep neural networks	
@@ -0,0 +1 @@
+Understanding how large neural networks avoid memorizing training data is key to explaining their high generalization performance. To examine the structure of when and where memorization occurs in a deep network, we use a recently developed replica-based mean field theoretic geometric analysis method. We find that all layers preferentially learn from examples which share features, and link this behavior to generalization performance. Memorization predominately occurs in the deeper layers, due to decreasing object manifolds' radius and dimension, whereas early layers are minimally affected. This predicts that generalization can be restored by reverting the final few layer weights to earlier epochs before significant memorization occurred, which is confirmed by the experiments. Additionally, by studying generalization under different model sizes, we reveal the connection between the double descent phenomenon and the underlying model geometry. Finally, analytical analysis shows that networks avoid memorization early in training because close to initialization, the gradient contribution from permuted examples are small. These findings provide quantitative evidence for the structure of memorization across layers of a deep neural network, the drivers for such structure, and its connection to manifold geometric properties.
\ No newline at end of file
diff --git a/data/2021/iclr/On the mapping between Hopfield networks and Restricted Boltzmann Machines b/data/2021/iclr/On the mapping between Hopfield networks and Restricted Boltzmann Machines
new file mode 100644
index 0000000000..7e42e4dc9a
--- /dev/null
+++ b/data/2021/iclr/On the mapping between Hopfield networks and Restricted Boltzmann Machines	
@@ -0,0 +1 @@
+Hopfield networks (HNs) and Restricted Boltzmann Machines (RBMs) are two important models at the interface of statistical physics, machine learning, and neuroscience. Recently, there has been interest in the relationship between HNs and RBMs, due to their similarity under the statistical mechanics formalism. An exact mapping between HNs and RBMs has been previously noted for the special case of orthogonal (uncorrelated) encoded patterns. We present here an exact mapping in the general case of correlated pattern HNs, which are more broadly applicable to existing datasets. Specifically, we show that any HN with $N$ binary variables and $p<N$ arbitrary binary patterns can be transformed into an RBM with $N$ binary visible variables and $p$ gaussian hidden variables. We outline the conditions under which the reverse mapping exists, and conduct experiments on the MNIST dataset which suggest the mapping provides a useful initialization to the RBM weights. We discuss extensions, the potential importance of this correspondence for the training of RBMs, and for understanding the performance of deep architectures which utilize RBMs.
\ No newline at end of file
diff --git a/data/2021/iclr/On the role of planning in model-based deep reinforcement learning b/data/2021/iclr/On the role of planning in model-based deep reinforcement learning
new file mode 100644
index 0000000000..f9bb6a3aff
--- /dev/null
+++ b/data/2021/iclr/On the role of planning in model-based deep reinforcement learning	
@@ -0,0 +1 @@
+Model-based planning is often thought to be necessary for deep, careful reasoning and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strengthened this hypothesis, the resulting diversity of model-based methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within planning, what choices drive performance? (3) To what extent does planning improve generalization? To answer these questions, we study the performance of MuZero (Schrittwieser et al., 2019), a state-of-the-art MBRL algorithm with strong connections and overlapping components with many other MBRL algorithms. We perform a number of interventions and ablations of MuZero across a wide range of environments, including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) Planning is most useful in the learning process, both for policy updates and for providing a more useful data distribution. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. (3) Planning alone is insufficient to drive strong generalization. These results indicate where and how to utilize planning in reinforcement learning settings, and highlight a number of open questions for future MBRL research.
\ No newline at end of file
diff --git a/data/2021/iclr/One Network Fits All? Modular versus Monolithic Task Formulations in Neural Networks b/data/2021/iclr/One Network Fits All? Modular versus Monolithic Task Formulations in Neural Networks
new file mode 100644
index 0000000000..0cb7cd17ea
--- /dev/null
+++ b/data/2021/iclr/One Network Fits All? Modular versus Monolithic Task Formulations in Neural Networks	
@@ -0,0 +1 @@
+Can deep learning solve multiple tasks simultaneously, even when they are unrelated and very different? We investigate how the representations of the underlying tasks affect the ability of a single neural network to learn them jointly. We present theoretical and empirical findings that a single neural network is capable of simultaneously learning multiple tasks from a combined data set, for a variety of methods for representing tasks -- for example, when the distinct tasks are encoded by well-separated clusters or decision trees over certain task-code attributes. More concretely, we present a novel analysis that shows that families of simple programming-like constructs for the codes encoding the tasks are learnable by two-layer neural networks with standard training. We study more generally how the complexity of learning such combined tasks grows with the complexity of the task codes; we find that combining many tasks may incur a sample complexity penalty, even though the individual tasks are easy to learn. We provide empirical support for the usefulness of the learning bounds by training networks on clusters, decision trees, and SQL-style aggregation.
\ No newline at end of file
diff --git a/data/2021/iclr/Online Adversarial Purification based on Self-supervised Learning b/data/2021/iclr/Online Adversarial Purification based on Self-supervised Learning
new file mode 100644
index 0000000000..6ce912f840
--- /dev/null
+++ b/data/2021/iclr/Online Adversarial Purification based on Self-supervised Learning	
@@ -0,0 +1 @@
+Deep neural networks are known to be vulnerable to adversarial examples, where a perturbation in the input space leads to an amplified shift in the latent network representation. In this paper, we combine canonical supervised learning with self-supervised representation learning, and present Self-supervised Online Adversarial Purification (SOAP), a novel defense strategy that uses a self-supervised loss to purify adversarial examples at test-time. Our approach leverages the label-independent nature of self-supervised signals and counters the adversarial perturbation with respect to the self-supervised tasks. SOAP yields competitive robust accuracy against state-of-the-art adversarial training and purification methods, with considerably less training complexity. In addition, our approach is robust even when adversaries are given knowledge of the purification defense strategy. To the best of our knowledge, our paper is the first that generalizes the idea of using self-supervised signals to perform online test-time purification.
\ No newline at end of file
diff --git a/data/2021/iclr/Open Question Answering over Tables and Text b/data/2021/iclr/Open Question Answering over Tables and Text
new file mode 100644
index 0000000000..d82521549d
--- /dev/null
+++ b/data/2021/iclr/Open Question Answering over Tables and Text	
@@ -0,0 +1 @@
+In open question answering (QA), the answer to a question is produced by retrieving and then analyzing documents that might contain answers to the question. Most open QA systems have considered only retrieving information from unstructured text. Here we consider for the first time open QA over both tabular and textual data and present a new large-scale dataset Open Table-and-Text Question Answering (OTT-QA) to evaluate performance on this task. Most questions in OTT-QA require multi-hop inference across tabular data and unstructured text, and the evidence required to answer a question can be distributed in different ways over these two types of input, making evidence retrieval challenging -- our baseline model using an iterative retriever and BERT-based reader achieves an exact match score less than 10%. We then propose two novel techniques to address the challenge of retrieving and aggregating evidence for OTT-QA. The first technique is to use "early fusion" to group multiple highly relevant tabular and textual units into a fused block, which provides more context for the retriever to search for. The second technique is to use a cross-block reader to model the cross-dependency between multiple retrieved evidence with global-local sparse attention. Combining these two techniques improves the score significantly, to above 27%.
\ No newline at end of file
diff --git a/data/2021/iclr/Optimal Conversion of Conventional Artificial Neural Networks to Spiking Neural Networks b/data/2021/iclr/Optimal Conversion of Conventional Artificial Neural Networks to Spiking Neural Networks
new file mode 100644
index 0000000000..7aca9083ff
--- /dev/null
+++ b/data/2021/iclr/Optimal Conversion of Conventional Artificial Neural Networks to Spiking Neural Networks	
@@ -0,0 +1 @@
+Spiking neural networks (SNNs) are biology-inspired artificial neural networks (ANNs) that comprise of spiking neurons to process asynchronous discrete signals. While more efficient in power consumption and inference speed on the neuromorphic hardware, SNNs are usually difficult to train directly from scratch with spikes due to the discreteness. As an alternative, many efforts have been devoted to converting conventional ANNs into SNNs by copying the weights from ANNs and adjusting the spiking threshold potential of neurons in SNNs. Researchers have designed new SNN architectures and conversion algorithms to diminish the conversion error. However, an effective conversion should address the difference between the SNN and ANN architectures with an efficient approximation \DSK{of} the loss function, which is missing in the field. In this work, we analyze the conversion error by recursive reduction to layer-wise summation and propose a novel strategic pipeline that transfers the weights to the target SNN by combining threshold balance and soft-reset mechanisms. This pipeline enables almost no accuracy loss between the converted SNNs and conventional ANNs with only $\sim1/10$ of the typical SNN simulation time. Our method is promising to get implanted onto embedded platforms with better support of SNNs with limited energy and memory.
\ No newline at end of file
diff --git a/data/2021/iclr/Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime b/data/2021/iclr/Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime
new file mode 100644
index 0000000000..bf891c759f
--- /dev/null
+++ b/data/2021/iclr/Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime	
@@ -0,0 +1 @@
+We analyze the convergence of the averaged stochastic gradient descent for over-parameterized two-layer neural networks for regression problems. It was recently found that, under the neural tangent kernel (NTK) regime, where the learning dynamics for overparameterized neural networks can be mostly characterized by that for the associated reproducing kernel Hilbert space (RKHS), an NTK plays an important role in revealing the global convergence of gradient-based methods. However, there is still room for a convergence rate analysis in the NTK regime. In this study, we show the global convergence of the averaged stochastic gradient descent and derive the optimal convergence rate by exploiting the complexities of the target function and the RKHS associated with the NTK. Moreover, we show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate through a smooth approximation of ReLU networks under certain conditions.
\ No newline at end of file
diff --git a/data/2021/iclr/Optimal Regularization can Mitigate Double Descent b/data/2021/iclr/Optimal Regularization can Mitigate Double Descent
new file mode 100644
index 0000000000..a5c996fcf4
--- /dev/null
+++ b/data/2021/iclr/Optimal Regularization can Mitigate Double Descent	
@@ -0,0 +1 @@
+Recent empirical and theoretical studies have shown that many learning algorithms -- from linear regression to neural networks -- can have test performance that is non-monotonic in quantities such the sample size and model size. This striking phenomenon, often referred to as "double descent", has raised questions of if we need to re-think our current understanding of generalization. In this work, we study whether the double-descent phenomenon can be avoided by using optimal regularization. Theoretically, we prove that for certain linear regression models with isotropic data distribution, optimally-tuned $\ell_2$ regularization achieves monotonic test performance as we grow either the sample size or the model size. We also demonstrate empirically that optimally-tuned $\ell_2$ regularization can mitigate double descent for more general models, including neural networks. Our results suggest that it may also be informative to study the test risk scalings of various algorithms in the context of appropriately tuned regularization.
\ No newline at end of file
diff --git a/data/2021/iclr/Optimism in Reinforcement Learning with Generalized Linear Function Approximation b/data/2021/iclr/Optimism in Reinforcement Learning with Generalized Linear Function Approximation
new file mode 100644
index 0000000000..cf38ce8b6e
--- /dev/null
+++ b/data/2021/iclr/Optimism in Reinforcement Learning with Generalized Linear Function Approximation	
@@ -0,0 +1 @@
+We design a new provably efficient algorithm for episodic reinforcement learning with generalized linear function approximation. We analyze the algorithm under a new expressivity assumption that we call "optimistic closure," which is strictly weaker than assumptions from prior analyses for the linear setting. With optimistic closure, we prove that our algorithm enjoys a regret bound of $\tilde{O}(\sqrt{d^3 T})$ where $d$ is the dimensionality of the state-action features and $T$ is the number of episodes. This is the first statistically and computationally efficient algorithm for reinforcement learning with generalized linear functions.
\ No newline at end of file
diff --git a/data/2021/iclr/Optimizing Memory Placement using Evolutionary Graph Reinforcement Learning b/data/2021/iclr/Optimizing Memory Placement using Evolutionary Graph Reinforcement Learning
new file mode 100644
index 0000000000..d74bff11fd
--- /dev/null
+++ b/data/2021/iclr/Optimizing Memory Placement using Evolutionary Graph Reinforcement Learning	
@@ -0,0 +1 @@
+As modern neural networks have grown to billions of parameters, meeting tight latency budgets has become increasingly challenging. Approaches like compression, sparsification and network pruning have proven effective to tackle this problem - but they rely on modifications of the underlying network. In this paper, we look at a complimentary approach of optimizing how tensors are mapped to on-chip memory in an inference accelerator while leaving the network parameters untouched. Since different memory components trade off capacity for bandwidth differently, a sub-optimal mapping can result in high latency. We introduce evolutionary graph reinforcement learning (EGRL) - a method combining graph neural networks, reinforcement learning (RL) and evolutionary search - that aims to find the optimal mapping to minimize latency. Furthermore, a set of fast, stateless policies guide the evolutionary search to improve sample-efficiency. We train and validate our approach directly on the Intel NNP-I chip for inference using a batch size of 1. EGRL outperforms policy-gradient, evolutionary search and dynamic programming baselines on BERT, ResNet-101 and ResNet-50. We achieve 28-78% speed-up compared to the native NNP-I compiler on all three workloads.
\ No newline at end of file
diff --git a/data/2021/iclr/Orthogonalizing Convolutional Layers with the Cayley Transform b/data/2021/iclr/Orthogonalizing Convolutional Layers with the Cayley Transform
new file mode 100644
index 0000000000..344f0e2c3f
--- /dev/null
+++ b/data/2021/iclr/Orthogonalizing Convolutional Layers with the Cayley Transform	
@@ -0,0 +1 @@
+Recent work has highlighted several advantages of enforcing orthogonality in the weight layers of deep networks, such as maintaining the stability of activations, preserving gradient norms, and enhancing adversarial robustness by enforcing low Lipschitz constants. Although numerous methods exist for enforcing the orthogonality of fully-connected layers, those for convolutional layers are more heuristic in nature, often focusing on penalty methods or limited classes of convolutions. In this work, we propose and evaluate an alternative approach to directly parameterize convolutional layers that are constrained to be orthogonal. Specifically, we propose to apply the Cayley transform to a skew-symmetric convolution in the Fourier domain, so that the inverse convolution needed by the Cayley transform can be computed efficiently. We compare our method to previous Lipschitz-constrained and orthogonal convolutional layers and show that it indeed preserves orthogonality to a high degree even for large convolutions. Applied to the problem of certified adversarial robustness, we show that networks incorporating the layer outperform existing deterministic methods for certified defense against $\ell_2$-norm-bounded adversaries, while scaling to larger architectures than previously investigated. Code is available at https://github.com/locuslab/orthogonal-convolutions.
\ No newline at end of file
diff --git a/data/2021/iclr/Overfitting for Fun and Profit: Instance-Adaptive Data Compression b/data/2021/iclr/Overfitting for Fun and Profit: Instance-Adaptive Data Compression
new file mode 100644
index 0000000000..2f741a8df2
--- /dev/null
+++ b/data/2021/iclr/Overfitting for Fun and Profit: Instance-Adaptive Data Compression	
@@ -0,0 +1 @@
+Neural data compression has been shown to outperform classical methods in terms of $RD$ performance, with results still improving rapidly. At a high level, neural compression is based on an autoencoder that tries to reconstruct the input instance from a (quantized) latent representation, coupled with a prior that is used to losslessly compress these latents. Due to limitations on model capacity and imperfect optimization and generalization, such models will suboptimally compress test data in general. However, one of the great strengths of learned compression is that if the test-time data distribution is known and relatively low-entropy (e.g. a camera watching a static scene, a dash cam in an autonomous car, etc.), the model can easily be finetuned or adapted to this distribution, leading to improved $RD$ performance. In this paper we take this concept to the extreme, adapting the full model to a single video, and sending model updates (quantized and compressed using a parameter-space prior) along with the latent representation. Unlike previous work, we finetune not only the encoder/latents but the entire model, and - during finetuning - take into account both the effect of model quantization and the additional costs incurred by sending the model updates. We evaluate an image compression model on I-frames (sampled at 2 fps) from videos of the Xiph dataset, and demonstrate that full-model adaptation improves $RD$ performance by ~1 dB, with respect to encoder-only finetuning.
\ No newline at end of file
diff --git a/data/2021/iclr/Overparameterisation and worst-case generalisation: friend or foe? b/data/2021/iclr/Overparameterisation and worst-case generalisation: friend or foe?
new file mode 100644
index 0000000000..79a96c4388
--- /dev/null
+++ b/data/2021/iclr/Overparameterisation and worst-case generalisation: friend or foe?	
@@ -0,0 +1 @@
+Overparameterised neural networks have demonstrated the remarkable ability to perfectly ﬁt training samples, while still generalising to unseen test samples. However, several recent works have revealed that such models’ good average performance does not always translate to good worst-case performance: in particular, they may perform poorly on subgroups that are under-represented in the training set. In this paper, we show that in certain settings, overparameterised models’ performance on under-represented subgroups may be improved via post-hoc processing. Speciﬁcally, such models’ bias can be restricted to their classiﬁcation layers
\ No newline at end of file
diff --git a/data/2021/iclr/PAC Confidence Predictions for Deep Neural Network Classifiers b/data/2021/iclr/PAC Confidence Predictions for Deep Neural Network Classifiers
new file mode 100644
index 0000000000..a43babe921
--- /dev/null
+++ b/data/2021/iclr/PAC Confidence Predictions for Deep Neural Network Classifiers	
@@ -0,0 +1 @@
+A key challenge for deploying deep neural networks (DNNs) in safety critical settings is the need to provide rigorous ways to quantify their uncertainty. In this paper, we propose a novel algorithm for constructing predicted classification confidences for DNNs that comes with provable correctness guarantees. Our approach uses Clopper-Pearson confidence intervals for the Binomial distribution in conjunction with the histogram binning approach to calibrated prediction. In addition, we demonstrate how our predicted confidences can be used to enable downstream guarantees in two settings: (i) fast DNN inference, where we demonstrate how to compose a fast but inaccurate DNN with an accurate but slow DNN in a rigorous way to improve performance without sacrificing accuracy, and (ii) safe planning, where we guarantee safety when using a DNN to predict whether a given action is safe based on visual observations. In our experiments, we demonstrate that our approach can be used to provide guarantees for state-of-the-art DNNs.
\ No newline at end of file
diff --git a/data/2021/iclr/PC2WF: 3D Wireframe Reconstruction from Raw Point Clouds b/data/2021/iclr/PC2WF: 3D Wireframe Reconstruction from Raw Point Clouds
new file mode 100644
index 0000000000..f980de4ff6
--- /dev/null
+++ b/data/2021/iclr/PC2WF: 3D Wireframe Reconstruction from Raw Point Clouds	
@@ -0,0 +1 @@
+We introduce PC2WF, the first end-to-end trainable deep network architecture to convert a 3D point cloud into a wireframe model. The network takes as input an unordered set of 3D points sampled from the surface of some object, and outputs a wireframe of that object, i.e., a sparse set of corner points linked by line segments. Recovering the wireframe is a challenging task, where the numbers of both vertices and edges are different for every instance, and a-priori unknown. Our architecture gradually builds up the model: It starts by encoding the points into feature vectors. Based on those features, it identifies a pool of candidate vertices, then prunes those candidates to a final set of corner vertices and refines their locations. Next, the corners are linked with an exhaustive set of candidate edges, which is again pruned to obtain the final wireframe. All steps are trainable, and errors can be backpropagated through the entire sequence. We validate the proposed model on a publicly available synthetic dataset, for which the ground truth wireframes are accessible, as well as on a new real-world dataset. Our model produces wireframe abstractions of good quality and outperforms several baselines.
\ No newline at end of file
diff --git a/data/2021/iclr/PDE-Driven Spatiotemporal Disentanglement b/data/2021/iclr/PDE-Driven Spatiotemporal Disentanglement
new file mode 100644
index 0000000000..459aa9ac05
--- /dev/null
+++ b/data/2021/iclr/PDE-Driven Spatiotemporal Disentanglement	
@@ -0,0 +1 @@
+A recent line of work in the machine learning community addresses the problem of predicting high-dimensional spatiotemporal phenomena by leveraging specific tools from the differential equations theory. Following this direction, we propose in this article a novel and general paradigm for this task based on a resolution method for partial differential equations: the separation of variables. This inspiration allows us to introduce a dynamical interpretation of spatiotemporal disentanglement. It induces a principled model based on learning disentangled spatial and temporal representations of a phenomenon to accurately predict future observations. We experimentally demonstrate the performance and broad applicability of our method against prior state-of-the-art models on physical and synthetic video datasets.
\ No newline at end of file
diff --git a/data/2021/iclr/PMI-Masking: Principled masking of correlated spans b/data/2021/iclr/PMI-Masking: Principled masking of correlated spans
new file mode 100644
index 0000000000..a5b865e386
--- /dev/null
+++ b/data/2021/iclr/PMI-Masking: Principled masking of correlated spans	
@@ -0,0 +1 @@
+Masking tokens uniformly at random constitutes a common flaw in the pretraining of Masked Language Models (MLMs) such as BERT. We show that such uniform masking allows an MLM to minimize its training objective by latching onto shallow local signals, leading to pretraining inefficiency and suboptimal downstream performance. To address this flaw, we propose PMI-Masking, a principled masking strategy based on the concept of Pointwise Mutual Information (PMI), which jointly masks a token n-gram if it exhibits high collocation over the corpus. PMI-Masking motivates, unifies, and improves upon prior more heuristic approaches that attempt to address the drawback of random uniform token masking, such as whole-word masking, entity/phrase masking, and random-span masking. Specifically, we show experimentally that PMI-Masking reaches the performance of prior masking approaches in half the training time, and consistently improves performance at the end of training.
\ No newline at end of file
diff --git a/data/2021/iclr/PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences b/data/2021/iclr/PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences
new file mode 100644
index 0000000000..5c04605d73
--- /dev/null
+++ b/data/2021/iclr/PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences	
@@ -0,0 +1 @@
+Point cloud sequences are irregular and unordered in the spatial dimension while exhibiting regularities and order in the temporal dimension. Therefore, existing grid based convolutions for conventional video processing cannot be directly applied to spatio-temporal modeling of raw point cloud sequences. In this paper, we propose a point spatio-temporal (PST) convolution to achieve informative representations of point cloud sequences. The proposed PST convolution first disentangles space and time in point cloud sequences. Then, a spatial convolution is employed to capture the local structure of points in the 3D space, and a temporal convolution is used to model the dynamics of the spatial regions along the time dimension. Furthermore, we incorporate the proposed PST convolution into a deep network, namely PSTNet, to extract features of point cloud sequences in a hierarchical manner. Extensive experiments on widely-used 3D action recognition and 4D semantic segmentation datasets demonstrate the effectiveness of PSTNet to model point cloud sequences.
\ No newline at end of file
diff --git a/data/2021/iclr/Parameter Efficient Multimodal Transformers for Video Representation Learning b/data/2021/iclr/Parameter Efficient Multimodal Transformers for Video Representation Learning
new file mode 100644
index 0000000000..c271980ccd
--- /dev/null
+++ b/data/2021/iclr/Parameter Efficient Multimodal Transformers for Video Representation Learning	
@@ -0,0 +1 @@
+The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model. However, due to the excessive memory requirements from Transformers, existing work typically fixes the language model and train only the vision module, which limits its ability to learn cross-modal information in an end-to-end manner. In this work, we focus on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning. We alleviate the high memory requirement by sharing the weights of Transformers across layers and modalities; we decompose the Transformer into modality-specific and modality-shared parts so that the model learns the dynamics of each modality both individually and together, and propose a novel parameter sharing scheme based on low-rank approximation. We show that our approach reduces parameters up to 80$\%$, allowing us to train our model end-to-end from scratch. We also propose a negative sampling approach based on an instance similarity measured on the CNN embedding space that our model learns with the Transformers. To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Parameter-Based Value Functions b/data/2021/iclr/Parameter-Based Value Functions
new file mode 100644
index 0000000000..ddae5b90ef
--- /dev/null
+++ b/data/2021/iclr/Parameter-Based Value Functions	
@@ -0,0 +1 @@
+Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. However, when value functions are updated to track the learned policy, they forget potentially useful information about old policies. We introduce a class of value functions called Parameter-based Value Functions (PVFs) whose inputs include the policy parameters. They can generalize across different policies. PVFs can evaluate the performance of any policy given a state, a state-action pair, or a distribution over the RL agent's initial states. First we show how PVFs yield novel off-policy policy gradient theorems. Then we derive off-policy actor-critic algorithms based on PVFs trained by Monte Carlo or Temporal Difference methods. We show how learned PVFs can zero-shot learn new policies that outperform any policy seen during training. Finally our algorithms are evaluated on a selection of discrete and continuous control tasks using shallow policies and deep neural networks. Their performance is comparable to the one of state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2021/iclr/Parrot: Data-Driven Behavioral Priors for Reinforcement Learning b/data/2021/iclr/Parrot: Data-Driven Behavioral Priors for Reinforcement Learning
new file mode 100644
index 0000000000..b86297a48e
--- /dev/null
+++ b/data/2021/iclr/Parrot: Data-Driven Behavioral Priors for Reinforcement Learning	
@@ -0,0 +1 @@
+Reinforcement learning provides a general framework for flexible decision making and control, but requires extensive data collection for each new task that an agent needs to learn. In other machine learning fields, such as natural language processing or computer vision, pre-training on large, previously collected datasets to bootstrap learning for new tasks has emerged as a powerful paradigm to reduce data requirements when learning a new task. In this paper, we ask the following question: how can we enable similarly useful pre-training for RL agents? We propose a method for pre-training behavioral priors that can capture complex input-output relationships observed in successful trials from a wide range of previously seen tasks, and we show how this learned prior can be used for rapidly learning new tasks without impeding the RL agent's ability to try out novel behaviors. We demonstrate the effectiveness of our approach in challenging robotic manipulation domains involving image observations and sparse reward functions, where our method outperforms prior works by a substantial margin.
\ No newline at end of file
diff --git a/data/2021/iclr/Partitioned Learned Bloom Filters b/data/2021/iclr/Partitioned Learned Bloom Filters
new file mode 100644
index 0000000000..b628299c6d
--- /dev/null
+++ b/data/2021/iclr/Partitioned Learned Bloom Filters	
@@ -0,0 +1 @@
+A Bloom filter is a memory-efficient data structure for approximate membership queries used in numerous fields of computer science. Recently, learned Bloom filters that achieve better memory efficiency using machine learning models have attracted attention. One such filter, the partitioned learned Bloom filter (PLBF), achieves excellent memory efficiency. However, PLBF requires a $O(N^3k)$ time complexity to construct the data structure, where $N$ and $k$ are the hyperparameters of PLBF. One can improve memory efficiency by increasing $N$, but the construction time becomes extremely long. Thus, we propose two methods that can reduce the construction time while maintaining the memory efficiency of PLBF. First, we propose fast PLBF, which can construct the same data structure as PLBF with a smaller time complexity $O(N^2k)$. Second, we propose fast PLBF++, which can construct the data structure with even smaller time complexity $O(Nk\log N + Nk^2)$. Fast PLBF++ does not necessarily construct the same data structure as PLBF. Still, it is almost as memory efficient as PLBF, and it is proved that fast PLBF++ has the same data structure as PLBF when the distribution satisfies a certain constraint. Our experimental results from real-world datasets show that (i) fast PLBF and fast PLBF++ can construct the data structure up to 233 and 761 times faster than PLBF, (ii) fast PLBF can achieve the same memory efficiency as PLBF, and (iii) fast PLBF++ can achieve almost the same memory efficiency as PLBF. The codes are available at https://github.com/atsukisato/FastPLBF .
\ No newline at end of file
diff --git a/data/2021/iclr/Perceptual Adversarial Robustness: Defense Against Unseen Threat Models b/data/2021/iclr/Perceptual Adversarial Robustness: Defense Against Unseen Threat Models
new file mode 100644
index 0000000000..ea3f2938d4
--- /dev/null
+++ b/data/2021/iclr/Perceptual Adversarial Robustness: Defense Against Unseen Threat Models	
@@ -0,0 +1 @@
+We present adversarial attacks and defenses for the perceptual adversarial threat model: the set of all perturbations to natural images which can mislead a classifier but are imperceptible to human eyes. The perceptual threat model is broad and encompasses $L_2$, $L_\infty$, spatial, and many other existing adversarial threat models. However, it is difficult to determine if an arbitrary perturbation is imperceptible without humans in the loop. To solve this issue, we propose to use a {\it neural perceptual distance}, an approximation of the true perceptual distance between images using internal activations of neural networks. In particular, we use the Learned Perceptual Image Patch Similarity (LPIPS) distance. We then propose the {\it neural perceptual threat model} that includes adversarial examples with a bounded neural perceptual distance to natural images. Under the neural perceptual threat model, we develop two novel perceptual adversarial attacks to find any imperceptible perturbations to images which can fool a classifier. Through an extensive perceptual study, we show that the LPIPS distance correlates well with human judgements of perceptibility of adversarial examples, validating our threat model. Because the LPIPS threat model is very broad, we find that Perceptual Adversarial Training (PAT) against a perceptual attack gives robustness against many other types of adversarial attacks. We test PAT on CIFAR-10 and ImageNet-100 against 12 types of adversarial attacks and find that, for each attack, PAT achieves close to the accuracy of adversarial training against just that perturbation type. That is, PAT generalizes well to unforeseen perturbation types. This is vital in sensitive applications where a particular threat model cannot be assumed, and to the best of our knowledge, PAT is the first adversarial defense with this property.
\ No newline at end of file
diff --git a/data/2021/iclr/Personalized Federated Learning with First Order Model Optimization b/data/2021/iclr/Personalized Federated Learning with First Order Model Optimization
new file mode 100644
index 0000000000..b2c091166b
--- /dev/null
+++ b/data/2021/iclr/Personalized Federated Learning with First Order Model Optimization	
@@ -0,0 +1 @@
+While federated learning traditionally aims to train a single global model across decentralized local datasets, one model may not always be ideal for all participating clients. Here we propose an alternative, where each client only federates with other relevant clients to obtain a stronger model per client-specific objectives. To achieve this personalization, rather than computing a single model average with constant weights for the entire federation as in traditional FL, we efficiently calculate optimal weighted model combinations for each client, based on figuring out how much a client can benefit from another's model. We do not assume knowledge of any underlying data distributions or client similarities, and allow each client to optimize for arbitrary target distributions of interest, enabling greater flexibility for personalization. We evaluate and characterize our method on a variety of federated settings, datasets, and degrees of local data heterogeneity. Our method outperforms existing alternatives, while also enabling new features for personalized FL such as transfer outside of local data distributions.
\ No newline at end of file
diff --git a/data/2021/iclr/Physics-aware, probabilistic model order reduction with guaranteed stability b/data/2021/iclr/Physics-aware, probabilistic model order reduction with guaranteed stability
new file mode 100644
index 0000000000..05c926002a
--- /dev/null
+++ b/data/2021/iclr/Physics-aware, probabilistic model order reduction with guaranteed stability	
@@ -0,0 +1 @@
+Given (small amounts of) time-series' data from a high-dimensional, fine-grained, multiscale dynamical system, we propose a generative framework for learning an effective, lower-dimensional, coarse-grained dynamical model that is predictive of the fine-grained system's long-term evolution but also of its behavior under different initial conditions. We target fine-grained models as they arise in physical applications (e.g. molecular dynamics, agent-based models), the dynamics of which are strongly non-stationary but their transition to equilibrium is governed by unknown slow processes which are largely inaccessible by brute-force simulations. Approaches based on domain knowledge heavily rely on physical insight in identifying temporally slow features and fail to enforce the long-term stability of the learned dynamics. On the other hand, purely statistical frameworks lack interpretability and rely on large amounts of expensive simulation data (long and multiple trajectories) as they cannot infuse domain knowledge. The generative framework proposed achieves the aforementioned desiderata by employing a flexible prior on the complex plane for the latent, slow processes, and an intermediate layer of physics-motivated latent variables that reduces reliance on data and imbues inductive bias. In contrast to existing schemes, it does not require the a priori definition of projection operators from the fine-grained description and addresses simultaneously the tasks of dimensionality reduction and model estimation. We demonstrate its efficacy and accuracy in multiscale physical systems of particle dynamics where probabilistic, long-term predictions of phenomena not contained in the training data are produced.
\ No newline at end of file
diff --git a/data/2021/iclr/Plan-Based Relaxed Reward Shaping for Goal-Directed Tasks b/data/2021/iclr/Plan-Based Relaxed Reward Shaping for Goal-Directed Tasks
new file mode 100644
index 0000000000..ee425f4fbe
--- /dev/null
+++ b/data/2021/iclr/Plan-Based Relaxed Reward Shaping for Goal-Directed Tasks	
@@ -0,0 +1 @@
+In high-dimensional state spaces, the usefulness of Reinforcement Learning (RL) is limited by the problem of exploration. This issue has been addressed using potential-based reward shaping (PB-RS) previously. In the present work, we introduce Final-Volume-Preserving Reward Shaping (FV-RS). FV-RS relaxes the strict optimality guarantees of PB-RS to a guarantee of preserved long-term behavior. Being less restrictive, FV-RS allows for reward shaping functions that are even better suited for improving the sample efficiency of RL algorithms. In particular, we consider settings in which the agent has access to an approximate plan. Here, we use examples of simulated robotic manipulation tasks to demonstrate that plan-based FV-RS can indeed significantly improve the sample efficiency of RL over plan-based PB-RS.
\ No newline at end of file
diff --git a/data/2021/iclr/Planning from Pixels using Inverse Dynamics Models b/data/2021/iclr/Planning from Pixels using Inverse Dynamics Models
new file mode 100644
index 0000000000..e66c627b1b
--- /dev/null
+++ b/data/2021/iclr/Planning from Pixels using Inverse Dynamics Models	
@@ -0,0 +1 @@
+Learning task-agnostic dynamics models in high-dimensional observation spaces can be challenging for model-based RL agents. We propose a novel way to learn latent world models by learning to predict sequences of future actions conditioned on task completion. These task-conditioned models adaptively focus modeling capacity on task-relevant dynamics, while simultaneously serving as an effective heuristic for planning with sparse rewards. We evaluate our method on challenging visual goal completion tasks and show a substantial increase in performance compared to prior model-free approaches.
\ No newline at end of file
diff --git a/data/2021/iclr/PlasticineLab: A Soft-Body Manipulation Benchmark with Differentiable Physics b/data/2021/iclr/PlasticineLab: A Soft-Body Manipulation Benchmark with Differentiable Physics
new file mode 100644
index 0000000000..fcba147aa8
--- /dev/null
+++ b/data/2021/iclr/PlasticineLab: A Soft-Body Manipulation Benchmark with Differentiable Physics	
@@ -0,0 +1 @@
+Simulated virtual environments serve as one of the main driving forces behind developing and evaluating skill learning algorithms. However, existing environments typically only simulate rigid body physics. Additionally, the simulation process usually does not provide gradients that might be useful for planning and control optimizations. We introduce a new differentiable physics benchmark called PasticineLab, which includes a diverse collection of soft body manipulation tasks. In each task, the agent uses manipulators to deform the plasticine into the desired configuration. The underlying physics engine supports differentiable elastic and plastic deformation using the DiffTaichi system, posing many under-explored challenges to robotic agents. We evaluate several existing reinforcement learning (RL) methods and gradient-based methods on this benchmark. Experimental results suggest that 1) RL-based approaches struggle to solve most of the tasks efficiently; 2) gradient-based approaches, by optimizing open-loop control sequences with the built-in differentiable physics engine, can rapidly find a solution within tens of iterations, but still fall short on multi-stage tasks that require long-term planning. We expect that PlasticineLab will encourage the development of novel algorithms that combine differentiable physics and RL for more complex physics-based skill learning tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/PolarNet: Learning to Optimize Polar Keypoints for Keypoint Based Object Detection b/data/2021/iclr/PolarNet: Learning to Optimize Polar Keypoints for Keypoint Based Object Detection
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Policy-Driven Attack: Learning to Query for Hard-label Black-box Adversarial Examples b/data/2021/iclr/Policy-Driven Attack: Learning to Query for Hard-label Black-box Adversarial Examples
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Practical Massively Parallel Monte-Carlo Tree Search Applied to Molecular Design b/data/2021/iclr/Practical Massively Parallel Monte-Carlo Tree Search Applied to Molecular Design
new file mode 100644
index 0000000000..a490fd962d
--- /dev/null
+++ b/data/2021/iclr/Practical Massively Parallel Monte-Carlo Tree Search Applied to Molecular Design	
@@ -0,0 +1 @@
+It is common practice to use large computational resources to train neural networks, as is known from many examples, such as reinforcement learning applications. However, while massively parallel computing is often used for training models, it is rarely used for searching solutions for combinatorial optimization problems. In this paper, we propose a novel massively parallel Monte-Carlo Tree Search (MP-MCTS) algorithm that works efficiently for 1,000 worker scale, and apply it to molecular design. This is the first work that applies distributed MCTS to a real-world and non-game problem. Existing work on large-scale parallel MCTS show efficient scalability in terms of the number of rollouts up to 100 workers, but suffer from the degradation in the quality of the solutions. MP-MCTS maintains the search quality at larger scale, and by running MP-MCTS on 256 CPU cores for only 10 minutes, we obtained candidate molecules having similar score to non-parallel MCTS running for 42 hours. Moreover, our results based on parallel MCTS (combined with a simple RNN model) significantly outperforms existing state-of-the-art work. Our method is generic and is expected to speed up other applications of MCTS.
\ No newline at end of file
diff --git a/data/2021/iclr/Practical Real Time Recurrent Learning with a Sparse Approximation b/data/2021/iclr/Practical Real Time Recurrent Learning with a Sparse Approximation
new file mode 100644
index 0000000000..0de8ea4040
--- /dev/null
+++ b/data/2021/iclr/Practical Real Time Recurrent Learning with a Sparse Approximation	
@@ -0,0 +1 @@
+Recurrent neural networks are usually trained with backpropagation through time, which requires storing a complete history of network states, and prohibits updating the weights ‘online’ (after every timestep). Real Time Recurrent Learning (RTRL) eliminates the need for history storage and allows for online weight updates, but does so at the expense of computational costs that are quartic in the state size. This renders RTRL training intractable for all but the smallest networks, even ones that are made highly sparse. We introduce the Sparse n-step Approximation (SnAp) to the RTRL inﬂuence matrix. SnAp only tracks the inﬂuence of a parameter on hidden units that are reached by the computation graph within n timesteps of the recurrent core. SnAp with n = 1 is no more expensive than backpropagation but allows training on arbitrarily long sequences. We ﬁnd that it substantially outperforms other RTRL approximations with comparable costs such as Unbiased Online Recurrent Optimization. For highly sparse networks, SnAp with n = 2 remains tractable and can outperform backpropagation through time in terms of learning speed when updates are done online.
\ No newline at end of file
diff --git a/data/2021/iclr/Pre-training Text-to-Text Transformers for Concept-centric Common Sense b/data/2021/iclr/Pre-training Text-to-Text Transformers for Concept-centric Common Sense
new file mode 100644
index 0000000000..31f3ea2197
--- /dev/null
+++ b/data/2021/iclr/Pre-training Text-to-Text Transformers for Concept-centric Common Sense	
@@ -0,0 +1 @@
+Pre-trained language models (PTLM) have achieved impressive results in a range of natural language understanding (NLU) and generation (NLG) tasks. However, current pre-training objectives such as masked token prediction (for BERT-style PTLMs) and masked span infilling (for T5-style PTLMs) do not explicitly model the relational commonsense knowledge about everyday concepts, which is crucial to many downstream tasks that need common sense to understand or generate. To augment PTLMs with concept-centric commonsense knowledge, in this paper, we propose both generative and contrastive objectives for learning common sense from the text, and use them as intermediate self-supervised learning tasks for incrementally pre-training PTLMs (before task-specific fine-tuning on downstream datasets). Furthermore, we develop a joint pre-training framework to unify generative and contrastive objectives so that they can mutually reinforce each other. Extensive experimental results show that our method, concept-aware language model (CALM), can pack more commonsense knowledge into the parameters of a pre-trained text-to-text transformer without relying on external knowledge graphs, yielding better performance on both NLU and NLG tasks. We show that while only incrementally pre-trained on a relatively small corpus for a few steps, CALM outperforms baseline methods by a consistent margin and even comparable with some larger PTLMs, which suggests that CALM can serve as a general, plug-and-play method for improving the commonsense reasoning ability of a PTLM.
\ No newline at end of file
diff --git a/data/2021/iclr/Predicting Classification Accuracy When Adding New Unobserved Classes b/data/2021/iclr/Predicting Classification Accuracy When Adding New Unobserved Classes
new file mode 100644
index 0000000000..385dd46d39
--- /dev/null
+++ b/data/2021/iclr/Predicting Classification Accuracy When Adding New Unobserved Classes	
@@ -0,0 +1 @@
+Multiclass classifiers are often designed and evaluated only on a sample from the classes on which they will eventually be applied. Hence, their final accuracy remains unknown. In this work we study how a classifier's performance over the initial class sample can be used to extrapolate its expected accuracy on a larger, unobserved set of classes. For this, we define a measure of separation between correct and incorrect classes that is independent of the number of classes: the reversed ROC (rROC), which is obtained by replacing the roles of classes and data-points in the common ROC. We show that the classification accuracy is a function of the rROC in multiclass classifiers, for which the learned representation of data from the initial class sample remains unchanged when new classes are added. Using these results we formulate a robust neural-network-based algorithm, CleaneX, which learns to estimate the accuracy of such classifiers on arbitrarily large sets of classes. Our method achieves remarkably better predictions than current state-of-the-art methods on both simulations and real datasets of object detection, face recognition, and brain decoding.
\ No newline at end of file
diff --git a/data/2021/iclr/Predicting Inductive Biases of Pre-Trained Models b/data/2021/iclr/Predicting Inductive Biases of Pre-Trained Models
new file mode 100644
index 0000000000..b318016cd9
--- /dev/null
+++ b/data/2021/iclr/Predicting Inductive Biases of Pre-Trained Models	
@@ -0,0 +1 @@
+Most current NLP systems are based on a pre-train-then-fine-tune paradigm, in which a large neural network is first trained in a self-supervised way designed to encourage the network to extract broadly-useful linguistic features, and then finetuned for a specific task of interest. Recent work attempts to understand why this recipe works and explain when it fails. Currently, such analyses have produced two sets of apparently-contradictory results. Work that analyzes the representations that result from pre-training (via “probing classifiers”) finds evidence that rich features of linguistic structure can be decoded with high accuracy, but work that analyzes model behavior after fine-tuning (via “challenge sets”) indicates that decisions are often not based on such structure but rather on spurious heuristics specific to the training set. In this work, we test the hypothesis that the extent to which a feature influences a model’s decisions can be predicted using a combination of two factors: The feature’s extractability after pre-training (measured using information-theoretic probing techniques), and the evidence available during finetuning (defined as the feature’s co-occurrence rate with the label). In experiments with both synthetic and naturalistic data, we find strong evidence (statistically significant correlations) supporting this hypothesis.
\ No newline at end of file
diff --git a/data/2021/iclr/Predicting Infectiousness for Proactive Contact Tracing b/data/2021/iclr/Predicting Infectiousness for Proactive Contact Tracing
new file mode 100644
index 0000000000..8af0e8f073
--- /dev/null
+++ b/data/2021/iclr/Predicting Infectiousness for Proactive Contact Tracing	
@@ -0,0 +1 @@
+The COVID-19 pandemic has spread rapidly worldwide, overwhelming manual contact tracing in many countries and resulting in widespread lockdowns for emergency containment. Large-scale digital contact tracing (DCT) has emerged as a potential solution to resume economic and social activity while minimizing spread of the virus. Various DCT methods have been proposed, each making trade-offs between privacy, mobility restrictions, and public health. The most common approach, binary contact tracing (BCT), models infection as a binary event, informed only by an individual's test results, with corresponding binary recommendations that either all or none of the individual's contacts quarantine. BCT ignores the inherent uncertainty in contacts and the infection process, which could be used to tailor messaging to high-risk individuals, and prompt proactive testing or earlier warnings. It also does not make use of observations such as symptoms or pre-existing medical conditions, which could be used to make more accurate infectiousness predictions. In this paper, we use a recently-proposed COVID-19 epidemiological simulator to develop and test methods that can be deployed to a smartphone to locally and proactively predict an individual's infectiousness (risk of infecting others) based on their contact history and other information, while respecting strong privacy constraints. Predictions are used to provide personalized recommendations to the individual via an app, as well as to send anonymized messages to the individual's contacts, who use this information to better predict their own infectiousness, an approach we call proactive contact tracing (PCT). We find a deep-learning based PCT method which improves over BCT for equivalent average mobility, suggesting PCT could help in safe re-opening and second-wave prevention.
\ No newline at end of file
diff --git a/data/2021/iclr/Prediction and generalisation over directed actions by grid cells b/data/2021/iclr/Prediction and generalisation over directed actions by grid cells
new file mode 100644
index 0000000000..196f6e78e7
--- /dev/null
+++ b/data/2021/iclr/Prediction and generalisation over directed actions by grid cells	
@@ -0,0 +1 @@
+Knowing how the effects of directed actions generalise to new situations (e.g. moving
\ No newline at end of file
diff --git a/data/2021/iclr/Primal Wasserstein Imitation Learning b/data/2021/iclr/Primal Wasserstein Imitation Learning
new file mode 100644
index 0000000000..1eb95dc0aa
--- /dev/null
+++ b/data/2021/iclr/Primal Wasserstein Imitation Learning	
@@ -0,0 +1 @@
+Imitation Learning (IL) methods seek to match the behavior of an agent with that of an expert. In the present work, we propose a new IL method based on a conceptually simple algorithm: Primal Wasserstein Imitation Learning (PWIL), which ties to the primal form of the Wasserstein distance between the expert and the agent state-action distributions. We present a reward function which is derived offline, as opposed to recent adversarial IL algorithms that learn a reward function through interactions with the environment, and which requires little fine-tuning. We show that we can recover expert behavior on a variety of continuous control tasks of the MuJoCo domain in a sample efficient manner in terms of agent interactions and of expert interactions with the environment. Finally, we show that the behavior of the agent we train matches the behavior of the expert with the Wasserstein distance, rather than the commonly used proxy of performance.
\ No newline at end of file
diff --git a/data/2021/iclr/Private Image Reconstruction from System Side Channels Using Generative Models b/data/2021/iclr/Private Image Reconstruction from System Side Channels Using Generative Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Private Post-GAN Boosting b/data/2021/iclr/Private Post-GAN Boosting
new file mode 100644
index 0000000000..e4094bc993
--- /dev/null
+++ b/data/2021/iclr/Private Post-GAN Boosting	
@@ -0,0 +1 @@
+Differentially private GANs have proven to be a promising approach for generating realistic synthetic data without compromising the privacy of individuals. However, due to the privacy-protective noise introduced in the training, the convergence of GANs becomes even more elusive, which often leads to poor utility in the output generator at the end of training. We propose Private post-GAN boosting (Private PGB), a differentially private method that combines samples produced by the sequence of generators obtained during GAN training to create a high-quality synthetic dataset. Our method leverages the Private Multiplicative Weights method (Hardt and Rothblum, 2010) and the discriminator rejection sampling technique (Azadi et al., 2019) for reweighting generated samples, to obtain high quality synthetic data even in cases where GAN training does not converge. We evaluate Private PGB on a Gaussian mixture dataset and two US Census datasets, and demonstrate that Private PGB improves upon the standard private GAN approach across a collection of quality measures. Finally, we provide a non-private variant of PGB that improves the data quality of standard GAN training.
\ No newline at end of file
diff --git a/data/2021/iclr/Probabilistic Numeric Convolutional Neural Networks b/data/2021/iclr/Probabilistic Numeric Convolutional Neural Networks
new file mode 100644
index 0000000000..854345a5a7
--- /dev/null
+++ b/data/2021/iclr/Probabilistic Numeric Convolutional Neural Networks	
@@ -0,0 +1 @@
+Continuous input signals like images and time series that are irregularly sampled or have missing values are challenging for existing deep learning methods. Coherently defined feature representations must depend on the values in unobserved regions of the input. Drawing from the work in probabilistic numerics, we propose Probabilistic Numeric Convolutional Neural Networks which represent features as Gaussian processes (GPs), providing a probabilistic description of discretization error. We then define a convolutional layer as the evolution of a PDE defined on this GP, followed by a nonlinearity. This approach also naturally admits steerable equivariant convolutions under e.g. the rotation group. In experiments we show that our approach yields a $3\times$ reduction of error from the previous state of the art on the SuperPixel-MNIST dataset and competitive performance on the medical time series dataset PhysioNet2012.
\ No newline at end of file
diff --git a/data/2021/iclr/Probing BERT in Hyperbolic Spaces b/data/2021/iclr/Probing BERT in Hyperbolic Spaces
new file mode 100644
index 0000000000..0f6df17cc9
--- /dev/null
+++ b/data/2021/iclr/Probing BERT in Hyperbolic Spaces	
@@ -0,0 +1 @@
+Recently, a variety of probing tasks are proposed to discover linguistic properties learned in contextualized word embeddings. Many of these works implicitly assume these embeddings lay in certain metric spaces, typically the Euclidean space. This work considers a family of geometrically special spaces, the hyperbolic spaces, that exhibit better inductive biases for hierarchical structures and may better reveal linguistic hierarchies encoded in contextualized representations. We introduce a Poincare probe, a structural probe projecting these embeddings into a Poincare subspace with explicitly defined hierarchies. We focus on two probing objectives: (a) dependency trees where the hierarchy is defined as head-dependent structures; (b) lexical sentiments where the hierarchy is defined as the polarity of words (positivity and negativity). We argue that a key desideratum of a probe is its sensitivity to the existence of linguistic structures. We apply our probes on BERT, a typical contextualized embedding model. In a syntactic subspace, our probe better recovers tree structures than Euclidean probes, revealing the possibility that the geometry of BERT syntax may not necessarily be Euclidean. In a sentiment subspace, we reveal two possible meta-embeddings for positive and negative sentiments and show how lexically-controlled contextualization would change the geometric localization of embeddings. We demonstrate the findings with our Poincare probe via extensive experiments and visualization. Our results can be reproduced at https://github.com/FranxYao/PoincareProbe.
\ No newline at end of file
diff --git a/data/2021/iclr/Progressive Skeletonization: Trimming more fat from a network at initialization b/data/2021/iclr/Progressive Skeletonization: Trimming more fat from a network at initialization
new file mode 100644
index 0000000000..396439135a
--- /dev/null
+++ b/data/2021/iclr/Progressive Skeletonization: Trimming more fat from a network at initialization	
@@ -0,0 +1 @@
+Recent studies have shown that skeletonization (pruning parameters) of networks at initialization provides all the practical benefits of sparsity both at inference and training time, while only marginally degrading their performance. However, we observe that beyond a certain level of sparsity (approx 95%), these approaches fail to preserve the network performance, and to our surprise, in many cases perform even worse than trivial random pruning. To this end, we propose to find a skeletonized network with maximum foresight connection sensitivity (FORCE). Intuitively, out of all possible sub-networks, we propose to find the one whose connections would have a maximum impact on the loss when perturbed. Our approximate solution to maximize the FORCE, progressively prunes connections of a given network at initialization. This allows parameters that were unimportant at earlier stages of skeletonization to become important at later stages. In many cases, our approach enables us to remove up to 99.9% parameters, while keeping networks trainable and providing significantly better performance than recent approaches. We demonstrate the effectiveness of our approach at various levels of sparsity (from medium to extreme) through extensive experiments and analysis. Code can be found in this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/Projected Latent Markov Chain Monte Carlo: Conditional Sampling of Normalizing Flows b/data/2021/iclr/Projected Latent Markov Chain Monte Carlo: Conditional Sampling of Normalizing Flows
new file mode 100644
index 0000000000..89d3111322
--- /dev/null
+++ b/data/2021/iclr/Projected Latent Markov Chain Monte Carlo: Conditional Sampling of Normalizing Flows	
@@ -0,0 +1 @@
+We introduce Projected Latent Markov Chain Monte Carlo (PL-MCMC), a technique for sampling from the high-dimensional conditional distributions learned by a normalizing flow. We prove that a Metropolis-Hastings implementation of PL-MCMC asymptotically samples from the exact conditional distributions associated with a normalizing flow. As a conditional sampling method, PL-MCMC enables Monte Carlo Expectation Maximization (MC-EM) training of normalizing flows from incomplete data. Through experimental tests applying normalizing flows to missing data tasks for a variety of data sets, we demonstrate the efficacy of PL-MCMC for conditional sampling from normalizing flows.
\ No newline at end of file
diff --git a/data/2021/iclr/Property Controllable Variational Autoencoder via Invertible Mutual Dependence b/data/2021/iclr/Property Controllable Variational Autoencoder via Invertible Mutual Dependence
new file mode 100644
index 0000000000..b6a0d5864d
--- /dev/null
+++ b/data/2021/iclr/Property Controllable Variational Autoencoder via Invertible Mutual Dependence	
@@ -0,0 +1 @@
+the
\ No newline at end of file
diff --git a/data/2021/iclr/Protecting DNNs from Theft using an Ensemble of Diverse Models b/data/2021/iclr/Protecting DNNs from Theft using an Ensemble of Diverse Models
new file mode 100644
index 0000000000..f6907b0887
--- /dev/null
+++ b/data/2021/iclr/Protecting DNNs from Theft using an Ensemble of Diverse Models	
@@ -0,0 +1 @@
+Several recent works have demonstrated highly effective model stealing (MS) attacks on Deep Neural Networks (DNNs) in black-box settings, even when the training data is unavailable. These attacks typically use some form of Out of Distribution (OOD) data to query the target model and use the predictions obtained to train a clone model. Such a clone model learns to approximate the decision boundary of the target model, achieving high accuracy on in-distribution examples. We propose Ensemble of Diverse Models (EDM) to defend against such MS attacks. EDM is made up of models that are trained to produce dissimilar predictions for OOD inputs. By using a different member of the ensemble to service different queries, our defense produces predictions that are highly discontinuous in the input space for the adversary’s OOD queries. Such discontinuities cause the clone model trained on these predictions to have poor generalization on in-distribution examples. Our evaluations on several image classification tasks demonstrate that EDM defense can severely degrade the accuracy of clone models (up to 39.7%). Our defense has minimal impact on the target accuracy, negligible computational costs during inference, and is compatible with existing defenses for MS attacks.
\ No newline at end of file
diff --git a/data/2021/iclr/Prototypical Contrastive Learning of Unsupervised Representations b/data/2021/iclr/Prototypical Contrastive Learning of Unsupervised Representations
new file mode 100644
index 0000000000..a7abad72f6
--- /dev/null
+++ b/data/2021/iclr/Prototypical Contrastive Learning of Unsupervised Representations	
@@ -0,0 +1 @@
+This paper presents Prototypical Contrastive Learning (PCL), an unsupervised representation learning method that addresses the fundamental limitations of instance-wise contrastive learning. PCL not only learns low-level features for the task of instance discrimination, but more importantly, it implicitly encodes semantic structures of the data into the learned embedding space. Specifically, we introduce prototypes as latent variables to help find the maximum-likelihood estimation of the network parameters in an Expectation-Maximization framework. We iteratively perform E-step as finding the distribution of prototypes via clustering and M-step as optimizing the network via contrastive learning. We propose ProtoNCE loss, a generalized version of the InfoNCE loss for contrastive learning, which encourages representations to be closer to their assigned prototypes. PCL achieves state-of-the-art results on multiple unsupervised representation learning benchmarks, with >10% accuracy improvement in low-resource transfer tasks. Code is available at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/Prototypical Representation Learning for Relation Extraction b/data/2021/iclr/Prototypical Representation Learning for Relation Extraction
new file mode 100644
index 0000000000..26d54ef0e1
--- /dev/null
+++ b/data/2021/iclr/Prototypical Representation Learning for Relation Extraction	
@@ -0,0 +1 @@
+Recognizing relations between entities is a pivotal task of relational learning. Learning relation representations from distantly-labeled datasets is difficult because of the abundant label noise and complicated expressions in human language. This paper aims to learn predictive, interpretable, and robust relation representations from distantly-labeled data that are effective in different settings, including supervised, distantly supervised, and few-shot learning. Instead of solely relying on the supervision from noisy labels, we propose to learn prototypes for each relation from contextual information to best explore the intrinsic semantics of relations. Prototypes are representations in the feature space abstracting the essential semantics of relations between entities in sentences. We learn prototypes based on objectives with clear geometric interpretation, where the prototypes are unit vectors uniformly dispersed in a unit ball, and statement embeddings are centered at the end of their corresponding prototype vectors on the surface of the ball. This approach allows us to learn meaningful, interpretable prototypes for the final classification. Results on several relation learning tasks show that our model significantly outperforms the previous state-of-the-art models. We further demonstrate the robustness of the encoder and the interpretability of prototypes with extensive experiments.
\ No newline at end of file
diff --git a/data/2021/iclr/Provable Rich Observation Reinforcement Learning with Combinatorial Latent States b/data/2021/iclr/Provable Rich Observation Reinforcement Learning with Combinatorial Latent States
new file mode 100644
index 0000000000..51e1ed0b12
--- /dev/null
+++ b/data/2021/iclr/Provable Rich Observation Reinforcement Learning with Combinatorial Latent States	
@@ -0,0 +1 @@
+We propose a novel setting for reinforcement learning that combines two common real-world difficulties: presence of observations (such as camera images) and factored states (such as location of objects). In our setting, the agent receives observations generated stochastically from a latent factored state. These observations are rich enough to enable decoding of the latent state and remove partial observability concerns. Since the latent state is combinatorial, the size of state space is exponential in the number of latent factors. We create a learning algorithm FactoRL (Fact-o-Rel) for this setting which uses noise-contrastive learning to identify latent structures in emission processes and discover a factorized state space. We derive polynomial sample complexity guarantees for FactoRL which polynomially depend upon the number factors, and very weakly depend on the size of the observation space. We also provide a guarantee of polynomial time complexity when given access to an efficient planning algorithm.
\ No newline at end of file
diff --git a/data/2021/iclr/Provably robust classification of adversarial examples with detection b/data/2021/iclr/Provably robust classification of adversarial examples with detection
new file mode 100644
index 0000000000..5325bf5a8f
--- /dev/null
+++ b/data/2021/iclr/Provably robust classification of adversarial examples with detection	
@@ -0,0 +1 @@
+Adversarial attacks against deep networks can be defended against either by building robust classiﬁers or, by creating classiﬁers that can detect the presence of adversarial perturbations. Although it may intuitively seem easier to simply detect attacks rather than build a robust classiﬁer, this has not bourne out in practice even empirically, as most detection methods have subsequently been broken by adaptive attacks, thus necessitating veriﬁable performance for detection mechanisms. In this paper, we propose a new method for jointly training a provably robust classiﬁer and detector. Speciﬁcally, we show that by introducing an additional “abstain/detection” into a classiﬁer, we can modify existing certiﬁed defense mechanisms to allow the classiﬁer to either robustly classify or detect adversarial attacks. We extend the common interval bound propagation (IBP) method for cer-tiﬁed robustness under (cid:96) ∞ perturbations to account for our new robust objective, and show that the method outperforms traditional IBP used in isolation, especially for large perturbation sizes. Speciﬁcally, tests on MNIST and CIFAR-10 datasets exhibit promising results, for example with provable robust error less than 63 . 63% and 67 . 92% , for 55 . 6% and 66 . 37% natural error, for (cid:15) = 8 / 255 and 16 / 255 on the CIFAR-10 dataset, respectively.
\ No newline at end of file
diff --git "a/data/2021/iclr/Proximal Gradient Descent-Ascent: Variable Convergence under K\305\201 Geometry" "b/data/2021/iclr/Proximal Gradient Descent-Ascent: Variable Convergence under K\305\201 Geometry"
new file mode 100644
index 0000000000..3bfc2f4ada
--- /dev/null
+++ "b/data/2021/iclr/Proximal Gradient Descent-Ascent: Variable Convergence under K\305\201 Geometry"	
@@ -0,0 +1 @@
+The gradient descent-ascent (GDA) algorithm has been widely applied to solve minimax optimization problems. In order to achieve convergent policy parameters for minimax optimization, it is important that GDA generates convergent variable sequences rather than convergent sequences of function values or gradient norms. However, the variable convergence of GDA has been proved only under convexity geometries, and there lacks understanding for general nonconvex minimax optimization. This paper fills such a gap by studying the convergence of a more general proximal-GDA for regularized nonconvex-strongly-concave minimax optimization. Specifically, we show that proximal-GDA admits a novel Lyapunov function, which monotonically decreases in the minimax optimization process and drives the variable sequence to a critical point. By leveraging this Lyapunov function and the K{\L} geometry that parameterizes the local geometries of general nonconvex functions, we formally establish the variable convergence of proximal-GDA to a critical point $x^*$, i.e., $x_t\to x^*, y_t\to y^*(x^*)$. Furthermore, over the full spectrum of the K{\L}-parameterized geometry, we show that proximal-GDA achieves different types of convergence rates ranging from sublinear convergence up to finite-step convergence, depending on the geometry associated with the K{\L} parameter. This is the first theoretical result on the variable convergence for nonconvex minimax optimization.
\ No newline at end of file
diff --git a/data/2021/iclr/Pruning Neural Networks at Initialization: Why Are We Missing the Mark? b/data/2021/iclr/Pruning Neural Networks at Initialization: Why Are We Missing the Mark?
new file mode 100644
index 0000000000..c5c6bc123e
--- /dev/null
+++ b/data/2021/iclr/Pruning Neural Networks at Initialization: Why Are We Missing the Mark?	
@@ -0,0 +1 @@
+Recent work has explored the possibility of pruning neural networks at initialization. We assess proposals for doing so: SNIP (Lee et al., 2019), GraSP (Wang et al., 2020), SynFlow (Tanaka et al., 2020), and magnitude pruning. Although these methods surpass the trivial baseline of random pruning, they remain below the accuracy of magnitude pruning after training, and we endeavor to understand why. We show that, unlike pruning after training, accuracy is the same or higher when randomly shuffling which weights these methods prune within each layer or sampling new initial values. As such, the per-weight pruning decisions made by these methods can be replaced by a per-layer choice of the fraction of weights to prune. This property undermines the claimed justifications for these methods and suggests broader challenges with the underlying pruning heuristics, the desire to prune at initialization, or both.
\ No newline at end of file
diff --git a/data/2021/iclr/PseudoSeg: Designing Pseudo Labels for Semantic Segmentation b/data/2021/iclr/PseudoSeg: Designing Pseudo Labels for Semantic Segmentation
new file mode 100644
index 0000000000..27c72dcb86
--- /dev/null
+++ b/data/2021/iclr/PseudoSeg: Designing Pseudo Labels for Semantic Segmentation	
@@ -0,0 +1 @@
+Recent advances in semi-supervised learning (SSL) demonstrate that a combination of consistency regularization and pseudo-labeling can effectively improve image classification accuracy in the low-data regime. Compared to classification, semantic segmentation tasks require much more intensive labeling costs. Thus, these tasks greatly benefit from data-efficient training methods. However, structured outputs in segmentation render particular difficulties (e.g., designing pseudo-labeling and augmentation) to apply existing SSL strategies. To address this problem, we present a simple and novel re-design of pseudo-labeling to generate well-calibrated structured pseudo labels for training with unlabeled or weakly-labeled data. Our proposed pseudo-labeling strategy is network structure agnostic to apply in a one-stage consistency training framework. We demonstrate the effectiveness of the proposed pseudo-labeling strategy in both low-data and high-data regimes. Extensive experiments have validated that pseudo labels generated from wisely fusing diverse sources and strong data augmentation are crucial to consistency training for segmentation. The source code is available at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/QPLEX: Duplex Dueling Multi-Agent Q-Learning b/data/2021/iclr/QPLEX: Duplex Dueling Multi-Agent Q-Learning
new file mode 100644
index 0000000000..9fc3a8518f
--- /dev/null
+++ b/data/2021/iclr/QPLEX: Duplex Dueling Multi-Agent Q-Learning	
@@ -0,0 +1 @@
+We explore value-based multi-agent reinforcement learning (MARL) in the popular paradigm of centralized training with decentralized execution (CTDE). CTDE has an important concept, Individual-Global-Max (IGM) principle, which requires the consistency between joint and local action selections to support efficient local decision-making. However, in order to achieve scalability, existing MARL methods either limit representation expressiveness of their value function classes or relax the IGM consistency, which may suffer from instability risk or lead to poor performance. This paper presents a novel MARL approach, called duPLEX dueling multi-agent Q-learning (QPLEX), which takes a duplex dueling network architecture to factorize the joint value function. This duplex dueling structure encodes the IGM principle into the neural network architecture and thus enables efficient value function learning. Theoretical analysis shows that QPLEX achieves a complete IGM function class. Empirical experiments on StarCraft II micromanagement tasks demonstrate that QPLEX significantly outperforms state-of-the-art baselines in both online and offline data collection settings, and also reveal that QPLEX achieves high sample efficiency and can benefit from offline datasets without additional online exploration.
\ No newline at end of file
diff --git a/data/2021/iclr/Quantifying Differences in Reward Functions b/data/2021/iclr/Quantifying Differences in Reward Functions
new file mode 100644
index 0000000000..f92b9283df
--- /dev/null
+++ b/data/2021/iclr/Quantifying Differences in Reward Functions	
@@ -0,0 +1 @@
+For many tasks, the reward function is too complex to be specified procedurally, and must instead be learned from user data. Prior work has evaluated learned reward functions by examining rollouts from a policy optimized for the learned reward. However, this method cannot distinguish between the learned reward function failing to reflect user preferences, and the reinforcement learning algorithm failing to optimize the learned reward. Moreover, the rollout method is highly sensitive to details of the environment the learned reward is evaluated in, which often differ in the deployment environment. To address these problems, we introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly, without training a policy. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy. Furthermore, we find EPIC can be precisely approximated and is more robust than baselines to the choice of visitation distribution. Finally, we find that the EPIC distance of learned reward functions to the ground-truth reward is predictive of the success of training a policy, even in different transition dynamics.
\ No newline at end of file
diff --git a/data/2021/iclr/R-GAP: Recursive Gradient Attack on Privacy b/data/2021/iclr/R-GAP: Recursive Gradient Attack on Privacy
new file mode 100644
index 0000000000..303ec0da93
--- /dev/null
+++ b/data/2021/iclr/R-GAP: Recursive Gradient Attack on Privacy	
@@ -0,0 +1 @@
+Federated learning frameworks have been regarded as a promising approach to break the dilemma between demands on privacy and the promise of learning from large collections of distributed data. Many such frameworks only ask collaborators to share their local update of a common model, i.e. gradients with respect to locally stored data, instead of exposing their raw data to other collaborators. However, recent optimization-based gradient attacks show that raw data can often be accurately recovered from gradients. It has been shown that minimizing the Euclidean distance between true gradients and those calculated from estimated data is often effective in fully recovering private data. However, there is a fundamental lack of theoretical understanding of how and when gradients can lead to unique recovery of original data. Our research fills this gap by providing a closed-form recursive procedure to recover data from gradients in deep neural networks. We demonstrate that gradient attacks consist of recursively solving a sequence of systems of linear equations. Furthermore, our closed-form approach works as well as or even better than optimization-based approaches at a fraction of the computation, we name it Recursive Gradient Attack on Privacy (R-GAP). Additionally, we propose a rank analysis method, which can be used to estimate a network architecture's risk of a gradient attack. Experimental results demonstrate the validity of the closed-form attack and rank analysis, while demonstrating its superior computational properties and lack of susceptibility to local optima vis a vis optimization-based attacks. Source code is available for download from https://github.com/JunyiZhu-AI/R-GAP.
\ No newline at end of file
diff --git a/data/2021/iclr/RMSprop converges with proper hyper-parameter b/data/2021/iclr/RMSprop converges with proper hyper-parameter
new file mode 100644
index 0000000000..1390feddd1
--- /dev/null
+++ b/data/2021/iclr/RMSprop converges with proper hyper-parameter	
@@ -0,0 +1 @@
+Despite the existence of divergence examples
\ No newline at end of file
diff --git a/data/2021/iclr/RNNLogic: Learning Logic Rules for Reasoning on Knowledge Graphs b/data/2021/iclr/RNNLogic: Learning Logic Rules for Reasoning on Knowledge Graphs
new file mode 100644
index 0000000000..014f0a2e87
--- /dev/null
+++ b/data/2021/iclr/RNNLogic: Learning Logic Rules for Reasoning on Knowledge Graphs	
@@ -0,0 +1 @@
+This paper studies learning logic rules for reasoning on knowledge graphs. Logic rules provide interpretable explanations when used for prediction as well as being able to generalize to other tasks, and hence are critical to learn. Existing methods either suffer from the problem of searching in a large search space (e.g., neural logic programming) or ineffective optimization due to sparse rewards (e.g., techniques based on reinforcement learning). To address these limitations, this paper proposes a probabilistic model called RNNLogic. RNNLogic treats logic rules as a latent variable, and simultaneously trains a rule generator as well as a reasoning predictor with logic rules. We develop an EM-based algorithm for optimization. In each iteration, the reasoning predictor is first updated to explore some generated logic rules for reasoning. Then in the E-step, we select a set of high-quality rules from all generated rules with both the rule generator and reasoning predictor via posterior inference; and in the M-step, the rule generator is updated with the rules selected in the E-step. Experiments on four datasets prove the effectiveness of RNNLogic.
\ No newline at end of file
diff --git a/data/2021/iclr/RODE: Learning Roles to Decompose Multi-Agent Tasks b/data/2021/iclr/RODE: Learning Roles to Decompose Multi-Agent Tasks
new file mode 100644
index 0000000000..f9824ee74b
--- /dev/null
+++ b/data/2021/iclr/RODE: Learning Roles to Decompose Multi-Agent Tasks	
@@ -0,0 +1 @@
+Role-based learning holds the promise of achieving scalable multi-agent learning by decomposing complex tasks using roles. However, it is largely unclear how to efficiently discover such a set of roles. To solve this problem, we propose to first decompose joint action spaces into restricted role action spaces by clustering actions according to their effects on the environment and other agents. Learning a role selector based on action effects makes role discovery much easier because it forms a bi-level learning hierarchy -- the role selector searches in a smaller role space and at a lower temporal resolution, while role policies learn in significantly reduced primitive action-observation spaces. We further integrate information about action effects into the role policies to boost learning efficiency and policy generalization. By virtue of these advances, our method (1) outperforms the current state-of-the-art MARL algorithms on 10 of the 14 scenarios that comprise the challenging StarCraft II micromanagement benchmark and (2) achieves rapid transfer to new environments with three times the number of agents. Demonstrative videos are available at this https URL .
\ No newline at end of file
diff --git a/data/2021/iclr/Random Feature Attention b/data/2021/iclr/Random Feature Attention
new file mode 100644
index 0000000000..10bc18db11
--- /dev/null
+++ b/data/2021/iclr/Random Feature Attention	
@@ -0,0 +1 @@
+Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its application in transformers. RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism. Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines. In the machine translation experiment, RFA decodes twice as fast as a vanilla transformer. Compared to existing efficient transformer variants, RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Our analysis shows that RFA's efficiency gains are especially notable on long sequences, suggesting that RFA will be particularly useful in tasks that require working with large inputs, fast decoding speed, or low memory footprints.
\ No newline at end of file
diff --git a/data/2021/iclr/Randomized Automatic Differentiation b/data/2021/iclr/Randomized Automatic Differentiation
new file mode 100644
index 0000000000..b37a1c6a08
--- /dev/null
+++ b/data/2021/iclr/Randomized Automatic Differentiation	
@@ -0,0 +1 @@
+The successes of deep learning, variational inference, and many other fields have been aided by specialized implementations of reverse-mode automatic differentiation (AD) to compute gradients of mega-dimensional objectives. The AD techniques underlying these tools were designed to compute exact gradients to numerical precision, but modern machine learning models are almost always trained with stochastic gradient descent. Why spend computation and memory on exact (minibatch) gradients only to use them for stochastic optimization? We develop a general framework and approach for randomized automatic differentiation (RAD), which allows unbiased gradient estimates to be computed with reduced memory in return for variance. We examine limitations of the general approach, and argue that we must leverage problem specific structure to realize benefits. We develop RAD techniques for a variety of simple neural network architectures, and show that for a fixed memory budget, RAD converges in fewer iterations than using a small batch size for feedforward networks, and in a similar number for recurrent networks. We also show that RAD can be applied to scientific computing, and use it to develop a low-memory stochastic gradient method for optimizing the control parameters of a linear reaction-diffusion PDE representing a fission reactor.
\ No newline at end of file
diff --git a/data/2021/iclr/Randomized Ensembled Double Q-Learning: Learning Fast Without a Model b/data/2021/iclr/Randomized Ensembled Double Q-Learning: Learning Fast Without a Model
new file mode 100644
index 0000000000..402c813c41
--- /dev/null
+++ b/data/2021/iclr/Randomized Ensembled Double Q-Learning: Learning Fast Without a Model	
@@ -0,0 +1 @@
+Using a high Update-To-Data (UTD) ratio, model-based methods have recently achieved much higher sample efficiency than previous model-free methods for continuous-action DRL benchmarks. In this paper, we introduce a simple model-free algorithm, Randomized Ensembled Double Q-Learning (REDQ), and show that its performance is just as good as, if not better than, a state-of-the-art model-based algorithm for the MuJoCo benchmark. Moreover, REDQ can achieve this performance using fewer parameters than the model-based method, and with less wall-clock run time. REDQ has three carefully integrated ingredients which allow it to achieve its high performance: (i) a UTD ratio>>1; (ii) an ensemble of Q functions; (iii) in-target minimization across a random subset of Q functions from the ensemble. Through carefully designed experiments, we provide a detailed analysis of REDQ and related model-free algorithms. To our knowledge, REDQ is the first successful model-free DRL algorithm for continuous-action spaces using a UTD ratio>>1.
\ No newline at end of file
diff --git a/data/2021/iclr/Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments b/data/2021/iclr/Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments
new file mode 100644
index 0000000000..6a4b740f90
--- /dev/null
+++ b/data/2021/iclr/Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments	
@@ -0,0 +1 @@
+Exploration under sparse reward is a long-standing challenge of model-free reinforcement learning. The state-of-the-art methods address this challenge by introducing intrinsic rewards to encourage exploration in novel states or uncertain environment dynamics. Unfortunately, methods based on intrinsic rewards often fall short in procedurally-generated environments, where a different environment is generated in each episode so that the agent is not likely to visit the same state more than once. Motivated by how humans distinguish good exploration behaviors by looking into the entire episode, we introduce RAPID, a simple yet effective episode-level exploration method for procedurally-generated environments. RAPID regards each episode as a whole and gives an episodic exploration score from both per-episode and long-term views. Those highly scored episodes are treated as good exploration behaviors and are stored in a small ranking buffer. The agent then imitates the episodes in the buffer to reproduce the past good exploration behaviors. We demonstrate our method on several procedurally-generated MiniGrid environments, a first-person-view 3D Maze navigation task from MiniWorld, and several sparse MuJoCo tasks. The results show that RAPID significantly outperforms the state-of-the-art intrinsic reward strategies in terms of sample efficiency and final performance. The code is available at https://github.com/daochenzha/rapid
\ No newline at end of file
diff --git a/data/2021/iclr/Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient Estimator b/data/2021/iclr/Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient Estimator
new file mode 100644
index 0000000000..e49cbfc64a
--- /dev/null
+++ b/data/2021/iclr/Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient Estimator	
@@ -0,0 +1 @@
+Gradient estimation in models with discrete latent variables is a challenging problem, because the simplest unbiased estimators tend to have high variance. To counteract this, modern estimators either introduce bias, rely on multiple function evaluations, or use learned, input-dependent baselines. Thus, there is a need for estimators that require minimal tuning, are computationally cheap, and have low mean squared error. In this paper, we show that the variance of the straight-through variant of the popular Gumbel-Softmax estimator can be reduced through Rao-Blackwellization without increasing the number of function evaluations. This provably reduces the mean squared error. We empirically demonstrate that this leads to variance reduction, faster convergence, and generally improved performance in two unsupervised latent variable models.
\ No newline at end of file
diff --git a/data/2021/iclr/Rapid Neural Architecture Search by Learning to Generate Graphs from Datasets b/data/2021/iclr/Rapid Neural Architecture Search by Learning to Generate Graphs from Datasets
new file mode 100644
index 0000000000..4ef81be79c
--- /dev/null
+++ b/data/2021/iclr/Rapid Neural Architecture Search by Learning to Generate Graphs from Datasets	
@@ -0,0 +1 @@
+Despite the success of recent Neural Architecture Search (NAS) methods on various tasks which have shown to output networks that largely outperform human-designed networks, conventional NAS methods have mostly tackled the optimization of searching for the network architecture for a single task (dataset), which does not generalize well across multiple tasks (datasets). Moreover, since such task-specific methods search for a neural architecture from scratch for every given task, they incur a large computational cost, which is problematic when the time and monetary budget are limited. In this paper, we propose an efficient NAS framework that is trained once on a database consisting of datasets and pretrained networks and can rapidly search a neural architecture for a novel dataset. The proposed MetaD2A (Meta Dataset-to-Architecture) model can stochastically generate graphs (architectures) from a given set (dataset) via a cross-modal latent space learned with amortized meta-learning. Moreover, we also propose a meta-performance predictor to estimate and select the best architecture from those sampled from MetaD2A. The experimental results demonstrate that our model meta-learned on subsets of ImageNet-1K and architectures from NAS-Bench 201 search space successfully generalizes to multiple benchmark datasets including CIFAR-10 and CIFAR-100, with the search time of less than 30 GPU seconds on CIFAR-10. We believe that the MetaD2A proposes a new research direction for rapid NAS as well as ways to utilize the knowledge from rich databases of datasets and architectures accumulated over the past years.
\ No newline at end of file
diff --git a/data/2021/iclr/Rapid Task-Solving in Novel Environments b/data/2021/iclr/Rapid Task-Solving in Novel Environments
new file mode 100644
index 0000000000..14a88eb3b8
--- /dev/null
+++ b/data/2021/iclr/Rapid Task-Solving in Novel Environments	
@@ -0,0 +1 @@
+When thrust into an unfamiliar environment and charged with solving a series of tasks, an effective agent should (1) leverage prior knowledge to solve its current task while (2) efficiently exploring to gather knowledge for use in future tasks, and then (3) plan using that knowledge when faced with new tasks in that same environment. We introduce two domains for conducting research on this challenge, and find that state-of-the-art deep reinforcement learning (RL) agents fail to plan in novel environments. We develop a recursive implicit planning module that operates over episodic memories, and show that the resulting deep-RL agent is able to explore and plan in novel environments, outperforming the nearest baseline by factors of 2-3 across the two domains. We find evidence that our module (1) learned to execute a sensible information-propagating algorithm and (2) generalizes to situations beyond its training experience.
\ No newline at end of file
diff --git a/data/2021/iclr/Recurrent Independent Mechanisms b/data/2021/iclr/Recurrent Independent Mechanisms
new file mode 100644
index 0000000000..25fe22cd7f
--- /dev/null
+++ b/data/2021/iclr/Recurrent Independent Mechanisms	
@@ -0,0 +1 @@
+Learning modular structures which reflect the dynamics of the environment can lead to better generalization and robustness to changes which only affect a few of the underlying causes. We propose Recurrent Independent Mechanisms (RIMs), a new recurrent architecture in which multiple groups of recurrent cells operate with nearly independent transition dynamics, communicate only sparingly through the bottleneck of attention, and are only updated at time steps where they are most relevant. We show that this leads to specialization amongst the RIMs, which in turn allows for dramatically improved generalization on tasks where some factors of variation differ systematically between training and evaluation.
\ No newline at end of file
diff --git a/data/2021/iclr/Reducing the Computational Cost of Deep Generative Models with Binary Neural Networks b/data/2021/iclr/Reducing the Computational Cost of Deep Generative Models with Binary Neural Networks
new file mode 100644
index 0000000000..a96f051b7b
--- /dev/null
+++ b/data/2021/iclr/Reducing the Computational Cost of Deep Generative Models with Binary Neural Networks	
@@ -0,0 +1 @@
+Deep generative models provide a powerful set of tools to understand real-world data. But as these models improve, they increase in size and complexity, so their computational cost in memory and execution time grows. Using binary weights in neural networks is one method which has shown promise in reducing this cost. However, whether binary neural networks can be used in generative models is an open problem. In this work we show, for the first time, that we can successfully train generative models which utilize binary neural networks. This reduces the computational cost of the models massively. We develop a new class of binary weight normalization, and provide insights for architecture designs of these binarized generative models. We demonstrate that two state-of-the-art deep generative models, the ResNet VAE and Flow++ models, can be binarized effectively using these techniques. We train binary models that achieve loss values close to those of the regular models but are 90%-94% smaller in size, and also allow significant speed-ups in execution time.
\ No newline at end of file
diff --git a/data/2021/iclr/Refining Deep Generative Models via Discriminator Gradient Flow b/data/2021/iclr/Refining Deep Generative Models via Discriminator Gradient Flow
new file mode 100644
index 0000000000..87fa285e54
--- /dev/null
+++ b/data/2021/iclr/Refining Deep Generative Models via Discriminator Gradient Flow	
@@ -0,0 +1 @@
+Deep generative modeling has seen impressive advances in recent years, to the point where it is now commonplace to see simulated samples (e.g., images) that closely resemble real-world data. However, generation quality is generally inconsistent for any given model and can vary dramatically between samples. We introduce Discriminator Gradient flow (DGflow), a new technique that improves generated samples via the gradient flow of entropy-regularized f-divergences between the real and the generated data distributions. The gradient flow takes the form of a non-linear Fokker-Plank equation, which can be easily simulated by sampling from the equivalent McKean-Vlasov process. By refining inferior samples, our technique avoids wasteful sample rejection used by previous methods (DRS&MH-GAN). Compared to existing works that focus on specific GAN variants, we show our refinement approach can be applied to GANs with vector-valued critics and even other deep generative models such as VAEs and Normalizing Flows. Empirical results on multiple synthetic, image, and text datasets demonstrate that DGflow leads to significant improvement in the quality of generated samples for a variety of generative models, outperforming the state-of-the-art Discriminator Optimal Transport (DOT) and Discriminator Driven Latent Sampling (DDLS) methods.
\ No newline at end of file
diff --git a/data/2021/iclr/Regularization Matters in Policy Optimization - An Empirical Study on Continuous Control b/data/2021/iclr/Regularization Matters in Policy Optimization - An Empirical Study on Continuous Control
new file mode 100644
index 0000000000..908d016469
--- /dev/null
+++ b/data/2021/iclr/Regularization Matters in Policy Optimization - An Empirical Study on Continuous Control	
@@ -0,0 +1 @@
+Deep Reinforcement Learning (Deep RL) has been receiving increasingly more attention thanks to its encouraging performance on a variety of control tasks. Yet, conventional regularization techniques in training neural networks (e.g., $L_2$ regularization, dropout) have been largely ignored in RL methods, possibly because agents are typically trained and evaluated in the same environment, and because the deep RL community focuses more on high-level algorithm designs. In this work, we present the first comprehensive study of regularization techniques with multiple policy optimization algorithms on continuous control tasks. Interestingly, we find conventional regularization techniques on the policy networks can often bring large improvement, especially on harder tasks. We also compare these techniques with the more widely used entropy regularization. Our findings are shown to be robust against training hyperparameter variations. In addition, we study regularizing different components and find that only regularizing the policy network is typically the best. Finally, we discuss and analyze why regularization may help generalization in RL from four perspectives: sample complexity, reward distribution, weight norm, and noise robustness. We hope our study provides guidance for future practices in regularizing policy optimization algorithms. Our code is available at https://github.com/xuanlinli17/po-rl-regularization .
\ No newline at end of file
diff --git a/data/2021/iclr/Regularized Inverse Reinforcement Learning b/data/2021/iclr/Regularized Inverse Reinforcement Learning
new file mode 100644
index 0000000000..e0a06a7b48
--- /dev/null
+++ b/data/2021/iclr/Regularized Inverse Reinforcement Learning	
@@ -0,0 +1 @@
+Inverse Reinforcement Learning (IRL) aims to facilitate a learner's ability to imitate expert behavior by acquiring reward functions that explain the expert's decisions. Regularized IRL applies strongly convex regularizers to the learner's policy in order to avoid the expert's behavior being rationalized by arbitrary constant rewards, also known as degenerate solutions. We propose tractable solutions, and practical methods to obtain them, for regularized IRL. Current methods are restricted to the maximum-entropy IRL framework, limiting them to Shannon-entropy regularizers, as well as proposing the solutions that are intractable in practice. We present theoretical backing for our proposed IRL method's applicability for both discrete and continuous controls, empirically validating our performance on a variety of tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Reinforcement Learning with Random Delays b/data/2021/iclr/Reinforcement Learning with Random Delays
new file mode 100644
index 0000000000..7eac170ba1
--- /dev/null
+++ b/data/2021/iclr/Reinforcement Learning with Random Delays	
@@ -0,0 +1 @@
+Action and observation delays commonly occur in many Reinforcement Learning applications, such as remote control scenarios. We study the anatomy of randomly delayed environments, and show that partially resampling trajectory fragments in hindsight allows for off-policy multi-step value estimation. We apply this principle to derive Delay-Correcting Actor-Critic (DCAC), an algorithm based on Soft Actor-Critic with significantly better performance in environments with delays. This is shown theoretically and also demonstrated practically on a delay-augmented version of the MuJoCo continuous control benchmark.
\ No newline at end of file
diff --git a/data/2021/iclr/Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models b/data/2021/iclr/Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models
new file mode 100644
index 0000000000..e5fb3da218
--- /dev/null
+++ b/data/2021/iclr/Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models	
@@ -0,0 +1 @@
+Multimodal learning for generative models often refers to the learning of abstract concepts from the commonality of information in multiple modalities, such as vision and language. While it has proven effective for learning generalisable representations, the training of such models often requires a large amount of "related" multimodal data that shares commonality, which can be expensive to come by. To mitigate this, we develop a novel contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data. We show in experiments that our method enables data-efficient multimodal learning on challenging datasets for various multimodal VAE models. We also show that under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
\ No newline at end of file
diff --git a/data/2021/iclr/Remembering for the Right Reasons: Explanations Reduce Catastrophic Forgetting b/data/2021/iclr/Remembering for the Right Reasons: Explanations Reduce Catastrophic Forgetting
new file mode 100644
index 0000000000..e0a0c070c5
--- /dev/null
+++ b/data/2021/iclr/Remembering for the Right Reasons: Explanations Reduce Catastrophic Forgetting	
@@ -0,0 +1,23 @@
+The goal of continual learning (CL) is to learn a sequence of tasks
+without suffering from the phenomenon of catastrophic forgetting.
+Previous work has shown that leveraging memory in the form of a replay
+buffer can reduce performance degradation on prior tasks. We hypothesize
+that forgetting can be further reduced when the model is encouraged to
+remember the evidence for previously made decisions. As a first
+step towards exploring this hypothesis, we propose a simple novel
+training paradigm, called Remembering for the Right Reasons (RRR), that
+additionally stores visual model explanations for each example in the
+buffer and ensures the model has “the right reasons” for its
+predictions by encouraging its explanations to remain consistent with
+those used to make decisions at training time. Without this constraint,
+there is a drift in explanations and increase in forgetting as
+conventional continual learning algorithms learn new tasks. We
+demonstrate how RRR can be easily added to any memory or
+regularization-based approach and results in reduced forgetting, and
+more importantly, improved model explanations. We have evaluated our
+approach in the standard and few-shot settings and observed a consistent
+improvement across various CL approaches using different architectures
+and techniques to generate model explanations and demonstrated our
+approach showing a promising connection between explainability and
+continual learning. Our code is available at
+\url{https://github.com/SaynaEbrahimi/Remembering-for-the-Right-Reasons}
\ No newline at end of file
diff --git a/data/2021/iclr/Removing Undesirable Feature Contributions Using Out-of-Distribution Data b/data/2021/iclr/Removing Undesirable Feature Contributions Using Out-of-Distribution Data
new file mode 100644
index 0000000000..ecc1054e29
--- /dev/null
+++ b/data/2021/iclr/Removing Undesirable Feature Contributions Using Out-of-Distribution Data	
@@ -0,0 +1 @@
+Several data augmentation methods deploy unlabeled-in-distribution (UID) data to bridge the gap between the training and inference of neural networks. However, these methods have clear limitations in terms of availability of UID data and dependence of algorithms on pseudo-labels. Herein, we propose a data augmentation method to improve generalization in both adversarial and standard learning by using out-of-distribution (OOD) data that are devoid of the abovementioned issues. We show how to improve generalization theoretically using OOD data in each learning scenario and complement our theoretical analysis with experiments on CIFAR-10, CIFAR-100, and a subset of ImageNet. The results indicate that undesirable features are shared even among image data that seem to have little correlation from a human point of view. We also present the advantages of the proposed method through comparison with other data augmentation methods, which can be used in the absence of UID data. Furthermore, we demonstrate that the proposed method can further improve the existing state-of-the-art adversarial training.
\ No newline at end of file
diff --git a/data/2021/iclr/Representation Balancing Offline Model-based Reinforcement Learning b/data/2021/iclr/Representation Balancing Offline Model-based Reinforcement Learning
new file mode 100644
index 0000000000..79224e1cc4
--- /dev/null
+++ b/data/2021/iclr/Representation Balancing Offline Model-based Reinforcement Learning	
@@ -0,0 +1 @@
+One of the main challenges in ofﬂine and off-policy reinforcement learning is to cope with the distribution shift that arises from the mismatch between the target policy and the data collection policy. In this paper, we focus on a model-based approach, particularly on learning the representation for a robust model of the environment under the distribution shift
\ No newline at end of file
diff --git a/data/2021/iclr/Representation Learning for Sequence Data with Deep Autoencoding Predictive Components b/data/2021/iclr/Representation Learning for Sequence Data with Deep Autoencoding Predictive Components
new file mode 100644
index 0000000000..da684b55cd
--- /dev/null
+++ b/data/2021/iclr/Representation Learning for Sequence Data with Deep Autoencoding Predictive Components	
@@ -0,0 +1 @@
+We propose Deep Autoencoding Predictive Components (DAPC) -- a self-supervised representation learning method for sequence data, based on the intuition that useful representations of sequence data should exhibit a simple structure in the latent space. We encourage this latent structure by maximizing an estimate of predictive information of latent feature sequences, which is the mutual information between past and future windows at each time step. In contrast to the mutual information lower bound commonly used by contrastive learning, the estimate of predictive information we adopt is exact under a Gaussian assumption. Additionally, it can be computed without negative sampling. To reduce the degeneracy of the latent space extracted by powerful encoders and keep useful information from the inputs, we regularize predictive information learning with a challenging masked reconstruction loss. We demonstrate that our method recovers the latent space of noisy dynamical systems, extracts predictive features for forecasting tasks, and improves automatic speech recognition when used to pretrain the encoder on large amounts of unlabeled data.
\ No newline at end of file
diff --git a/data/2021/iclr/Representation Learning via Invariant Causal Mechanisms b/data/2021/iclr/Representation Learning via Invariant Causal Mechanisms
new file mode 100644
index 0000000000..1c5f161595
--- /dev/null
+++ b/data/2021/iclr/Representation Learning via Invariant Causal Mechanisms	
@@ -0,0 +1 @@
+Self-supervised learning has emerged as a strategy to reduce the reliance on costly supervised signal by pretraining representations only using unlabeled data. These methods combine heuristic proxy classification tasks with data augmentations and have achieved significant success, but our theoretical understanding of this success remains limited. In this paper we analyze self-supervised representation learning using a causal framework. We show how data augmentations can be more effectively utilized through explicit invariance constraints on the proxy classifiers employed during pretraining. Based on this, we propose a novel self-supervised objective, Representation Learning via Invariant Causal Mechanisms (ReLIC), that enforces invariant prediction of proxy targets across augmentations through an invariance regularizer which yields improved generalization guarantees. Further, using causality we generalize contrastive learning, a particular kind of self-supervised method, and provide an alternative theoretical explanation for the success of these methods. Empirically, ReLIC significantly outperforms competing methods in terms of robustness and out-of-distribution generalization on ImageNet, while also significantly outperforming these methods on Atari achieving above human-level performance on $51$ out of $57$ games.
\ No newline at end of file
diff --git a/data/2021/iclr/Representation learning for improved interpretability and classification accuracy of clinical factors from EEG b/data/2021/iclr/Representation learning for improved interpretability and classification accuracy of clinical factors from EEG
new file mode 100644
index 0000000000..15353475fe
--- /dev/null
+++ b/data/2021/iclr/Representation learning for improved interpretability and classification accuracy of clinical factors from EEG	
@@ -0,0 +1 @@
+Despite extensive standardization, diagnostic interviews for mental health disorders encompass substantial subjective judgment. Previous studies have demonstrated that EEG-based neural measures can function as reliable objective correlates of depression, or even predictors of depression and its course. However, their clinical utility has not been fully realized because of 1) the lack of automated ways to deal with the inherent noise associated with EEG data at scale, and 2) the lack of knowledge of which aspects of the EEG signal may be markers of a clinical disorder. Here we adapt an unsupervised pipeline from the recent deep representation learning literature to address these problems by 1) learning a disentangled representation using $\beta$-VAE to denoise the signal, and 2) extracting interpretable features associated with a sparse set of clinical labels using a Symbol-Concept Association Network (SCAN). We demonstrate that our method is able to outperform the canonical hand-engineered baseline classification method on a number of factors, including participant age and depression diagnosis. Furthermore, our method recovers a representation that can be used to automatically extract denoised Event Related Potentials (ERPs) from novel, single EEG trajectories, and supports fast supervised re-mapping to various clinical labels, allowing clinicians to re-use a single EEG representation regardless of updates to the standardized diagnostic system. Finally, single factors of the learned disentangled representations often correspond to meaningful markers of clinical factors, as automatically detected by SCAN, allowing for human interpretability and post-hoc expert analysis of the recommendations made by the model.
\ No newline at end of file
diff --git a/data/2021/iclr/Representing Partial Programs with Blended Abstract Semantics b/data/2021/iclr/Representing Partial Programs with Blended Abstract Semantics
new file mode 100644
index 0000000000..c438741dbb
--- /dev/null
+++ b/data/2021/iclr/Representing Partial Programs with Blended Abstract Semantics	
@@ -0,0 +1 @@
+Synthesizing programs from examples requires searching over a vast, combinatorial space of possible programs. In this search process, a key challenge is representing the behavior of a partially written program before it can be executed, to judge if it is on the right track and predict where to search next. We introduce a general technique for representing partially written programs in a program synthesis engine. We take inspiration from the technique of abstract interpretation, in which an approximate execution model is used to determine if an unfinished program will eventually satisfy a goal specification. Here we learn an approximate execution model implemented as a modular neural network. By constructing compositional program representations that implicitly encode the interpretation semantics of the underlying programming language, we can represent partial programs using a flexible combination of concrete execution state and learned neural representations, using the learned approximate semantics when concrete semantics are not known (in unfinished parts of the program). We show that these hybrid neuro-symbolic representations enable execution-guided synthesizers to use more powerful language constructs, such as loops and higher-order functions, and can be used to synthesize programs more accurately for a given search budget than pure neural approaches in several domains.
\ No newline at end of file
diff --git a/data/2021/iclr/Repurposing Pretrained Models for Robust Out-of-domain Few-Shot Learning b/data/2021/iclr/Repurposing Pretrained Models for Robust Out-of-domain Few-Shot Learning
new file mode 100644
index 0000000000..d2ae5946c6
--- /dev/null
+++ b/data/2021/iclr/Repurposing Pretrained Models for Robust Out-of-domain Few-Shot Learning	
@@ -0,0 +1 @@
+Model-agnostic meta-learning (MAML) is a popular method for few-shot learning but assumes that we have access to the meta-training set. In practice, training on the meta-training set may not always be an option due to data privacy concerns, intellectual property issues, or merely lack of computing resources. In this paper, we consider the novel problem of repurposing pretrained MAML checkpoints to solve new few-shot classification tasks. Because of the potential distribution mismatch, the original MAML steps may no longer be optimal. Therefore we propose an alternative meta-testing procedure and combine MAML gradient steps with adversarial training and uncertainty-based stepsize adaptation. Our method outperforms "vanilla" MAML on same-domain and cross-domains benchmarks using both SGD and Adam optimizers and shows improved robustness to the choice of base stepsize.
\ No newline at end of file
diff --git a/data/2021/iclr/ResNet After All: Neural ODEs and Their Numerical Solution b/data/2021/iclr/ResNet After All: Neural ODEs and Their Numerical Solution
new file mode 100644
index 0000000000..64e2a44443
--- /dev/null
+++ b/data/2021/iclr/ResNet After All: Neural ODEs and Their Numerical Solution	
@@ -0,0 +1 @@
+A key appeal of the recently proposed Neural Ordinary Differential Equation (ODE) framework is that it seems to provide a continuous-time extension of discrete residual neural networks. As we show herein, though, trained Neural ODE models actually depend on the specific numerical method used during training. If the trained model is supposed to be a flow generated from an ODE, it should be possible to choose another numerical solver with equal or smaller numerical error without loss of performance. We observe that if training relies on a solver with overly coarse discretization, then testing with another solver of equal or smaller numerical error results in a sharp drop in accuracy. In such cases, the combination of vector field and numerical method cannot be interpreted as a flow generated from an ODE, which arguably poses a fatal breakdown of the Neural ODE concept. We observe, however, that there exists a critical step size beyond which the training yields a valid ODE vector field. We propose a method that monitors the behavior of the ODE solver during training to adapt its step size, aiming to ensure a valid ODE without unnecessarily increasing computational cost. We verify this adaptation algorithm on a common bench mark dataset as well as a synthetic dataset.
\ No newline at end of file
diff --git a/data/2021/iclr/Reset-Free Lifelong Learning with Skill-Space Planning b/data/2021/iclr/Reset-Free Lifelong Learning with Skill-Space Planning
new file mode 100644
index 0000000000..61dea5ec12
--- /dev/null
+++ b/data/2021/iclr/Reset-Free Lifelong Learning with Skill-Space Planning	
@@ -0,0 +1 @@
+The objective of lifelong reinforcement learning (RL) is to optimize agents which can continuously adapt and interact in changing environments. However, current RL approaches fail drastically when environments are non-stationary and interactions are non-episodic. We propose Lifelong Skill Planning (LiSP), an algorithmic framework for non-episodic lifelong RL based on planning in an abstract space of higher-order skills. We learn the skills in an unsupervised manner using intrinsic rewards and plan over the learned skills using a learned dynamics model. Moreover, our framework permits skill discovery even from offline data, thereby reducing the need for excessive real-world interactions. We demonstrate empirically that LiSP successfully enables long-horizon planning and learns agents that can avoid catastrophic failures even in challenging non-stationary and non-episodic environments derived from gridworld and MuJoCo benchmarks.
\ No newline at end of file
diff --git a/data/2021/iclr/Rethinking Architecture Selection in Differentiable NAS b/data/2021/iclr/Rethinking Architecture Selection in Differentiable NAS
new file mode 100644
index 0000000000..11abb44235
--- /dev/null
+++ b/data/2021/iclr/Rethinking Architecture Selection in Differentiable NAS	
@@ -0,0 +1 @@
+Differentiable Neural Architecture Search is one of the most popular Neural Architecture Search (NAS) methods for its search efficiency and simplicity, accomplished by jointly optimizing the model weight and architecture parameters in a weight-sharing supernet via gradient-based algorithms. At the end of the search phase, the operations with the largest architecture parameters will be selected to form the final architecture, with the implicit assumption that the values of architecture parameters reflect the operation strength. While much has been discussed about the supernet's optimization, the architecture selection process has received little attention. We provide empirical and theoretical analysis to show that the magnitude of architecture parameters does not necessarily indicate how much the operation contributes to the supernet's performance. We propose an alternative perturbation-based architecture selection that directly measures each operation's influence on the supernet. We re-evaluate several differentiable NAS methods with the proposed architecture selection and find that it is able to extract significantly improved architectures from the underlying supernets consistently. Furthermore, we find that several failure modes of DARTS can be greatly alleviated with the proposed selection method, indicating that much of the poor generalization observed in DARTS can be attributed to the failure of magnitude-based architecture selection rather than entirely the optimization of its supernet.
\ No newline at end of file
diff --git a/data/2021/iclr/Rethinking Attention with Performers b/data/2021/iclr/Rethinking Attention with Performers
new file mode 100644
index 0000000000..13250e9186
--- /dev/null
+++ b/data/2021/iclr/Rethinking Attention with Performers	
@@ -0,0 +1 @@
+We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
\ No newline at end of file
diff --git a/data/2021/iclr/Rethinking Embedding Coupling in Pre-trained Language Models b/data/2021/iclr/Rethinking Embedding Coupling in Pre-trained Language Models
new file mode 100644
index 0000000000..e8982559db
--- /dev/null
+++ b/data/2021/iclr/Rethinking Embedding Coupling in Pre-trained Language Models	
@@ -0,0 +1 @@
+We re-evaluate the standard practice of sharing weights between input and output embeddings in state-of-the-art pre-trained language models. We show that decoupled embeddings provide increased modeling flexibility, allowing us to significantly improve the efficiency of parameter allocation in the input embedding of multilingual models. By reallocating the input embedding parameters in the Transformer layers, we achieve dramatically better performance on standard natural language understanding tasks with the same number of parameters during fine-tuning. We also show that allocating additional capacity to the output embedding provides benefits to the model that persist through the fine-tuning stage even though the output embedding is discarded after pre-training. Our analysis shows that larger output embeddings prevent the model's last layers from overspecializing to the pre-training task and encourage Transformer representations to be more general and more transferable to other tasks and languages. Harnessing these findings, we are able to train models that achieve strong performance on the XTREME benchmark without increasing the number of parameters at the fine-tuning stage.
\ No newline at end of file
diff --git a/data/2021/iclr/Rethinking Positional Encoding in Language Pre-training b/data/2021/iclr/Rethinking Positional Encoding in Language Pre-training
new file mode 100644
index 0000000000..6b8e8ea9e3
--- /dev/null
+++ b/data/2021/iclr/Rethinking Positional Encoding in Language Pre-training	
@@ -0,0 +1 @@
+How to explicitly encode positional information into neural networks is important in learning the representation of natural languages, such as BERT. Based on the Transformer architecture, the positional information is simply encoded as embedding vectors, which are used in the input layer, or encoded as a bias term in the self-attention module. In this work, we investigate the problems in the previous formulations and propose a new positional encoding method for BERT called Transformer with Untied Positional Encoding (TUPE). Different from all other works, TUPE only uses the word embedding as input. In the self-attention module, the word contextual correlation and positional correlation are computed separately with different parameterizations and then added together. This design removes the addition over heterogeneous embeddings in the input, which may potentially bring randomness, and gives more expressiveness to characterize the relationship between words/positions by using different projection matrices. Furthermore, TUPE unties the [CLS] symbol from other positions to provide it with a more specific role to capture the global representation of the sentence. Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness and efficiency of the proposed method: TUPE outperforms several baselines on almost all tasks by a large margin. In particular, it can achieve a higher score than baselines while only using 30% pre-training computational costs. We release our code at https://github.com/guolinke/TUPE.
\ No newline at end of file
diff --git a/data/2021/iclr/Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective b/data/2021/iclr/Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective
new file mode 100644
index 0000000000..536b58be33
--- /dev/null
+++ b/data/2021/iclr/Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective	
@@ -0,0 +1 @@
+Knowledge distillation is an effective approach to leverage a well-trained network or an ensemble of them, named as the teacher, to guide the training of a student network. The outputs from the teacher network are used as soft labels for supervising the training of a new network. Recent studies \citep{muller2019does,yuan2020revisiting} revealed an intriguing property of the soft labels that making labels soft serves as a good regularization to the student network. From the perspective of statistical learning, regularization aims to reduce the variance, however how bias and variance change is not clear for training with soft labels. In this paper, we investigate the bias-variance tradeoff brought by distillation with soft labels. Specifically, we observe that during training the bias-variance tradeoff varies sample-wisely. Further, under the same distillation temperature setting, we observe that the distillation performance is negatively associated with the number of some specific samples, which are named as regularization samples since these samples lead to bias increasing and variance decreasing. Nevertheless, we empirically find that completely filtering out regularization samples also deteriorates distillation performance. Our discoveries inspired us to propose the novel weighted soft labels to help the network adaptively handle the sample-wise bias-variance tradeoff. Experiments on standard evaluation benchmarks validate the effectiveness of our method. Our code is available at \url{https://github.com/bellymonster/Weighted-Soft-Label-Distillation}.
\ No newline at end of file
diff --git a/data/2021/iclr/Rethinking the Role of Gradient-based Attribution Methods for Model Interpretability b/data/2021/iclr/Rethinking the Role of Gradient-based Attribution Methods for Model Interpretability
new file mode 100644
index 0000000000..40d3c74fba
--- /dev/null
+++ b/data/2021/iclr/Rethinking the Role of Gradient-based Attribution Methods for Model Interpretability	
@@ -0,0 +1 @@
+Current methods for the interpretability of discriminative deep neural networks commonly rely on the model’s input-gradients, i.e., the gradients of the output logits w.r.t. the inputs. The common assumption is that these input-gradients contain information regarding pθ(y | x), the model’s discriminative capabilities, thus justifying their use for interpretability. However, in this work we show that these input-gradients can be arbitrarily manipulated as a consequence of the shiftinvariance of softmax without changing the discriminative function. This leaves an open question: if input-gradients can be arbitrary, why are they highly structured and explanatory in standard models? We investigate this by re-interpreting the logits of standard softmax-based classifiers as unnormalized log-densities of the data distribution and show that input-gradients can be viewed as gradients of a class-conditional density model pθ(x | y) implicit within the discriminative model. This leads us to hypothesize that the highly structured and explanatory nature of input-gradients may be due to the alignment of this class-conditional model pθ(x | y) with that of the ground truth data distribution pdata(x | y). We test this hypothesis by studying the effect of density alignment on gradient explanations. To achieve this density alignment, we use an algorithm called score-matching, and propose novel approximations to this algorithm to enable training large-scale models. Our experiments show that improving the alignment of the implicit density model with the data distribution enhances gradient structure and explanatory power while reducing this alignment has the opposite effect. This also leads us to conjecture that unintended density alignment in standard neural network training may explain the highly structured nature of input-gradients observed in practice. Overall, our finding that input-gradients capture information regarding an implicit generative model implies that we need to re-think their use for interpreting discriminative models.
\ No newline at end of file
diff --git a/data/2021/iclr/Retrieval-Augmented Generation for Code Summarization via Hybrid GNN b/data/2021/iclr/Retrieval-Augmented Generation for Code Summarization via Hybrid GNN
new file mode 100644
index 0000000000..5ba4bdc90e
--- /dev/null
+++ b/data/2021/iclr/Retrieval-Augmented Generation for Code Summarization via Hybrid GNN	
@@ -0,0 +1 @@
+Source code summarization aims to generate natural language summaries from structured code snippets for better understanding code functionalities. However, automatic code summarization is challenging due to the complexity of the source code and the language gap between the source code and natural language summaries. Most previous approaches either rely on retrieval-based (which can take advantage of similar examples seen from the retrieval database, but have low generalization performance) or generation-based methods (which have better generalization performance, but cannot take advantage of similar examples). This paper proposes a novel retrieval-augmented mechanism to combine the benefits of both worlds. Furthermore, to mitigate the limitation of Graph Neural Networks (GNNs) on capturing global graph structure information of source code, we propose a novel attention-based dynamic graph to complement the static graph representation of the source code, and design a hybrid message passing GNN for capturing both the local and global structural information. To evaluate the proposed approach, we release a new challenging benchmark, crawled from diversified large-scale open-source C projects (total 95k+ unique functions in the dataset). Our method achieves the state-of-the-art performance, improving existing methods by 1.42, 2.44 and 1.29 in terms of BLEU-4, ROUGE-L and METEOR.
\ No newline at end of file
diff --git a/data/2021/iclr/Return-Based Contrastive Representation Learning for Reinforcement Learning b/data/2021/iclr/Return-Based Contrastive Representation Learning for Reinforcement Learning
new file mode 100644
index 0000000000..ba111ed0c0
--- /dev/null
+++ b/data/2021/iclr/Return-Based Contrastive Representation Learning for Reinforcement Learning	
@@ -0,0 +1 @@
+Recently, various auxiliary tasks have been proposed to accelerate representation learning and improve sample efficiency in deep reinforcement learning (RL). However, existing auxiliary tasks do not take the characteristics of RL problems into consideration and are unsupervised. By leveraging returns, the most important feedback signals in RL, we propose a novel auxiliary task that forces the learnt representations to discriminate state-action pairs with different returns. Our auxiliary loss is theoretically justified to learn representations that capture the structure of a new form of state-action abstraction, under which state-action pairs with similar return distributions are aggregated together. In low data regime, our algorithm outperforms strong baselines on complex tasks in Atari games and DeepMind Control suite, and achieves even better performance when combined with existing auxiliary tasks.
\ No newline at end of file
diff --git a/data/2021/iclr/Revisiting Dynamic Convolution via Matrix Decomposition b/data/2021/iclr/Revisiting Dynamic Convolution via Matrix Decomposition
new file mode 100644
index 0000000000..8be9b66a17
--- /dev/null
+++ b/data/2021/iclr/Revisiting Dynamic Convolution via Matrix Decomposition	
@@ -0,0 +1 @@
+Recent research in dynamic convolution shows substantial performance boost for efficient CNNs, due to the adaptive aggregation of K static convolution kernels.It has two limitations: (a) it increases the number of convolutional weights by K-times, and (b) the joint optimization of dynamic attention and static convolution kernels is challenging. In this paper, we revisit it from a new perspective of matrix decomposition and reveal the key issue is that dynamic convolution applies dynamic attentions over channel groups after projecting into a higher dimensional intermediate space. To address this issue, we propose dynamic channel fusion to replace dynamic attentions over channel groups. Dynamic channel fusion not only enables significant dimension reduction of the intermediate space, but also mitigates the joint optimization difficulty. As a result, our method is easier to train and requires significantly fewer parameters without sacrificing accuracy.
\ No newline at end of file
diff --git a/data/2021/iclr/Revisiting Few-sample BERT Fine-tuning b/data/2021/iclr/Revisiting Few-sample BERT Fine-tuning
new file mode 100644
index 0000000000..b7ad91c819
--- /dev/null
+++ b/data/2021/iclr/Revisiting Few-sample BERT Fine-tuning	
@@ -0,0 +1 @@
+We study the problem of few-sample fine-tuning of BERT contextual representations, and identify three sub-optimal choices in current, broadly adopted practices. First, we observe that the omission of the gradient bias correction in the BERTAdam optimizer results in fine-tuning instability. We also find that parts of the BERT network provide a detrimental starting point for fine-tuning, and simply re-initializing these layers speeds up learning and improves performance. Finally, we study the effect of training time, and observe that commonly used recipes often do not allocate sufficient time for training. In light of these findings, we re-visit recently proposed methods to improve few-sample fine-tuning with BERT and re-evaluate their effectiveness. Generally, we observe a decrease in their relative impact when modifying the fine-tuning process based on our findings.
\ No newline at end of file
diff --git a/data/2021/iclr/Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction b/data/2021/iclr/Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction
new file mode 100644
index 0000000000..db695cbfc4
--- /dev/null
+++ b/data/2021/iclr/Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction	
@@ -0,0 +1 @@
+Learning to predict the long-term future of video frames is notoriously challenging due to inherent ambiguities in the distant future and dramatic amplifications of prediction error through time. Despite the recent advances in the literature, existing approaches are limited to moderately short-term prediction (less than a few seconds), while extrapolating it to a longer future quickly leads to destruction in structure and content. In this work, we revisit hierarchical models in video prediction. Our method predicts future frames by first estimating a sequence of semantic structures and subsequently translating the structures to pixels by video-to-video translation. Despite the simplicity, we show that modeling structures and their dynamics in the discrete semantic structure space with a stochastic recurrent estimator leads to surprisingly successful long-term prediction. We evaluate our method on three challenging datasets involving car driving and human dancing, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon (i.e., thousands frames), setting a new standard of video prediction with orders of magnitude longer prediction time than existing approaches. Full videos and codes are available at https://1konny.github.io/HVP/.
\ No newline at end of file
diff --git a/data/2021/iclr/Revisiting Locally Supervised Learning: an Alternative to End-to-end Training b/data/2021/iclr/Revisiting Locally Supervised Learning: an Alternative to End-to-end Training
new file mode 100644
index 0000000000..4d0726aff4
--- /dev/null
+++ b/data/2021/iclr/Revisiting Locally Supervised Learning: an Alternative to End-to-end Training	
@@ -0,0 +1 @@
+Due to the need to store the intermediate activations for back-propagation, end-to-end (E2E) training of deep networks usually suffers from high GPUs memory footprint. This paper aims to address this problem by revisiting the locally supervised learning, where a network is split into gradient-isolated modules and trained with local supervision. We experimentally show that simply training local modules with E2E loss tends to collapse task-relevant information at early layers, and hence hurts the performance of the full model. To avoid this issue, we propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible, while progressively discard task-irrelevant information. As InfoPro loss is difficult to compute in its original form, we derive a feasible upper bound as a surrogate optimization objective, yielding a simple but effective algorithm. In fact, we show that the proposed method boils down to minimizing the combination of a reconstruction loss and a normal cross-entropy/contrastive term. Extensive empirical results on five datasets (i.e., CIFAR, SVHN, STL-10, ImageNet and Cityscapes) validate that InfoPro is capable of achieving competitive performance with less than 40% memory footprint compared to E2E training, while allowing using training data with higher-resolution or larger batch sizes under the same GPU memory constraint. Our method also enables training local modules asynchronously for potential training acceleration. Code is available at: https://github.com/blackfeather-wang/InfoPro-Pytorch.
\ No newline at end of file
diff --git a/data/2021/iclr/Reweighting Augmented Samples by Minimizing the Maximal Expected Loss b/data/2021/iclr/Reweighting Augmented Samples by Minimizing the Maximal Expected Loss
new file mode 100644
index 0000000000..f8a69a2b04
--- /dev/null
+++ b/data/2021/iclr/Reweighting Augmented Samples by Minimizing the Maximal Expected Loss	
@@ -0,0 +1 @@
+Data augmentation is an effective technique to improve the generalization of deep neural networks. However, previous data augmentation methods usually treat the augmented samples equally without considering their individual impacts on the model. To address this, for the augmented samples from the same training example, we propose to assign different weights to them. We construct the maximal expected loss which is the supremum over any reweighted loss on augmented samples. Inspired by adversarial training, we minimize this maximal expected loss (MMEL) and obtain a simple and interpretable closed-form solution: more attention should be paid to augmented samples with large loss values (i.e., harder examples). Minimizing this maximal expected loss enables the model to perform well under any reweighting strategy. The proposed method can generally be applied on top of any data augmentation methods. Experiments are conducted on both natural language understanding tasks with token-level data augmentation, and image classification tasks with commonly-used image augmentation techniques like random crop and horizontal flip. Empirical results show that the proposed method improves the generalization performance of the model.
\ No newline at end of file
diff --git a/data/2021/iclr/Ringing ReLUs: Harmonic Distortion Analysis of Nonlinear Feedforward Networks b/data/2021/iclr/Ringing ReLUs: Harmonic Distortion Analysis of Nonlinear Feedforward Networks
new file mode 100644
index 0000000000..a6af53e6a1
--- /dev/null
+++ b/data/2021/iclr/Ringing ReLUs: Harmonic Distortion Analysis of Nonlinear Feedforward Networks	
@@ -0,0 +1 @@
+In this paper, we apply harmonic distortion analysis to understand the effect of nonlinearities in the spectral domain. Each nonlinear layer creates higherfrequency harmonics, which we call "blueshift", whose magnitude increases with network depth, thereby increasing the “roughness” of the output landscape. Unlike differential models (such as vanishing gradients, sharpness), this provides a more global view of how network architectures behave across larger areas of their parameter domain. For example, the model predicts that residual connections are able to counter the effect by dampening corresponding higher frequency modes. We empirically verify the connection between blueshift and architectural choices, and provide evidence for a connection with trainability.
\ No newline at end of file
diff --git a/data/2021/iclr/Risk-Averse Offline Reinforcement Learning b/data/2021/iclr/Risk-Averse Offline Reinforcement Learning
new file mode 100644
index 0000000000..324a969249
--- /dev/null
+++ b/data/2021/iclr/Risk-Averse Offline Reinforcement Learning	
@@ -0,0 +1 @@
+Training Reinforcement Learning (RL) agents in high-stakes applications might be too prohibitive due to the risk associated to exploration. Thus, the agent can only use data previously collected by safe policies. While previous work considers optimizing the average performance using offline data, we focus on optimizing a risk-averse criteria, namely the CVaR. In particular, we present the Offline Risk-Averse Actor-Critic (O-RAAC), a model-free RL algorithm that is able to learn risk-averse policies in a fully offline setting. We show that O-RAAC learns policies with higher CVaR than risk-neutral approaches in different robot control tasks. Furthermore, considering risk-averse criteria guarantees distributional robustness of the average performance with respect to particular distribution shifts. We demonstrate empirically that in the presence of natural distribution-shifts, O-RAAC learns policies with good average performance.
\ No newline at end of file
diff --git a/data/2021/iclr/Robust Learning of Fixed-Structure Bayesian Networks in Nearly-Linear Time b/data/2021/iclr/Robust Learning of Fixed-Structure Bayesian Networks in Nearly-Linear Time
new file mode 100644
index 0000000000..f5fde6074d
--- /dev/null
+++ b/data/2021/iclr/Robust Learning of Fixed-Structure Bayesian Networks in Nearly-Linear Time	
@@ -0,0 +1 @@
+We study the problem of learning Bayesian networks where an $\epsilon$-fraction of the samples are adversarially corrupted. We focus on the fully-observable case where the underlying graph structure is known. In this work, we present the first nearly-linear time algorithm for this problem with a dimension-independent error guarantee. Previous robust algorithms with comparable error guarantees are slower by at least a factor of $(d/\epsilon)$, where $d$ is the number of variables in the Bayesian network and $\epsilon$ is the fraction of corrupted samples. Our algorithm and analysis are considerably simpler than those in previous work. We achieve this by establishing a direct connection between robust learning of Bayesian networks and robust mean estimation. As a subroutine in our algorithm, we develop a robust mean estimation algorithm whose runtime is nearly-linear in the number of nonzeros in the input samples, which may be of independent interest.
\ No newline at end of file
diff --git a/data/2021/iclr/Robust Overfitting may be mitigated by properly learned smoothening b/data/2021/iclr/Robust Overfitting may be mitigated by properly learned smoothening
new file mode 100644
index 0000000000..635144d3b8
--- /dev/null
+++ b/data/2021/iclr/Robust Overfitting may be mitigated by properly learned smoothening	
@@ -0,0 +1 @@
+A recent study
\ No newline at end of file
diff --git a/data/2021/iclr/Robust Pruning at Initialization b/data/2021/iclr/Robust Pruning at Initialization
new file mode 100644
index 0000000000..26a2875a36
--- /dev/null
+++ b/data/2021/iclr/Robust Pruning at Initialization	
@@ -0,0 +1 @@
+Overparameterized Neural Networks (NN) display state-of-the-art performance. However, there is a growing need for smaller, energy-efficient, neural networks tobe able to use machine learning applications on devices with limited computational resources. A popular approach consists of using pruning techniques. While these techniques have traditionally focused on pruning pre-trained NN (LeCun et al.,1990; Hassibi et al., 1993), recent work by Lee et al. (2018) has shown promising results when pruning at initialization. However, for Deep NNs, such procedures remain unsatisfactory as the resulting pruned networks can be difficult to train and, for instance, they do not prevent one layer from being fully pruned. In this paper, we provide a comprehensive theoretical analysis of Magnitude and Gradient based pruning at initialization and training of sparse architectures. This allows us to propose novel principled approaches which we validate experimentally on a variety of NN architectures.
\ No newline at end of file
diff --git a/data/2021/iclr/Robust Reinforcement Learning on State Observations with Learned Optimal Adversary b/data/2021/iclr/Robust Reinforcement Learning on State Observations with Learned Optimal Adversary
new file mode 100644
index 0000000000..d54cba012c
--- /dev/null
+++ b/data/2021/iclr/Robust Reinforcement Learning on State Observations with Learned Optimal Adversary	
@@ -0,0 +1 @@
+We study the robustness of reinforcement learning (RL) with adversarially perturbed state observations, which aligns with the setting of many adversarial attacks to deep reinforcement learning (DRL) and is also important for rolling out real-world RL agent under unpredictable sensing noise. With a fixed agent policy, we demonstrate that an optimal adversary to perturb state observations can be found, which is guaranteed to obtain the worst case agent reward. For DRL settings, this leads to a novel empirical adversarial attack to RL agents via a learned adversary that is much stronger than previous ones. To enhance the robustness of an agent, we propose a framework of alternating training with learned adversaries (ATLA), which trains an adversary online together with the agent using policy gradient following the optimal adversarial attack framework. Additionally, inspired by the analysis of state-adversarial Markov decision process (SA-MDP), we show that past states and actions (history) can be useful for learning a robust agent, and we empirically find a LSTM based policy can be more robust under adversaries. Empirical evaluations on a few continuous control environments show that ATLA achieves state-of-the-art performance under strong adversaries. Our code is available at https://github.com/huanzhang12/ATLA_robust_RL.
\ No newline at end of file
diff --git a/data/2021/iclr/Robust and Generalizable Visual Representation Learning via Random Convolutions b/data/2021/iclr/Robust and Generalizable Visual Representation Learning via Random Convolutions
new file mode 100644
index 0000000000..a83d726f9a
--- /dev/null
+++ b/data/2021/iclr/Robust and Generalizable Visual Representation Learning via Random Convolutions	
@@ -0,0 +1 @@
+While successful for various computer vision tasks, deep neural networks have shown to be vulnerable to texture style shifts and small perturbations to which humans are robust. Hence, our goal is to train models in such a way that improves their robustness to these perturbations. We are motivated by the approximately shape-preserving property of randomized convolutions, which is due to distance preservation under random linear transforms. Intuitively, randomized convolutions create an infinite number of new domains with similar object shapes but random local texture. Therefore, we explore using outputs of multi-scale random convolutions as new images or mixing them with the original images during training. When applying a network trained with our approach to unseen domains, our method consistently improves the performance on domain generalization benchmarks and is scalable to ImageNet. Especially for the challenging scenario of generalizing to the sketch domain in PACS and to ImageNet-Sketch, our method outperforms state-of-art methods by a large margin. More interestingly, our method can benefit downstream tasks by providing a more robust pretrained visual representation.
\ No newline at end of file
diff --git a/data/2021/iclr/Robust early-learning: Hindering the memorization of noisy labels b/data/2021/iclr/Robust early-learning: Hindering the memorization of noisy labels
new file mode 100644
index 0000000000..1b38720a61
--- /dev/null
+++ b/data/2021/iclr/Robust early-learning: Hindering the memorization of noisy labels	
@@ -0,0 +1 @@
+The memorization effects of deep networks show that they will ﬁrst memorize training data with clean labels and then those with noisy labels. The early stopping method therefore can be exploited for learning with noisy labels. However, the side effect brought by noisy labels will inﬂuence the memorization of clean labels before early stopping. In this paper, motivated by the lottery ticket hypothesis which shows that only partial parameters are important for generalization, we ﬁnd that only partial parameters are important for ﬁtting clean labels and generalize well, which we term as critical parameters ; while the other parameters tend to ﬁt noisy labels and cannot generalize well, which we term as non-critical parameters . Based on this, we propose robust early-learning to reduce the side effect of noisy labels before early stopping and thus enhance the memorization of clean labels. Speciﬁcally, in each iteration, we divide all parameters into the critical and non-critical ones, and then perform different update rules for different types of parameters. Extensive experiments on benchmark-simulated and real-world label-noise datasets demonstrate the superiority of the proposed method over the state-of-the-art label-noise learning methods.
\ No newline at end of file
diff --git a/data/2021/iclr/SAFENet: A Secure, Accurate and Fast Neural Network Inference b/data/2021/iclr/SAFENet: A Secure, Accurate and Fast Neural Network Inference
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/SALD: Sign Agnostic Learning with Derivatives b/data/2021/iclr/SALD: Sign Agnostic Learning with Derivatives
new file mode 100644
index 0000000000..d181c92912
--- /dev/null
+++ b/data/2021/iclr/SALD: Sign Agnostic Learning with Derivatives	
@@ -0,0 +1 @@
+Learning 3D geometry directly from raw data, such as point clouds, triangle soups, or un-oriented meshes is still a challenging task that feeds many downstream computer vision and graphics applications. In this paper, we introduce SAL++: a method for learning implicit neural representations of shapes directly from such raw data. We build upon the recent sign agnostic learning (SAL) approach and generalize it to include derivative data in a sign agnostic manner. In more detail, given the unsigned distance function to the input raw data, we suggest a novel sign agnostic regression loss, incorporating both pointwise values and gradients of the unsigned distance function. Optimizing this loss leads to a signed implicit function solution, the zero level set of which is a high quality, valid manifold approximation to the input 3D data. We demonstrate the efficacy of SAL++ shape space learning from two challenging datasets: ShapeNet that contains inconsistent orientation and non-manifold meshes, and D-Faust that contains raw 3D scans (triangle soups). On both these datasets, we present state-of-the-art results.
\ No newline at end of file
diff --git a/data/2021/iclr/SCoRe: Pre-Training for Context Representation in Conversational Semantic Parsing b/data/2021/iclr/SCoRe: Pre-Training for Context Representation in Conversational Semantic Parsing
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/SEDONA: Search for Decoupled Neural Networks toward Greedy Block-wise Learning b/data/2021/iclr/SEDONA: Search for Decoupled Neural Networks toward Greedy Block-wise Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/SEED: Self-supervised Distillation For Visual Representation b/data/2021/iclr/SEED: Self-supervised Distillation For Visual Representation
new file mode 100644
index 0000000000..fefa596b0d
--- /dev/null
+++ b/data/2021/iclr/SEED: Self-supervised Distillation For Visual Representation	
@@ -0,0 +1 @@
+This paper is concerned with self-supervised learning for small models. The problem is motivated by our empirical studies that while the widely used contrastive self-supervised learning method has shown great progress on large model training, it does not work well for small models. To address this problem, we propose a new learning paradigm, named SElf-SupErvised Distillation (SEED), where we leverage a larger network (as Teacher) to transfer its representational knowledge into a smaller architecture (as Student) in a self-supervised fashion. Instead of directly learning from unlabeled data, we train a student encoder to mimic the similarity score distribution inferred by a teacher over a set of instances. We show that SEED dramatically boosts the performance of small networks on downstream tasks. Compared with self-supervised baselines, SEED improves the top-1 accuracy from 42.2% to 67.6% on EfficientNet-B0 and from 36.3% to 68.2% on MobileNet-v3-Large on the ImageNet-1k dataset.
\ No newline at end of file
diff --git a/data/2021/iclr/SMiRL: Surprise Minimizing Reinforcement Learning in Unstable Environments b/data/2021/iclr/SMiRL: Surprise Minimizing Reinforcement Learning in Unstable Environments
new file mode 100644
index 0000000000..1e1359b6d8
--- /dev/null
+++ b/data/2021/iclr/SMiRL: Surprise Minimizing Reinforcement Learning in Unstable Environments	
@@ -0,0 +1 @@
+Every living organism struggles against disruptive environmental forces to carve out and maintain an orderly niche. We propose that such a struggle to achieve and preserve order might offer a principle for the emergence of useful behaviors in artiﬁcial agents. We formalize this idea into an unsupervised reinforcement learning method called surprise minimizing reinforcement learning (SMiRL). SMiRL alternates between learning a density model to evaluate the surprise of a stimulus, and improving the policy to seek more predictable stimuli. The policy seeks out stable and repeatable situations that counteract the environment’s prevailing sources of entropy. This might include avoiding other hostile agents, or ﬁnding a stable, balanced pose for a bipedal robot in the face of disturbance forces. We demonstrate that our surprise minimizing agents can successfully play Tetris, Doom, control a humanoid to avoid falls, and navigate to escape enemies in a maze without any task-speciﬁc reward supervision. We further show that SMiRL can be used together with standard task rewards to accelerate reward-driven learning
\ No newline at end of file
diff --git a/data/2021/iclr/SOLAR: Sparse Orthogonal Learned and Random Embeddings b/data/2021/iclr/SOLAR: Sparse Orthogonal Learned and Random Embeddings
new file mode 100644
index 0000000000..546cbcc38f
--- /dev/null
+++ b/data/2021/iclr/SOLAR: Sparse Orthogonal Learned and Random Embeddings	
@@ -0,0 +1 @@
+Dense embedding models are commonly deployed in commercial search engines, wherein all the document vectors are pre-computed, and near-neighbor search (NNS) is performed with the query vector to find relevant documents. However, the bottleneck of indexing a large number of dense vectors and performing an NNS hurts the query time and accuracy of these models. In this paper, we argue that high-dimensional and ultra-sparse embedding is a significantly superior alternative to dense low-dimensional embedding for both query efficiency and accuracy. Extreme sparsity eliminates the need for NNS by replacing them with simple lookups, while its high dimensionality ensures that the embeddings are informative even when sparse. However, learning extremely high dimensional embeddings leads to blow up in the model size. To make the training feasible, we propose a partitioning algorithm that learns such high dimensional embeddings across multiple GPUs without any communication. This is facilitated by our novel asymmetric mixture of Sparse, Orthogonal, Learned and Random (SOLAR) Embeddings. The label vectors are random, sparse, and near-orthogonal by design, while the query vectors are learned and sparse. We theoretically prove that our way of one-sided learning is equivalent to learning both query and label embeddings. With these unique properties, we can successfully train 500K dimensional SOLAR embeddings for the tasks of searching through 1.6M books and multi-label classification on the three largest public datasets. We achieve superior precision and recall compared to the respective state-of-the-art baselines for each of the tasks with up to 10 times faster speed.
\ No newline at end of file
diff --git a/data/2021/iclr/SSD: A Unified Framework for Self-Supervised Outlier Detection b/data/2021/iclr/SSD: A Unified Framework for Self-Supervised Outlier Detection
new file mode 100644
index 0000000000..2edc61c3d7
--- /dev/null
+++ b/data/2021/iclr/SSD: A Unified Framework for Self-Supervised Outlier Detection	
@@ -0,0 +1 @@
+We ask the following question: what training information is required to design an effective outlier/out-of-distribution (OOD) detector, i.e., detecting samples that lie far away from the training distribution? Since unlabeled data is easily accessible for many applications, the most compelling approach is to develop detectors based on only unlabeled in-distribution data. However, we observe that most existing detectors based on unlabeled data perform poorly, often equivalent to a random prediction. In contrast, existing state-of-the-art OOD detectors achieve impressive performance but require access to fine-grained data labels for supervised training. We propose SSD, an outlier detector based on only unlabeled in-distribution data. We use self-supervised representation learning followed by a Mahalanobis distance based detection in the feature space. We demonstrate that SSD outperforms most existing detectors based on unlabeled data by a large margin. Additionally, SSD even achieves performance on par, and sometimes even better, with supervised training based detectors. Finally, we expand our detection framework with two key extensions. First, we formulate few-shot OOD detection, in which the detector has access to only one to five samples from each class of the targeted OOD dataset. Second, we extend our framework to incorporate training data labels, if available. We find that our novel detection framework based on SSD displays enhanced performance with these extensions, and achieves state-of-the-art performance. Our code is publicly available at https://github.com/inspire-group/SSD.
\ No newline at end of file
diff --git a/data/2021/iclr/Saliency is a Possible Red Herring When Diagnosing Poor Generalization b/data/2021/iclr/Saliency is a Possible Red Herring When Diagnosing Poor Generalization
new file mode 100644
index 0000000000..f5cc1cda07
--- /dev/null
+++ b/data/2021/iclr/Saliency is a Possible Red Herring When Diagnosing Poor Generalization	
@@ -0,0 +1 @@
+Poor generalization is one symptom of models that learn to predict target variables using spuriously-correlated image features present only in the training distribution instead of the true image features that denote a class. It is often thought that this can be diagnosed visually using attribution (aka saliency) maps. We study if this assumption is correct. In some prediction tasks, such as for medical images, one may have some images with masks drawn by a human expert, indicating a region of the image containing relevant information to make the prediction. We study multiple methods that take advantage of such auxiliary labels, by training networks to ignore distracting features which may be found outside of the region of interest. This mask information is only used during training and has an impact on generalization accuracy depending on the severity of the shift between the training and test distributions. Surprisingly, while these methods improve generalization performance in the presence of a covariate shift, there is no strong correspondence between the correction of attribution towards the features a human expert has labelled as important and generalization performance. These results suggest that the root cause of poor generalization may not always be spatially deﬁned, and raise questions about the utility of masks as “attribution priors” as well as saliency maps for explainable predictions.
\ No newline at end of file
diff --git a/data/2021/iclr/SaliencyMix: A Saliency Guided Data Augmentation Strategy for Better Regularization b/data/2021/iclr/SaliencyMix: A Saliency Guided Data Augmentation Strategy for Better Regularization
new file mode 100644
index 0000000000..db75bc8fb5
--- /dev/null
+++ b/data/2021/iclr/SaliencyMix: A Saliency Guided Data Augmentation Strategy for Better Regularization	
@@ -0,0 +1 @@
+Advanced data augmentation strategies have widely been studied to improve the generalization ability of deep learning models. Regional dropout is one of the popular solutions that guides the model to focus on less discriminative parts by randomly removing image regions, resulting in improved regularization. However, such information removal is undesirable. On the other hand, recent strategies suggest to randomly cut and mix patches and their labels among training images, to enjoy the advantages of regional dropout without having any pointless pixel in the augmented images. We argue that the random selection of the patch may not necessarily represent any information about the corresponding object and thereby mixing the labels according to that uninformative patch enables the model to learn unexpected feature representation. Therefore, we propose SaliencyMix that carefully selects a representative image patch with the help of a saliency map and mixes this indicative patch with the target image that leads the model to learn more appropriate feature representation. SaliencyMix achieves a new state-of-the-art top-1 error of 20.09% on ImageNet classification using ResNet-101 architecture and also improves the model robustness against adversarial perturbations. Furthermore, SaliencyMix trained model helps to improve the object detection performance.
\ No newline at end of file
diff --git a/data/2021/iclr/Sample-Efficient Automated Deep Reinforcement Learning b/data/2021/iclr/Sample-Efficient Automated Deep Reinforcement Learning
new file mode 100644
index 0000000000..f9a671bd10
--- /dev/null
+++ b/data/2021/iclr/Sample-Efficient Automated Deep Reinforcement Learning	
@@ -0,0 +1 @@
+Despite significant progress in challenging problems across various domains, applying state-of-the-art deep reinforcement learning (RL) algorithms remains challenging due to their sensitivity to the choice of hyperparameters. This sensitivity can partly be attributed to the non-stationarity of the RL problem, potentially requiring different hyperparameter settings at various stages of the learning process. Additionally, in the RL setting, hyperparameter optimization (HPO) requires a large number of environment interactions, hindering the transfer of the successes in RL to real-world applications. In this work, we tackle the issues of sample-efficient and dynamic HPO in RL. We propose a population-based automated RL (AutoRL) framework to meta-optimize arbitrary off-policy RL algorithms. In this framework, we optimize the hyperparameters and also the neural architecture while simultaneously training the agent. By sharing the collected experience across the population, we substantially increase the sample efficiency of the meta-optimization. We demonstrate the capabilities of our sample-efficient AutoRL approach in a case study with the popular TD3 algorithm in the MuJoCo benchmark suite, where we reduce the number of environment interactions needed for meta-optimization by up to an order of magnitude compared to population-based training.
\ No newline at end of file
diff --git a/data/2021/iclr/Scalable Bayesian Inverse Reinforcement Learning b/data/2021/iclr/Scalable Bayesian Inverse Reinforcement Learning
new file mode 100644
index 0000000000..68e399d77f
--- /dev/null
+++ b/data/2021/iclr/Scalable Bayesian Inverse Reinforcement Learning	
@@ -0,0 +1 @@
+Bayesian inference over the reward presents an ideal solution to the ill-posed nature of the inverse reinforcement learning problem. Unfortunately current methods generally do not scale well beyond the small tabular setting due to the need for an inner-loop MDP solver, and even non-Bayesian methods that do themselves scale often require extensive interaction with the environment to perform well, being inappropriate for high stakes or costly applications such as healthcare. In this paper we introduce our method, Approximate Variational Reward Imitation Learning (AVRIL), that addresses both of these issues by jointly learning an approximate posterior distribution over the reward that scales to arbitrarily complicated state spaces alongside an appropriate policy in a completely offline manner through a variational approach to said latent reward. Applying our method to real medical data alongside classic control simulations, we demonstrate Bayesian reward inference in environments beyond the scope of current methods, as well as task performance competitive with focused offline imitation learning algorithms.
\ No newline at end of file
diff --git a/data/2021/iclr/Scalable Learning and MAP Inference for Nonsymmetric Determinantal Point Processes b/data/2021/iclr/Scalable Learning and MAP Inference for Nonsymmetric Determinantal Point Processes
new file mode 100644
index 0000000000..608bbd013c
--- /dev/null
+++ b/data/2021/iclr/Scalable Learning and MAP Inference for Nonsymmetric Determinantal Point Processes	
@@ -0,0 +1 @@
+Determinantal point processes (DPPs) have attracted significant attention from the machine learning community for their ability to model subsets drawn from a large collection of items. Recent work shows that nonsymmetric DPP kernels have significant advantages over symmetric kernels in terms of modeling power and predictive performance. However, the nonsymmetric kernel learning algorithm from prior work has computational complexity that is cubic in the size of the DPP ground set, from which subsets are drawn, making it impractical to use at large scales. In this work, we propose a new decomposition for nonsymmetric DPP kernels that induces linear-time complexity for learning and approximate maximum a posteriori (MAP) inference. We also prove a lower bound on the quality of this MAP approximation. Through evaluation on real-world datasets, we show that our new decomposition not only scales better, but also matches or exceeds the predictive performance of prior work.
\ No newline at end of file
diff --git a/data/2021/iclr/Scalable Transfer Learning with Expert Models b/data/2021/iclr/Scalable Transfer Learning with Expert Models
new file mode 100644
index 0000000000..0060a5cd9a
--- /dev/null
+++ b/data/2021/iclr/Scalable Transfer Learning with Expert Models	
@@ -0,0 +1 @@
+Transfer of pre-trained representations can improve sample efficiency and reduce computational requirements for new tasks. However, representations used for transfer are usually generic, and are not tailored to a particular distribution of downstream tasks. We explore the use of expert representations for transfer with a simple, yet effective, strategy. We train a diverse set of experts by exploiting existing label structures, and use cheap-to-compute performance proxies to select the relevant expert for each target task. This strategy scales the process of transferring to new tasks, since it does not revisit the pre-training data during transfer. Accordingly, it requires little extra compute per target task, and results in a speed-up of 2-3 orders of magnitude compared to competing approaches. Further, we provide an adapter-based architecture able to compress many experts into a single model. We evaluate our approach on two different data sources and demonstrate that it outperforms baselines on over 20 diverse vision tasks in both cases.
\ No newline at end of file
diff --git a/data/2021/iclr/Scaling Symbolic Methods using Gradients for Neural Model Explanation b/data/2021/iclr/Scaling Symbolic Methods using Gradients for Neural Model Explanation
new file mode 100644
index 0000000000..3de2c4e03f
--- /dev/null
+++ b/data/2021/iclr/Scaling Symbolic Methods using Gradients for Neural Model Explanation	
@@ -0,0 +1 @@
+Symbolic techniques based on Satisfiability Modulo Theory (SMT) solvers have been proposed for analyzing and verifying neural network properties, but their usage has been fairly limited owing to their poor scalability with larger networks. In this work, we propose a technique for combining gradient-based methods with symbolic techniques to scale such analyses and demonstrate its application for model explanation. In particular, we apply this technique to identify minimal regions in an input that are most relevant for a neural network's prediction. Our approach uses gradient information (based on Integrated Gradients) to focus on a subset of neurons in the first layer, which allows our technique to scale to large networks. The corresponding SMT constraints encode the minimal input mask discovery problem such that after masking the input, the activations of the selected neurons are still above a threshold. After solving for the minimal masks, our approach scores the mask regions to generate a relative ordering of the features within the mask. This produces a saliency map which explains "where a model is looking" when making a prediction. We evaluate our technique on three datasets - MNIST, ImageNet, and Beer Reviews, and demonstrate both quantitatively and qualitatively that the regions generated by our approach are sparser and achieve higher saliency scores compared to the gradient-based methods alone.
\ No newline at end of file
diff --git a/data/2021/iclr/Scaling the Convex Barrier with Active Sets b/data/2021/iclr/Scaling the Convex Barrier with Active Sets
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Score-Based Generative Modeling through Stochastic Differential Equations b/data/2021/iclr/Score-Based Generative Modeling through Stochastic Differential Equations
new file mode 100644
index 0000000000..7734f31619
--- /dev/null
+++ b/data/2021/iclr/Score-Based Generative Modeling through Stochastic Differential Equations	
@@ -0,0 +1 @@
+Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model.
\ No newline at end of file
diff --git a/data/2021/iclr/Selective Classification Can Magnify Disparities Across Groups b/data/2021/iclr/Selective Classification Can Magnify Disparities Across Groups
new file mode 100644
index 0000000000..390929118d
--- /dev/null
+++ b/data/2021/iclr/Selective Classification Can Magnify Disparities Across Groups	
@@ -0,0 +1 @@
+Selective classification, in which models are allowed to abstain on uncertain predictions, is a natural approach to improving accuracy in settings where errors are costly but abstentions are manageable. In this paper, we find that while selective classification can improve average accuracies, it can simultaneously magnify existing accuracy disparities between various groups within a population, especially in the presence of spurious correlations. We observe this behavior consistently across five datasets from computer vision and NLP. Surprisingly, increasing the abstention rate can even decrease accuracies on some groups. To better understand when selective classification improves or worsens accuracy on a group, we study its margin distribution, which captures the model's confidences over all predictions. For example, when the margin distribution is symmetric, we prove that whether selective classification monotonically improves or worsens accuracy is fully determined by the accuracy at full coverage (i.e., without any abstentions) and whether the distribution satisfies a property we term left-log-concavity. Our analysis also shows that selective classification tends to magnify accuracy disparities that are present at full coverage. Fortunately, we find that it uniformly improves each group when applied to distributionally-robust models that achieve similar full-coverage accuracies across groups. Altogether, our results imply selective classification should be used with care and underscore the importance of models that perform equally well across groups at full coverage.
\ No newline at end of file
diff --git a/data/2021/iclr/Selectivity considered harmful: evaluating the causal impact of class selectivity in DNNs b/data/2021/iclr/Selectivity considered harmful: evaluating the causal impact of class selectivity in DNNs
new file mode 100644
index 0000000000..32ed266230
--- /dev/null
+++ b/data/2021/iclr/Selectivity considered harmful: evaluating the causal impact of class selectivity in DNNs	
@@ -0,0 +1 @@
+The properties of individual neurons are often analyzed in order to understand the biological and artificial neural networks in which they're embedded. Class selectivity-typically defined as how different a neuron's responses are across different classes of stimuli or data samples-is commonly used for this purpose. However, it remains an open question whether it is necessary and/or sufficient for deep neural networks (DNNs) to learn class selectivity in individual units. We investigated the causal impact of class selectivity on network function by directly regularizing for or against class selectivity. Using this regularizer to reduce class selectivity across units in convolutional neural networks increased test accuracy by over 2% in ResNet18 trained on Tiny ImageNet. In ResNet20 trained on CIFAR10 we could reduce class selectivity by a factor of 2.5 with no impact on test accuracy, and reduce it nearly to zero with only a small ($\sim$2%) drop in test accuracy. In contrast, regularizing to increase class selectivity had rapid and disastrous effects on test accuracy across all models and datasets. These results indicate that class selectivity in individual units is neither sufficient nor strictly necessary, and can even impair DNN performance. They also encourage caution when focusing on the properties of single units as representative of the mechanisms by which DNNs function.
\ No newline at end of file
diff --git a/data/2021/iclr/Self-Supervised Learning of Compressed Video Representations b/data/2021/iclr/Self-Supervised Learning of Compressed Video Representations
new file mode 100644
index 0000000000..1bab43f963
--- /dev/null
+++ b/data/2021/iclr/Self-Supervised Learning of Compressed Video Representations	
@@ -0,0 +1 @@
+Self-supervised learning of video representations has received great attention. Existing methods typically require frames to be decoded before being processed, which increases compute and storage requirements and ultimately hinders largescale training. In this work, we propose an efficient self-supervised approach to learn video representations by eliminating the expensive decoding step. We use a three-stream video architecture that encodes I-frames and P-frames of a compressed video. Unlike existing approaches that encode I-frames and P-frames individually, we propose to jointly encode them by establishing bidirectional dynamic connections across streams. To enable self-supervised learning, we propose two pretext tasks that leverage the multimodal nature (RGB, motion vector, residuals) and the internal GOP structure of compressed videos. The first task asks our network to predict zeroth-order motion statistics in a spatio-temporal pyramid; the second task asks correspondence types between I-frames and P-frames after applying temporal transformations. We show that our approach achieves competitive performance on compressed video recognition both in supervised and self-supervised regimes.
\ No newline at end of file
diff --git a/data/2021/iclr/Self-Supervised Policy Adaptation during Deployment b/data/2021/iclr/Self-Supervised Policy Adaptation during Deployment
new file mode 100644
index 0000000000..27e419fb80
--- /dev/null
+++ b/data/2021/iclr/Self-Supervised Policy Adaptation during Deployment	
@@ -0,0 +1 @@
+In most real world scenarios, a policy trained by reinforcement learning in one environment needs to be deployed in another, potentially quite different environment. However, generalization across different environments is known to be hard. A natural solution would be to keep training after deployment in the new environment, but this cannot be done if the new environment offers no reward signal. Our work explores the use of self-supervision to allow the policy to continue training after deployment without using any rewards. While previous methods explicitly anticipate changes in the new environment, we assume no prior knowledge of those changes yet still obtain significant improvements. Empirical evaluations are performed on diverse environments from DeepMind Control suite and ViZDoom. Our method improves generalization in 25 out of 30 environments across various tasks, and outperforms domain randomization on a majority of environments.
\ No newline at end of file
diff --git a/data/2021/iclr/Self-supervised Adversarial Robustness for the Low-label, High-data Regime b/data/2021/iclr/Self-supervised Adversarial Robustness for the Low-label, High-data Regime
new file mode 100644
index 0000000000..c0c28ad581
--- /dev/null
+++ b/data/2021/iclr/Self-supervised Adversarial Robustness for the Low-label, High-data Regime	
@@ -0,0 +1 @@
+IFAR
\ No newline at end of file
diff --git a/data/2021/iclr/Self-supervised Learning from a Multi-view Perspective b/data/2021/iclr/Self-supervised Learning from a Multi-view Perspective
new file mode 100644
index 0000000000..1c73136817
--- /dev/null
+++ b/data/2021/iclr/Self-supervised Learning from a Multi-view Perspective	
@@ -0,0 +1 @@
+As a subset of unsupervised representation learning, self-supervised representation learning adopts self-defined signals as supervision and uses the learned representation for downstream tasks, such as object detection and image captioning. Many proposed approaches for self-supervised learning follow naturally a multi-view perspective, where the input (e.g., original images) and the self-supervised signals (e.g., augmented images) can be seen as two redundant views of the data. Building from this multi-view perspective, this paper provides an information-theoretical framework to better understand the properties that encourage successful self-supervised learning. Specifically, we demonstrate that self-supervised learned representations can extract task-relevant information and discard task-irrelevant information. Our theoretical framework paves the way to a larger space of self-supervised learning objective design. In particular, we propose a composite objective that bridges the gap between prior contrastive and predictive learning objectives, and introduce an additional objective term to discard task-irrelevant information. To verify our analysis, we conduct controlled experiments to evaluate the impact of the composite objectives. We also explore our framework's empirical generalization beyond the multi-view perspective, where the cross-view redundancy may not be clearly observed.
\ No newline at end of file
diff --git a/data/2021/iclr/Self-supervised Representation Learning with Relative Predictive Coding b/data/2021/iclr/Self-supervised Representation Learning with Relative Predictive Coding
new file mode 100644
index 0000000000..e1ec581c56
--- /dev/null
+++ b/data/2021/iclr/Self-supervised Representation Learning with Relative Predictive Coding	
@@ -0,0 +1 @@
+This paper introduces Relative Predictive Coding (RPC), a new contrastive representation learning objective that maintains a good balance among training stability, minibatch size sensitivity, and downstream task performance. The key to the success of RPC is two-fold. First, RPC introduces the relative parameters to regularize the objective for boundedness and low variance. Second, RPC contains no logarithm and exponential score functions, which are the main cause of training instability in prior contrastive objectives. We empirically verify the effectiveness of RPC on benchmark vision and speech self-supervised learning tasks. Lastly, we relate RPC with mutual information (MI) estimation, showing RPC can be used to estimate MI with low variance.
\ No newline at end of file
diff --git a/data/2021/iclr/Self-supervised Visual Reinforcement Learning with Object-centric Representations b/data/2021/iclr/Self-supervised Visual Reinforcement Learning with Object-centric Representations
new file mode 100644
index 0000000000..6922a55200
--- /dev/null
+++ b/data/2021/iclr/Self-supervised Visual Reinforcement Learning with Object-centric Representations	
@@ -0,0 +1 @@
+Autonomous agents need large repertoires of skills to act reasonably on new tasks that they have not seen before. However, acquiring these skills using only a stream of high-dimensional, unstructured, and unlabeled observations is a tricky challenge for any autonomous agent. Previous methods have used variational autoencoders to encode a scene into a low-dimensional vector that can be used as a goal for an agent to discover new skills. Nevertheless, in compositional/multi-object environments it is difficult to disentangle all the factors of variation into such a fixed-length representation of the whole scene. We propose to use object-centric representations as a modular and structured observation space, which is learned with a compositional generative world model. We show that the structure in the representations in combination with goal-conditioned attention policies helps the autonomous agent to discover and learn useful skills. These skills can be further combined to address compositional tasks like the manipulation of several different objects.
\ No newline at end of file
diff --git a/data/2021/iclr/Self-training For Few-shot Transfer Across Extreme Task Differences b/data/2021/iclr/Self-training For Few-shot Transfer Across Extreme Task Differences
new file mode 100644
index 0000000000..2d06f7e371
--- /dev/null
+++ b/data/2021/iclr/Self-training For Few-shot Transfer Across Extreme Task Differences	
@@ -0,0 +1 @@
+All few-shot learning techniques must be pre-trained on a large, labeled "base dataset". In problem domains where such large labeled datasets are not available for pre-training (e.g., X-ray images), one must resort to pre-training in a different "source" problem domain (e.g., ImageNet), which can be very different from the desired target task. Traditional few-shot and transfer learning techniques fail in the presence of such extreme differences between the source and target tasks. In this paper, we present a simple and effective solution to tackle this extreme domain gap: self-training a source domain representation on unlabeled data from the target domain. We show that this improves one-shot performance on the target domain by 2.9 points on average on a challenging benchmark with multiple domains.
\ No newline at end of file
diff --git a/data/2021/iclr/Semantic Re-tuning with Contrastive Tension b/data/2021/iclr/Semantic Re-tuning with Contrastive Tension
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Semi-supervised Keypoint Localization b/data/2021/iclr/Semi-supervised Keypoint Localization
new file mode 100644
index 0000000000..7959ee4834
--- /dev/null
+++ b/data/2021/iclr/Semi-supervised Keypoint Localization	
@@ -0,0 +1 @@
+Localizing keypoints of an object is a basic visual problem. However, supervised learning of a keypoint localization network often requires a large amount of data, which is expensive and time-consuming to obtain. To remedy this, there is an ever-growing interest in semi-supervised learning (SSL), which leverages a small set of labeled data along with a large set of unlabeled data. Among these SSL approaches, pseudo-labeling (PL) is one of the most popular. PL approaches apply pseudo-labels to unlabeled data, and then train the model with a combination of the labeled and pseudo-labeled data iteratively. The key to the success of PL is the selection of high-quality pseudo-labeled samples. Previous works mostly select training samples by manually setting a single confidence threshold. We propose to automatically select reliable pseudo-labeled samples with a series of dynamic thresholds, which constitutes a learning curriculum. Extensive experiments on six keypoint localization benchmark datasets demonstrate that the proposed approach significantly outperforms the previous state-of-the-art SSL approaches.
\ No newline at end of file
diff --git a/data/2021/iclr/SenSeI: Sensitive Set Invariance for Enforcing Individual Fairness b/data/2021/iclr/SenSeI: Sensitive Set Invariance for Enforcing Individual Fairness
new file mode 100644
index 0000000000..e5834c3c92
--- /dev/null
+++ b/data/2021/iclr/SenSeI: Sensitive Set Invariance for Enforcing Individual Fairness	
@@ -0,0 +1 @@
+In this paper, we cast fair machine learning as invariant machine learning. We first formulate a version of individual fairness that enforces invariance on certain sensitive sets. We then design a transport-based regularizer that enforces this version of individual fairness and develop an algorithm to minimize the regularizer efficiently. Our theoretical results guarantee the proposed approach trains certifiably fair ML models. Finally, in the experimental studies we demonstrate improved fairness metrics in comparison to several recent fair training procedures on three ML tasks that are susceptible to algorithmic bias.
\ No newline at end of file
diff --git a/data/2021/iclr/Separation and Concentration in Deep Networks b/data/2021/iclr/Separation and Concentration in Deep Networks
new file mode 100644
index 0000000000..81533c71af
--- /dev/null
+++ b/data/2021/iclr/Separation and Concentration in Deep Networks	
@@ -0,0 +1 @@
+Numerical experiments demonstrate that deep neural network classifiers progressively separate class distributions around their mean, achieving linear separability on the training set, and increasing the Fisher discriminant ratio. We explain this mechanism with two types of operators. We prove that a rectifier without biases applied to sign-invariant tight frames can separate class means and increase Fisher ratios. On the opposite, a soft-thresholding on tight frames can reduce within-class variabilities while preserving class means. Variance reduction bounds are proved for Gaussian mixture models. For image classification, we show that separation of class means can be achieved with rectified wavelet tight frames that are not learned. It defines a scattering transform. Learning $1 \times 1$ convolutional tight frames along scattering channels and applying a soft-thresholding reduces within-class variabilities. The resulting scattering network reaches the classification accuracy of ResNet-18 on CIFAR-10 and ImageNet, with fewer layers and no learned biases.
\ No newline at end of file
diff --git a/data/2021/iclr/Seq2Tens: An Efficient Representation of Sequences by Low-Rank Tensor Projections b/data/2021/iclr/Seq2Tens: An Efficient Representation of Sequences by Low-Rank Tensor Projections
new file mode 100644
index 0000000000..98008cc5d4
--- /dev/null
+++ b/data/2021/iclr/Seq2Tens: An Efficient Representation of Sequences by Low-Rank Tensor Projections	
@@ -0,0 +1 @@
+Sequential data such as time series, video, or text can be challenging to analyse as the ordered structure gives rise to complex dependencies. At the heart of this is non-commutativity, in the sense that reordering the elements of a sequence can completely change its meaning. We use a classical mathematical object -- the tensor algebra -- to capture such dependencies. To address the innate computational complexity of high degree tensors, we use compositions of low-rank tensor projections. This yields modular and scalable building blocks for neural networks that give state-of-the-art performance on standard benchmarks such as multivariate time series classification and generative models for video.
\ No newline at end of file
diff --git a/data/2021/iclr/Sequential Density Ratio Estimation for Simultaneous Optimization of Speed and Accuracy b/data/2021/iclr/Sequential Density Ratio Estimation for Simultaneous Optimization of Speed and Accuracy
new file mode 100644
index 0000000000..50c8894f84
--- /dev/null
+++ b/data/2021/iclr/Sequential Density Ratio Estimation for Simultaneous Optimization of Speed and Accuracy	
@@ -0,0 +1 @@
+Classifying sequential data as early and as accurately as possible is a challenging yet critical problem, especially when a sampling cost is high. One algorithm that achieves this goal is the sequential probability ratio test (SPRT), which is known as Bayes-optimal: it can keep the expected number of data samples as small as possible, given the desired error upper-bound. However, the original SPRT makes two critical assumptions that limit its application in real-world scenarios: (i) samples are independently and identically distributed, and (ii) the likelihood of the data being derived from each class can be calculated precisely. Here, we propose the SPRT-TANDEM, a deep neural network-based SPRT algorithm that overcomes the above two obstacles. The SPRT-TANDEM sequentially estimates the log-likelihood ratio of two alternative hypotheses by leveraging a novel Loss function for Log-Likelihood Ratio estimation (LLLR) while allowing correlations up to N (∈ N) preceding samples. In tests on one original and two public video databases, Nosaic MNIST, UCF101, and SiW, the SPRT-TANDEM achieves statistically significantly better classification accuracy than other baseline classifiers, with a smaller number of data samples. The code and Nosaic MNIST are publicly available at https://github.com/TaikiMiyagawa/SPRT-TANDEM.
\ No newline at end of file
diff --git a/data/2021/iclr/Set Prediction without Imposing Structure as Conditional Density Estimation b/data/2021/iclr/Set Prediction without Imposing Structure as Conditional Density Estimation
new file mode 100644
index 0000000000..6ce2743f62
--- /dev/null
+++ b/data/2021/iclr/Set Prediction without Imposing Structure as Conditional Density Estimation	
@@ -0,0 +1 @@
+Set prediction is about learning to predict a collection of unordered variables with unknown interrelations. Training such models with set losses imposes the structure of a metric space over sets. We focus on stochastic and underdefined cases, where an incorrectly chosen loss function leads to implausible predictions. Example tasks include conditional point-cloud reconstruction and predicting future states of molecules. In this paper, we propose an alternative to training via set losses by viewing learning as conditional density estimation. Our learning framework fits deep energy-based models and approximates the intractable likelihood with gradient-guided sampling. Furthermore, we propose a stochastically augmented prediction algorithm that enables multiple predictions, reflecting the possible variations in the target set. We empirically demonstrate on a variety of datasets the capability to learn multi-modal densities and produce multiple plausible predictions. Our approach is competitive with previous set prediction models on standard benchmarks. More importantly, it extends the family of addressable tasks beyond those that have unambiguous predictions.
\ No newline at end of file
diff --git a/data/2021/iclr/Shape or Texture: Understanding Discriminative Features in CNNs b/data/2021/iclr/Shape or Texture: Understanding Discriminative Features in CNNs
new file mode 100644
index 0000000000..b6ed7abbf3
--- /dev/null
+++ b/data/2021/iclr/Shape or Texture: Understanding Discriminative Features in CNNs	
@@ -0,0 +1 @@
+Contrasting the previous evidence that neurons in the later layers of a Convolutional Neural Network (CNN) respond to complex object shapes, recent studies have shown that CNNs actually exhibit a `texture bias': given an image with both texture and shape cues (e.g., a stylized image), a CNN is biased towards predicting the category corresponding to the texture. However, these previous studies conduct experiments on the final classification output of the network, and fail to robustly evaluate the bias contained (i) in the latent representations, and (ii) on a per-pixel level. In this paper, we design a series of experiments that overcome these issues. We do this with the goal of better understanding what type of shape information contained in the network is discriminative, where shape information is encoded, as well as when the network learns about object shape during training. We show that a network learns the majority of overall shape information at the first few epochs of training and that this information is largely encoded in the last few layers of a CNN. Finally, we show that the encoding of shape does not imply the encoding of localized per-pixel semantic information. The experimental results and findings provide a more accurate understanding of the behaviour of current CNNs, thus helping to inform future design choices.
\ No newline at end of file
diff --git a/data/2021/iclr/Shape-Texture Debiased Neural Network Training b/data/2021/iclr/Shape-Texture Debiased Neural Network Training
new file mode 100644
index 0000000000..356f74662e
--- /dev/null
+++ b/data/2021/iclr/Shape-Texture Debiased Neural Network Training	
@@ -0,0 +1 @@
+Shape and texture are two prominent and complementary cues for recognizing objects. Nonetheless, Convolutional Neural Networks are often biased towards either texture or shape, depending on the training dataset. Our ablation shows that such bias degenerates model performance. Motivated by this observation, we develop a simple algorithm for shape-texture debiased learning. To prevent models from exclusively attending on a single cue in representation learning, we augment training data with images with conflicting shape and texture information (e.g., an image of chimpanzee shape but with lemon texture) and, most importantly, provide the corresponding supervisions from shape and texture simultaneously. Experiments show that our method successfully improves model performance on several image recognition benchmarks and adversarial robustness. For example, by training on ImageNet, it helps ResNet-152 achieve substantial improvements on ImageNet (+1.2%), ImageNet-A (+5.2%), ImageNet-C (+8.3%) and Stylized-ImageNet (+11.1%), and on defending against FGSM adversarial attacker on ImageNet (+14.4%). Our method also claims to be compatible to other advanced data augmentation strategies, e.g., Mixup and CutMix. The code is available here: https://github.com/LiYingwei/ShapeTextureDebiasedTraining.
\ No newline at end of file
diff --git a/data/2021/iclr/Shapley Explanation Networks b/data/2021/iclr/Shapley Explanation Networks
new file mode 100644
index 0000000000..827c07d3fa
--- /dev/null
+++ b/data/2021/iclr/Shapley Explanation Networks	
@@ -0,0 +1 @@
+Colorectal cancer is mostly caused by colorectal polyps, which can be prevented through polyp diagnosis using colonoscopy. The current computer-aided decision-making methods suffer from a variety of drawbacks, including inaccurate polyp classification, poor real-time performance, and poor interpretability. To address these issues, we propose an explainable multitask Shapley explanation networks (EMSEN) that can perform real-time explainable multitasks such as polyp detection and classification in colonoscopy videos. The EMSEN accepts two multimodal inputs of different light sources, and outputs the polyp location, classification type, and diagnosis results according to the real-time colonoscopy video, where efficient channel attention (ECA) Mechanism-based network and Shapley explanation networks (ShapNet) are designed to improve the feature extraction performance and model interpretability, respectively. Extensive experiment studies are conducted to verify the efficiency and effectiveness of the proposed method by comparing with the experts and state-of-the-art methods. The results demonstrate that the developed method performs the best, which achieves competitive diagnosis performance.
\ No newline at end of file
diff --git a/data/2021/iclr/Shapley explainability on the data manifold b/data/2021/iclr/Shapley explainability on the data manifold
new file mode 100644
index 0000000000..914c2ef12d
--- /dev/null
+++ b/data/2021/iclr/Shapley explainability on the data manifold	
@@ -0,0 +1 @@
+Explainability in AI is crucial for model development, compliance with regulation, and providing operational nuance to predictions. The Shapley framework for explainability attributes a model's predictions to its input features in a mathematically principled and model-agnostic way. However, general implementations of Shapley explainability make an untenable assumption: that the model's features are uncorrelated. In this work, we demonstrate unambiguous drawbacks of this assumption and develop two solutions to Shapley explainability that respect the data manifold. One solution, based on generative modelling, provides flexible access to data imputations; the other directly learns the Shapley value-function, providing performance and stability at the cost of flexibility. While "off-manifold" Shapley values can (i) give rise to incorrect explanations, (ii) hide implicit model dependence on sensitive attributes, and (iii) lead to unintelligible explanations in higher-dimensional data, on-manifold explainability overcomes these problems.
\ No newline at end of file
diff --git a/data/2021/iclr/Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation b/data/2021/iclr/Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation
new file mode 100644
index 0000000000..457d3687d7
--- /dev/null
+++ b/data/2021/iclr/Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation	
@@ -0,0 +1 @@
+Using a mix of shared and language-specific (LS) parameters has shown promise in multilingual neural machine translation (MNMT), but the question of when and where LS capacity matters most is still under-studied. We offer such a study by proposing conditional language-specific routing (CLSR). CLSR employs hard binary gates conditioned on token representations to dynamically select LS or shared paths. By manipulating these gates, it can schedule LS capacity across sub-layers in MNMT subject to the guidance of translation signals and budget constraints. Moreover, CLSR can easily scale up to massively multilingual settings. Experiments with Transformer on OPUS-100 and WMT datasets show that: 1) MNMT is sensitive to both the amount and the position of LS modeling: distributing 10%-30% LS computation to the top and/or bottom encoder/decoder layers delivers the best performance; and 2) one-to-many translation benefits more from CLSR compared to many-to-one translation, particularly with unbalanced training data. Our study further verifies the trade-off between the shared capacity and LS capacity for multilingual translation. We corroborate our analysis by confirming the soundness of our findings as foundation of our improved multilingual Transformers. Source code and models will be released.
\ No newline at end of file
diff --git a/data/2021/iclr/Sharper Generalization Bounds for Learning with Gradient-dominated Objective Functions b/data/2021/iclr/Sharper Generalization Bounds for Learning with Gradient-dominated Objective Functions
new file mode 100644
index 0000000000..acd5630bcb
--- /dev/null
+++ b/data/2021/iclr/Sharper Generalization Bounds for Learning with Gradient-dominated Objective Functions	
@@ -0,0 +1 @@
+Stochastic optimization has become the workhorse behind many successful machine learning applications, which motivates a lot of theoretical analysis to understand its empirical behavior. As a comparison, there is far less work to study the generalization behavior especially in a non-convex learning setting. In this paper, we study the generalization behavior of stochastic optimization by leveraging the algorithmic stability for learning with β -gradient-dominated objective functions. We develop generalization bounds of the order O (1 / ( nβ )) plus the convergence rate of the optimization algorithm, where n is the sample size. Our stability analysis signiﬁcantly improves the existing non-convex analysis by removing the bounded gradient assumption and implying better generalization bounds. We achieve this improvement by exploiting the smoothness of loss functions instead of the Lipschitz condition in Charles & Papailiopoulos (2018). We apply our general results to various stochastic optimization algorithms, which show clearly how the variance-reduction techniques improve not only training but also generalization. Furthermore, our discussion explains how interpolation helps generalization for highly expressive models.
\ No newline at end of file
diff --git a/data/2021/iclr/Sharpness-aware Minimization for Efficiently Improving Generalization b/data/2021/iclr/Sharpness-aware Minimization for Efficiently Improving Generalization
new file mode 100644
index 0000000000..21996dc540
--- /dev/null
+++ b/data/2021/iclr/Sharpness-aware Minimization for Efficiently Improving Generalization	
@@ -0,0 +1 @@
+In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by the connection between geometry of the loss landscape and generalization---including a generalization bound that we prove here---we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-{10, 100}, ImageNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels.
\ No newline at end of file
diff --git a/data/2021/iclr/Signatory: differentiable computations of the signature and logsignature transforms, on both CPU and GPU b/data/2021/iclr/Signatory: differentiable computations of the signature and logsignature transforms, on both CPU and GPU
new file mode 100644
index 0000000000..be9e8f449d
--- /dev/null
+++ b/data/2021/iclr/Signatory: differentiable computations of the signature and logsignature transforms, on both CPU and GPU	
@@ -0,0 +1 @@
+Signatory is a library for calculating signature and logsignature transforms and related functionality. The focus is on making this functionality available for use in machine learning, and as such includes features such as GPU support and backpropagation. To our knowledge it is the first publically available GPU-capable library for these operations. It also implements several new algorithmic improvements, and provides several new features not available in previous libraries. The library operates as a Python wrapper around C++, and is compatible with the PyTorch ecosystem. It may be installed directly via \texttt{pip}. Source code, documentation, examples, benchmarks and tests may be found at \texttt{\url{this https URL}}. The license is Apache-2.0.
\ No newline at end of file
diff --git a/data/2021/iclr/Simple Augmentation Goes a Long Way: ADRL for DNN Quantization b/data/2021/iclr/Simple Augmentation Goes a Long Way: ADRL for DNN Quantization
new file mode 100644
index 0000000000..60552c3040
--- /dev/null
+++ b/data/2021/iclr/Simple Augmentation Goes a Long Way: ADRL for DNN Quantization	
@@ -0,0 +1 @@
+Mixed precision quantization improves DNN performance by assigning different layers with different bit-width values. Searching for the optimal bit-width for each layer, however, remains a challenge. Deep Reinforcement Learning (DRL) shows some recent promise. It however suffers instability due to function approximation errors, causing large variances in the early training stages, slow convergence, and suboptimal policies in the mixed precision quantization problem. This paper proposes augmented DRL (ADRL) as a way to alleviate these issues. This new strategy augments the neural networks in DRL with a complementary scheme to boost the performance of learning. The paper examines the effectiveness of ADRL both analytically and empirically, showing that it can produce more accurate quantized models than the state of the art DRL-based quantization while improving the learning speed by 4.5-64 × .
\ No newline at end of file
diff --git a/data/2021/iclr/Simple Spectral Graph Convolution b/data/2021/iclr/Simple Spectral Graph Convolution
new file mode 100644
index 0000000000..1f90700e4f
--- /dev/null
+++ b/data/2021/iclr/Simple Spectral Graph Convolution	
@@ -0,0 +1 @@
+neighborhoods of various sizes. Moreover, we show that our design incorporates larger neighborhoods compared to SGC thus coping better with oversmoothing. We explain that limiting over-dominance of the largest neighborhoods in the aggregation step is a desired approach to limit oversmoothing while preserving large context of each node. We also show that in spectral analysis that S 2 GC is a trade-off between the low-and high-pass ﬁlters which leads to capturing the global and local contexts of each node. Moreover, we show how S 2 GC and APPNP (Klicpera et al., 2019a) are related and explain why S 2 GC captures a range of neighborhoods better than APPNP. Our experimental results include node clustering, unsupervised and semi-supervised node classiﬁ-cation, node property prediction and supervised text classiﬁcation. We show that S 2 GC is highly competitive often signiﬁcantly outperforming state-of-the-art methods
\ No newline at end of file
diff --git a/data/2021/iclr/Single-Photon Image Classification b/data/2021/iclr/Single-Photon Image Classification
new file mode 100644
index 0000000000..b1e0495fbe
--- /dev/null
+++ b/data/2021/iclr/Single-Photon Image Classification	
@@ -0,0 +1,3 @@
+Quantum computing-based machine learning mainly focuses on quantum computing hardware that is experimentally challenging to realize due to requiring quantum gates that operate at very low temperature. Instead, we demonstrate the existence of a lower performance and much lower effort island on the accuracy-vs-qubits graph that may well be experimentally accessible with room temperature optics. This high temperature "quantum computing toy model" is nevertheless interesting to study as it allows rather accessible explanations of key concepts in quantum computing, in particular interference, entanglement, and the measurement process. 
+We specifically study the problem of classifying an example from the MNIST and Fashion-MNIST datasets, subject to the constraint that we have to make a prediction after the detection of the very first photon that passed a coherently illuminated filter showing the example. Whereas a classical set-up in which a photon is detected after falling on one of the~$28\times 28$ image pixels is limited to a (maximum likelihood estimation) accuracy of~$21.27\%$ for MNIST, respectively $18.27\%$ for Fashion-MNIST, we show that the theoretically achievable accuracy when exploiting inference by optically transforming the quantum state of the photon is at least $41.27\%$ for MNIST, respectively $36.14\%$ for Fashion-MNIST. 
+We show in detail how to train the corresponding transformation with TensorFlow and also explain how this example can serve as a teaching tool for the measurement process in quantum mechanics.
\ No newline at end of file
diff --git a/data/2021/iclr/Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy b/data/2021/iclr/Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy
new file mode 100644
index 0000000000..d47d769926
--- /dev/null
+++ b/data/2021/iclr/Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy	
@@ -0,0 +1 @@
+We study the global convergence and global optimality of actor-critic, one of the most popular families of reinforcement learning algorithms. While most existing works on actor-critic employ bi-level or two-timescale updates, we focus on the more practical single-timescale setting, where the actor and critic are updated simultaneously. Specifically, in each iteration, the critic update is obtained by applying the Bellman evaluation operator only once while the actor is updated in the policy gradient direction computed using the critic. Moreover, we consider two function approximation settings where both the actor and critic are represented by linear or deep neural networks. For both cases, we prove that the actor sequence converges to a globally optimal policy at a sublinear $O(K^{-1/2})$ rate, where $K$ is the number of iterations. To the best of our knowledge, we establish the rate of convergence and global optimality of single-timescale actor-critic with linear function approximation for the first time. Moreover, under the broader scope of policy optimization with nonlinear function approximation, we prove that actor-critic with deep neural network finds the globally optimal policy at a sublinear rate for the first time.
\ No newline at end of file
diff --git a/data/2021/iclr/SkipW: Resource Adaptable RNN with Strict Upper Computational Limit b/data/2021/iclr/SkipW: Resource Adaptable RNN with Strict Upper Computational Limit
new file mode 100644
index 0000000000..dabcb1a64d
--- /dev/null
+++ b/data/2021/iclr/SkipW: Resource Adaptable RNN with Strict Upper Computational Limit	
@@ -0,0 +1 @@
+We introduce Skip-Window, a method to allow recurrent neural networks (RNNs) to trade off accuracy for computational cost during the analysis of a sequence. Similarly to existing approaches, Skip-Window extends existing RNN cells by adding a mechanism to encourage the model to process fewer inputs. Unlike existing approaches, Skip-Window is able to respect a strict computational budget, making this model more suitable for limited hardware like edge devices. We evaluate this approach on four datasets: a human activity recognition task, sequential MNIST, IMDB and adding task. Our results show that Skip-Window is often able to exceed the accuracy of existing approaches for a lower computational cost while strictly limiting said cost.
\ No newline at end of file
diff --git a/data/2021/iclr/Sliced Kernelized Stein Discrepancy b/data/2021/iclr/Sliced Kernelized Stein Discrepancy
new file mode 100644
index 0000000000..eb3a45a9bf
--- /dev/null
+++ b/data/2021/iclr/Sliced Kernelized Stein Discrepancy	
@@ -0,0 +1 @@
+Kernelized Stein discrepancy (KSD), though being extensively used in goodness-of-fit tests and model learning, suffers from the curse-of-dimensionality. We address this issue by proposing the sliced Stein discrepancy and its scalable and kernelized variants, which employs kernel-based test functions defined on the optimal onedimensional projections instead of the full input in high dimensions. When applied to goodness-of-fit tests, extensive experiments show the proposed discrepancy significantly outperforms KSD and various baselines in high dimensions. For model learning, we show its advantages by training an independent component analysis when compared with existing Stein discrepancy baselines. We further propose a novel particle inference method called sliced Stein variational gradient descent (S-SVGD) which alleviates the mode-collapse issue of SVGD in training variational autoencoders.
\ No newline at end of file
diff --git a/data/2021/iclr/Solving Compositional Reinforcement Learning Problems via Task Reduction b/data/2021/iclr/Solving Compositional Reinforcement Learning Problems via Task Reduction
new file mode 100644
index 0000000000..e91f7c7d1f
--- /dev/null
+++ b/data/2021/iclr/Solving Compositional Reinforcement Learning Problems via Task Reduction	
@@ -0,0 +1 @@
+We propose a novel learning paradigm, Self-Imitation via Reduction (SIR), for solving compositional reinforcement learning problems. SIR is based on two core ideas: task reduction and self-imitation. Task reduction tackles a hard-to-solve task by actively reducing it to an easier task whose solution is known by the RL agent. Once the original hard task is successfully solved by task reduction, the agent naturally obtains a self-generated solution trajectory to imitate. By continuously collecting and imitating such demonstrations, the agent is able to progressively expand the solved subspace in the entire task space. Experiment results show that SIR can significantly accelerate and improve learning on a variety of challenging sparse-reward continuous-control problems with compositional structures.
\ No newline at end of file
diff --git a/data/2021/iclr/Sparse Quantized Spectral Clustering b/data/2021/iclr/Sparse Quantized Spectral Clustering
new file mode 100644
index 0000000000..7ad493973e
--- /dev/null
+++ b/data/2021/iclr/Sparse Quantized Spectral Clustering	
@@ -0,0 +1 @@
+Given a large data matrix, sparsifying, quantizing, and/or performing other entry-wise nonlinear operations can have numerous benefits, ranging from speeding up iterative algorithms for core numerical linear algebra problems to providing nonlinear filters to design state-of-the-art neural network models. Here, we exploit tools from random matrix theory to make precise statements about how the eigenspectrum of a matrix changes under such nonlinear transformations. In particular, we show that very little change occurs in the informative eigenstructure even under drastic sparsification/quantization, and consequently that very little downstream performance loss occurs with very aggressively sparsified or quantized spectral clustering. We illustrate how these results depend on the nonlinearity, we characterize a phase transition beyond which spectral clustering becomes possible, and we show when such nonlinear transformations can introduce spurious non-informative eigenvectors.
\ No newline at end of file
diff --git a/data/2021/iclr/Sparse encoding for more-interpretable feature-selecting representations in probabilistic matrix factorization b/data/2021/iclr/Sparse encoding for more-interpretable feature-selecting representations in probabilistic matrix factorization
new file mode 100644
index 0000000000..08cb6dfe1d
--- /dev/null
+++ b/data/2021/iclr/Sparse encoding for more-interpretable feature-selecting representations in probabilistic matrix factorization	
@@ -0,0 +1 @@
+Dimensionality reduction methods for count data are critical to a wide range of applications in medical informatics and other fields where model interpretability is paramount. For such data, hierarchical Poisson matrix factorization (HPF) and other sparse probabilistic non-negative matrix factorization (NMF) methods are considered to be interpretable generative models. They consist of sparse transformations for decoding their learned representations into predictions. However, sparsity in representation decoding does not necessarily imply sparsity in the encoding of representations from the original data features. HPF is often incorrectly interpreted in the literature as if it possesses encoder sparsity. The distinction between decoder sparsity and encoder sparsity is subtle but important. Due to the lack of encoder sparsity, HPF does not possess the column-clustering property of classical NMF -- the factor loading matrix does not sufficiently define how each factor is formed from the original features. We address this deficiency by self-consistently enforcing encoder sparsity, using a generalized additive model (GAM), thereby allowing one to relate each representation coordinate to a subset of the original data features. In doing so, the method also gains the ability to perform feature selection. We demonstrate our method on simulated data and give an example of how encoder sparsity is of practical use in a concrete application of representing inpatient comorbidities in Medicare patients.
\ No newline at end of file
diff --git a/data/2021/iclr/Spatial Dependency Networks: Neural Layers for Improved Generative Image Modeling b/data/2021/iclr/Spatial Dependency Networks: Neural Layers for Improved Generative Image Modeling
new file mode 100644
index 0000000000..30333b5f05
--- /dev/null
+++ b/data/2021/iclr/Spatial Dependency Networks: Neural Layers for Improved Generative Image Modeling	
@@ -0,0 +1 @@
+How to improve generative modeling by better exploiting spatial regularities and coherence in images? We introduce a novel neural network for building image generators (decoders) and apply it to variational autoencoders (VAEs). In our spatial dependency networks (SDNs), feature maps at each level of a deep neural net are computed in a spatially coherent way, using a sequential gating-based mechanism that distributes contextual information across 2-D space. We show that augmenting the decoder of a hierarchical VAE by spatial dependency layers considerably improves density estimation over baseline convolutional architectures and the state-of-the-art among the models within the same class. Furthermore, we demonstrate that SDN can be applied to large images by synthesizing samples of high quality and coherence. In a vanilla VAE setting, we find that a powerful SDN decoder also improves learning disentangled representations, indicating that neural architectures play an important role in this task. Our results suggest favoring spatial dependency over convolutional layers in various VAE settings. The accompanying source code is given at https://github.com/djordjemila/sdn.
\ No newline at end of file
diff --git a/data/2021/iclr/Spatially Structured Recurrent Modules b/data/2021/iclr/Spatially Structured Recurrent Modules
new file mode 100644
index 0000000000..a70dec966d
--- /dev/null
+++ b/data/2021/iclr/Spatially Structured Recurrent Modules	
@@ -0,0 +1 @@
+Capturing the structure of a data-generating process by means of appropriate inductive biases can help in learning models that generalize well and are robust to changes in the input distribution. While methods that harness spatial and temporal structures find broad application, recent work has demonstrated the potential of models that leverage sparse and modular structure using an ensemble of sparingly interacting modules. In this work, we take a step towards dynamic models that are capable of simultaneously exploiting both modular and spatiotemporal structures. We accomplish this by abstracting the modeled dynamical system as a collection of autonomous but sparsely interacting sub-systems. The sub-systems interact according to a topology that is learned, but also informed by the spatial structure of the underlying real-world system. This results in a class of models that are well suited for modeling the dynamics of systems that only offer local views into their state, along with corresponding spatial locations of those views. On the tasks of video prediction from cropped frames and multi-agent world modeling from partial observations in the challenging Starcraft2 domain, we find our models to be more robust to the number of available views and better capable of generalization to novel tasks without additional training, even when compared against strong baselines that perform equally well or better on the training distribution.
\ No newline at end of file
diff --git a/data/2021/iclr/Spatio-Temporal Graph Scattering Transform b/data/2021/iclr/Spatio-Temporal Graph Scattering Transform
new file mode 100644
index 0000000000..7a4f36022e
--- /dev/null
+++ b/data/2021/iclr/Spatio-Temporal Graph Scattering Transform	
@@ -0,0 +1 @@
+Although spatio-temporal graph neural networks have achieved great empirical success in handling multiple correlated time series, they may be impractical in some real-world scenarios due to a lack of sufficient high-quality training data. Furthermore, spatio-temporal graph neural networks lack theoretical interpretation. To address these issues, we put forth a novel mathematically designed framework to analyze spatio-temporal data. Our proposed spatio-temporal graph scattering transform (ST-GST) extends traditional scattering transforms to the spatio-temporal domain. It performs iterative applications of spatio-temporal graph wavelets and nonlinear activation functions, which can be viewed as a forward pass of spatio-temporal graph convolutional networks without training. Since all the filter coefficients in ST-GST are mathematically designed, it is promising for the real-world scenarios with limited training data, and also allows for a theoretical analysis, which shows that the proposed ST-GST is stable to small perturbations of input signals and structures. Finally, our experiments show that i) ST-GST outperforms spatio-temporal graph convolutional networks by an increase of 35% in accuracy for MSR Action3D dataset; ii) it is better and computationally more efficient to design the transform based on separable spatio-temporal graphs than the joint ones; and iii) the nonlinearity in ST-GST is critical to empirical performance.
\ No newline at end of file
diff --git a/data/2021/iclr/Stabilized Medical Image Attacks b/data/2021/iclr/Stabilized Medical Image Attacks
new file mode 100644
index 0000000000..c39edc57e8
--- /dev/null
+++ b/data/2021/iclr/Stabilized Medical Image Attacks	
@@ -0,0 +1 @@
+Convolutional Neural Networks (CNNs) have advanced existing medical systems for automatic disease diagnosis. However, a threat to these systems arises that adversarial attacks make CNNs vulnerable. Inaccurate diagnosis results make a negative influence on human healthcare. There is a need to investigate potential adversarial attacks to robustify deep medical diagnosis systems. On the other side, there are several modalities of medical images (e.g., CT, fundus, and endoscopic image) of which each type is significantly different from others. It is more challenging to generate adversarial perturbations for different types of medical images. In this paper, we propose an image-based medical adversarial attack method to consistently produce adversarial perturbations on medical images. The objective function of our method consists of a loss deviation term and a loss stabilization term. The loss deviation term increases the divergence between the CNN prediction of an adversarial example and its ground truth label. Meanwhile, the loss stabilization term ensures similar CNN predictions of this example and its smoothed input. From the perspective of the whole iterations for perturbation generation, the proposed loss stabilization term exhaustively searches the perturbation space to smooth the single spot for local optimum escape. We further analyze the KL-divergence of the proposed loss function and find that the loss stabilization term makes the perturbations updated towards a fixed objective spot while deviating from the ground truth. This stabilization ensures the proposed medical attack effective for different types of medical images while producing perturbations in small variance. Experiments on several medical image analysis benchmarks including the recent COVID-19 dataset show the stability of the proposed method.
\ No newline at end of file
diff --git a/data/2021/iclr/Statistical inference for individual fairness b/data/2021/iclr/Statistical inference for individual fairness
new file mode 100644
index 0000000000..3ad33f94ee
--- /dev/null
+++ b/data/2021/iclr/Statistical inference for individual fairness	
@@ -0,0 +1 @@
+As we rely on machine learning (ML) models to make more consequential decisions, the issue of ML models perpetuating or even exacerbating undesirable historical biases (e.g., gender and racial biases) has come to the fore of the public's attention. In this paper, we focus on the problem of detecting violations of individual fairness in ML models. We formalize the problem as measuring the susceptibility of ML models against a form of adversarial attack and develop a suite of inference tools for the adversarial cost function. The tools allow auditors to assess the individual fairness of ML models in a statistically-principled way: form confidence intervals for the worst-case performance differential between similar individuals and test hypotheses of model fairness with (asymptotic) non-coverage/Type I error rate control. We demonstrate the utility of our tools in a real-world case study.
\ No newline at end of file
diff --git a/data/2021/iclr/Stochastic Security: Adversarial Defense Using Long-Run Dynamics of Energy-Based Models b/data/2021/iclr/Stochastic Security: Adversarial Defense Using Long-Run Dynamics of Energy-Based Models
new file mode 100644
index 0000000000..c641f8629e
--- /dev/null
+++ b/data/2021/iclr/Stochastic Security: Adversarial Defense Using Long-Run Dynamics of Energy-Based Models	
@@ -0,0 +1 @@
+The vulnerability of deep networks to adversarial attacks is a central problem for deep learning from the perspective of both cognition and security. The current most successful defense method is to train a classifier using adversarial images created during learning. Another defense approach involves transformation or purification of the original input to remove adversarial signals before the image is classified. We focus on defending naturally-trained classifiers using Markov Chain Monte Carlo (MCMC) sampling with an Energy-Based Model (EBM) for adversarial purification. In contrast to adversarial training, our approach is intended to secure pre-existing and highly vulnerable classifiers. The memoryless behavior of long-run MCMC sampling will eventually remove adversarial signals, while metastable behavior preserves consistent appearance of MCMC samples after many steps to allow accurate long-run prediction. Balancing these factors can lead to effective purification and robust classification. We evaluate adversarial defense with an EBM using the strongest known attacks against purification. Our contributions are 1) an improved method for training EBM's with realistic long-run MCMC samples, 2) an Expectation-Over-Transformation (EOT) defense that resolves theoretical ambiguities for stochastic defenses and from which the EOT attack naturally follows, and 3) state-of-the-art adversarial defense for naturally-trained classifiers and competitive defense compared to adversarially-trained classifiers on Cifar-10, SVHN, and Cifar-100. Code and pre-trained models are available at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/Structured Prediction as Translation between Augmented Natural Languages b/data/2021/iclr/Structured Prediction as Translation between Augmented Natural Languages
new file mode 100644
index 0000000000..1de3bcc2d1
--- /dev/null
+++ b/data/2021/iclr/Structured Prediction as Translation between Augmented Natural Languages	
@@ -0,0 +1 @@
+We propose a new framework, Translation between Augmented Natural Languages (TANL), to solve many structured prediction language tasks including joint entity and relation extraction, nested named entity recognition, relation classification, semantic role labeling, event extraction, coreference resolution, and dialogue state tracking. Instead of tackling the problem by training task-specific discriminative classifiers, we frame it as a translation task between augmented natural languages, from which the task-relevant information can be easily extracted. Our approach can match or outperform task-specific models on all tasks, and in particular, achieves new state-of-the-art results on joint entity and relation extraction (CoNLL04, ADE, NYT, and ACE2005 datasets), relation classification (FewRel and TACRED), and semantic role labeling (CoNLL-2005 and CoNLL-2012). We accomplish this while using the same architecture and hyperparameters for all tasks and even when training a single model to solve all tasks at the same time (multi-task learning). Finally, we show that our framework can also significantly improve the performance in a low-resource regime, thanks to better use of label semantics.
\ No newline at end of file
diff --git a/data/2021/iclr/Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning b/data/2021/iclr/Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning
new file mode 100644
index 0000000000..9c78c3c134
--- /dev/null
+++ b/data/2021/iclr/Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning	
@@ -0,0 +1 @@
+State-of-the-art natural language understanding classification models follow two-stages: pre-training a large language model on an auxiliary task, and then fine-tuning the model on a task-specific labeled dataset using cross-entropy loss. Cross-entropy loss has several shortcomings that can lead to sub-optimal generalization and instability. Driven by the intuition that good generalization requires capturing the similarity between examples in one class and contrasting them with examples in other classes, we propose a supervised contrastive learning (SCL) objective for the fine-tuning stage. Combined with cross-entropy, the SCL loss we propose obtains improvements over a strong RoBERTa-Large baseline on multiple datasets of the GLUE benchmark in both the high-data and low-data regimes, and it does not require any specialized architecture, data augmentation of any kind, memory banks, or additional unsupervised data. We also demonstrate that the new objective leads to models that are more robust to different levels of noise in the training data, and can generalize better to related tasks with limited labeled task data.
\ No newline at end of file
diff --git a/data/2021/iclr/Support-set bottlenecks for video-text representation learning b/data/2021/iclr/Support-set bottlenecks for video-text representation learning
new file mode 100644
index 0000000000..d2a89c2de2
--- /dev/null
+++ b/data/2021/iclr/Support-set bottlenecks for video-text representation learning	
@@ -0,0 +1 @@
+The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs. We posit that this last behaviour is too strict, enforcing dissimilar representations even for samples that are semantically-related -- for example, visually similar videos or ones that share the same depicted action. In this paper, we propose a novel method that alleviates this by leveraging a generative model to naturally push these related samples together: each sample's caption must be reconstructed as a weighted combination of other support samples' visual representations. This simple idea ensures that representations are not overly-specialized to individual samples, are reusable across the dataset, and results in representations that explicitly encode semantics shared between samples, unlike noise contrastive learning. Our proposed method outperforms others by a large margin on MSR-VTT, VATEX and ActivityNet, for video-to-text and text-to-video retrieval.
\ No newline at end of file
diff --git a/data/2021/iclr/Symmetry-Aware Actor-Critic for 3D Molecular Design b/data/2021/iclr/Symmetry-Aware Actor-Critic for 3D Molecular Design
new file mode 100644
index 0000000000..b9f9c2694b
--- /dev/null
+++ b/data/2021/iclr/Symmetry-Aware Actor-Critic for 3D Molecular Design	
@@ -0,0 +1 @@
+Automating molecular design using deep reinforcement learning (RL) has the potential to greatly accelerate the search for novel materials. Despite recent progress on leveraging graph representations to design molecules, such methods are fundamentally limited by the lack of three-dimensional (3D) information. In light of this, we propose a novel actor-critic architecture for 3D molecular design that can generate molecular structures unattainable with previous approaches. This is achieved by exploiting the symmetries of the design process through a rotationally covariant state-action representation based on a spherical harmonics series expansion. We demonstrate the benefits of our approach on several 3D molecular design tasks, where we find that building in such symmetries significantly improves generalization and the quality of generated molecules.
\ No newline at end of file
diff --git a/data/2021/iclr/Systematic generalisation with group invariant predictions b/data/2021/iclr/Systematic generalisation with group invariant predictions
new file mode 100644
index 0000000000..5f345c93ed
--- /dev/null
+++ b/data/2021/iclr/Systematic generalisation with group invariant predictions	
@@ -0,0 +1 @@
+We consider situations where the presence of dominant simpler correlations with the target variable in a training set can cause an SGD-trained neural network to be less reliant on more persistently correlating complex features. When the non-persistent, simpler correlations correspond to non-semantic background factors, a neural network trained on this data can exhibit dramatic failure upon encountering systematic distributional shift, where the correlating background features are re-combined with different objects. We perform an empirical study on three synthetic datasets, showing that group invariance methods across inferred partitionings of the training set can lead to signiﬁcant improvements at such test-time situations. We also suggest a simple invariance penalty, showing with experiments on our setups that it can perform better than alternatives. We ﬁnd that even without assuming access to any systematically shifted validation sets, one can still ﬁnd improvements over an ERM-trained reference model.
\ No newline at end of file
diff --git a/data/2021/iclr/Taking Notes on the Fly Helps Language Pre-Training b/data/2021/iclr/Taking Notes on the Fly Helps Language Pre-Training
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Taming GANs with Lookahead-Minmax b/data/2021/iclr/Taming GANs with Lookahead-Minmax
new file mode 100644
index 0000000000..c749913676
--- /dev/null
+++ b/data/2021/iclr/Taming GANs with Lookahead-Minmax	
@@ -0,0 +1 @@
+Generative Adversarial Networks are notoriously challenging to train. The underlying minmax optimization is highly susceptible to the variance of the stochastic gradient and the rotational component of the associated game vector field. To tackle these challenges, we propose the Lookahead algorithm for minmax optimization, originally developed for single objective minimization only. The backtracking step of our Lookahead-minmax naturally handles the rotational game dynamics, a property which was identified to be key for enabling gradient ascent descent methods to converge on challenging examples often analyzed in the literature. Moreover, it implicitly handles high variance without using large mini-batches, known to be essential for reaching state of the art performance. Experimental results on MNIST, SVHN, CIFAR-10, and ImageNet demonstrate a clear advantage of combining Lookahead-minmax with Adam or extragradient, in terms of performance and improved stability, for negligible memory and computational cost. Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent BigGAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels, bringing state-of-the-art GAN training within reach of common computational resources.
\ No newline at end of file
diff --git a/data/2021/iclr/Targeted Attack against Deep Neural Networks via Flipping Limited Weight Bits b/data/2021/iclr/Targeted Attack against Deep Neural Networks via Flipping Limited Weight Bits
new file mode 100644
index 0000000000..9e08ce469a
--- /dev/null
+++ b/data/2021/iclr/Targeted Attack against Deep Neural Networks via Flipping Limited Weight Bits	
@@ -0,0 +1 @@
+To explore the vulnerability of deep neural networks (DNNs), many attack paradigms have been well studied, such as the poisoning-based backdoor attack in the training stage and the adversarial attack in the inference stage. In this paper, we study a novel attack paradigm, which modifies model parameters in the deployment stage for malicious purposes. Specifically, our goal is to misclassify a specific sample into a target class without any sample modification, while not significantly reduce the prediction accuracy of other samples to ensure the stealthiness. To this end, we formulate this problem as a binary integer programming (BIP), since the parameters are stored as binary bits ($i.e.$, 0 and 1) in the memory. By utilizing the latest technique in integer programming, we equivalently reformulate this BIP problem as a continuous optimization problem, which can be effectively and efficiently solved using the alternating direction method of multipliers (ADMM) method. Consequently, the flipped critical bits can be easily determined through optimization, rather than using a heuristic strategy. Extensive experiments demonstrate the superiority of our method in attacking DNNs.
\ No newline at end of file
diff --git a/data/2021/iclr/Task-Agnostic Morphology Evolution b/data/2021/iclr/Task-Agnostic Morphology Evolution
new file mode 100644
index 0000000000..2feade3242
--- /dev/null
+++ b/data/2021/iclr/Task-Agnostic Morphology Evolution	
@@ -0,0 +1 @@
+Deep reinforcement learning primarily focuses on learning behavior, usually overlooking the fact that an agent's function is largely determined by form. So, how should one go about finding a morphology fit for solving tasks in a given environment? Current approaches that co-adapt morphology and behavior use a specific task's reward as a signal for morphology optimization. However, this often requires expensive policy optimization and results in task-dependent morphologies that are not built to generalize. In this work, we propose a new approach, Task-Agnostic Morphology Evolution (TAME), to alleviate both of these issues. Without any task or reward specification, TAME evolves morphologies by only applying randomly sampled action primitives on a population of agents. This is accomplished using an information-theoretic objective that efficiently ranks agents by their ability to reach diverse states in the environment and the causality of their actions. Finally, we empirically demonstrate that across 2D, 3D, and manipulation environments TAME can evolve morphologies that match the multi-task performance of those learned with task supervised algorithms. Our code and videos can be found at https://sites.google.com/view/task-agnostic-evolution.
\ No newline at end of file
diff --git a/data/2021/iclr/Teaching Temporal Logics to Neural Networks b/data/2021/iclr/Teaching Temporal Logics to Neural Networks
new file mode 100644
index 0000000000..4ea6e9b29b
--- /dev/null
+++ b/data/2021/iclr/Teaching Temporal Logics to Neural Networks	
@@ -0,0 +1 @@
+We show that a deep neural network can learn the semantics of linear-time temporal logic (LTL). As a challenging task that requires deep understanding of the LTL semantics, we show that our network can solve the trace generation problem for LTL: given a satisfiable LTL formula, find a trace that satisfies the formula. We frame the trace generation problem for LTL as a translation task, i.e., to translate from formulas to satisfying traces, and train an off-the-shelf implementation of the Transformer, a recently introduced deep learning architecture proposed for solving natural language processing tasks. We provide a detailed analysis of our experimental results, comparing multiple hyperparameter settings and formula representations. After training for several hours on a single GPU the results were surprising: the Transformer returns the syntactically equivalent trace in 89% of the cases on a held-out test set. Most of the "mispredictions", however, (and overall more than 99% of the predicted traces) still satisfy the given LTL formula. In other words, the Transformer generalized from imperfect training data to the semantics of LTL.
\ No newline at end of file
diff --git a/data/2021/iclr/Teaching with Commentaries b/data/2021/iclr/Teaching with Commentaries
new file mode 100644
index 0000000000..d2bea61ddd
--- /dev/null
+++ b/data/2021/iclr/Teaching with Commentaries	
@@ -0,0 +1 @@
+Effective training of deep neural networks can be challenging, and there remain many open questions on how to best learn these models. Recently developed methods to improve neural network training examine teaching: providing learned information during the training process to improve downstream model performance. In this paper, we take steps towards extending the scope of teaching. We propose a flexible teaching framework using commentaries, meta-learned information helpful for training on a particular task or dataset. We present an efficient and scalable gradient-based method to learn commentaries, leveraging recent work on implicit differentiation. We explore diverse applications of commentaries, from learning weights for individual training examples, to parameterizing label-dependent data augmentation policies, to representing attention masks that highlight salient image regions. In these settings, we find that commentaries can improve training speed and/or performance and also provide fundamental insights about the dataset and training process.
\ No newline at end of file
diff --git "a/data/2021/iclr/Temporally-Extended \316\265-Greedy Exploration" "b/data/2021/iclr/Temporally-Extended \316\265-Greedy Exploration"
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Tent: Fully Test-Time Adaptation by Entropy Minimization b/data/2021/iclr/Tent: Fully Test-Time Adaptation by Entropy Minimization
new file mode 100644
index 0000000000..75e5664d7a
--- /dev/null
+++ b/data/2021/iclr/Tent: Fully Test-Time Adaptation by Entropy Minimization	
@@ -0,0 +1 @@
+A model must adapt itself to generalize to new and different data during testing. In this setting of fully test-time adaptation the model has only the test data and its own parameters. We propose to adapt by test entropy minimization (tent1): we optimize the model for confidence as measured by the entropy of its predictions. Our method estimates normalization statistics and optimizes channel-wise affine transformations to update online on each batch. Tent reduces generalization error for image classification on corrupted ImageNet and CIFAR-10/100 and reaches a new state-of-the-art error on ImageNet-C. Tent handles source-free domain adaptation on digit recognition from SVHN to MNIST/MNIST-M/USPS, on semantic segmentation from GTA to Cityscapes, and on the VisDA-C benchmark. These results are achieved in one epoch of test-time optimization without altering training.
\ No newline at end of file
diff --git a/data/2021/iclr/Text Generation by Learning from Demonstrations b/data/2021/iclr/Text Generation by Learning from Demonstrations
new file mode 100644
index 0000000000..2570f89cfb
--- /dev/null
+++ b/data/2021/iclr/Text Generation by Learning from Demonstrations	
@@ -0,0 +1 @@
+Current approaches to text generation largely rely on autoregressive models and maximum likelihood estimation. This paradigm leads to (i) diverse but low-quality samples due to mismatched learning objective and evaluation metric (likelihood vs. quality) and (ii) exposure bias due to mismatched history distributions (gold vs. model-generated). To alleviate these problems, we frame text generation as an offline reinforcement learning (RL) problem with expert demonstrations (i.e., the reference), where the goal is to maximize quality given model-generated histories. We propose GOLD (generation by off-policy learning from demonstrations): an easy-to-optimize algorithm that learns from the demonstrations by importance weighting. Intuitively, GOLD upweights confident tokens and downweights unconfident ones in the reference during training, avoiding optimization issues faced by prior RL approaches that rely on online data collection. According to both automatic and human evaluation, models trained by GOLD outperform those trained by MLE and policy gradient on summarization, question generation, and machine translation. Further, our models are less sensitive to decoding algorithms and alleviate exposure bias.
\ No newline at end of file
diff --git a/data/2021/iclr/The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers b/data/2021/iclr/The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/The Importance of Pessimism in Fixed-Dataset Policy Optimization b/data/2021/iclr/The Importance of Pessimism in Fixed-Dataset Policy Optimization
new file mode 100644
index 0000000000..0ce8414f21
--- /dev/null
+++ b/data/2021/iclr/The Importance of Pessimism in Fixed-Dataset Policy Optimization	
@@ -0,0 +1 @@
+We study worst-case guarantees on the expected return of fixed-dataset policy optimization algorithms. Our core contribution is a unified conceptual and mathematical framework for the study of algorithms in this regime. This analysis reveals that for naive approaches, the possibility of erroneous value overestimation leads to a difficult-to-satisfy requirement: in order to guarantee that we select a policy which is near-optimal, we may need the dataset to be informative of the value of every policy. To avoid this, algorithms can follow the pessimism principle, which states that we should choose the policy which acts optimally in the worst possible world. We show why pessimistic algorithms can achieve good performance even when the dataset is not informative of every policy, and derive families of algorithms which follow this principle. These theoretical findings are validated by experiments on a tabular gridworld, and deep learning experiments on four MinAtar environments.
\ No newline at end of file
diff --git a/data/2021/iclr/The Intrinsic Dimension of Images and Its Impact on Learning b/data/2021/iclr/The Intrinsic Dimension of Images and Its Impact on Learning
new file mode 100644
index 0000000000..0fbd9c89e6
--- /dev/null
+++ b/data/2021/iclr/The Intrinsic Dimension of Images and Its Impact on Learning	
@@ -0,0 +1 @@
+It is widely believed that natural image data exhibits low-dimensional structure despite the high dimensionality of conventional pixel representations. This idea underlies a common intuition for the remarkable success of deep learning in computer vision. In this work, we apply dimension estimation tools to popular datasets and investigate the role of low-dimensional structure in deep learning. We find that common natural image datasets indeed have very low intrinsic dimension relative to the high number of pixels in the images. Additionally, we find that low dimensional datasets are easier for neural networks to learn, and models solving these tasks generalize better from training to test data. Along the way, we develop a technique for validating our dimension estimation tools on synthetic data generated by GANs allowing us to actively manipulate the intrinsic dimension by controlling the image generation process. Code for our experiments may be found here https://github.com/ppope/dimensions.
\ No newline at end of file
diff --git a/data/2021/iclr/The Recurrent Neural Tangent Kernel b/data/2021/iclr/The Recurrent Neural Tangent Kernel
new file mode 100644
index 0000000000..fd48e615d8
--- /dev/null
+++ b/data/2021/iclr/The Recurrent Neural Tangent Kernel	
@@ -0,0 +1 @@
+The study of deep networks (DNs) in the infinite-width limit, via the so-called Neural Tangent Kernel (NTK) approach, has provided new insights into the dynamics of learning, generalization, and the impact of initialization. One key DN architecture remains to be kernelized, namely, the Recurrent Neural Network (RNN). In this paper we introduce and study the Recurrent Neural Tangent Kernel (RNTK), which sheds new insights into the behavior of overparametrized RNNs, including how different time steps are weighted by the RNTK to form the output under different initialization parameters and nonlinearity choices, and how inputs of different lengths are treated. We demonstrate via a number of experiments that the RNTK offers significant performance gains over other kernels, including standard NTKs across a range of different data sets. A unique benefit of the RNTK is that it is agnostic to the length of the input, in stark contrast to other kernels.
\ No newline at end of file
diff --git a/data/2021/iclr/The Risks of Invariant Risk Minimization b/data/2021/iclr/The Risks of Invariant Risk Minimization
new file mode 100644
index 0000000000..792b614fb7
--- /dev/null
+++ b/data/2021/iclr/The Risks of Invariant Risk Minimization	
@@ -0,0 +1 @@
+Invariant Causal Prediction (Peters et al., 2016) is a technique for out-of-distribution generalization which assumes that some aspects of the data distribution vary across the training set but that the underlying causal mechanisms remain constant. Recently, Arjovsky et al. (2019) proposed Invariant Risk Minimization (IRM), an objective based on this idea for learning deep, invariant features of data which are a complex function of latent variables; many alternatives have subsequently been suggested. However, formal guarantees for all of these works are severely lacking. In this paper, we present the first analysis of classification under the IRM objective$-$as well as these recently proposed alternatives$-$under a fairly natural and general model. In the linear case, we show simple conditions under which the optimal solution succeeds or, more often, fails to recover the optimal invariant predictor. We furthermore present the very first results in the non-linear regime: we demonstrate that IRM can fail catastrophically unless the test data are sufficiently similar to the training distribution$-$this is precisely the issue that it was intended to solve. Thus, in this setting we find that IRM and its alternatives fundamentally do not improve over standard Empirical Risk Minimization.
\ No newline at end of file
diff --git a/data/2021/iclr/The Role of Momentum Parameters in the Optimal Convergence of Adaptive Polyak's Heavy-ball Methods b/data/2021/iclr/The Role of Momentum Parameters in the Optimal Convergence of Adaptive Polyak's Heavy-ball Methods
new file mode 100644
index 0000000000..1b2ec4d91e
--- /dev/null
+++ b/data/2021/iclr/The Role of Momentum Parameters in the Optimal Convergence of Adaptive Polyak's Heavy-ball Methods	
@@ -0,0 +1 @@
+The adaptive stochastic gradient descent (SGD) with momentum has been widely adopted in deep learning as well as convex optimization. In practice, the last iterate is commonly used as the final solution to make decisions. However, the available regret analysis and the setting of constant momentum parameters only guarantee the optimal convergence of the averaged solution. In this paper, we fill this theory-practice gap by investigating the convergence of the last iterate (referred to as individual convergence), which is a more difficult task than convergence analysis of the averaged solution. Specifically, in the constrained convex cases, we prove that the adaptive Polyak's Heavy-ball (HB) method, in which only the step size is updated using the exponential moving average strategy, attains an optimal individual convergence rate of $O(\frac{1}{\sqrt{t}})$, as opposed to the optimality of $O(\frac{\log t}{\sqrt {t}})$ of SGD, where $t$ is the number of iterations. Our new analysis not only shows how the HB momentum and its time-varying weight help us to achieve the acceleration in convex optimization but also gives valuable hints how the momentum parameters should be scheduled in deep learning. Empirical results on optimizing convex functions and training deep networks validate the correctness of our convergence analysis and demonstrate the improved performance of the adaptive HB methods.
\ No newline at end of file
diff --git a/data/2021/iclr/The Traveling Observer Model: Multi-task Learning Through Spatial Variable Embeddings b/data/2021/iclr/The Traveling Observer Model: Multi-task Learning Through Spatial Variable Embeddings
new file mode 100644
index 0000000000..3b3d244893
--- /dev/null
+++ b/data/2021/iclr/The Traveling Observer Model: Multi-task Learning Through Spatial Variable Embeddings	
@@ -0,0 +1 @@
+This paper frames a general prediction system as an observer traveling around a continuous space, measuring values at some locations, and predicting them at others. The observer is completely agnostic about any particular task being solved; it cares only about measurement locations and their values. This perspective leads to a machine learning framework in which seemingly unrelated tasks can be solved by a single model, by embedding their input and output variables into a shared space. An implementation of the framework is developed in which these variable embeddings are learned jointly with internal model parameters. In experiments, the approach is shown to (1) recover intuitive locations of variables in space and time, (2) exploit regularities across related datasets with completely disjoint input and output spaces, and (3) exploit regularities across seemingly unrelated tasks, outperforming task-specific single-task models and multi-task learning alternatives. The results suggest that even seemingly unrelated tasks may originate from similar underlying processes, a fact that the traveling observer model can use to make better predictions.
\ No newline at end of file
diff --git a/data/2021/iclr/The Unreasonable Effectiveness of Patches in Deep Convolutional Kernels Methods b/data/2021/iclr/The Unreasonable Effectiveness of Patches in Deep Convolutional Kernels Methods
new file mode 100644
index 0000000000..40aede767c
--- /dev/null
+++ b/data/2021/iclr/The Unreasonable Effectiveness of Patches in Deep Convolutional Kernels Methods	
@@ -0,0 +1 @@
+A recent line of work showed that various forms of convolutional kernel methods can be competitive with standard supervised deep convolutional networks on datasets like CIFAR-10, obtaining accuracies in the range of 87-90% while being more amenable to theoretical analysis. In this work, we highlight the importance of a data-dependent feature extraction step that is key to the obtain good performance in convolutional kernel methods. This step typically corresponds to a whitened dictionary of patches, and gives rise to a data-driven convolutional kernel methods. We extensively study its effect, demonstrating it is the key ingredient for high performance of these methods. Specifically, we show that one of the simplest instances of such kernel methods, based on a single layer of image patches followed by a linear classifier is already obtaining classification accuracies on CIFAR-10 in the same range as previous more sophisticated convolutional kernel methods. We scale this method to the challenging ImageNet dataset, showing such a simple approach can exceed all existing non-learned representation methods. This is a new baseline for object recognition without representation learning methods, that initiates the investigation of convolutional kernel models on ImageNet. We conduct experiments to analyze the dictionary that we used, our ablations showing they exhibit low-dimensional properties.
\ No newline at end of file
diff --git a/data/2021/iclr/The geometry of integration in text classification RNNs b/data/2021/iclr/The geometry of integration in text classification RNNs
new file mode 100644
index 0000000000..81b7fd7d4a
--- /dev/null
+++ b/data/2021/iclr/The geometry of integration in text classification RNNs	
@@ -0,0 +1 @@
+Despite the widespread application of recurrent neural networks (RNNs) across a variety of tasks, a unified understanding of how RNNs solve these tasks remains elusive. In particular, it is unclear what dynamical patterns arise in trained RNNs, and how those patterns depend on the training dataset or task. This work addresses these questions in the context of a specific natural language processing task: text classification. Using tools from dynamical systems analysis, we study recurrent networks trained on a battery of both natural and synthetic text classification tasks. We find the dynamics of these trained RNNs to be both interpretable and low-dimensional. Specifically, across architectures and datasets, RNNs accumulate evidence for each class as they process the text, using a low-dimensional attractor manifold as the underlying mechanism. Moreover, the dimensionality and geometry of the attractor manifold are determined by the structure of the training dataset; in particular, we describe how simple word-count statistics computed on the training dataset can be used to predict these properties. Our observations span multiple architectures and datasets, reflecting a common mechanism RNNs employ to perform text classification. To the degree that integration of evidence towards a decision is a common computational primitive, this work lays the foundation for using dynamical systems techniques to study the inner workings of RNNs.
\ No newline at end of file
diff --git a/data/2021/iclr/The inductive bias of ReLU networks on orthogonally separable data b/data/2021/iclr/The inductive bias of ReLU networks on orthogonally separable data
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/The role of Disentanglement in Generalisation b/data/2021/iclr/The role of Disentanglement in Generalisation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data b/data/2021/iclr/Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data
new file mode 100644
index 0000000000..557432a805
--- /dev/null
+++ b/data/2021/iclr/Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data	
@@ -0,0 +1 @@
+Self-training algorithms, which train a model to fit pseudolabels predicted by another previously-learned model, have been very successful for learning with unlabeled data using neural networks. However, the current theoretical understanding of self-training only applies to linear models. This work provides a unified theoretical analysis of self-training with deep networks for semi-supervised learning, unsupervised domain adaptation, and unsupervised learning. At the core of our analysis is a simple but realistic "expansion" assumption, which states that a low-probability subset of the data must expand to a neighborhood with large probability relative to the subset. We also assume that neighborhoods of examples in different classes have minimal overlap. We prove that under these assumptions, the minimizers of population objectives based on self-training and input-consistency regularization will achieve high accuracy with respect to ground-truth labels. By using off-the-shelf generalization bounds, we immediately convert this result to sample complexity guarantees for neural nets that are polynomial in the margin and Lipschitzness. Our results help explain the empirical successes of recently proposed self-training algorithms which use input consistency regularization.
\ No newline at end of file
diff --git a/data/2021/iclr/Theoretical bounds on estimation error for meta-learning b/data/2021/iclr/Theoretical bounds on estimation error for meta-learning
new file mode 100644
index 0000000000..b0afc3ed49
--- /dev/null
+++ b/data/2021/iclr/Theoretical bounds on estimation error for meta-learning	
@@ -0,0 +1 @@
+Machine learning models have traditionally been developed under the assumption that the training and test distributions match exactly. However, recent success in few-shot learning and related problems are encouraging signs that these models can be adapted to more realistic settings where train and test distributions differ. Unfortunately, there is severely limited theoretical support for these algorithms and little is known about the difficulty of these problems. In this work, we provide novel information-theoretic lower-bounds on minimax rates of convergence for algorithms that are trained on data from multiple sources and tested on novel data. Our bounds depend intuitively on the information shared between sources of data, and characterize the difficulty of learning in this setting for arbitrary algorithms. We demonstrate these bounds on a hierarchical Bayesian model of meta-learning, computing both upper and lower bounds on parameter estimation via maximum-a-posteriori inference.
\ No newline at end of file
diff --git a/data/2021/iclr/Tilted Empirical Risk Minimization b/data/2021/iclr/Tilted Empirical Risk Minimization
new file mode 100644
index 0000000000..cc856ddbf0
--- /dev/null
+++ b/data/2021/iclr/Tilted Empirical Risk Minimization	
@@ -0,0 +1 @@
+Empirical risk minimization (ERM) is typically designed to perform well on the average loss, which can result in estimators that are sensitive to outliers, generalize poorly, or treat subgroups unfairly. While many methods aim to address these problems individually, in this work, we explore them through a unified framework---tilted empirical risk minimization (TERM). In particular, we show that it is possible to flexibly tune the impact of individual losses through a straightforward extension to ERM using a hyperparameter called the tilt. We provide several interpretations of the resulting framework: We show that TERM can increase or decrease the influence of outliers, respectively, to enable fairness or robustness; has variance-reduction properties that can benefit generalization; and can be viewed as a smooth approximation to a superquantile method. We develop batch and stochastic first-order optimization methods for solving TERM, and show that the problem can be efficiently solved relative to common alternatives. Finally, we demonstrate that TERM can be used for a multitude of applications, such as enforcing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance. TERM is not only competitive with existing solutions tailored to these individual problems, but can also enable entirely new applications, such as simultaneously addressing outliers and promoting fairness.
\ No newline at end of file
diff --git a/data/2021/iclr/Tomographic Auto-Encoder: Unsupervised Bayesian Recovery of Corrupted Data b/data/2021/iclr/Tomographic Auto-Encoder: Unsupervised Bayesian Recovery of Corrupted Data
new file mode 100644
index 0000000000..eb2bb1573a
--- /dev/null
+++ b/data/2021/iclr/Tomographic Auto-Encoder: Unsupervised Bayesian Recovery of Corrupted Data	
@@ -0,0 +1 @@
+We propose a new probabilistic method for unsupervised recovery of corrupted data. Given a large ensemble of degraded samples, our method recovers accurate posteriors of clean values, allowing the exploration of the manifold of possible reconstructed data and hence characterising the underlying uncertainty. In this setting, direct application of classical variational methods often gives rise to collapsed densities that do not adequately explore the solution space. Instead, we derive our novel reduced entropy condition approximate inference method that results in rich posteriors. We test our model in a data recovery task under the common setting of missing values and noise, demonstrating superior performance to existing variational methods for imputation and de-noising with different real data sets. We further show higher classification accuracy after imputation, proving the advantage of propagating uncertainty to downstream tasks with our model.
\ No newline at end of file
diff --git a/data/2021/iclr/Topology-Aware Segmentation Using Discrete Morse Theory b/data/2021/iclr/Topology-Aware Segmentation Using Discrete Morse Theory
new file mode 100644
index 0000000000..2219e4ba2a
--- /dev/null
+++ b/data/2021/iclr/Topology-Aware Segmentation Using Discrete Morse Theory	
@@ -0,0 +1 @@
+In the segmentation of fine-scale structures from natural and biomedical images, per-pixel accuracy is not the only metric of concern. Topological correctness, such as vessel connectivity and membrane closure, is crucial for downstream analysis tasks. In this paper, we propose a new approach to train deep image segmentation networks for better topological accuracy. In particular, leveraging the power of discrete Morse theory (DMT), we identify global structures, including 1D skeletons and 2D patches, which are important for topological accuracy. Trained with a novel loss based on these global structures, the network performance is significantly improved especially near topologically challenging locations (such as weak spots of connections and membranes). On diverse datasets, our method achieves superior performance on both the DICE score and topological metrics.
\ No newline at end of file
diff --git a/data/2021/iclr/Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis b/data/2021/iclr/Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis
new file mode 100644
index 0000000000..a249c0767a
--- /dev/null
+++ b/data/2021/iclr/Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis	
@@ -0,0 +1 @@
+Training Generative Adversarial Networks (GAN) on high-fidelity images usually requires large-scale GPU-clusters and a vast number of training images. In this paper, we study the few-shot image synthesis task for GAN with minimum computing cost. We propose a light-weight GAN structure that gains superior quality on 1024*1024 resolution. Notably, the model converges from scratch with just a few hours of training on a single RTX-2080 GPU, and has a consistent performance, even with less than 100 training samples. Two technique designs constitute our work, a skip-layer channel-wise excitation module and a self-supervised discriminator trained as a feature-encoder. With thirteen datasets covering a wide variety of image domains (The datasets and code are available at: https://github.com/odegeasslbc/FastGAN-pytorch), we show our model's superior performance compared to the state-of-the-art StyleGAN2, when data and computing budget are limited.
\ No newline at end of file
diff --git a/data/2021/iclr/Towards Impartial Multi-task Learning b/data/2021/iclr/Towards Impartial Multi-task Learning
new file mode 100644
index 0000000000..04c0e827fc
--- /dev/null
+++ b/data/2021/iclr/Towards Impartial Multi-task Learning	
@@ -0,0 +1 @@
+Multi-task learning (MTL) has been widely used in representation learning. However, naı̈vely training all tasks simultaneously may lead to the partial training issue, where specific tasks are trained more adequately than others. In this paper, we propose to learn multiple tasks impartially. Specifically, for the task-shared parameters, we optimize the scaling factors via a closed-form solution, such that the aggregated gradient (sum of raw gradients weighted by the scaling factors) has equal projections onto individual tasks. For the task-specific parameters, we dynamically weigh the task losses so that all of them are kept at a comparable scale. Further, we find the above gradient balance and loss balance are complementary and thus propose a hybrid balance method to further improve the performance. Our impartial multi-task learning (IMTL) can be end-to-end trained without any heuristic hyper-parameter tuning, and is general to be applied on all kinds of losses without any distribution assumption. Moreover, our IMTL can converge to similar results even when the task losses are designed to have different scales, and thus it is scale-invariant. We extensively evaluate our IMTL on the standard MTL benchmarks including Cityscapes, NYUv2 and CelebA. It achieves the new stateof-the-art among loss weighting methods under the same experimental settings.
\ No newline at end of file
diff --git a/data/2021/iclr/Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding b/data/2021/iclr/Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding
new file mode 100644
index 0000000000..ca031f2ba4
--- /dev/null
+++ b/data/2021/iclr/Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding	
@@ -0,0 +1 @@
+We construct an unsupervised learning model that achieves nonlinear disentanglement of underlying factors of variation in naturalistic videos. Previous work suggests that representations can be disentangled if all but a few factors in the environment stay constant at any point in time. As a result, algorithms proposed for this problem have only been tested on carefully constructed datasets with this exact property, leaving it unclear whether they will transfer to natural scenes. Here we provide evidence that objects in segmented natural movies undergo transitions that are typically small in magnitude with occasional large jumps, which is characteristic of a temporally sparse distribution. We leverage this finding and present SlowVAE, a model for unsupervised representation learning that uses a sparse prior on temporally adjacent observations to disentangle generative factors without any assumptions on the number of changing factors. We provide a proof of identifiability and show that the model reliably learns disentangled representations on several established benchmark datasets, often surpassing the current state-of-the-art. We additionally demonstrate transferability towards video datasets with natural dynamics, Natural Sprites and KITTI Masks, which we contribute as benchmarks for guiding disentanglement research towards more natural data domains.
\ No newline at end of file
diff --git a/data/2021/iclr/Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning b/data/2021/iclr/Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning
new file mode 100644
index 0000000000..0521645835
--- /dev/null
+++ b/data/2021/iclr/Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning	
@@ -0,0 +1 @@
+Matrix factorization is a simple and natural test-bed to investigate the implicit regularization of gradient descent. Gunasekar et al. (2018) conjectured that Gradient Flow with infinitesimal initialization converges to the solution that minimizes the nuclear norm, but a series of recent papers argued that the language of norm minimization is not sufficient to give a full characterization for the implicit regularization. In this work, we provide theoretical and empirical evidence that for depth-2 matrix factorization, gradient flow with infinitesimal initialization is mathematically equivalent to a simple heuristic rank minimization algorithm, Greedy Low-Rank Learning, under some reasonable assumptions. This generalizes the rank minimization view from previous works to a much broader setting and enables us to construct counter-examples to refute the conjecture from Gunasekar et al. (2018). We also extend the results to the case where depth $\ge 3$, and we show that the benefit of being deeper is that the above convergence has a much weaker dependence over initialization magnitude so that this rank minimization is more likely to take effect for initialization with practical scale.
\ No newline at end of file
diff --git a/data/2021/iclr/Towards Robust Neural Networks via Close-loop Control b/data/2021/iclr/Towards Robust Neural Networks via Close-loop Control
new file mode 100644
index 0000000000..ffa3839651
--- /dev/null
+++ b/data/2021/iclr/Towards Robust Neural Networks via Close-loop Control	
@@ -0,0 +1 @@
+Despite their success in massive engineering applications, deep neural networks are vulnerable to various perturbations due to their black-box nature. Recent study has shown that a deep neural network can misclassify the data even if the input data is perturbed by an imperceptible amount. In this paper, we address the robustness issue of neural networks by a novel close-loop control method from the perspective of dynamic systems. Instead of modifying the parameters in a fixed neural network architecture, a close-loop control process is added to generate control signals adaptively for the perturbed or corrupted data. We connect the robustness of neural networks with optimal control using the geometrical information of underlying data to design the control objective. The detailed analysis shows how the embedding manifolds of state trajectory affect error estimation of the proposed method. Our approach can simultaneously maintain the performance on clean data and improve the robustness against many types of data perturbations. It can also further improve the performance of robustly trained neural networks against different perturbations. To the best of our knowledge, this is the first work that improves the robustness of neural networks with close-loop control.
\ No newline at end of file
diff --git a/data/2021/iclr/Towards Robustness Against Natural Language Word Substitutions b/data/2021/iclr/Towards Robustness Against Natural Language Word Substitutions
new file mode 100644
index 0000000000..e3b745a079
--- /dev/null
+++ b/data/2021/iclr/Towards Robustness Against Natural Language Word Substitutions	
@@ -0,0 +1 @@
+Robustness against word substitutions has a well-defined and widely acceptable form, i.e., using semantically similar words as substitutions, and thus it is considered as a fundamental stepping-stone towards broader robustness in natural language processing. Previous defense methods capture word substitutions in vector space by using either $l_2$-ball or hyper-rectangle, which results in perturbation sets that are not inclusive enough or unnecessarily large, and thus impedes mimicry of worst cases for robust training. In this paper, we introduce a novel \textit{Adversarial Sparse Convex Combination} (ASCC) method. We model the word substitution attack space as a convex hull and leverages a regularization term to enforce perturbation towards an actual substitution, thus aligning our modeling better with the discrete textual space. Based on the ASCC method, we further propose ASCC-defense, which leverages ASCC to generate worst-case perturbations and incorporates adversarial training towards robustness. Experiments show that ASCC-defense outperforms the current state-of-the-arts in terms of robustness on two prevailing NLP tasks, \emph{i.e.}, sentiment analysis and natural language inference, concerning several attacks across multiple model architectures. Besides, we also envision a new class of defense towards robustness in NLP, where our robustly trained word vectors can be plugged into a normally trained model and enforce its robustness without applying any other defense techniques.
\ No newline at end of file
diff --git a/data/2021/iclr/Tradeoffs in Data Augmentation: An Empirical Study b/data/2021/iclr/Tradeoffs in Data Augmentation: An Empirical Study
new file mode 100644
index 0000000000..d29cc9674c
--- /dev/null
+++ b/data/2021/iclr/Tradeoffs in Data Augmentation: An Empirical Study	
@@ -0,0 +1 @@
+Though data augmentation has become a standard component of deep neural network training, the underlying mechanism behind the effectiveness of these techniques remains poorly understood. In practice, augmentation policies are often chosen using heuristics of distribution shift or augmentation diversity. Inspired by these, we conduct an empirical study to quantify how data augmentation improves model generalization. We introduce two interpretable and easy-to-compute measures: Affinity and Diversity. We find that augmentation performance is predicted not by either of these alone but by jointly optimizing the two. .5 .75 1 Affinity 1 2 3 4 6
\ No newline at end of file
diff --git a/data/2021/iclr/Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs b/data/2021/iclr/Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs
new file mode 100644
index 0000000000..1a3ecd2aa4
--- /dev/null
+++ b/data/2021/iclr/Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs	
@@ -0,0 +1 @@
+Batch normalization (BatchNorm) has become an indispensable tool for training deep neural networks, yet it is still poorly understood. Although previous work has typically focused on its normalization component, BatchNorm also adds two per-feature trainable parameters - a coefficient and a bias - whose role and expressive power remain unclear. To study this question, we investigate the performance achieved when training only these parameters and freezing all others at their random initializations. We find that doing so leads to surprisingly high performance. For example, sufficiently deep ResNets reach 82% (CIFAR-10) and 32% (ImageNet, top-5) accuracy in this configuration, far higher than when training an equivalent number of randomly chosen parameters from elsewhere in the network. BatchNorm achieves this performance in part by naturally learning to disable around a third of the random features. Not only do these results highlight the under-appreciated role of the affine parameters in BatchNorm, but - in a broader sense - they characterize the expressive power of neural networks constructed simply by shifting and rescaling random features.
\ No newline at end of file
diff --git a/data/2021/iclr/Training GANs with Stronger Augmentations via Contrastive Discriminator b/data/2021/iclr/Training GANs with Stronger Augmentations via Contrastive Discriminator
new file mode 100644
index 0000000000..ac20342b18
--- /dev/null
+++ b/data/2021/iclr/Training GANs with Stronger Augmentations via Contrastive Discriminator	
@@ -0,0 +1 @@
+Recent works in Generative Adversarial Networks (GANs) are actively revisiting various data augmentation techniques as an effective way to prevent discriminator overfitting. It is still unclear, however, that which augmentations could actually improve GANs, and in particular, how to apply a wider range of augmentations in training. In this paper, we propose a novel way to address these questions by incorporating a recent contrastive representation learning scheme into the GAN discriminator, coined ContraD. This"fusion"enables the discriminators to work with much stronger augmentations without increasing their training instability, thereby preventing the discriminator overfitting issue in GANs more effectively. Even better, we observe that the contrastive learning itself also benefits from our GAN training, i.e., by maintaining discriminative features between real and fake samples, suggesting a strong coherence between the two worlds: good contrastive representations are also good for GAN discriminators, and vice versa. Our experimental results show that GANs with ContraD consistently improve FID and IS compared to other recent techniques incorporating data augmentations, still maintaining highly discriminative features in the discriminator in terms of the linear evaluation. Finally, as a byproduct, we also show that our GANs trained in an unsupervised manner (without labels) can induce many conditional generative models via a simple latent sampling, leveraging the learned features of ContraD. Code is available at https://github.com/jh-jeong/ContraD.
\ No newline at end of file
diff --git a/data/2021/iclr/Training independent subnetworks for robust prediction b/data/2021/iclr/Training independent subnetworks for robust prediction
new file mode 100644
index 0000000000..eb145047b5
--- /dev/null
+++ b/data/2021/iclr/Training independent subnetworks for robust prediction	
@@ -0,0 +1 @@
+Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network. However, these methods still require multiple forward passes for prediction, leading to a significant computational cost. In this work, we show a surprising result: the benefits of using multiple predictions can be achieved `for free' under a single model's forward pass. In particular, we show that, using a multi-input multi-output (MIMO) configuration, one can utilize a single model's capacity to train multiple subnetworks that independently learn the task at hand. By ensembling the predictions made by the subnetworks, we improve model robustness without increasing compute. We observe a significant improvement in negative log-likelihood, accuracy, and calibration error on CIFAR10, CIFAR100, ImageNet, and their out-of-distribution variants compared to previous methods.
\ No newline at end of file
diff --git a/data/2021/iclr/Training with Quantization Noise for Extreme Model Compression b/data/2021/iclr/Training with Quantization Noise for Extreme Model Compression
new file mode 100644
index 0000000000..96744d0767
--- /dev/null
+++ b/data/2021/iclr/Training with Quantization Noise for Extreme Model Compression	
@@ -0,0 +1 @@
+We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods where the approximations introduced by STE are severe, such as Product Quantization. Our proposal is to only quantize a different random subset of weights during each forward, allowing for unbiased gradients to flow through the other weights. Controlling the amount of noise and its form allows for extreme compression rates while maintaining the performance of the original model. As a result we establish new state-of-the-art compromises between accuracy and model size both in natural language processing and image classification. For example, applying our method to state-of-the-art Transformer and ConvNet architectures, we can achieve 82.5% accuracy on MNLI by compressing RoBERTa to 14MB and 80.0 top-1 accuracy on ImageNet by compressing an EfficientNet-B3 to 3.3MB.
\ No newline at end of file
diff --git a/data/2021/iclr/Trajectory Prediction using Equivariant Continuous Convolution b/data/2021/iclr/Trajectory Prediction using Equivariant Continuous Convolution
new file mode 100644
index 0000000000..c200315962
--- /dev/null
+++ b/data/2021/iclr/Trajectory Prediction using Equivariant Continuous Convolution	
@@ -0,0 +1 @@
+Trajectory prediction is a critical part of many AI applications, for example, the safe operation of autonomous vehicles. However, current methods are prone to making inconsistent and physically unrealistic predictions. We leverage insights from fluid dynamics to overcome this limitation by considering internal symmetry in trajectories. We propose a novel model, Equivariant Continous COnvolution (ECCO) for improved trajectory prediction. ECCO uses rotationally-equivariant continuous convolutions to embed the symmetries of the system. On two real-world vehicle and pedestrian trajectory datasets, ECCO attains competitive accuracy with significantly fewer parameters. It is also more sample efficient, generalizing automatically from few data points in any orientation. Lastly, ECCO improves generalization with equivariance, resulting in more physically consistent predictions. Our method provides a fresh perspective towards increasing trust and transparency in deep learning models.
\ No newline at end of file
diff --git a/data/2021/iclr/Transformer protein language models are unsupervised structure learners b/data/2021/iclr/Transformer protein language models are unsupervised structure learners
new file mode 100644
index 0000000000..743d6f2d20
--- /dev/null
+++ b/data/2021/iclr/Transformer protein language models are unsupervised structure learners	
@@ -0,0 +1 @@
+Unsupervised contact prediction is central to uncovering physical, structural, and functional constraints for protein structure determination and design. For decades, the predominant approach has been to infer evolutionary constraints from a set of related sequences. In the past year, protein language models have emerged as a potential alternative, but performance has fallen short of state-of-the-art approaches in bioinformatics. In this paper we demonstrate that Transformer attention maps learn contacts from the unsupervised language modeling objective. We find the highest capacity models that have been trained to date already outperform a state-of-the-art unsupervised contact prediction pipeline, suggesting these pipelines can be replaced with a single forward pass of an end-to-end model.1
\ No newline at end of file
diff --git a/data/2021/iclr/Transient Non-stationarity and Generalisation in Deep Reinforcement Learning b/data/2021/iclr/Transient Non-stationarity and Generalisation in Deep Reinforcement Learning
new file mode 100644
index 0000000000..27b12f6f18
--- /dev/null
+++ b/data/2021/iclr/Transient Non-stationarity and Generalisation in Deep Reinforcement Learning	
@@ -0,0 +1 @@
+Non-stationarity can arise in Reinforcement Learning (RL) even in stationary environments. For example, most RL algorithms collect new data throughout training, using a non-stationary behaviour policy. Due to the transience of this non-stationarity, it is often not explicitly addressed in deep RL and a single neural network is continually updated. However, we find evidence that neural networks exhibit a memory effect where these transient non-stationarities can permanently impact the latent representation and adversely affect generalisation performance. Consequently, to improve generalisation of deep RL agents, we propose Iterated Relearning (ITER). ITER augments standard RL training by repeated knowledge transfer of the current policy into a freshly initialised network, which thereby experiences less non-stationarity during training. Experimentally, we show that ITER improves performance on the challenging generalisation benchmarks ProcGen and Multiroom.
\ No newline at end of file
diff --git a/data/2021/iclr/TropEx: An Algorithm for Extracting Linear Terms in Deep Neural Networks b/data/2021/iclr/TropEx: An Algorithm for Extracting Linear Terms in Deep Neural Networks
new file mode 100644
index 0000000000..6be8e10ca7
--- /dev/null
+++ b/data/2021/iclr/TropEx: An Algorithm for Extracting Linear Terms in Deep Neural Networks	
@@ -0,0 +1 @@
+Deep neural networks with rectiﬁed linear (ReLU) activations are piecewise linear functions, where hyperplanes partition the input space into an astronomically high number of linear regions. Previous work focused on counting linear regions to measure the network’s expressive power and on analyzing geometric properties of the hyperplane conﬁgurations. In contrast, we aim to understand the impact of the linear terms on network performance, by examining the information encoded in their coefﬁcients. To this end, we derive TropEx, a non-trivial tropical algebra-inspired algorithm to systematically extract linear terms based on data. Applied to convolutional and fully-connected networks, our algorithm uncovers signiﬁcant differences in how the different networks utilize linear regions for generalization. This underlines the importance of systematic linear term exploration, to better understand generalization in neural networks trained with complex data sets.
\ No newline at end of file
diff --git a/data/2021/iclr/Trusted Multi-View Classification b/data/2021/iclr/Trusted Multi-View Classification
new file mode 100644
index 0000000000..c6bb9651ab
--- /dev/null
+++ b/data/2021/iclr/Trusted Multi-View Classification	
@@ -0,0 +1 @@
+Existing multi-view classification algorithms focus on promoting accuracy by exploiting different views, typically integrating them into common representations for follow-up tasks. Although effective, it is also crucial to ensure the reliability of both the multi-view integration and the final decision, especially for noisy, corrupted and out-of-distribution data. Dynamically assessing the trustworthiness of each view for different samples could provide reliable integration. This can be achieved through uncertainty estimation. With this in mind, we propose a novel multi-view classification algorithm, termed trusted multi-view classification (TMC), providing a new paradigm for multi-view learning by dynamically integrating different views at an evidence level. The proposed TMC can promote classification reliability by considering evidence from each view. Specifically, we introduce the variational Dirichlet to characterize the distribution of the class probabilities, parameterized with evidence from different views and integrated with the Dempster-Shafer theory. The unified learning framework induces accurate uncertainty and accordingly endows the model with both reliability and robustness against possible noise or corruption. Both theoretical and experimental results validate the effectiveness of the proposed model in accuracy, robustness and trustworthiness.
\ No newline at end of file
diff --git a/data/2021/iclr/UMEC: Unified model and embedding compression for efficient recommendation systems b/data/2021/iclr/UMEC: Unified model and embedding compression for efficient recommendation systems
new file mode 100644
index 0000000000..6400b7fde6
--- /dev/null
+++ b/data/2021/iclr/UMEC: Unified model and embedding compression for efficient recommendation systems	
@@ -0,0 +1 @@
+in the
\ No newline at end of file
diff --git a/data/2021/iclr/UPDeT: Universal Multi-agent RL via Policy Decoupling with Transformers b/data/2021/iclr/UPDeT: Universal Multi-agent RL via Policy Decoupling with Transformers
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Unbiased Teacher for Semi-Supervised Object Detection b/data/2021/iclr/Unbiased Teacher for Semi-Supervised Object Detection
new file mode 100644
index 0000000000..77f991030b
--- /dev/null
+++ b/data/2021/iclr/Unbiased Teacher for Semi-Supervised Object Detection	
@@ -0,0 +1 @@
+Semi-supervised learning, i.e., training networks with both labeled and unlabeled data, has made significant progress recently. However, existing works have primarily focused on image classification tasks and neglected object detection which requires more annotation effort. In this work, we revisit the Semi-Supervised Object Detection (SS-OD) and identify the pseudo-labeling bias issue in SS-OD. To address this, we introduce Unbiased Teacher, a simple yet effective approach that jointly trains a student and a gradually progressing teacher in a mutually-beneficial manner. Together with a class-balance loss to downweight overly confident pseudo-labels, Unbiased Teacher consistently improved state-of-the-art methods by significant margins on COCO-standard, COCO-additional, and VOC datasets. Specifically, Unbiased Teacher achieves 6.8 absolute mAP improvements against state-of-the-art method when using 1% of labeled data on MS-COCO, achieves around 10 mAP improvements against the supervised baseline when using only 0.5, 1, 2% of labeled data on MS-COCO.
\ No newline at end of file
diff --git a/data/2021/iclr/Uncertainty Estimation and Calibration with Finite-State Probabilistic RNNs b/data/2021/iclr/Uncertainty Estimation and Calibration with Finite-State Probabilistic RNNs
new file mode 100644
index 0000000000..06e5ac30e5
--- /dev/null
+++ b/data/2021/iclr/Uncertainty Estimation and Calibration with Finite-State Probabilistic RNNs	
@@ -0,0 +1 @@
+Uncertainty quantification is crucial for building reliable and trustable machine learning systems. We propose to estimate uncertainty in recurrent neural networks (RNNs) via stochastic discrete state transitions over recurrent timesteps. The uncertainty of the model can be quantified by running a prediction several times, each time sampling from the recurrent state transition distribution, leading to potentially different results if the model is uncertain. Alongside uncertainty quantification, our proposed method offers several advantages in different settings. The proposed method can (1) learn deterministic and probabilistic automata from data, (2) learn well-calibrated models on real-world classification tasks, (3) improve the performance of out-of-distribution detection, and (4) control the exploration-exploitation trade-off in reinforcement learning.
\ No newline at end of file
diff --git a/data/2021/iclr/Uncertainty Estimation in Autoregressive Structured Prediction b/data/2021/iclr/Uncertainty Estimation in Autoregressive Structured Prediction
new file mode 100644
index 0000000000..f1bf1c4d56
--- /dev/null
+++ b/data/2021/iclr/Uncertainty Estimation in Autoregressive Structured Prediction	
@@ -0,0 +1 @@
+Uncertainty estimation is important for ensuring safety and robustness of AI systems. While most research in the area has focused on un-structured prediction tasks, limited work has investigated general uncertainty estimation approaches for structured prediction. Thus, this work aims to investigate uncertainty estimation for structured prediction tasks within a single unified and interpretable probabilistic ensemble-based framework. We consider: uncertainty estimation for sequence data at the token-level and complete sequence-level; interpretations for, and applications of, various measures of uncertainty; and discuss both the theoretical and practical challenges associated with obtaining them. This work also provides baselines for token-level and sequence-level error detection, and sequence-level out-of-domain input detection on the WMT’14 English-French and WMT’17 English-German translation and LibriSpeech speech recognition datasets.
\ No newline at end of file
diff --git a/data/2021/iclr/Uncertainty Sets for Image Classifiers using Conformal Prediction b/data/2021/iclr/Uncertainty Sets for Image Classifiers using Conformal Prediction
new file mode 100644
index 0000000000..94db91ddf6
--- /dev/null
+++ b/data/2021/iclr/Uncertainty Sets for Image Classifiers using Conformal Prediction	
@@ -0,0 +1 @@
+Convolutional image classifiers can achieve high predictive accuracy, but quantifying their uncertainty remains an unresolved challenge, hindering their deployment in consequential settings. Existing uncertainty quantification techniques, such as Platt scaling, attempt to calibrate the network's probability estimates, but they do not have formal guarantees. We present an algorithm that modifies any classifier to output a predictive set containing the true label with a user-specified probability, such as 90%. The algorithm is simple and fast like Platt scaling, but provides a formal finite-sample coverage guarantee for every model and dataset. Furthermore, our method generates much smaller predictive sets than alternative methods, since we introduce a regularizer to stabilize the small scores of unlikely classes after Platt scaling. In experiments on both Imagenet and Imagenet-V2 with a ResNet-152 and other classifiers, our scheme outperforms existing approaches, achieving exact coverage with sets that are often factors of 5 to 10 smaller.
\ No newline at end of file
diff --git a/data/2021/iclr/Uncertainty in Gradient Boosting via Ensembles b/data/2021/iclr/Uncertainty in Gradient Boosting via Ensembles
new file mode 100644
index 0000000000..bd6b0542b1
--- /dev/null
+++ b/data/2021/iclr/Uncertainty in Gradient Boosting via Ensembles	
@@ -0,0 +1 @@
+Gradient boosting is a powerful machine learning technique that is particularly successful for tasks containing heterogeneous features and noisy data. While gradient boosting classification models return a distribution over class labels, regressions models typically yield only point predictions. However, for many practical, high-risk applications, it is also important to be able to quantify uncertainty in the predictions to avoid costly mistakes. In this work, we examine a probabilistic ensemble-based framework for deriving uncertainty estimates in the predictions of gradient boosting classification and regression models. Crucially, the proposed approach allows the total uncertainty to be decomposed into \textit{data uncertainty}, which comes from the complexity and noise in data distribution, and \textit{knowledge uncertainty}, coming from the lack of information about a given region of the feature space. Two approaches for generating ensembles are considered: Stochastic Gradient Boosting (SGB) and Stochastic Gradient Langevin Boosting (SGLB). Notably, SGLB also enables the generation of a \emph{virtual} ensemble via only one gradient boosting model, which significantly reduces complexity. Experiments on a range of regression and classification datasets show that ensembles of gradient boosting models yield improved predictive performance, and measures of uncertainty successfully enable detection of out-of-domain inputs.
\ No newline at end of file
diff --git a/data/2021/iclr/Uncertainty-aware Active Learning for Optimal Bayesian Classifier b/data/2021/iclr/Uncertainty-aware Active Learning for Optimal Bayesian Classifier
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2021/iclr/Understanding Over-parameterization in Generative Adversarial Networks b/data/2021/iclr/Understanding Over-parameterization in Generative Adversarial Networks
new file mode 100644
index 0000000000..e7845f934a
--- /dev/null
+++ b/data/2021/iclr/Understanding Over-parameterization in Generative Adversarial Networks	
@@ -0,0 +1 @@
+Abstract We demonstrate the use of a probabilistic machine-learning technique to develop stochastic parameterizations of atmospheric column physics. After suitable preprocessing of NASA’s Modern-Era Retrospective analysis for Research and Applications, version 2 (MERRA2) data to minimize the effects of high-frequency, high-wavenumber component of MERRA2 estimate of vertical velocity, we use generative adversarial networks to learn the probability distribution of vertical profiles of diabatic sources conditioned on vertical profiles of temperature and humidity. This may be viewed as an improvement over previous similar but deterministic approaches that seek to alleviate both, shortcomings of human-designed physics parameterizations, and the computational demand of the “physics” step in climate models.
\ No newline at end of file
diff --git a/data/2021/iclr/Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning b/data/2021/iclr/Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning
new file mode 100644
index 0000000000..77cb5f8f15
--- /dev/null
+++ b/data/2021/iclr/Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning	
@@ -0,0 +1 @@
+Encoder layer fusion (EncoderFusion) is a technique to fuse all the encoder layers (instead of the uppermost layer) for sequence-to-sequence (Seq2Seq) models, which has proven effective on various NLP tasks. However, it is still not entirely clear why and when EncoderFusion should work. In this paper, our main contribution is to take a step further in understanding EncoderFusion. Many of previous studies believe that the success of EncoderFusion comes from exploiting surface and syntactic information embedded in lower encoder layers. Unlike them, we find that the encoder embedding layer is more important than other intermediate encoder layers. In addition, the uppermost decoder layer consistently pays more attention to the encoder embedding layer across NLP tasks. Based on this observation, we propose a simple fusion method, SurfaceFusion, by fusing only the encoder embedding layer for the softmax layer. Experimental results show that SurfaceFusion outperforms EncoderFusion on several NLP benchmarks, including machine translation, text summarization, and grammatical error correction. It obtains the state-of-the-art performance on WMT16 Romanian-English and WMT14 English-French translation tasks. Extensive analyses reveal that SurfaceFusion learns more expressive bilingual word embeddings by building a closer relationship between relevant source and target embeddings. The source code will be released.
\ No newline at end of file
diff --git a/data/2021/iclr/Understanding and Improving Lexical Choice in Non-Autoregressive Translation b/data/2021/iclr/Understanding and Improving Lexical Choice in Non-Autoregressive Translation
new file mode 100644
index 0000000000..da93b93f9d
--- /dev/null
+++ b/data/2021/iclr/Understanding and Improving Lexical Choice in Non-Autoregressive Translation	
@@ -0,0 +1 @@
+Knowledge distillation (KD) is essential for training non-autoregressive translation (NAT) models by reducing the complexity of the raw data with an autoregressive teacher model. In this study, we empirically show that as a side effect of this training, the lexical choice errors on low-frequency words are propagated to the NAT model from the teacher model. To alleviate this problem, we propose to expose the raw data to NAT models to restore the useful information of low-frequency words, which are missed in the distilled data. To this end, we introduce an extra Kullback-Leibler divergence term derived by comparing the lexical choice of NAT model and that embedded in the raw data. Experimental results across language pairs and model architectures demonstrate the effectiveness and universality of the proposed approach. Extensive analyses confirm our claim that our approach improves performance by reducing the lexical choice errors on low-frequency words. Encouragingly, our approach pushes the SOTA NAT performance on the WMT14 English-German and WMT16 Romanian-English datasets up to 27.8 and 33.8 BLEU points, respectively. The source code will be released.
\ No newline at end of file
diff --git a/data/2021/iclr/Understanding the effects of data parallelism and sparsity on neural network training b/data/2021/iclr/Understanding the effects of data parallelism and sparsity on neural network training
new file mode 100644
index 0000000000..3b91bbd3dc
--- /dev/null
+++ b/data/2021/iclr/Understanding the effects of data parallelism and sparsity on neural network training	
@@ -0,0 +1 @@
+Network pruning is an effective methodology to compress large neural networks, and sparse neural networks obtained by pruning can benefit from their reduced memory and computational costs at use. Notably, recent studies have found that it is possible to find a trainable sparse neural network even at random initialization prior to training. While this approach of pruning at initialization turned out to be highly effective, there has been little study concerning the subsequent training of these sparse neural networks. In this work, we focus on studying the effects of data parallelism and sparsity on neural network training. For data parallelism, this usually means processing training data in parallel using distributed systems, or equivalently increasing batch size, so that the training process can be accelerated. To this end, we first measure the effects for different study cases of batch size and sparsity level while tuning all metaparameters involved in the optimization. As a result, we find across various workloads of data set, network model, and optimization algorithms that there exists a general scaling trend in the relationship between batch size and number of training steps to convergence for the effect of data parallelism, irrespective of sparsity levels. Also, the effect of data parallelism in training sparse networks turns out to be no worse, or can be even better when the training is done by a momentum based optimizer, than that in training densely parameterized networks, despite the general difficulty of training sparse networks. We further provide theoretical insights based on the convergence properties of stochastic gradient methods and a smoothness analysis, so as to precisely illustrate our empirical findings and hence to develop a better account of the effects of data parallelism and sparsity on neural network training.
\ No newline at end of file
diff --git a/data/2021/iclr/Understanding the failure modes of out-of-distribution generalization b/data/2021/iclr/Understanding the failure modes of out-of-distribution generalization
new file mode 100644
index 0000000000..49ef2f72ca
--- /dev/null
+++ b/data/2021/iclr/Understanding the failure modes of out-of-distribution generalization	
@@ -0,0 +1 @@
+Empirical studies suggest that machine learning models often rely on features, such as the background, that may be spuriously correlated with the label only during training time, resulting in poor accuracy during test-time. In this work, we identify the fundamental factors that give rise to this behavior, by explaining why models fail this way {\em even} in easy-to-learn tasks where one would expect these models to succeed. In particular, through a theoretical study of gradient-descent-trained linear classifiers on some easy-to-learn tasks, we uncover two complementary failure modes. These modes arise from how spurious correlations induce two kinds of skews in the data: one geometric in nature, and another, statistical in nature. Finally, we construct natural modifications of image classification datasets to understand when these failure modes can arise in practice. We also design experiments to isolate the two failure modes when training modern neural networks on these datasets.
\ No newline at end of file
diff --git a/data/2021/iclr/Understanding the role of importance weighting for deep learning b/data/2021/iclr/Understanding the role of importance weighting for deep learning
new file mode 100644
index 0000000000..95ef1dc6d7
--- /dev/null
+++ b/data/2021/iclr/Understanding the role of importance weighting for deep learning	
@@ -0,0 +1 @@
+The recent paper by Byrd&Lipton (2019), based on empirical observations, raises a major concern on the impact of importance weighting for the over-parameterized deep learning models. They observe that as long as the model can separate the training data, the impact of importance weighting diminishes as the training proceeds. Nevertheless, there lacks a rigorous characterization of this phenomenon. In this paper, we provide formal characterizations and theoretical justifications on the role of importance weighting with respect to the implicit bias of gradient descent and margin-based learning theory. We reveal both the optimization dynamics and generalization performance under deep learning models. Our work not only explains the various novel phenomenons observed for importance weighting in deep learning, but also extends to the studies where the weights are being optimized as part of the model, which applies to a number of topics under active research.
\ No newline at end of file
diff --git a/data/2021/iclr/Undistillable: Making A Nasty Teacher That CANNOT teach students b/data/2021/iclr/Undistillable: Making A Nasty Teacher That CANNOT teach students
new file mode 100644
index 0000000000..4e46301a2c
--- /dev/null
+++ b/data/2021/iclr/Undistillable: Making A Nasty Teacher That CANNOT teach students	
@@ -0,0 +1 @@
+Knowledge Distillation (KD) is a widely used technique to transfer knowledge from pre-trained teacher models to (usually more lightweight) student models. However, in certain situations, this technique is more of a curse than a blessing. For instance, KD poses a potential risk of exposing intellectual properties (IPs): even if a trained machine learning model is released in 'black boxes' (e.g., as executable software or APIs without open-sourcing code), it can still be replicated by KD through imitating input-output behaviors. To prevent this unwanted effect of KD, this paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one, but would significantly degrade the performance of student models learned by imitating it. We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation. Specifically, we aim to maximize the difference between the output of the nasty teacher and a normal pre-trained network. Extensive experiments on several datasets demonstrate that our method is effective on both standard KD and data-free KD, providing the desirable KD-immunity to model owners for the first time. We hope our preliminary study can draw more awareness and interest in this new practical problem of both social and legal importance.
\ No newline at end of file
diff --git a/data/2021/iclr/Universal Weakly Supervised Segmentation by Pixel-to-Segment Contrastive Learning b/data/2021/iclr/Universal Weakly Supervised Segmentation by Pixel-to-Segment Contrastive Learning
new file mode 100644
index 0000000000..9ff09d82fc
--- /dev/null
+++ b/data/2021/iclr/Universal Weakly Supervised Segmentation by Pixel-to-Segment Contrastive Learning	
@@ -0,0 +1 @@
+Weakly supervised segmentation requires assigning a label to every pixel based on training instances with partial annotations such as image-level tags, object bounding boxes, labeled points and scribbles. This task is challenging, as coarse annotations (tags, boxes) lack precise pixel localization whereas sparse annotations (points, scribbles) lack broad region coverage. Existing methods tackle these two types of weak supervision differently: Class activation maps are used to localize coarse labels and iteratively refine the segmentation model, whereas conditional random fields are used to propagate sparse labels to the entire image. We formulate weakly supervised segmentation as a semi-supervised metric learning problem, where pixels of the same (different) semantics need to be mapped to the same (distinctive) features. We propose 4 types of contrastive relationships between pixels and segments in the feature space, capturing low-level image similarity, semantic annotation, co-occurrence, and feature affinity They act as priors; the pixel-wise feature can be learned from training images with any partial annotations in a data-driven fashion. In particular, unlabeled pixels in training images participate not only in data-driven grouping within each image, but also in discriminative feature learning within and across images. We deliver a universal weakly supervised segmenter with significant gains on Pascal VOC and DensePose. Our code is publicly available at https://github.com/twke18/SPML.
\ No newline at end of file
diff --git a/data/2021/iclr/Universal approximation power of deep residual neural networks via nonlinear control theory b/data/2021/iclr/Universal approximation power of deep residual neural networks via nonlinear control theory
new file mode 100644
index 0000000000..a6f2d982f4
--- /dev/null
+++ b/data/2021/iclr/Universal approximation power of deep residual neural networks via nonlinear control theory	
@@ -0,0 +1 @@
+In this paper, we explain the universal approximation capabilities of deep residual neural networks through geometric nonlinear control. Inspired by recent work establishing links between residual networks and control systems, we provide a general sufficient condition for a residual network to have the power of universal approximation by asking the activation function, or one of its derivatives, to satisfy a quadratic differential equation. Many activation functions used in practice satisfy this assumption, exactly or approximately, and we show this property to be sufficient for an adequately deep neural network with $n+1$ neurons per layer to approximate arbitrarily well, on a compact set and with respect to the supremum norm, any continuous function from $\mathbb{R}^n$ to $\mathbb{R}^n$. We further show this result to hold for very simple architectures for which the weights only need to assume two values. The first key technical contribution consists of relating the universal approximation problem to controllability of an ensemble of control systems corresponding to a residual network and to leverage classical Lie algebraic techniques to characterize controllability. The second technical contribution is to identify monotonicity as the bridge between controllability of finite ensembles and uniform approximability on compact sets.
\ No newline at end of file
diff --git a/data/2021/iclr/Unlearnable Examples: Making Personal Data Unexploitable b/data/2021/iclr/Unlearnable Examples: Making Personal Data Unexploitable
new file mode 100644
index 0000000000..752eb0afa0
--- /dev/null
+++ b/data/2021/iclr/Unlearnable Examples: Making Personal Data Unexploitable	
@@ -0,0 +1 @@
+The volume of"free"data on the internet has been key to the current success of deep learning. However, it also raises privacy concerns about the unauthorized exploitation of personal data for training commercial models. It is thus crucial to develop methods to prevent unauthorized data exploitation. This paper raises the question: \emph{can data be made unlearnable for deep learning models?} We present a type of \emph{error-minimizing} noise that can indeed make training examples unlearnable. Error-minimizing noise is intentionally generated to reduce the error of one or more of the training example(s) close to zero, which can trick the model into believing there is"nothing"to learn from these example(s). The noise is restricted to be imperceptible to human eyes, and thus does not affect normal data utility. We empirically verify the effectiveness of error-minimizing noise in both sample-wise and class-wise forms. We also demonstrate its flexibility under extensive experimental settings and practicability in a case study of face recognition. Our work establishes an important first step towards making personal data unexploitable to deep learning models.
\ No newline at end of file
diff --git a/data/2021/iclr/Unsupervised Audiovisual Synthesis via Exemplar Autoencoders b/data/2021/iclr/Unsupervised Audiovisual Synthesis via Exemplar Autoencoders
new file mode 100644
index 0000000000..c6e2b18f54
--- /dev/null
+++ b/data/2021/iclr/Unsupervised Audiovisual Synthesis via Exemplar Autoencoders	
@@ -0,0 +1 @@
+We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers. Our approach builds on simple autoencoders that project out-of-sample data onto the distribution of the training set. We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target exemplar speech. In contrast to existing methods, the proposed approach can be easily extended to an arbitrarily large number of speakers and styles using only 3 minutes of target audio-video data, without requiring {\em any} training data for the input speaker. To do so, we learn audiovisual bottleneck representations that capture the structured linguistic content of speech. We outperform prior approaches on both audio and video synthesis, and provide extensive qualitative analysis on our project page -- https://www.cs.cmu.edu/~exemplar-ae/.
\ No newline at end of file
diff --git a/data/2021/iclr/Unsupervised Discovery of 3D Physical Objects from Video b/data/2021/iclr/Unsupervised Discovery of 3D Physical Objects from Video
new file mode 100644
index 0000000000..128a99f27e
--- /dev/null
+++ b/data/2021/iclr/Unsupervised Discovery of 3D Physical Objects from Video	
@@ -0,0 +1 @@
+We study the problem of unsupervised physical object discovery. Unlike existing frameworks that aim to learn to decompose scenes into 2D segments purely based on each object's appearance, we explore how physics, especially object interactions, facilitates learning to disentangle and segment instances from raw videos, and to infer the 3D geometry and position of each object, all without supervision. Drawing inspiration from developmental psychology, our Physical Object Discovery Network (POD-Net) uses both multi-scale pixel cues and physical motion cues to accurately segment observable and partially occluded objects of varying sizes, and infer properties of those objects. Our model reliably segments objects on both synthetic and real scenes. The discovered object properties can also be used to reason about physical events.
\ No newline at end of file
diff --git a/data/2021/iclr/Unsupervised Meta-Learning through Latent-Space Interpolation in Generative Models b/data/2021/iclr/Unsupervised Meta-Learning through Latent-Space Interpolation in Generative Models
new file mode 100644
index 0000000000..0f6c663777
--- /dev/null
+++ b/data/2021/iclr/Unsupervised Meta-Learning through Latent-Space Interpolation in Generative Models	
@@ -0,0 +1 @@
+Unsupervised meta-learning approaches rely on synthetic meta-tasks that are created using techniques such as random selection, clustering and/or augmentation. Unfortunately, clustering and augmentation are domain-dependent, and thus they require either manual tweaking or expensive learning. In this work, we describe an approach that generates meta-tasks using generative models. A critical component is a novel approach of sampling from the latent space that generates objects grouped into synthetic classes forming the training and validation data of a meta-task. We find that the proposed approach, LAtent Space Interpolation Unsupervised Meta-learning (LASIUM), outperforms or is competitive with current unsupervised learning baselines on few-shot classification tasks on the most widely used benchmark datasets. In addition, the approach promises to be applicable without manual tweaking over a wider range of domains than previous approaches.
\ No newline at end of file
diff --git a/data/2021/iclr/Unsupervised Object Keypoint Learning using Local Spatial Predictability b/data/2021/iclr/Unsupervised Object Keypoint Learning using Local Spatial Predictability
new file mode 100644
index 0000000000..6aa5d1848e
--- /dev/null
+++ b/data/2021/iclr/Unsupervised Object Keypoint Learning using Local Spatial Predictability	
@@ -0,0 +1 @@
+We propose PermaKey, a novel approach to representation learning based on object keypoints. It leverages the predictability of local image regions from spatial neighborhoods to identify salient regions that correspond to object parts, which are then converted to keypoints. Unlike prior approaches, it utilizes predictability to discover object keypoints, an intrinsic property of objects. This ensures that it does not overly bias keypoints to focus on characteristics that are not unique to objects, such as movement, shape, colour etc. We demonstrate the efficacy of PermaKey on Atari where it learns keypoints corresponding to the most salient object parts and is robust to certain visual distractors. Further, on downstream RL tasks in the Atari domain we demonstrate how agents equipped with our keypoints outperform those using competing alternatives, even on challenging environments with moving backgrounds or distractor objects.
\ No newline at end of file
diff --git a/data/2021/iclr/Unsupervised Representation Learning for Time Series with Temporal Neighborhood Coding b/data/2021/iclr/Unsupervised Representation Learning for Time Series with Temporal Neighborhood Coding
new file mode 100644
index 0000000000..f4a7c1731b
--- /dev/null
+++ b/data/2021/iclr/Unsupervised Representation Learning for Time Series with Temporal Neighborhood Coding	
@@ -0,0 +1 @@
+Time series are often complex and rich in information but sparsely labeled and therefore challenging to model. In this paper, we propose a self-supervised framework for learning generalizable representations for non-stationary time series. Our approach, called Temporal Neighborhood Coding (TNC), takes advantage of the local smoothness of a signal's generative process to define neighborhoods in time with stationary properties. Using a debiased contrastive objective, our framework learns time series representations by ensuring that in the encoding space, the distribution of signals from within a neighborhood is distinguishable from the distribution of non-neighboring signals. Our motivation stems from the medical field, where the ability to model the dynamic nature of time series data is especially valuable for identifying, tracking, and predicting the underlying patients' latent states in settings where labeling data is practically impossible. We compare our method to recently developed unsupervised representation learning approaches and demonstrate superior performance on clustering and classification tasks for multiple datasets.
\ No newline at end of file
diff --git a/data/2021/iclr/Usable Information and Evolution of Optimal Representations During Training b/data/2021/iclr/Usable Information and Evolution of Optimal Representations During Training
new file mode 100644
index 0000000000..1ffc572613
--- /dev/null
+++ b/data/2021/iclr/Usable Information and Evolution of Optimal Representations During Training	
@@ -0,0 +1 @@
+We introduce a notion of usable information contained in the representation learned by a deep network, and use it to study how optimal representations for the task emerge during training, and how they adapt to different tasks. We use this to characterize the transient dynamics of deep neural networks on perceptual decision-making tasks inspired by neuroscience literature. In particular, we show that both the random initialization and the implicit regularization from Stochastic Gradient Descent play an important role in learning minimal sufficient representations for the task. If the network is not randomly initialized, we show that the training may not recover an optimal representation, increasing the chance of overfitting.
\ No newline at end of file
diff --git a/data/2021/iclr/Using latent space regression to analyze and leverage compositionality in GANs b/data/2021/iclr/Using latent space regression to analyze and leverage compositionality in GANs
new file mode 100644
index 0000000000..f75ff20092
--- /dev/null
+++ b/data/2021/iclr/Using latent space regression to analyze and leverage compositionality in GANs	
@@ -0,0 +1 @@
+In recent years, Generative Adversarial Networks have become ubiquitous in both research and public perception, but how GANs convert an unstructured latent code to a high quality output is still an open question. In this work, we investigate regression into the latent space as a probe to understand the compositional properties of GANs. We find that combining the regressor and a pretrained generator provides a strong image prior, allowing us to create composite images from a collage of random image parts at inference time while maintaining global consistency. To compare compositional properties across different generators, we measure the trade-offs between reconstruction of the unrealistic input and image quality of the regenerated samples. We find that the regression approach enables more localized editing of individual image parts compared to direct editing in the latent space, and we conduct experiments to quantify this independence effect. Our method is agnostic to the semantics of edits, and does not require labels or predefined concepts during training. Beyond image composition, our method extends to a number of related applications, such as image inpainting or example-based image editing, which we demonstrate on several GANs and datasets, and because it uses only a single forward pass, it can operate in real-time. Code is available on our project page: https://chail.github.io/latent-composition/.
\ No newline at end of file
diff --git a/data/2021/iclr/VA-RED2: Video Adaptive Redundancy Reduction b/data/2021/iclr/VA-RED2: Video Adaptive Redundancy Reduction
new file mode 100644
index 0000000000..4d593ca1c6
--- /dev/null
+++ b/data/2021/iclr/VA-RED2: Video Adaptive Redundancy Reduction	
@@ -0,0 +1 @@
+Performing inference on deep learning models for videos remains a challenge due to the large amount of computational resources required to achieve robust recognition. An inherent property of real-world videos is the high correlation of information across frames which can translate into redundancy in either temporal or spatial feature maps of the models, or both. The type of redundant features depends on the dynamics and type of events in the video: static videos have more temporal redundancy while videos focusing on objects tend to have more channel redundancy. Here we present a redundancy reduction framework, termed VA-RED$^2$, which is input-dependent. Specifically, our VA-RED$^2$ framework uses an input-dependent policy to decide how many features need to be computed for temporal and channel dimensions. To keep the capacity of the original model, after fully computing the necessary features, we reconstruct the remaining redundant features from those using cheap linear operations. We learn the adaptive policy jointly with the network weights in a differentiable way with a shared-weight mechanism, making it highly efficient. Extensive experiments on multiple video datasets and different visual tasks show that our framework achieves $20\% - 40\%$ reduction in computation (FLOPs) when compared to state-of-the-art methods without any performance loss. Project page: http://people.csail.mit.edu/bpan/va-red/.
\ No newline at end of file
diff --git a/data/2021/iclr/VAEBM: A Symbiosis between Variational Autoencoders and Energy-based Models b/data/2021/iclr/VAEBM: A Symbiosis between Variational Autoencoders and Energy-based Models
new file mode 100644
index 0000000000..eb3b51e8f9
--- /dev/null
+++ b/data/2021/iclr/VAEBM: A Symbiosis between Variational Autoencoders and Energy-based Models	
@@ -0,0 +1 @@
+Energy-based models (EBMs) have recently been successful in representing complex distributions of small images. However, sampling from them requires expensive Markov chain Monte Carlo (MCMC) iterations that mix slowly in high dimensional pixel space. Unlike EBMs, variational autoencoders (VAEs) generate samples quickly and are equipped with a latent space that enables fast traversal of the data manifold. However, VAEs tend to assign high probability density to regions in data space outside the actual data distribution and often fail at generating sharp images. In this paper, we propose VAEBM, a symbiotic composition of a VAE and an EBM that offers the best of both worlds. VAEBM captures the overall mode structure of the data distribution using a state-of-the-art VAE and it relies on its EBM component to explicitly exclude non-data-like regions from the model and refine the image samples. Moreover, the VAE component in VAEBM allows us to speed up MCMC updates by reparameterizing them in the VAE's latent space. Our experimental results show that VAEBM outperforms state-of-the-art VAEs and EBMs in generative quality on several benchmark image datasets by a large margin. It can generate high-quality images as large as 256$\times$256 pixels with short MCMC chains. We also demonstrate that VAEBM provides complete mode coverage and performs well in out-of-distribution detection.
\ No newline at end of file
diff --git a/data/2021/iclr/VCNet and Functional Targeted Regularization For Learning Causal Effects of Continuous Treatments b/data/2021/iclr/VCNet and Functional Targeted Regularization For Learning Causal Effects of Continuous Treatments
new file mode 100644
index 0000000000..aa1049e8b0
--- /dev/null
+++ b/data/2021/iclr/VCNet and Functional Targeted Regularization For Learning Causal Effects of Continuous Treatments	
@@ -0,0 +1 @@
+Motivated by the rising abundance of observational data with continuous treatments, we investigate the problem of estimating the average dose-response curve (ADRF). Available parametric methods are limited in their model space, and previous attempts in leveraging neural network to enhance model expressiveness relied on partitioning continuous treatment into blocks and using separate heads for each block; this however produces in practice discontinuous ADRFs. Therefore, the question of how to adapt the structure and training of neural network to estimate ADRFs remains open. This paper makes two important contributions. First, we propose a novel varying coefficient neural network (VCNet) that improves model expressiveness while preserving continuity of the estimated ADRF. Second, to improve finite sample performance, we generalize targeted regularization to obtain a doubly robust estimator of the whole ADRF curve.
\ No newline at end of file
diff --git a/data/2021/iclr/VTNet: Visual Transformer Network for Object Goal Navigation b/data/2021/iclr/VTNet: Visual Transformer Network for Object Goal Navigation
new file mode 100644
index 0000000000..4eb04571a8
--- /dev/null
+++ b/data/2021/iclr/VTNet: Visual Transformer Network for Object Goal Navigation	
@@ -0,0 +1 @@
+Object goal navigation aims to steer an agent towards a target object based on observations of the agent. It is of pivotal importance to design effective visual representations of the observed scene in determining navigation actions. In this paper, we introduce a Visual Transformer Network (VTNet) for learning informative visual representation in navigation. VTNet is a highly effective structure that embodies two key properties for visual representations: First, the relationships among all the object instances in a scene are exploited; Second, the spatial locations of objects and image regions are emphasized so that directional navigation signals can be learned. Furthermore, we also develop a pre-training scheme to associate the visual representations with navigation signals, and thus facilitate navigation policy learning. In a nutshell, VTNet embeds object and region features with their location cues as spatial-aware descriptors and then incorporates all the encoded descriptors through attention operations to achieve informative representation for navigation. Given such visual representations, agents are able to explore the correlations between visual observations and navigation actions. For example, an agent would prioritize ``turning right'' over ``turning left'' when the visual representation emphasizes on the right side of activation map. Experiments in the artificial environment AI2-Thor demonstrate that VTNet significantly outperforms state-of-the-art methods in unseen testing environments.
\ No newline at end of file
diff --git a/data/2021/iclr/Variational Information Bottleneck for Effective Low-Resource Fine-Tuning b/data/2021/iclr/Variational Information Bottleneck for Effective Low-Resource Fine-Tuning
new file mode 100644
index 0000000000..63e755ae7f
--- /dev/null
+++ b/data/2021/iclr/Variational Information Bottleneck for Effective Low-Resource Fine-Tuning	
@@ -0,0 +1 @@
+While large-scale pretrained language models have obtained impressive results when fine-tuned on a wide variety of tasks, they still often suffer from overfitting in low-resource scenarios. Since such models are general-purpose feature extractors, many of these features are inevitably irrelevant for a given target task. We propose to use Variational Information Bottleneck (VIB) to suppress irrelevant features when fine-tuning on low-resource target tasks, and show that our method successfully reduces overfitting. Moreover, we show that our VIB model finds sentence representations that are more robust to biases in natural language inference datasets, and thereby obtains better generalization to out-of-domain datasets. Evaluation on seven low-resource datasets in different tasks shows that our method significantly improves transfer learning in low-resource scenarios, surpassing prior work. Moreover, it improves generalization on 13 out of 15 out-of-domain natural language inference benchmarks. Our code is publicly available in https://github.com/rabeehk/vibert.
\ No newline at end of file
diff --git a/data/2021/iclr/Variational Intrinsic Control Revisited b/data/2021/iclr/Variational Intrinsic Control Revisited
new file mode 100644
index 0000000000..cb7d895e24
--- /dev/null
+++ b/data/2021/iclr/Variational Intrinsic Control Revisited	
@@ -0,0 +1 @@
+In this paper, we revisit variational intrinsic control (VIC), an unsupervised reinforcement learning method for finding the largest set of intrinsic options available to an agent. In the original work by Gregor et al. (2016), two VIC algorithms were proposed: one that represents the options explicitly, and the other that does it implicitly. We show that the intrinsic reward used in the latter is subject to bias in stochastic environments, causing convergence to suboptimal solutions. To correct this behavior and achieve the maximal empowerment, we propose two methods respectively based on the transitional probability model and Gaussian mixture model. We substantiate our claims through rigorous mathematical derivations and experimental analyses.
\ No newline at end of file
diff --git a/data/2021/iclr/Variational State-Space Models for Localisation and Dense 3D Mapping in 6 DoF b/data/2021/iclr/Variational State-Space Models for Localisation and Dense 3D Mapping in 6 DoF
new file mode 100644
index 0000000000..94773b2159
--- /dev/null
+++ b/data/2021/iclr/Variational State-Space Models for Localisation and Dense 3D Mapping in 6 DoF	
@@ -0,0 +1 @@
+We solve the problem of 6-DoF localisation and 3D dense reconstruction in spatial environments as approximate Bayesian inference in a deep generative approach which combines learned with engineered models. This principled treatment of uncertainty and probabilistic inference overcomes the shortcoming of current state-of-the-art solutions to rely on heavily engineered, heterogeneous pipelines. Variational inference enables us to use neural networks for system identification, while a differentiable raycaster is used for the emission model. This ensures that our model is amenable to end-to-end gradient-based optimisation. We evaluate our approach on realistic unmanned aerial vehicle flight data, nearing the performance of a state-of-the-art visual inertial odometry system. The applicability of the learned model to downstream tasks such as generative prediction and planning is investigated.
\ No newline at end of file
diff --git a/data/2021/iclr/Vector-output ReLU Neural Network Problems are Copositive Programs: Convex Analysis of Two Layer Networks and Polynomial-time Algorithms b/data/2021/iclr/Vector-output ReLU Neural Network Problems are Copositive Programs: Convex Analysis of Two Layer Networks and Polynomial-time Algorithms
new file mode 100644
index 0000000000..2ca47ce6b8
--- /dev/null
+++ b/data/2021/iclr/Vector-output ReLU Neural Network Problems are Copositive Programs: Convex Analysis of Two Layer Networks and Polynomial-time Algorithms	
@@ -0,0 +1 @@
+We describe the convex semi-infinite dual of the two-layer vector-output ReLU neural network training problem. This semi-infinite dual admits a finite dimensional representation, but its support is over a convex set which is difficult to characterize. In particular, we demonstrate that the non-convex neural network training problem is equivalent to a finite-dimensional convex copositive program. Our work is the first to identify this strong connection between the global optima of neural networks and those of copositive programs. We thus demonstrate how neural networks implicitly attempt to solve copositive programs via semi-nonnegative matrix factorization, and draw key insights from this formulation. We describe the first algorithms for provably finding the global minimum of the vector output neural network training problem, which are polynomial in the number of samples for a fixed data rank, yet exponential in the dimension. However, in the case of convolutional architectures, the computational complexity is exponential in only the filter size and polynomial in all other parameters. We describe the circumstances in which we can find the global optimum of this neural network training problem exactly with soft-thresholded SVD, and provide a copositive relaxation which is guaranteed to be exact for certain classes of problems, and which corresponds with the solution of Stochastic Gradient Descent in practice.
\ No newline at end of file
diff --git a/data/2021/iclr/Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images b/data/2021/iclr/Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images
new file mode 100644
index 0000000000..ac6ac976a9
--- /dev/null
+++ b/data/2021/iclr/Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images	
@@ -0,0 +1 @@
+We present a hierarchical VAE that, for the first time, outperforms the PixelCNN in log-likelihood on all natural image benchmarks. We begin by observing that VAEs can actually implement autoregressive models, and other, more efficient generative models, if made sufficiently deep. Despite this, autoregressive models have traditionally outperformed VAEs. We test if insufficient depth explains the performance gap by by scaling a VAE to greater stochastic depth than previously explored and evaluating it on CIFAR-10, ImageNet, and FFHQ. We find that, in comparison to the PixelCNN, these very deep VAEs achieve higher likelihoods, use fewer parameters, generate samples thousands of times faster, and are more easily applied to high-resolution images. We visualize the generative process and show the VAEs learn efficient hierarchical visual representations. We release our source code and models at https://github.com/openai/vdvae.
\ No newline at end of file
diff --git a/data/2021/iclr/Viewmaker Networks: Learning Views for Unsupervised Representation Learning b/data/2021/iclr/Viewmaker Networks: Learning Views for Unsupervised Representation Learning
new file mode 100644
index 0000000000..ed8473f759
--- /dev/null
+++ b/data/2021/iclr/Viewmaker Networks: Learning Views for Unsupervised Representation Learning	
@@ -0,0 +1 @@
+Many recent methods for unsupervised representation learning involve training models to be invariant to different "views," or transformed versions of an input. However, designing these views requires considerable human expertise and experimentation, hindering widespread adoption of unsupervised representation learning methods across domains and modalities. To address this, we propose viewmaker networks: generative models that learn to produce input-dependent views for contrastive learning. We train this network jointly with an encoder network to produce adversarial $\ell_p$ perturbations for an input, which yields challenging yet useful views without extensive human tuning. Our learned views, when applied to CIFAR-10, enable comparable transfer accuracy to the the well-studied augmentations used for the SimCLR model. Our views significantly outperforming baseline augmentations in speech (+9% absolute) and wearable sensor (+17% absolute) domains. We also show how viewmaker views can be combined with handcrafted views to improve robustness to common image corruptions. Our method demonstrates that learned views are a promising way to reduce the amount of expertise and effort needed for unsupervised learning, potentially extending its benefits to a much wider set of domains.
\ No newline at end of file
diff --git a/data/2021/iclr/Vulnerability-Aware Poisoning Mechanism for Online RL with Unknown Dynamics b/data/2021/iclr/Vulnerability-Aware Poisoning Mechanism for Online RL with Unknown Dynamics
new file mode 100644
index 0000000000..7f35ac35b3
--- /dev/null
+++ b/data/2021/iclr/Vulnerability-Aware Poisoning Mechanism for Online RL with Unknown Dynamics	
@@ -0,0 +1 @@
+Poisoning attacks, although have been studied extensively in supervised learning, are not well understood in Reinforcement Learning (RL), especially in deep RL. Prior works on poisoning RL usually either assume the attacker knows the underlying Markov Decision Process (MDP), or directly apply the poisoning methods in supervised learning to RL. In this work, we build a generic poisoning framework for online RL via a comprehensive investigation of heterogeneous types/victims of poisoning attacks in RL, considering the unique challenges in RL such as data no longer being i.i.d. Without any prior knowledge of the MDP, we propose a strategic poisoning algorithm called Vulnerability-Aware Adversarial Critic Poison (VA2C-P), which works for most policy-based deep RL agents, using a novel metric, stability radius in RL, that measures the vulnerability of RL algorithms. Experiments on multiple deep RL agents and multiple environments show that our poisoning algorithm successfully prevents agents from learning a good policy, with a limited attacking budget. Our experiment results demonstrate varying vulnerabilities of different deep RL agents in multiple environments, benefiting the understanding and applications of deep RL under security threat scenarios.
\ No newline at end of file
diff --git a/data/2021/iclr/WaNet - Imperceptible Warping-based Backdoor Attack b/data/2021/iclr/WaNet - Imperceptible Warping-based Backdoor Attack
new file mode 100644
index 0000000000..8bc0e33d58
--- /dev/null
+++ b/data/2021/iclr/WaNet - Imperceptible Warping-based Backdoor Attack	
@@ -0,0 +1 @@
+Image camouflage has been utilized to create clean-label poisoned images for implanting backdoor into a DL model. But there exists a crucial limitation that one attack/poisoned image can only fit a single input size of the DL model, which greatly increases its attack budget when attacking multiple commonly adopted input sizes of DL models. This work proposes to constructively craft an attack image through camouflaging but can fit multiple DL models' input sizes simultaneously, namely OmClic. Thus, through OmClic, we are able to always implant a backdoor regardless of which common input size is chosen by the user to train the DL model given the same attack budget (i.e., a fraction of the poisoning rate). With our camouflaging algorithm formulated as a multi-objective optimization, M=5 input sizes can be concurrently targeted with one attack image, which artifact is retained to be almost visually imperceptible at the same time. Extensive evaluations validate the proposed OmClic can reliably succeed in various settings using diverse types of images. Further experiments on OmClic based backdoor insertion to DL models show that high backdoor performances (i.e., attack success rate and clean data accuracy) are achievable no matter which common input size is randomly chosen by the user to train the model. So that the OmClic based backdoor attack budget is reduced by M$\times$ compared to the state-of-the-art camouflage based backdoor attack as a baseline. Significantly, the same set of OmClic based poisonous attack images is transferable to different model architectures for backdoor implant.
\ No newline at end of file
diff --git a/data/2021/iclr/Wandering within a world: Online contextualized few-shot learning b/data/2021/iclr/Wandering within a world: Online contextualized few-shot learning
new file mode 100644
index 0000000000..738ce4496c
--- /dev/null
+++ b/data/2021/iclr/Wandering within a world: Online contextualized few-shot learning	
@@ -0,0 +1 @@
+We aim to bridge the gap between typical human and machine-learning environments by extending the standard framework of few-shot learning to an online, continual setting. In this setting, episodes do not have separate training and testing phases, and instead models are evaluated online while learning novel classes. As in the real world, where the presence of spatiotemporal context helps us retrieve learned skills in the past, our online few-shot learning setting also features an underlying context that changes throughout time. Object classes are correlated within a context and inferring the correct context can lead to better performance. Building upon this setting, we propose a new few-shot learning dataset based on large scale indoor imagery that mimics the visual experience of an agent wandering within a world. Furthermore, we convert popular few-shot learning approaches into online versions and we also propose a new contextual prototypical memory model that can make use of spatiotemporal contextual information from the recent past.
\ No newline at end of file
diff --git a/data/2021/iclr/Wasserstein Embedding for Graph Learning b/data/2021/iclr/Wasserstein Embedding for Graph Learning
new file mode 100644
index 0000000000..6769d9eab8
--- /dev/null
+++ b/data/2021/iclr/Wasserstein Embedding for Graph Learning	
@@ -0,0 +1 @@
+We present Wasserstein Embedding for Graph Learning (WEGL), a novel and fast framework for embedding entire graphs in a vector space, in which various machine learning models are applicable for graph-level prediction tasks. We leverage new insights on defining similarity between graphs as a function of the similarity between their node embedding distributions. Specifically, we use the Wasserstein distance to measure the dissimilarity between node embeddings of different graphs. Different from prior work, we avoid pairwise calculation of distances between graphs and reduce the computational complexity from quadratic to linear in the number of graphs. WEGL calculates Monge maps from a reference distribution to each node embedding and, based on these maps, creates a fixed-sized vector representation of the graph. We evaluate our new graph embedding approach on various benchmark graph-property prediction tasks, showing state-of-the-art classification performance, while having superior computational efficiency.
\ No newline at end of file
diff --git a/data/2021/iclr/Wasserstein-2 Generative Networks b/data/2021/iclr/Wasserstein-2 Generative Networks
new file mode 100644
index 0000000000..13279934ac
--- /dev/null
+++ b/data/2021/iclr/Wasserstein-2 Generative Networks	
@@ -0,0 +1 @@
+Generative Adversarial Networks training is not easy due to the minimax nature of the optimization objective. In this paper, we propose a novel end-to-end algorithm for training generative models which uses a non-minimax objective simplifying model training. The proposed algorithm uses the approximation of Wasserstein-2 distance by Input Convex Neural Networks. From the theoretical side, we estimate the properties of the generative mapping fitted by the algorithm. From the practical side, we conduct computational experiments which confirm the efficiency of our algorithm in various applied problems: image-to-image color transfer, latent space optimal transport, image-to-image style transfer, and domain adaptation.
\ No newline at end of file
diff --git a/data/2021/iclr/Watch-And-Help: A Challenge for Social Perception and Human-AI Collaboration b/data/2021/iclr/Watch-And-Help: A Challenge for Social Perception and Human-AI Collaboration
new file mode 100644
index 0000000000..8794d53f63
--- /dev/null
+++ b/data/2021/iclr/Watch-And-Help: A Challenge for Social Perception and Human-AI Collaboration	
@@ -0,0 +1 @@
+In this paper, we introduce Watch-And-Help (WAH), a challenge for testing social intelligence in agents. In WAH, an AI agent needs to help a human-like agent perform a complex household task efficiently. To succeed, the AI agent needs to i) understand the underlying goal of the task by watching a single demonstration of the human-like agent performing the same task (social perception), and ii) coordinate with the human-like agent to solve the task in an unseen environment as fast as possible (human-AI collaboration). For this challenge, we build VirtualHome-Social, a multi-agent household environment, and provide a benchmark including both planning and learning based baselines. We evaluate the performance of AI agents with the human-like agent as well as with real humans using objective metrics and subjective user ratings. Experimental results demonstrate that the proposed challenge and virtual environment enable a systematic evaluation on the important aspects of machine social intelligence at scale.
\ No newline at end of file
diff --git a/data/2021/iclr/WaveGrad: Estimating Gradients for Waveform Generation b/data/2021/iclr/WaveGrad: Estimating Gradients for Waveform Generation
new file mode 100644
index 0000000000..449d2847ce
--- /dev/null
+++ b/data/2021/iclr/WaveGrad: Estimating Gradients for Waveform Generation	
@@ -0,0 +1 @@
+This paper introduces WaveGrad, a conditional model for waveform generation which estimates gradients of the data density. The model is built on prior work on score matching and diffusion probabilistic models. It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram. WaveGrad offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps, and bridges the gap between non-autoregressive and autoregressive models in terms of audio quality. We find that it can generate high fidelity audio samples using as few as six iterations. Experiments reveal WaveGrad to generate high fidelity audio, outperforming adversarial non-autoregressive baselines and matching a strong likelihood-based autoregressive baseline using fewer sequential operations. Audio samples are available at this https URL.
\ No newline at end of file
diff --git a/data/2021/iclr/What Can You Learn From Your Muscles? Learning Visual Representation from Human Interactions b/data/2021/iclr/What Can You Learn From Your Muscles? Learning Visual Representation from Human Interactions
new file mode 100644
index 0000000000..fb9a1651bb
--- /dev/null
+++ b/data/2021/iclr/What Can You Learn From Your Muscles? Learning Visual Representation from Human Interactions	
@@ -0,0 +1 @@
+Learning effective representations of visual data that generalize to a variety of downstream tasks has been a long quest for computer vision. Most representation learning approaches rely solely on visual data such as images or videos. In this paper, we explore a novel approach, where we use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations. For this study, we collect a dataset of human interactions capturing body part movements and gaze in their daily lives. Our experiments show that our self-supervised representation that encodes interaction and attention cues outperforms a visual-only state-of-the-art method MoCo (He et al., 2020), on a variety of target tasks: scene classification (semantic), action recognition (temporal), depth estimation (geometric), dynamics prediction (physics) and walkable surface estimation (affordance).
\ No newline at end of file
diff --git a/data/2021/iclr/What Makes Instance Discrimination Good for Transfer Learning? b/data/2021/iclr/What Makes Instance Discrimination Good for Transfer Learning?
new file mode 100644
index 0000000000..72c5cade0e
--- /dev/null
+++ b/data/2021/iclr/What Makes Instance Discrimination Good for Transfer Learning?	
@@ -0,0 +1 @@
+Unsupervised visual pretraining based on the instance discrimination pretext task has shown significant progress. Notably, in the recent work of MoCo, unsupervised pretraining has shown to surpass the supervised counterpart for finetuning downstream applications such as object detection on PASCAL VOC. It comes as a surprise that image annotations would be better left unused for transfer learning. In this work, we investigate the following problems: What makes instance discrimination pretraining good for transfer learning? What knowledge is actually learned and transferred from unsupervised pretraining? From this understanding of unsupervised pretraining, can we make supervised pretraining great again? Our findings are threefold. First, what truly matters for this detection transfer is low-level and mid-level representations, not high-level representations. Second, the intra-category invariance enforced by the traditional supervised model weakens transferability by increasing task misalignment. Finally, supervised pretraining can be strengthened by following an exemplar-based approach without explicit constraints among the instances within the same category.
\ No newline at end of file
diff --git a/data/2021/iclr/What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study b/data/2021/iclr/What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study
new file mode 100644
index 0000000000..377bb58cda
--- /dev/null
+++ b/data/2021/iclr/What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study	
@@ -0,0 +1 @@
+In recent years, reinforcement learning (RL) has been successfully applied to many different continuous control tasks. While RL algorithms are often conceptually simple, their state-of-the-art implementations take numerous lowand high-level design decisions that strongly affect the performance of the resulting agents. Those choices are usually not extensively discussed in the literature, leading to discrepancy between published descriptions of algorithms and their implementations. This makes it hard to attribute progress in RL and slows down overall progress [27]. As a step towards filling that gap, we implement >50 such “choices” in a unified on-policy deep actor-critic framework, allowing us to investigate their impact in a large-scale empirical study. We train over 250’000 agents in five continuous control environments of different complexity and provide insights and practical recommendations for the training of on-policy deep actor-critic RL agents.
\ No newline at end of file
diff --git a/data/2021/iclr/What Should Not Be Contrastive in Contrastive Learning b/data/2021/iclr/What Should Not Be Contrastive in Contrastive Learning
new file mode 100644
index 0000000000..0b157facd9
--- /dev/null
+++ b/data/2021/iclr/What Should Not Be Contrastive in Contrastive Learning	
@@ -0,0 +1 @@
+Recent self-supervised contrastive methods have been able to produce impressive transferable visual representations by learning to be invariant to different data augmentations. However, these methods implicitly assume a particular set of representational invariances (e.g., invariance to color), and can perform poorly when a downstream task violates this assumption (e.g., distinguishing red vs. yellow cars). We introduce a contrastive learning framework which does not require prior knowledge of specific, task-dependent invariances. Our model learns to capture varying and invariant factors for visual representations by constructing separate embedding spaces, each of which is invariant to all but one augmentation. We use a multi-head network with a shared backbone which captures information across each augmentation and alone outperforms all baselines on downstream tasks. We further find that the concatenation of the invariant and varying spaces performs best across all tasks we investigate, including coarse-grained, fine-grained, and few-shot downstream classification tasks, and various data corruptions.
\ No newline at end of file
diff --git a/data/2021/iclr/What are the Statistical Limits of Offline RL with Linear Function Approximation? b/data/2021/iclr/What are the Statistical Limits of Offline RL with Linear Function Approximation?
new file mode 100644
index 0000000000..073593c7fd
--- /dev/null
+++ b/data/2021/iclr/What are the Statistical Limits of Offline RL with Linear Function Approximation?	
@@ -0,0 +1,2 @@
+Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making problems. However, the extent to which this broader approach can be effective is not well understood, where the literature largely consists of sufficient conditions. 
+This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning. Perhaps surprisingly, our main result shows that even if: i) we have realizability in that the true value function of \emph{every} policy is linear in a given set of features and 2) our off-policy data has good coverage over all features (under a strong spectral condition), then any algorithm still (information-theoretically) requires a number of offline samples that is exponential in the problem horizon in order to non-trivially estimate the value of \emph{any} given policy. Our results highlight that sample-efficient offline policy evaluation is simply not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability).
\ No newline at end of file
diff --git a/data/2021/iclr/What they do when in doubt: a study of inductive biases in seq2seq learners b/data/2021/iclr/What they do when in doubt: a study of inductive biases in seq2seq learners
new file mode 100644
index 0000000000..9a71c691a6
--- /dev/null
+++ b/data/2021/iclr/What they do when in doubt: a study of inductive biases in seq2seq learners	
@@ -0,0 +1,2 @@
+Sequence-to-sequence (seq2seq) learners are widely used, but we still have only limited knowledge about what inductive biases shape the way they generalize. We address that by investigating how popular seq2seq learners generalize in tasks that have high ambiguity in the training data. We use SCAN and three new tasks to study learners' preferences for memorization, arithmetic, hierarchical, and compositional reasoning. Further, we connect to Solomonoff's theory of induction and propose to use description length as a principled and sensitive measure of inductive biases. 
+In our experimental study, we find that LSTM-based learners can learn to perform counting, addition, and multiplication by a constant from a single training example. Furthermore, Transformer and LSTM-based learners show a bias toward the hierarchical induction over the linear one, while CNN-based learners prefer the opposite. On the SCAN dataset, we find that CNN-based, and, to a lesser degree, Transformer- and LSTM-based learners have a preference for compositional generalization over memorization. Finally, across all our experiments, description length proved to be a sensitive measure of inductive biases.
\ No newline at end of file
diff --git a/data/2021/iclr/When Do Curricula Work? b/data/2021/iclr/When Do Curricula Work?
new file mode 100644
index 0000000000..3abc9a6dec
--- /dev/null
+++ b/data/2021/iclr/When Do Curricula Work?	
@@ -0,0 +1 @@
+Inspired by human learning, researchers have proposed ordering examples during training based on their difficulty. Both curriculum learning, exposing a network to easier examples early in training, and anti-curriculum learning, showing the most difficult examples first, have been suggested as improvements to the standard i.i.d. training. In this work, we set out to investigate the relative benefits of ordered learning. We first investigate the \emph{implicit curricula} resulting from architectural and optimization bias and find that samples are learned in a highly consistent order. Next, to quantify the benefit of \emph{explicit curricula}, we conduct extensive experiments over thousands of orderings spanning three kinds of learning: curriculum, anti-curriculum, and random-curriculum -- in which the size of the training dataset is dynamically increased over time, but the examples are randomly ordered. We find that for standard benchmark datasets, curricula have only marginal benefits, and that randomly ordered samples perform as well or better than curricula and anti-curricula, suggesting that any benefit is entirely due to the dynamic training set size. Inspired by common use cases of curriculum learning in practice, we investigate the role of limited training time budget and noisy data in the success of curriculum learning. Our experiments demonstrate that curriculum, but not anti-curriculum can indeed improve the performance either with limited training time budget or in existence of noisy data.
\ No newline at end of file
diff --git a/data/2021/iclr/When Optimizing f-Divergence is Robust with Label Noise b/data/2021/iclr/When Optimizing f-Divergence is Robust with Label Noise
new file mode 100644
index 0000000000..f9b83f6c58
--- /dev/null
+++ b/data/2021/iclr/When Optimizing f-Divergence is Robust with Label Noise	
@@ -0,0 +1 @@
+We show when maximizing a properly defined $f$-divergence measure with respect to a classifier's predictions and the supervised labels is robust with label noise. Leveraging its variational form, we derive a nice decoupling property for this particular $f$-divergence when label noise presents, where the divergence is shown to be a linear combination of the variational difference defined on the clean distribution and a bias term introduced due to the noise. The above derivation helps us analyze the robustness of this measure for different $f$-divergence functions. With established robustness, this family of $f$-divergence functions arises as useful metrics for the problem of learning with noisy labels, which do not require the specification of the labels' noise rate. When they are possibly not robust, we propose fixes to make them so. In addition to the analytical results, we present thorough experimental studies.
\ No newline at end of file
diff --git a/data/2021/iclr/When does preconditioning help or hurt generalization? b/data/2021/iclr/When does preconditioning help or hurt generalization?
new file mode 100644
index 0000000000..5cff2d3b3e
--- /dev/null
+++ b/data/2021/iclr/When does preconditioning help or hurt generalization?	
@@ -0,0 +1 @@
+While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization remains controversial. For instance, it has been pointed out that gradient descent (GD), in contrast to many preconditioned updates, converges to small Euclidean norm solutions in overparameterized models, leading to favorable generalization properties. This work presents a more nuanced view on the comparison of generalization between first- and second-order methods. We provide an asymptotic bias-variance decomposition of the generalization error of overparameterized ridgeless regression under a general class of preconditioner $\boldsymbol{P}$, and consider the inverse population Fisher information matrix (used in NGD) as a particular example. We determine the optimal $\boldsymbol{P}$ for both the bias and variance, and find that the relative generalization performance of different optimizers depends on the label noise and the "shape" of the signal (true parameters): when the labels are noisy, the model is misspecified, or the signal is misaligned with the features, NGD can achieve lower risk; conversely, GD generalizes better than NGD under clean labels, a well-specified model, or aligned signal. Based on this analysis, we discuss several approaches to manage the bias-variance tradeoff, and the potential benefit of interpolating between GD and NGD. We then extend our analysis to regression in the reproducing kernel Hilbert space and demonstrate that preconditioned GD can decrease the population risk faster than GD. Lastly, we empirically compare the generalization performance of first- and second-order optimizers in neural network experiments, and observe robust trends matching our theoretical analysis.
\ No newline at end of file
diff --git a/data/2021/iclr/Why Are Convolutional Nets More Sample-Efficient than Fully-Connected Nets? b/data/2021/iclr/Why Are Convolutional Nets More Sample-Efficient than Fully-Connected Nets?
new file mode 100644
index 0000000000..a202ada492
--- /dev/null
+++ b/data/2021/iclr/Why Are Convolutional Nets More Sample-Efficient than Fully-Connected Nets?	
@@ -0,0 +1 @@
+Convolutional neural networks often dominate fully-connected counterparts in generalization performance, especially on image classification tasks. This is often explained in terms of 'better inductive bias'. However, this has not been made mathematically rigorous, and the hurdle is that the fully connected net can always simulate the convolutional net (for a fixed task). Thus the training algorithm plays a role. The current work describes a natural task on which a provable sample complexity gap can be shown, for standard training algorithms. We construct a single natural distribution on $\mathbb{R}^d\times\{\pm 1\}$ on which any orthogonal-invariant algorithm (i.e. fully-connected networks trained with most gradient-based methods from gaussian initialization) requires $\Omega(d^2)$ samples to generalize while $O(1)$ samples suffice for convolutional architectures. Furthermore, we demonstrate a single target function, learning which on all possible distributions leads to an $O(1)$ vs $\Omega(d^2/\varepsilon)$ gap. The proof relies on the fact that SGD on fully-connected network is orthogonal equivariant. Similar results are achieved for $\ell_2$ regression and adaptive training algorithms, e.g. Adam and AdaGrad, which are only permutation equivariant.
\ No newline at end of file
diff --git a/data/2021/iclr/Why resampling outperforms reweighting for correcting sampling bias with stochastic gradients b/data/2021/iclr/Why resampling outperforms reweighting for correcting sampling bias with stochastic gradients
new file mode 100644
index 0000000000..e005dfae1b
--- /dev/null
+++ b/data/2021/iclr/Why resampling outperforms reweighting for correcting sampling bias with stochastic gradients	
@@ -0,0 +1 @@
+A data set sampled from a certain population is biased if the subgroups of the population are sampled at proportions that are significantly different from their underlying proportions. Training machine learning models on biased data sets requires correction techniques to compensate for potential biases. We consider two commonly-used techniques, resampling and reweighting, that rebalance the proportions of the subgroups to maintain the desired objective function. Though statistically equivalent, it has been observed that reweighting outperforms resampling when combined with stochastic gradient algorithms. By analyzing illustrative examples, we explain the reason behind this phenomenon using tools from dynamical stability and stochastic asymptotics. We also present experiments from regression, classification, and off-policy prediction to demonstrate that this is a general phenomenon. We argue that it is imperative to consider the objective function design and the optimization algorithm together while addressing the sampling bias.
\ No newline at end of file
diff --git a/data/2021/iclr/Winning the L2RPN Challenge: Power Grid Management via Semi-Markov Afterstate Actor-Critic b/data/2021/iclr/Winning the L2RPN Challenge: Power Grid Management via Semi-Markov Afterstate Actor-Critic
new file mode 100644
index 0000000000..8686667181
--- /dev/null
+++ b/data/2021/iclr/Winning the L2RPN Challenge: Power Grid Management via Semi-Markov Afterstate Actor-Critic	
@@ -0,0 +1 @@
+Safe and reliable electricity transmission in power grids is crucial for modern society. It is thus quite natural that there has been a growing interest in the automatic management of power grids, exemplified by the Learning to Run a Power Network Challenge (L2RPN), modeling the problem as a reinforcement learning (RL) task. However, it is highly challenging to manage a real-world scale power grid, mostly due to the massive scale of its state and action space. In this paper, we present an off-policy actor-critic approach that effectively tackles the unique challenges in power grid management by RL, adopting the hierarchical policy together with the afterstate representation. Our agent ranked first in the latest challenge (L2RPN WCCI 2020), being able to avoid disastrous situations while maintaining the highest level of operational efficiency in every test scenario. This paper provides a formal description of the algorithmic aspect of our approach, as well as further experimental studies on diverse power grids.
\ No newline at end of file
diff --git a/data/2021/iclr/Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching b/data/2021/iclr/Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching
new file mode 100644
index 0000000000..1580063d81
--- /dev/null
+++ b/data/2021/iclr/Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching	
@@ -0,0 +1 @@
+Data Poisoning attacks involve an attacker modifying training data to maliciouslycontrol a model trained on this data. Previous poisoning attacks against deep neural networks have been limited in scope and success, working only in simplified settings or being prohibitively expensive for large datasets. In this work, we focus on a particularly malicious poisoning attack that is both "from scratch" and"clean label", meaning we analyze an attack that successfully works against new, randomly initialized models, and is nearly imperceptible to humans, all while perturbing only a small fraction of the training data. The central mechanism of this attack is matching the gradient direction of malicious examples. We analyze why this works, supplement with practical considerations. and show its threat to real-world practitioners, finding that it is the first poisoning method to cause targeted misclassification in modern deep networks trained from scratch on a full-sized, poisoned ImageNet dataset. Finally we demonstrate the limitations of existing defensive strategies against such an attack, concluding that data poisoning is a credible threat, even for large-scale deep learning systems.
\ No newline at end of file
diff --git a/data/2021/iclr/WrapNet: Neural Net Inference with Ultra-Low-Precision Arithmetic b/data/2021/iclr/WrapNet: Neural Net Inference with Ultra-Low-Precision Arithmetic
new file mode 100644
index 0000000000..9aecbb6f38
--- /dev/null
+++ b/data/2021/iclr/WrapNet: Neural Net Inference with Ultra-Low-Precision Arithmetic	
@@ -0,0 +1 @@
+Low-precision neural networks represent both weights and activations with few bits, drastically reducing the multiplication complexity. Nonetheless, these products are accumulated using high-precision (typically 32-bit) additions, an operation that dominates the arithmetic complexity of inference when using extreme quantization (e.g., binary weights). To further optimize inference, we propose WrapNet that adapts neural networks to use low-precision (8-bit) additions in the accumulators, achieving classification accuracy comparable to their 32-bit counterparts. We achieve resilience to low-precision accumulation by inserting a cyclic activation layer, as well as an overflow penalty regularizer. We demonstrate the efficacy of our approach on both software and hardware platforms.
\ No newline at end of file
diff --git a/data/2021/iclr/X2T: Training an X-to-Text Typing Interface with Online Learning from User Feedback b/data/2021/iclr/X2T: Training an X-to-Text Typing Interface with Online Learning from User Feedback
new file mode 100644
index 0000000000..f5c8629670
--- /dev/null
+++ b/data/2021/iclr/X2T: Training an X-to-Text Typing Interface with Online Learning from User Feedback	
@@ -0,0 +1 @@
+We aim to help users communicate their intent to machines using flexible, adaptive interfaces that translate arbitrary user input into desired actions. In this work, we focus on assistive typing applications in which a user cannot operate a keyboard, but can instead supply other inputs, such as webcam images that capture eye gaze or neural activity measured by a brain implant. Standard methods train a model on a fixed dataset of user inputs, then deploy a static interface that does not learn from its mistakes; in part, because extracting an error signal from user behavior can be challenging. We investigate a simple idea that would enable such interfaces to improve over time, with minimal additional effort from the user: online learning from user feedback on the accuracy of the interface's actions. In the typing domain, we leverage backspaces as feedback that the interface did not perform the desired action. We propose an algorithm called x-to-text (X2T) that trains a predictive model of this feedback signal, and uses this model to fine-tune any existing, default interface for translating user input into actions that select words or characters. We evaluate X2T through a small-scale online user study with 12 participants who type sentences by gazing at their desired words, a large-scale observational study on handwriting samples from 60 users, and a pilot study with one participant using an electrocorticography-based brain-computer interface. The results show that X2T learns to outperform a non-adaptive default interface, stimulates user co-adaptation to the interface, personalizes the interface to individual users, and can leverage offline data collected from the default interface to improve its initial performance and accelerate online learning.
\ No newline at end of file
diff --git a/data/2021/iclr/You Only Need Adversarial Supervision for Semantic Image Synthesis b/data/2021/iclr/You Only Need Adversarial Supervision for Semantic Image Synthesis
new file mode 100644
index 0000000000..c749b5db00
--- /dev/null
+++ b/data/2021/iclr/You Only Need Adversarial Supervision for Semantic Image Synthesis	
@@ -0,0 +1 @@
+Despite their recent successes, GAN models for semantic image synthesis still suffer from poor image quality when trained with only adversarial supervision. Historically, additionally employing the VGG-based perceptual loss has helped to overcome this issue, significantly improving the synthesis quality, but at the same time limiting the progress of GAN models for semantic image synthesis. In this work, we propose a novel, simplified GAN model, which needs only adversarial supervision to achieve high quality results. We re-design the discriminator as a semantic segmentation network, directly using the given semantic label maps as the ground truth for training. By providing stronger supervision to the discriminator as well as to the generator through spatially- and semantically-aware discriminator feedback, we are able to synthesize images of higher fidelity with better alignment to their input label maps, making the use of the perceptual loss superfluous. Moreover, we enable high-quality multi-modal image synthesis through global and local sampling of a 3D noise tensor injected into the generator, which allows complete or partial image change. We show that images synthesized by our model are more diverse and follow the color and texture distributions of real images more closely. We achieve an average improvement of $6$ FID and $5$ mIoU points over the state of the art across different datasets using only adversarial supervision.
\ No newline at end of file
diff --git a/data/2021/iclr/Zero-Cost Proxies for Lightweight NAS b/data/2021/iclr/Zero-Cost Proxies for Lightweight NAS
new file mode 100644
index 0000000000..d4510afd61
--- /dev/null
+++ b/data/2021/iclr/Zero-Cost Proxies for Lightweight NAS	
@@ -0,0 +1 @@
+Neural Architecture Search (NAS) is quickly becoming the standard methodology to design neural network models. However, NAS is typically compute-intensive because multiple models need to be evaluated before choosing the best one. To reduce the computational power and time needed, a proxy task is often used for evaluating each model instead of full training. In this paper, we evaluate conventional reduced-training proxies and quantify how well they preserve ranking between multiple models during search when compared with the rankings produced by final trained accuracy. We propose a series of zero-cost proxies, based on recent pruning literature, that use just a single minibatch of training data to compute a model's score. Our zero-cost proxies use 3 orders of magnitude less computation but can match and even outperform conventional proxies. For example, Spearman's rank correlation coefficient between final validation accuracy and our best zero-cost proxy on NAS-Bench-201 is 0.82, compared to 0.61 for EcoNAS (a recently proposed reduced-training proxy). Finally, we use these zero-cost proxies to enhance existing NAS search algorithms such as random search, reinforcement learning, evolutionary search and predictor-based search. For all search methodologies and across three different NAS datasets, we are able to significantly improve sample efficiency, and thereby decrease computation, by using our zero-cost proxies. For example on NAS-Bench-101, we achieved the same accuracy 4× quicker than the best previous result.
\ No newline at end of file
diff --git a/data/2021/iclr/Zero-shot Synthesis with Group-Supervised Learning b/data/2021/iclr/Zero-shot Synthesis with Group-Supervised Learning
new file mode 100644
index 0000000000..6fde00c683
--- /dev/null
+++ b/data/2021/iclr/Zero-shot Synthesis with Group-Supervised Learning	
@@ -0,0 +1 @@
+Visual cognition of primates is superior to that of artificial neural networks in its ability to 'envision' a visual object, even a newly-introduced one, in different attributes including pose, position, color, texture, etc. To aid neural networks to envision objects with different attributes, we propose a family of objective functions, expressed on groups of examples, as a novel learning framework that we term Group-Supervised Learning (GSL). GSL decomposes inputs into a disentangled representation with swappable components that can be recombined to synthesize new samples, trained through similarity mining within groups of exemplars. For instance, images of red boats & blue cars can be decomposed and recombined to synthesize novel images of red cars. We describe a general class of datasets admissible by GSL. We propose an implementation based on auto-encoder, termed group-supervised zero-shot synthesis network (GZS-Net) trained with our learning framework, that can produce a high-quality red car even if no such example is witnessed during training. We test our model and learning framework on existing benchmarks, in addition to new dataset that we open-source. We qualitatively and quantitatively demonstrate that GZS-Net trained with GSL outperforms state-of-the-art methods
\ No newline at end of file
diff --git a/data/2021/iclr/gradSim: Differentiable simulation for system identification and visuomotor control b/data/2021/iclr/gradSim: Differentiable simulation for system identification and visuomotor control
new file mode 100644
index 0000000000..514c46ff96
--- /dev/null
+++ b/data/2021/iclr/gradSim: Differentiable simulation for system identification and visuomotor control	
@@ -0,0 +1 @@
+We consider the problem of estimating an object's physical properties such as mass, friction, and elasticity directly from video sequences. Such a system identification problem is fundamentally ill-posed due to the loss of information during image formation. Current solutions require precise 3D labels which are labor-intensive to gather, and infeasible to create for many systems such as deformable solids or cloth. We present gradSim, a framework that overcomes the dependence on 3D supervision by leveraging differentiable multiphysics simulation and differentiable rendering to jointly model the evolution of scene dynamics and image formation. This novel combination enables backpropagation from pixels in a video sequence through to the underlying physical attributes that generated them. Moreover, our unified computation graph -- spanning from the dynamics and through the rendering process -- enables learning in challenging visuomotor control tasks, without relying on state-based (3D) supervision, while obtaining performance competitive to or better than techniques that rely on precise 3D labels.
\ No newline at end of file
diff --git a/data/2021/iclr/i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning b/data/2021/iclr/i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning
new file mode 100644
index 0000000000..0a5f01fe7f
--- /dev/null
+++ b/data/2021/iclr/i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning	
@@ -0,0 +1 @@
+Contrastive representation learning has shown to be effective to learn representations from unlabeled data. However, much progress has been made in vision domains relying on data augmentations carefully designed using domain knowledge. In this work, we propose i-Mix, a simple yet effective domain-agnostic regularization strategy for improving contrastive representation learning. We cast contrastive learning as training a non-parametric classifier by assigning a unique virtual class to each data in a batch. Then, data instances are mixed in both the input and virtual label spaces, providing more augmented data during training. In experiments, we demonstrate that i-Mix consistently improves the quality of learned representations across domains, including image, speech, and tabular data. Furthermore, we confirm its regularization effect via extensive ablation studies across model and dataset sizes. The code is available at https://github.com/kibok90/imix.
\ No newline at end of file
diff --git a/data/2021/iclr/not-MIWAE: Deep Generative Modelling with Missing not at Random Data b/data/2021/iclr/not-MIWAE: Deep Generative Modelling with Missing not at Random Data
new file mode 100644
index 0000000000..9675e5ab7b
--- /dev/null
+++ b/data/2021/iclr/not-MIWAE: Deep Generative Modelling with Missing not at Random Data	
@@ -0,0 +1 @@
+When a missing process depends on the missing values themselves, it needs to be explicitly modelled and taken into account while doing likelihood-based inference. We present an approach for building and fitting deep latent variable models (DLVMs) in cases where the missing process is dependent on the missing data. Specifically, a deep neural network enables us to flexibly model the conditional distribution of the missingness pattern given the data. This allows for incorporating prior information about the type of missingness (e.g. self-censoring) into the model. Our inference technique, based on importance-weighted variational inference, involves maximising a lower bound of the joint likelihood. Stochastic gradients of the bound are obtained by using the reparameterisation trick both in latent space and data space. We show on various kinds of data sets and missingness patterns that explicitly modelling the missing process can be invaluable.
\ No newline at end of file
diff --git a/data/2022/iclr/8-bit Optimizers via Block-wise Quantization b/data/2022/iclr/8-bit Optimizers via Block-wise Quantization
new file mode 100644
index 0000000000..a9b63ead15
--- /dev/null
+++ b/data/2022/iclr/8-bit Optimizers via Block-wise Quantization	
@@ -0,0 +1 @@
+Stateful optimizers maintain gradient statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. In this paper, we develop the first optimizers that use 8-bit statistics while maintaining the performance levels of using 32-bit optimizer states. To overcome the resulting computational, quantization, and stability challenges, we develop block-wise dynamic quantization. Block-wise quantization divides input tensors into smaller blocks that are independently quantized. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization. To maintain stability and performance, we combine block-wise quantization with two additional changes: (1) dynamic quantization, a form of non-linear optimization that is precise for both large and small magnitude values, and (2) a stable embedding layer to reduce gradient variance that comes from the highly non-uniform distribution of input tokens in language models. As a result, our 8-bit optimizers maintain 32-bit performance with a small fraction of the memory footprint on a range of tasks, including 1.5B parameter language modeling, GLUE finetuning, ImageNet classification, WMT’14 machine translation, MoCo v2 contrastive ImageNet pretraining+finetuning, and RoBERTa pretraining, without changes to the original optimizer hyperparameters. We open-sourceour 8-bit optimizers as a drop-in replacement that only requires a two-line code change. Increasing model size is an effective way to achieve better performance for given resources (Kaplan et al., 2020; Henighan et al., 2020; Raffel et al., 2019; Lewis et al., 2021). However, training such large models requires storing the model, gradient, and state of the optimizer (e.g., exponentially smoothed sum and squared sum of previous gradients for Adam), all in a fixed amount of available memory. Although significant research has focused on enabling larger model training by reducing or efficiently distributing the memory required for the model parameters (Shoeybi et al., 2019; Lepikhin et al., 2020; Fedus et al., 2021; Brown et al., 2020; Rajbhandari et al., 2020), reducing the memory footprint of optimizer gradient statistics is much less studied. This is a significant missed opportunity since these optimizer states use 33-75% of the total memory footprint during training. For example, the Adam optimizer states for the largest GPT-2 (Radford et al., 2019) and T5 (Raffel et al., 2019) models are 11 GB and 41 GB in size. In this paper, we develop a fast, high-precision non-linear quantization method – block-wise dynamic quantization – that enables stable 8-bit optimizers (e.g., Adam, AdamW, and Momentum) which maintain 32-bit performance at a fraction of the memory footprint and without any changes to the original hyperparameters.1 While most current work uses 32-bit optimizer states, recent high-profile efforts to use 16-bit optimizers report difficultly for large models with more than 1B parameters (Ramesh et al., 2021). Going from 16-bit optimizers to 8-bit optimizers reduces the range of possible values from 2 = 65536 values to just 2 = 256. To our knowledge, this has not been attempted before. Effectively using this very limited range is challenging for three reasons: quantization accuracy, computational efficiency, and large-scale stability. To maintain accuracy, it is critical to introduce some form of non-linear quantization to reduce errors for both common small magnitude values We study 8-bit optimization with current best practice model and gradient representations (typically 16-bit mixed precision), to isolate optimization challenges. Future work could explore further compressing all three. 1 ar X iv :2 11 0. 02 86 1v 2 [ cs .L G ] 2 0 Ju n 20 22 Published as a conference paper at ICLR 2022
\ No newline at end of file
diff --git a/data/2022/iclr/A Biologically Interpretable Graph Convolutional Network to Link Genetic Risk Pathways and Imaging Phenotypes of Disease b/data/2022/iclr/A Biologically Interpretable Graph Convolutional Network to Link Genetic Risk Pathways and Imaging Phenotypes of Disease
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/A Class of Short-term Recurrence Anderson Mixing Methods and Their Applications b/data/2022/iclr/A Class of Short-term Recurrence Anderson Mixing Methods and Their Applications
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/A Comparison of Hamming Errors of Representative Variable Selection Methods b/data/2022/iclr/A Comparison of Hamming Errors of Representative Variable Selection Methods
new file mode 100644
index 0000000000..fc871aa96c
--- /dev/null
+++ b/data/2022/iclr/A Comparison of Hamming Errors of Representative Variable Selection Methods	
@@ -0,0 +1 @@
+Lasso is a celebrated method for variable selection in linear models, but it faces challenges when the variables are moderately or strongly correlated. This motivates alternative approaches such as using a non-convex penalty, adding a ridge regularization, or conducting a post-Lasso thresholding. In this paper, we compare Lasso with 5 other methods: Elastic net, SCAD, forward selection, thresholded Lasso, and forward backward selection. We measure their performances theoretically by the expected Hamming error, assuming that the regression coefficients are iid drawn from a two-point mixture and that the Gram matrix is block-wise diagonal. By deriving the rates of convergence of Hamming errors and the phase diagrams, we obtain useful conclusions about the pros and cons of different methods.
\ No newline at end of file
diff --git a/data/2022/iclr/A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion b/data/2022/iclr/A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion
new file mode 100644
index 0000000000..1a9c98e181
--- /dev/null
+++ b/data/2022/iclr/A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion	
@@ -0,0 +1 @@
+3D point cloud is an important 3D representation for capturing real world 3D objects. However, real-scanned 3D point clouds are often incomplete, and it is important to recover complete point clouds for downstream applications. Most existing point cloud completion methods use Chamfer Distance (CD) loss for training. The CD loss estimates correspondences between two point clouds by searching nearest neighbors, which does not capture the overall point density distribution on the generated shape, and therefore likely leads to non-uniform point cloud generation. To tackle this problem, we propose a novel Point Diffusion-Refinement (PDR) paradigm for point cloud completion. PDR consists of a Conditional Generation Network (CGNet) and a ReFinement Network (RFNet). The CGNet uses a conditional generative model called the denoising diffusion probabilistic model (DDPM) to generate a coarse completion conditioned on the partial observation. DDPM establishes a one-to-one pointwise mapping between the generated point cloud and the uniform ground truth, and then optimizes the mean squared error loss to realize uniform generation. The RFNet refines the coarse output of the CGNet and further improves quality of the completed point cloud. Furthermore, we develop a novel dual-path architecture for both networks. The architecture can (1) effectively and efficiently extract multi-level features from partially observed point clouds to guide completion, and (2) accurately manipulate spatial locations of 3D points to obtain smooth surfaces and sharp details. Extensive experimental results on various benchmark datasets show that our PDR paradigm outperforms previous state-of-the-art methods for point cloud completion. Remarkably, with the help of the RFNet, we can accelerate the iterative generation process of the DDPM by up to 50 times without much performance drop.
\ No newline at end of file
diff --git a/data/2022/iclr/A Deep Variational Approach to Clustering Survival Data b/data/2022/iclr/A Deep Variational Approach to Clustering Survival Data
new file mode 100644
index 0000000000..1676077940
--- /dev/null
+++ b/data/2022/iclr/A Deep Variational Approach to Clustering Survival Data	
@@ -0,0 +1 @@
+In this work, we study the problem of clustering survival data $-$ a challenging and so far under-explored task. We introduce a novel semi-supervised probabilistic approach to cluster survival data by leveraging recent advances in stochastic gradient variational inference. In contrast to previous work, our proposed method employs a deep generative model to uncover the underlying distribution of both the explanatory variables and censored survival times. We compare our model to the related work on clustering and mixture models for survival data in comprehensive experiments on a wide range of synthetic, semi-synthetic, and real-world datasets, including medical imaging data. Our method performs better at identifying clusters and is competitive at predicting survival times. Relying on novel generative assumptions, the proposed model offers a holistic perspective on clustering survival data and holds a promise of discovering subpopulations whose survival is regulated by different generative mechanisms.
\ No newline at end of file
diff --git a/data/2022/iclr/A Fine-Grained Analysis on Distribution Shift b/data/2022/iclr/A Fine-Grained Analysis on Distribution Shift
new file mode 100644
index 0000000000..3656284110
--- /dev/null
+++ b/data/2022/iclr/A Fine-Grained Analysis on Distribution Shift	
@@ -0,0 +1 @@
+Robustness to distribution shifts is critical for deploying machine learning models in the real world. Despite this necessity, there has been little work in defining the underlying mechanisms that cause these shifts and evaluating the robustness of algorithms across multiple, different distribution shifts. To this end, we introduce a framework that enables fine-grained analysis of various distribution shifts. We provide a holistic analysis of current state-of-the-art methods by evaluating 19 distinct methods grouped into five categories across both synthetic and real-world datasets. Overall, we train more than 85K models. Our experimental framework can be easily extended to include new methods, shifts, and datasets. We find, unlike previous work~\citep{Gulrajani20}, that progress has been made over a standard ERM baseline; in particular, pretraining and augmentations (learned or heuristic) offer large gains in many cases. However, the best methods are not consistent over different datasets and shifts.
\ No newline at end of file
diff --git a/data/2022/iclr/A Fine-Tuning Approach to Belief State Modeling b/data/2022/iclr/A Fine-Tuning Approach to Belief State Modeling
new file mode 100644
index 0000000000..301aba5eca
--- /dev/null
+++ b/data/2022/iclr/A Fine-Tuning Approach to Belief State Modeling	
@@ -0,0 +1 @@
+We investigate the challenge of modeling the belief state of a partially observable Markov system, given sample-access to its dynamics model. This problem setting is often approached using parametric sequential generative modeling methods. However, these methods do not leverage any additional computation at inference time to increase their accuracy. Moreover, applying these methods to belief state modeling in certain multi-agent settings would require passing policies into the belief model—at the time of writing, there have been no successful demonstrations of this. Toward addressing these shortcomings, we propose an inference-time improvement framework for parametric sequential generative modeling methods called belief fine-tuning (BFT). BFT leverages approximate dynamic programming in the form of fine-tuning to determine the model parameters at each time step. It can improve the accuracy of the belief model at test time because it specializes the model to the space of local observations. Furthermore, because this specialization occurs after the action or policy has already been decided, BFT does not require the belief model to process it as input. As a result of the latter point, BFT enables, for the first time, approximate public belief state search in imperfect-information games where the number of possible information states is too large to track tabularly. We exhibit these findings on large-scale variants of the benchmark game Hanabi.
\ No newline at end of file
diff --git a/data/2022/iclr/A First-Occupancy Representation for Reinforcement Learning b/data/2022/iclr/A First-Occupancy Representation for Reinforcement Learning
new file mode 100644
index 0000000000..70fc764aa0
--- /dev/null
+++ b/data/2022/iclr/A First-Occupancy Representation for Reinforcement Learning	
@@ -0,0 +1 @@
+Both animals and artificial agents benefit from state representations that support rapid transfer of learning across tasks and which enable them to efficiently traverse their environments to reach rewarding states. The successor representation (SR), which measures the expected cumulative, discounted state occupancy under a fixed policy, enables efficient transfer to different reward structures in an otherwise constant Markovian environment and has been hypothesized to underlie aspects of biological behavior and neural activity. However, in the real world, rewards may move or only be available for consumption once, may shift location, or agents may simply aim to reach goal states as rapidly as possible without the constraint of artificially imposed task horizons. In such cases, the most behaviorally-relevant representation would carry information about when the agent was likely to first reach states of interest, rather than how often it should expect to visit them over a potentially infinite time span. To reflect such demands, we introduce the first-occupancy representation (FR), which measures the expected temporal discount to the first time a state is accessed. We demonstrate that the FR facilitates exploration, the selection of efficient paths to desired states, allows the agent, under certain conditions, to plan provably optimal trajectories defined by a sequence of subgoals, and induces similar behavior to animals avoiding threatening stimuli.
\ No newline at end of file
diff --git a/data/2022/iclr/A General Analysis of Example-Selection for Stochastic Gradient Descent b/data/2022/iclr/A General Analysis of Example-Selection for Stochastic Gradient Descent
new file mode 100644
index 0000000000..6d261e75f4
--- /dev/null
+++ b/data/2022/iclr/A General Analysis of Example-Selection for Stochastic Gradient Descent	
@@ -0,0 +1 @@
+Training example order in SGD has long been known to affect convergence rate. Recent results show that accelerated rates are possible in a variety of cases for permutation-based sample orders, in which each example from the training set is used once before any example is reused. In this paper, we develop a broad condition on the sequence of examples used by SGD that is sufficient to prove tight convergence rates in both strongly convex and non-convex settings. We show that our approach suffices to recover, and in some cases improve upon, previous state-of-the-art analyses for four known example-selection schemes: (1) shuffle once, (2) random reshuffling, (3) random reshuffling with data echoing, and (4) Markov Chain Gradient Descent. Motivated by our theory, we propose two new example-selection approaches. First, using quasi-Monte-Carlo methods, we achieve unprecedented accelerated convergence rates for learning with data augmentation. Second, we greedily choose a fixed scan-order to minimize the metric used in our condition and show that we can obtain more accurate solutions from the same number of epochs of SGD. We conclude by empirically demonstrating the utility of our approach for both convex linear-model and deep learning tasks. Our code is available at: https://github.com/EugeneLYC/qmc-ordering .
\ No newline at end of file
diff --git a/data/2022/iclr/A Generalized Weighted Optimization Method for Computational Learning and Inversion b/data/2022/iclr/A Generalized Weighted Optimization Method for Computational Learning and Inversion
new file mode 100644
index 0000000000..691c05b159
--- /dev/null
+++ b/data/2022/iclr/A Generalized Weighted Optimization Method for Computational Learning and Inversion	
@@ -0,0 +1 @@
+The generalization capacity of various machine learning models exhibits different phenomena in the under- and over-parameterized regimes. In this paper, we focus on regression models such as feature regression and kernel regression and analyze a generalized weighted least-squares optimization method for computational learning and inversion with noisy data. The highlight of the proposed framework is that we allow weighting in both the parameter space and the data space. The weighting scheme encodes both a priori knowledge on the object to be learned and a strategy to weight the contribution of different data points in the loss function. Here, we characterize the impact of the weighting scheme on the generalization error of the learning method, where we derive explicit generalization errors for the random Fourier feature model in both the under- and over-parameterized regimes. For more general feature maps, error bounds are provided based on the singular values of the feature matrix. We demonstrate that appropriate weighting from prior knowledge can improve the generalization capability of the learned model.
\ No newline at end of file
diff --git a/data/2022/iclr/A Johnson-Lindenstrauss Framework for Randomly Initialized CNNs b/data/2022/iclr/A Johnson-Lindenstrauss Framework for Randomly Initialized CNNs
new file mode 100644
index 0000000000..b353d1bfe1
--- /dev/null
+++ b/data/2022/iclr/A Johnson-Lindenstrauss Framework for Randomly Initialized CNNs	
@@ -0,0 +1 @@
+How does the geometric representation of a dataset change after the application of each randomly initialized layer of a neural network? The celebrated Johnson– Lindenstrauss lemma answers this question for linear fully-connected neural networks (FNNs), stating that the geometry is essentially preserved. For FNNs with the ReLU activation, the angle between two inputs contracts according to a known mapping. The question for non-linear convolutional neural networks (CNNs) becomes much more intricate. To answer this question, we introduce a geometric framework. For linear CNNs, we show that the Johnson–Lindenstrauss lemma continues to hold, namely, that the angle between two inputs is preserved. For CNNs with ReLU activation, on the other hand, the behavior is richer: The angle between the outputs contracts, where the level of contraction depends on the nature of the inputs. In particular, after one layer, the geometry of natural images is essentially preserved, whereas for Gaussian correlated inputs, CNNs exhibit the same contracting behavior as FNNs with ReLU activation.
\ No newline at end of file
diff --git a/data/2022/iclr/A Loss Curvature Perspective on Training Instabilities of Deep Learning Models b/data/2022/iclr/A Loss Curvature Perspective on Training Instabilities of Deep Learning Models
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/A Neural Tangent Kernel Perspective of Infinite Tree Ensembles b/data/2022/iclr/A Neural Tangent Kernel Perspective of Infinite Tree Ensembles
new file mode 100644
index 0000000000..eb295298e4
--- /dev/null
+++ b/data/2022/iclr/A Neural Tangent Kernel Perspective of Infinite Tree Ensembles	
@@ -0,0 +1 @@
+In practical situations, the tree ensemble is one of the most popular models along with neural networks. A soft tree is a variant of a decision tree. Instead of using a greedy method for searching splitting rules, the soft tree is trained using a gradient method in which the entire splitting operation is formulated in a differentiable form. Although ensembles of such soft trees have been used increasingly in recent years, little theoretical work has been done to understand their behavior. By considering an ensemble of infinite soft trees, this paper introduces and studies the Tree Neural Tangent Kernel (TNTK), which provides new insights into the behavior of the infinite ensemble of soft trees. Using the TNTK, we theoretically identify several non-trivial properties, such as global convergence of the training, the equivalence of the oblivious tree structure, and the degeneracy of the TNTK induced by the deepening of the trees.
\ No newline at end of file
diff --git "a/data/2022/iclr/A New Perspective on \"How Graph Neural Networks Go Beyond Weisfeiler-Lehman?\"" "b/data/2022/iclr/A New Perspective on \"How Graph Neural Networks Go Beyond Weisfeiler-Lehman?\""
new file mode 100644
index 0000000000..4ad0575d3e
--- /dev/null
+++ "b/data/2022/iclr/A New Perspective on \"How Graph Neural Networks Go Beyond Weisfeiler-Lehman?\""	
@@ -0,0 +1 @@
+We propose a new perspective on designing powerful Graph Neural Networks (GNNs). In a nutshell, this enables a general solution to inject structural properties of graphs into a message-passing aggregation scheme of GNNs. As a theoretical basis, we develop a new hierarchy of local isomorphism on neighborhood sub-graphs. Then, we theoretically characterize how message-passing GNNs can be designed to be more expressive than the Weisfeiler Lehman test. To elaborate this characterization, we propose a novel neural model, called GraphSNN , and prove that this model is strictly more expressive than the Weisfeiler Lehman test in distinguishing graph structures. We empirically verify the strength of our model on different graph learning tasks. It is shown that our model consistently improves the state-of-the-art methods on the benchmark tasks without sacriﬁcing computational simplicity and efﬁciency.
\ No newline at end of file
diff --git a/data/2022/iclr/A Non-Parametric Regression Viewpoint : Generalization of Overparametrized Deep RELU Network Under Noisy Observations b/data/2022/iclr/A Non-Parametric Regression Viewpoint : Generalization of Overparametrized Deep RELU Network Under Noisy Observations
new file mode 100644
index 0000000000..4d08f24adb
--- /dev/null
+++ b/data/2022/iclr/A Non-Parametric Regression Viewpoint : Generalization of Overparametrized Deep RELU Network Under Noisy Observations	
@@ -0,0 +1 @@
+We study the generalization properties of the overparameterized deep neural network (DNN) with Rectiﬁed Linear Unit (ReLU) activations. Under the non-parametric regression framework, it is assumed that the ground-truth function is from a reproducing kernel Hilbert space (RKHS) induced by a neural tangent kernel (NTK) of ReLU DNN, and a dataset is given with the noises. Without a delicate adoption of early stopping, we prove that the overparametrized DNN trained by vanilla gradient descent does not recover the ground-truth function. It turns out that the estimated DNN’s L 2 prediction error is bounded away from 0 . As a complement of the above result, we show that the (cid:96) 2 -regularized gradient descent enables the overparametrized DNN to achieve the minimax optimal convergence rate of the L 2 prediction error, without early stopping. Notably, the rate we obtained is faster than O ( n − 1 / 2 ) known in the literature.
\ No newline at end of file
diff --git a/data/2022/iclr/A Program to Build E(N)-Equivariant Steerable CNNs b/data/2022/iclr/A Program to Build E(N)-Equivariant Steerable CNNs
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/A Reduction-Based Framework for Conservative Bandits and Reinforcement Learning b/data/2022/iclr/A Reduction-Based Framework for Conservative Bandits and Reinforcement Learning
new file mode 100644
index 0000000000..fa096a79b5
--- /dev/null
+++ b/data/2022/iclr/A Reduction-Based Framework for Conservative Bandits and Reinforcement Learning	
@@ -0,0 +1 @@
+In this paper, we present a reduction-based framework for conservative bandits and RL, in which our core technique is to calculate the necessary and sufficient budget obtained from running the baseline policy. For lower bounds, we improve the existing lower bound for conservative multi-armed bandits and obtain new lower bounds for conservative linear bandits, tabular RL and low-rank MDP, through a black-box reduction that turns a certain lower bound in the nonconservative setting into a new lower bound in the conservative setting. For upper bounds, in multi-armed bandits, linear bandits and tabular RL, our new upper bounds tighten or match existing ones with significantly simpler analyses. We also obtain a new upper bound for conservative low-rank MDP.
\ No newline at end of file
diff --git a/data/2022/iclr/A Relational Intervention Approach for Unsupervised Dynamics Generalization in Model-Based Reinforcement Learning b/data/2022/iclr/A Relational Intervention Approach for Unsupervised Dynamics Generalization in Model-Based Reinforcement Learning
new file mode 100644
index 0000000000..c601028260
--- /dev/null
+++ b/data/2022/iclr/A Relational Intervention Approach for Unsupervised Dynamics Generalization in Model-Based Reinforcement Learning	
@@ -0,0 +1 @@
+The generalization of model-based reinforcement learning (MBRL) methods to environments with unseen transition dynamics is an important yet challenging problem. Existing methods try to extract environment-specified information $Z$ from past transition segments to make the dynamics prediction model generalizable to different dynamics. However, because environments are not labelled, the extracted information inevitably contains redundant information unrelated to the dynamics in transition segments and thus fails to maintain a crucial property of $Z$: $Z$ should be similar in the same environment and dissimilar in different ones. As a result, the learned dynamics prediction function will deviate from the true one, which undermines the generalization ability. To tackle this problem, we introduce an interventional prediction module to estimate the probability of two estimated $\hat{z}_i, \hat{z}_j$ belonging to the same environment. Furthermore, by utilizing the $Z$'s invariance within a single environment, a relational head is proposed to enforce the similarity between $\hat{{Z}}$ from the same environment. As a result, the redundant information will be reduced in $\hat{Z}$. We empirically show that $\hat{{Z}}$ estimated by our method enjoy less redundant information than previous methods, and such $\hat{{Z}}$ can significantly reduce dynamics prediction errors and improve the performance of model-based RL methods on zero-shot new environments with unseen dynamics. The codes of this method are available at \url{https://github.com/CR-Gjx/RIA}.
\ No newline at end of file
diff --git a/data/2022/iclr/A Statistical Framework for Efficient Out of Distribution Detection in Deep Neural Networks b/data/2022/iclr/A Statistical Framework for Efficient Out of Distribution Detection in Deep Neural Networks
new file mode 100644
index 0000000000..9ce27bd435
--- /dev/null
+++ b/data/2022/iclr/A Statistical Framework for Efficient Out of Distribution Detection in Deep Neural Networks	
@@ -0,0 +1 @@
+Background. Commonly, Deep Neural Networks (DNNs) generalize well on samples drawn from a distribution similar to that of the training set. However, DNNs' predictions are brittle and unreliable when the test samples are drawn from a dissimilar distribution. This is a major concern for deployment in real-world applications, where such behavior may come at a considerable cost, such as industrial production lines, autonomous vehicles, or healthcare applications. Contributions. We frame Out Of Distribution (OOD) detection in DNNs as a statistical hypothesis testing problem. Tests generated within our proposed framework combine evidence from the entire network. Unlike previous OOD detection heuristics, this framework returns a $p$-value for each test sample. It is guaranteed to maintain the Type I Error (T1E - incorrectly predicting OOD for an actual in-distribution sample) for test data. Moreover, this allows to combine several detectors while maintaining the T1E. Building on this framework, we suggest a novel OOD procedure based on low-order statistics. Our method achieves comparable or better results than state-of-the-art methods on well-accepted OOD benchmarks, without retraining the network parameters or assuming prior knowledge on the test distribution -- and at a fraction of the computational cost.
\ No newline at end of file
diff --git a/data/2022/iclr/A Tale of Two Flows: Cooperative Learning of Langevin Flow and Normalizing Flow Toward Energy-Based Model b/data/2022/iclr/A Tale of Two Flows: Cooperative Learning of Langevin Flow and Normalizing Flow Toward Energy-Based Model
new file mode 100644
index 0000000000..ddc24ec237
--- /dev/null
+++ b/data/2022/iclr/A Tale of Two Flows: Cooperative Learning of Langevin Flow and Normalizing Flow Toward Energy-Based Model	
@@ -0,0 +1 @@
+This paper studies the cooperative learning of two generative flow models, in which the two models are iteratively updated based on the jointly synthesized examples. The first flow model is a normalizing flow that transforms an initial simple density to a target density by applying a sequence of invertible transformations. The second flow model is a Langevin flow that runs finite steps of gradient-based MCMC toward an energy-based model. We start from proposing a generative framework that trains an energy-based model with a normalizing flow as an amortized sampler to initialize the MCMC chains of the energy-based model. In each learning iteration, we generate synthesized examples by using a normalizing flow initialization followed by a short-run Langevin flow revision toward the current energy-based model. Then we treat the synthesized examples as fair samples from the energy-based model and update the model parameters with the maximum likelihood learning gradient, while the normalizing flow directly learns from the synthesized examples by maximizing the tractable likelihood. Under the short-run non-mixing MCMC scenario, the estimation of the energy-based model is shown to follow the perturbation of maximum likelihood, and the short-run Langevin flow and the normalizing flow form a two-flow generator that we call CoopFlow. We provide an understating of the CoopFlow algorithm by information geometry and show that it is a valid generator as it converges to a moment matching estimator. We demonstrate that the trained CoopFlow is capable of synthesizing realistic images, reconstructing images, and interpolating between images.
\ No newline at end of file
diff --git a/data/2022/iclr/A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features b/data/2022/iclr/A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features
new file mode 100644
index 0000000000..d438b3a846
--- /dev/null
+++ b/data/2022/iclr/A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features	
@@ -0,0 +1 @@
+An important characteristic of neural networks is their ability to learn representations of the input data with effective features for prediction, which is believed to be a key factor to their superior empirical performance. To better understand the source and benefit of feature learning in neural networks, we consider learning problems motivated by practical data, where the labels are determined by a set of class relevant patterns and the inputs are generated from these along with some background patterns. We prove that neural networks trained by gradient descent can succeed on these problems. The success relies on the emergence and improvement of effective features, which are learned among exponentially many candidates efficiently by exploiting the data (in particular, the structure of the input distribution). In contrast, no linear models on data-independent features of polynomial sizes can learn to as good errors. Furthermore, if the specific input structure is removed, then no polynomial algorithm in the Statistical Query model can learn even weakly. These results provide theoretical evidence showing that feature learning in neural networks depends strongly on the input structure and leads to the superior performance. Our preliminary experimental results on synthetic and real data also provide positive support.
\ No newline at end of file
diff --git a/data/2022/iclr/A Theory of Tournament Representations b/data/2022/iclr/A Theory of Tournament Representations
new file mode 100644
index 0000000000..dc58d23e7b
--- /dev/null
+++ b/data/2022/iclr/A Theory of Tournament Representations	
@@ -0,0 +1 @@
+Real world tournaments are almost always intransitive. Recent works have noted that parametric models which assume $d$ dimensional node representations can effectively model intransitive tournaments. However, nothing is known about the structure of the class of tournaments that arise out of any fixed $d$ dimensional representations. In this work, we develop a novel theory for understanding parametric tournament representations. Our first contribution is to structurally characterize the class of tournaments that arise out of $d$ dimensional representations. We do this by showing that these tournament classes have forbidden configurations which must necessarily be union of flip classes, a novel way to partition the set of all tournaments. We further characterise rank $2$ tournaments completely by showing that the associated forbidden flip class contains just $2$ tournaments. Specifically, we show that the rank $2$ tournaments are equivalent to locally-transitive tournaments. This insight allows us to show that the minimum feedback arc set problem on this tournament class can be solved using the standard Quicksort procedure. For a general rank $d$ tournament class, we show that the flip class associated with a coned-doubly regular tournament of size $\mathcal{O}(\sqrt{d})$ must be a forbidden configuration. To answer a dual question, using a celebrated result of \cite{forster}, we show a lower bound of $\mathcal{O}(\sqrt{n})$ on the minimum dimension needed to represent all tournaments on $n$ nodes. For any given tournament, we show a novel upper bound on the smallest representation dimension that depends on the least size of the number of unique nodes in any feedback arc set of the flip class associated with a tournament. We show how our results also shed light on upper bound of sign-rank of matrices.
\ No newline at end of file
diff --git a/data/2022/iclr/A Unified Contrastive Energy-based Model for Understanding the Generative Ability of Adversarial Training b/data/2022/iclr/A Unified Contrastive Energy-based Model for Understanding the Generative Ability of Adversarial Training
new file mode 100644
index 0000000000..6922a1af66
--- /dev/null
+++ b/data/2022/iclr/A Unified Contrastive Energy-based Model for Understanding the Generative Ability of Adversarial Training	
@@ -0,0 +1 @@
+Adversarial Training (AT) is known as an effective approach to enhance the robustness of deep neural networks. Recently researchers notice that robust models with AT have good generative ability and can synthesize realistic images, while the reason behind it is yet under-explored. In this paper, we demystify this phenomenon by developing a unified probabilistic framework, called Contrastive Energy-based Models (CEM). On the one hand, we provide the first probabilistic characterization of AT through a unified understanding of robustness and generative ability. On the other hand, our unified framework can be extended to the unsupervised scenario, which interprets unsupervised contrastive learning as an important sampling of CEM. Based on these, we propose a principled method to develop adversarial learning and sampling methods. Experiments show that the sampling methods derived from our framework improve the sample quality in both supervised and unsupervised learning. Notably, our unsupervised adversarial sampling method achieves an Inception score of 9.61 on CIFAR-10, which is superior to previous energy-based models and comparable to state-of-the-art generative models.
\ No newline at end of file
diff --git a/data/2022/iclr/A Unified Wasserstein Distributional Robustness Framework for Adversarial Training b/data/2022/iclr/A Unified Wasserstein Distributional Robustness Framework for Adversarial Training
new file mode 100644
index 0000000000..40f23aacce
--- /dev/null
+++ b/data/2022/iclr/A Unified Wasserstein Distributional Robustness Framework for Adversarial Training	
@@ -0,0 +1 @@
+It is well-known that deep neural networks (DNNs) are susceptible to adversarial attacks, exposing a severe fragility of deep learning systems. As the result, adversarial training (AT) method, by incorporating adversarial examples during training, represents a natural and effective approach to strengthen the robustness of a DNN-based classifier. However, most AT-based methods, notably PGD-AT and TRADES, typically seek a pointwise adversary that generates the worst-case adversarial example by independently perturbing each data sample, as a way to"probe"the vulnerability of the classifier. Arguably, there are unexplored benefits in considering such adversarial effects from an entire distribution. To this end, this paper presents a unified framework that connects Wasserstein distributional robustness with current state-of-the-art AT methods. We introduce a new Wasserstein cost function and a new series of risk functions, with which we show that standard AT methods are special cases of their counterparts in our framework. This connection leads to an intuitive relaxation and generalization of existing AT methods and facilitates the development of a new family of distributional robustness AT-based algorithms. Extensive experiments show that our distributional robustness AT algorithms robustify further their standard AT counterparts in various settings.
\ No newline at end of file
diff --git a/data/2022/iclr/A Zest of LIME: Towards Architecture-Independent Model Distances b/data/2022/iclr/A Zest of LIME: Towards Architecture-Independent Model Distances
new file mode 100644
index 0000000000..2e1c5eb30a
--- /dev/null
+++ b/data/2022/iclr/A Zest of LIME: Towards Architecture-Independent Model Distances	
@@ -0,0 +1 @@
+Definitions of the distance between two machine learning models either characterize the similarity of the models’ predictions or of their weights. While similarity of weights is attractive because it implies similarity of predictions in the limit, it suffers from being inapplicable to comparing models with different architectures. On the other hand, the similarity of predictions is broadly applicable but depends heavily on the choice of model inputs during comparison. In this paper, we instead propose to compute distance between black-box models by comparing their Local Interpretable Model-Agnostic Explanations (LIME). To compare two models, we take a reference dataset, and locally approximate the models on each reference point with linear models trained by LIME. We then compute the Cosine distance between the concatenated weights of the linear models. This yields an approach that is both architecture-independent and possesses the benefits of comparing models in weight space. We empirically show that our method, which we call Zest, helps in several tasks that require measurements of model similarity: verifying machine unlearning, and detecting many forms of model reuse, such as model stealing, knowledge distillation, and transfer learning.1
\ No newline at end of file
diff --git a/data/2022/iclr/A fast and accurate splitting method for optimal transport: analysis and implementation b/data/2022/iclr/A fast and accurate splitting method for optimal transport: analysis and implementation
new file mode 100644
index 0000000000..a885d4d64c
--- /dev/null
+++ b/data/2022/iclr/A fast and accurate splitting method for optimal transport: analysis and implementation	
@@ -0,0 +1 @@
+We develop a fast and reliable method for solving large-scale optimal transport (OT) problems at an unprecedented combination of speed and accuracy. Built on the celebrated Douglas-Rachford splitting technique, our method tackles the original OT problem directly instead of solving an approximate regularized problem, as many state-of-the-art techniques do. This allows us to provide sparse transport plans and avoid numerical issues of methods that use entropic regularization. The algorithm has the same cost per iteration as the popular Sinkhorn method, and each iteration can be executed efficiently, in parallel. The proposed method enjoys an iteration complexity $O(1/\epsilon)$ compared to the best-known $O(1/\epsilon^2)$ of the Sinkhorn method. In addition, we establish a linear convergence rate for our formulation of the OT problem. We detail an efficient GPU implementation of the proposed method that maintains a primal-dual stopping criterion at no extra cost. Substantial experiments demonstrate the effectiveness of our method, both in terms of computation times and robustness.
\ No newline at end of file
diff --git a/data/2022/iclr/A generalization of the randomized singular value decomposition b/data/2022/iclr/A generalization of the randomized singular value decomposition
new file mode 100644
index 0000000000..c1c3a65180
--- /dev/null
+++ b/data/2022/iclr/A generalization of the randomized singular value decomposition	
@@ -0,0 +1 @@
+The randomized singular value decomposition (SVD) is a popular and effective algorithm for computing a near-best rank $k$ approximation of a matrix $A$ using matrix-vector products with standard Gaussian vectors. Here, we generalize the randomized SVD to multivariate Gaussian vectors, allowing one to incorporate prior knowledge of $A$ into the algorithm. This enables us to explore the continuous analogue of the randomized SVD for Hilbert--Schmidt (HS) operators using operator-function products with functions drawn from a Gaussian process (GP). We then construct a new covariance kernel for GPs, based on weighted Jacobi polynomials, which allows us to rapidly sample the GP and control the smoothness of the randomly generated functions. Numerical examples on matrices and HS operators demonstrate the applicability of the algorithm.
\ No newline at end of file
diff --git a/data/2022/iclr/A global convergence theory for deep ReLU implicit networks via over-parameterization b/data/2022/iclr/A global convergence theory for deep ReLU implicit networks via over-parameterization
new file mode 100644
index 0000000000..3a79d84f5e
--- /dev/null
+++ b/data/2022/iclr/A global convergence theory for deep ReLU implicit networks via over-parameterization	
@@ -0,0 +1 @@
+Implicit deep learning has received increasing attention recently due to the fact that it generalizes the recursive prediction rules of many commonly used neural network architectures. Its prediction rule is provided implicitly based on the solution of an equilibrium equation. Although a line of recent empirical studies has demonstrated its superior performances, the theoretical understanding of implicit neural networks is limited. In general, the equilibrium equation may not be well-posed during the training. As a result, there is no guarantee that a vanilla (stochastic) gradient descent (SGD) training nonlinear implicit neural networks can converge. This paper fills the gap by analyzing the gradient flow of Rectified Linear Unit (ReLU) activated implicit neural networks. For an $m$-width implicit neural network with ReLU activation and $n$ training samples, we show that a randomly initialized gradient descent converges to a global minimum at a linear rate for the square loss function if the implicit neural network is \textit{over-parameterized}. It is worth noting that, unlike existing works on the convergence of (S)GD on finite-layer over-parameterized neural networks, our convergence results hold for implicit neural networks, where the number of layers is \textit{infinite}.
\ No newline at end of file
diff --git a/data/2022/iclr/ADAVI: Automatic Dual Amortized Variational Inference Applied To Pyramidal Bayesian Models b/data/2022/iclr/ADAVI: Automatic Dual Amortized Variational Inference Applied To Pyramidal Bayesian Models
new file mode 100644
index 0000000000..387dba06cc
--- /dev/null
+++ b/data/2022/iclr/ADAVI: Automatic Dual Amortized Variational Inference Applied To Pyramidal Bayesian Models	
@@ -0,0 +1 @@
+Frequently, population studies feature pyramidally-organized data represented using Hierarchical Bayesian Models (HBM) enriched with plates. These models can become prohibitively large in settings such as neuroimaging, where a sample is composed of a functional MRI signal measured on 300 brain locations, across 4 measurement sessions, and 30 subjects, resulting in around 1 million latent parameters.Such high dimensionality hampers the usage of modern, expressive flow-based techniques.To infer parameter posterior distributions in this challenging class of problems, we designed a novel methodology that automatically produces a variational family dual to a target HBM. This variational family, represented as a neural network, consists in the combination of an attention-based hierarchical encoder feeding summary statistics to a set of normalizing flows. Our automatically-derived neural network exploits exchangeability in the plate-enriched HBM and factorizes its parameter space. The resulting architecture reduces by orders of magnitude its parameterization with respect to that of a typical flow-based representation, while maintaining expressivity.Our method performs inference on the specified HBM in an amortized setup: once trained, it can readily be applied to a new data sample to compute the parameters' full posterior.We demonstrate the capability and scalability of our method on simulated data, as well as a challenging high-dimensional brain parcellation experiment. We also open up several questions that lie at the intersection between normalizing flows, SBI, structured Variational Inference, and inference amortization.
\ No newline at end of file
diff --git a/data/2022/iclr/AEVA: Black-box Backdoor Detection Using Adversarial Extreme Value Analysis b/data/2022/iclr/AEVA: Black-box Backdoor Detection Using Adversarial Extreme Value Analysis
new file mode 100644
index 0000000000..8329cea671
--- /dev/null
+++ b/data/2022/iclr/AEVA: Black-box Backdoor Detection Using Adversarial Extreme Value Analysis	
@@ -0,0 +1 @@
+Deep neural networks (DNNs) are proved to be vulnerable against backdoor attacks. A backdoor is often embedded in the target DNNs through injecting a backdoor trigger into training examples, which can cause the target DNNs misclassify an input attached with the backdoor trigger. Existing backdoor detection methods often require the access to the original poisoned training data, the parameters of the target DNNs, or the predictive confidence for each given input, which are impractical in many real-world applications, e.g., on-device deployed DNNs. We address the black-box hard-label backdoor detection problem where the DNN is fully black-box and only its final output label is accessible. We approach this problem from the optimization perspective and show that the objective of backdoor detection is bounded by an adversarial objective. Further theoretical and empirical studies reveal that this adversarial objective leads to a solution with highly skewed distribution; a singularity is often observed in the adversarial map of a backdoor-infected example, which we call the adversarial singularity phenomenon. Based on this observation, we propose the adversarial extreme value analysis(AEVA) to detect backdoors in black-box neural networks. AEVA is based on an extreme value analysis of the adversarial map, computed from the monte-carlo gradient estimation. Evidenced by extensive experiments across multiple popular tasks and backdoor attacks, our approach is shown effective in detecting backdoor attacks under the black-box hard-label scenarios.
\ No newline at end of file
diff --git a/data/2022/iclr/ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity b/data/2022/iclr/ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity
new file mode 100644
index 0000000000..0f3504d09f
--- /dev/null
+++ b/data/2022/iclr/ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity	
@@ -0,0 +1 @@
+An intuitive way to search for images is to use queries composed of an example image and a complementary text. While the first provides rich and implicit context for the search, the latter explicitly calls for new traits, or specifies how some elements of the example image should be changed to retrieve the desired target image. Current approaches typically combine the features of each of the two elements of the query into a single representation, which can then be compared to the ones of the potential target images. Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval. Taking inspiration from them, we exploit the specific relation of each query element with the targeted image and derive light-weight attention mechanisms which enable to mediate between the two complementary modalities. We validate our approach on several retrieval benchmarks, querying with images and their associated free-form text modifiers. Our method obtains state-of-the-art results without resorting to side information, multi-level features, heavy pre-training nor large architectures as in previous works.
\ No newline at end of file
diff --git a/data/2022/iclr/AS-MLP: An Axial Shifted MLP Architecture for Vision b/data/2022/iclr/AS-MLP: An Axial Shifted MLP Architecture for Vision
new file mode 100644
index 0000000000..00fa1de4d5
--- /dev/null
+++ b/data/2022/iclr/AS-MLP: An Axial Shifted MLP Architecture for Vision	
@@ -0,0 +1 @@
+An Axial Shifted MLP architecture (AS-MLP) is proposed in this paper. Different from MLP-Mixer, where the global spatial feature is encoded for information flow through matrix transposition and one token-mixing MLP, we pay more attention to the local features interaction. By axially shifting channels of the feature map, AS-MLP is able to obtain the information flow from different axial directions, which captures the local dependencies. Such an operation enables us to utilize a pure MLP architecture to achieve the same local receptive field as CNN-like architecture. We can also design the receptive field size and dilation of blocks of AS-MLP, etc, in the same spirit of convolutional neural networks. With the proposed AS-MLP architecture, our model obtains 83.3% Top-1 accuracy with 88M parameters and 15.2 GFLOPs on the ImageNet-1K dataset. Such a simple yet effective architecture outperforms all MLP-based architectures and achieves competitive performance compared to the transformer-based architectures (e.g., Swin Transformer) even with slightly lower FLOPs. In addition, AS-MLP is also the first MLP-based architecture to be applied to the downstream tasks (e.g., object detection and semantic segmentation). The experimental results are also impressive. Our proposed AS-MLP obtains 51.5 mAP on the COCO validation set and 49.5 MS mIoU on the ADE20K dataset, which is competitive compared to the transformer-based architectures. Our AS-MLP establishes a strong baseline of MLP-based architecture. Code is available at https://github.com/svip-lab/AS-MLP.
\ No newline at end of file
diff --git a/data/2022/iclr/Ab-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave Functions b/data/2022/iclr/Ab-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave Functions
new file mode 100644
index 0000000000..a8d6a825c7
--- /dev/null
+++ b/data/2022/iclr/Ab-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave Functions	
@@ -0,0 +1 @@
+Solving the Schr\"odinger equation is key to many quantum mechanical properties. However, an analytical solution is only tractable for single-electron systems. Recently, neural networks succeeded at modeling wave functions of many-electron systems. Together with the variational Monte-Carlo (VMC) framework, this led to solutions on par with the best known classical methods. Still, these neural methods require tremendous amounts of computational resources as one has to train a separate model for each molecular geometry. In this work, we combine a Graph Neural Network (GNN) with a neural wave function to simultaneously solve the Schr\"odinger equation for multiple geometries via VMC. This enables us to model continuous subsets of the potential energy surface with a single training pass. Compared to existing state-of-the-art networks, our Potential Energy Surface Network PESNet speeds up training for multiple geometries by up to 40 times while matching or surpassing their accuracy. This may open the path to accurate and orders of magnitude cheaper quantum mechanical calculations.
\ No newline at end of file
diff --git a/data/2022/iclr/Accelerated Policy Learning with Parallel Differentiable Simulation b/data/2022/iclr/Accelerated Policy Learning with Parallel Differentiable Simulation
new file mode 100644
index 0000000000..c26949f0f3
--- /dev/null
+++ b/data/2022/iclr/Accelerated Policy Learning with Parallel Differentiable Simulation	
@@ -0,0 +1 @@
+Deep reinforcement learning can generate complex control policies, but requires large amounts of training data to work effectively. Recent work has attempted to address this issue by leveraging differentiable simulators. However, inherent problems such as local minima and exploding/vanishing numerical gradients prevent these methods from being generally applied to control tasks with complex contact-rich dynamics, such as humanoid locomotion in classical RL benchmarks. In this work we present a high-performance differentiable simulator and a new policy learning algorithm (SHAC) that can effectively leverage simulation gradients, even in the presence of non-smoothness. Our learning algorithm alleviates problems with local minima through a smooth critic function, avoids vanishing/exploding gradients through a truncated learning window, and allows many physical environments to be run in parallel. We evaluate our method on classical RL control tasks, and show substantial improvements in sample efficiency and wall-clock time over state-of-the-art RL and differentiable simulation-based algorithms. In addition, we demonstrate the scalability of our method by applying it to the challenging high-dimensional problem of muscle-actuated locomotion with a large action space, achieving a greater than 17x reduction in training time over the best-performing established RL algorithm.
\ No newline at end of file
diff --git a/data/2022/iclr/Acceleration of Federated Learning with Alleviated Forgetting in Local Training b/data/2022/iclr/Acceleration of Federated Learning with Alleviated Forgetting in Local Training
new file mode 100644
index 0000000000..689c945289
--- /dev/null
+++ b/data/2022/iclr/Acceleration of Federated Learning with Alleviated Forgetting in Local Training	
@@ -0,0 +1 @@
+Federated learning (FL) enables distributed optimization of machine learning models while protecting privacy by independently training local models on each client and then aggregating parameters on a central server, thereby producing an effective global model. Although a variety of FL algorithms have been proposed, their training efficiency remains low when the data are not independently and identically distributed (non-i.i.d.) across different clients. We observe that the slow convergence rates of the existing methods are (at least partially) caused by the catastrophic forgetting issue during the local training stage on each individual client, which leads to a large increase in the loss function concerning the previous training data at the other clients. Here, we propose FedReg, an algorithm to accelerate FL with alleviated knowledge forgetting in the local training stage by regularizing locally trained parameters with the loss on generated pseudo data, which encode the knowledge of previous training data learned by the global model. Our comprehensive experiments demonstrate that FedReg not only significantly improves the convergence rate of FL, especially when the neural network architecture is deep and the clients' data are extremely non-i.i.d., but is also able to protect privacy better in classification problems and more robust against gradient inversion attacks. The code is available at: https://github.com/Zoesgithub/FedReg.
\ No newline at end of file
diff --git a/data/2022/iclr/Active Hierarchical Exploration with Stable Subgoal Representation Learning b/data/2022/iclr/Active Hierarchical Exploration with Stable Subgoal Representation Learning
new file mode 100644
index 0000000000..9a08ec3e00
--- /dev/null
+++ b/data/2022/iclr/Active Hierarchical Exploration with Stable Subgoal Representation Learning	
@@ -0,0 +1 @@
+Goal-conditioned hierarchical reinforcement learning (GCHRL) provides a promising approach to solving long-horizon tasks. Recently, its success has been extended to more general settings by concurrently learning hierarchical policies and subgoal representations. Although GCHRL possesses superior exploration ability by decomposing tasks via subgoals, existing GCHRL methods struggle in temporally extended tasks with sparse external rewards, since the high-level policy learning relies on external rewards. As the high-level policy selects subgoals in an online learned representation space, the dynamic change of the subgoal space severely hinders effective high-level exploration. In this paper, we propose a novel regularization that contributes to both stable and efficient subgoal representation learning. Building upon the stable representation, we design measures of novelty and potential for subgoals, and develop an active hierarchical exploration strategy that seeks out new promising subgoals and states without intrinsic rewards. Experimental results show that our approach significantly outperforms state-of-the-art baselines in continuous control tasks with sparse rewards.
\ No newline at end of file
diff --git a/data/2022/iclr/Actor-Critic Policy Optimization in a Large-Scale Imperfect-Information Game b/data/2022/iclr/Actor-Critic Policy Optimization in a Large-Scale Imperfect-Information Game
new file mode 100644
index 0000000000..c8ff87891f
--- /dev/null
+++ b/data/2022/iclr/Actor-Critic Policy Optimization in a Large-Scale Imperfect-Information Game	
@@ -0,0 +1 @@
+The deep policy gradient method has demonstrated promising results in many large-scale games, where the agent learns purely from its own experience. Yet, policy gradient methods with self-play suffer convergence problems to a Nash Equilibrium (NE) in multi-agent situations. Counterfactual regret minimization (CFR) has a convergence guarantee to a NE in 2-player zero-sum games, but it usually needs domain-speciﬁc abstractions to deal with large-scale games. Inheriting merits from both methods, in this paper we extend the actor-critic algorithm framework in deep reinforcement learning to tackle a large-scale 2-player zero-sum imperfect-information game, 1-on-1 Mahjong, whose information set size and game length are much larger than poker. The proposed algorithm, named Actor-Critic Hedge (ACH), modiﬁes the policy optimization objective from originally maximizing the discounted returns to minimizing a type of weighted cumulative counterfactual regret. This modiﬁcation is achieved by approximating the regret via a deep neural network and minimizing the regret via generating self-play policies using Hedge. ACH is theoretically justiﬁed as it is derived from a neural-based weighted CFR, for which we prove the convergence to a NE under certain conditions. Experimental results on the proposed 1-on-1 Mahjong benchmark and benchmarks from the literature demonstrate that ACH outperforms related state-of-the-art methods. Also, the agent obtained
\ No newline at end of file
diff --git a/data/2022/iclr/Actor-critic is implicitly biased towards high entropy optimal policies b/data/2022/iclr/Actor-critic is implicitly biased towards high entropy optimal policies
new file mode 100644
index 0000000000..7253ccdea6
--- /dev/null
+++ b/data/2022/iclr/Actor-critic is implicitly biased towards high entropy optimal policies	
@@ -0,0 +1 @@
+We show that the simplest actor-critic method -- a linear softmax policy updated with TD through interaction with a linear MDP, but featuring no explicit regularization or exploration -- does not merely find an optimal policy, but moreover prefers high entropy optimal policies. To demonstrate the strength of this bias, the algorithm not only has no regularization, no projections, and no exploration like $\epsilon$-greedy, but is moreover trained on a single trajectory with no resets. The key consequence of the high entropy bias is that uniform mixing assumptions on the MDP, which exist in some form in all prior work, can be dropped: the implicit regularization of the high entropy bias is enough to ensure that all chains mix and an optimal policy is reached with high probability. As auxiliary contributions, this work decouples concerns between the actor and critic by writing the actor update as an explicit mirror descent, provides tools to uniformly bound mixing times within KL balls of policy space, and provides a projection-free TD analysis with its own implicit bias which can be run from an unmixed starting distribution.
\ No newline at end of file
diff --git a/data/2022/iclr/Ada-NETS: Face Clustering via Adaptive Neighbour Discovery in the Structure Space b/data/2022/iclr/Ada-NETS: Face Clustering via Adaptive Neighbour Discovery in the Structure Space
new file mode 100644
index 0000000000..9b99b1242e
--- /dev/null
+++ b/data/2022/iclr/Ada-NETS: Face Clustering via Adaptive Neighbour Discovery in the Structure Space	
@@ -0,0 +1 @@
+Face clustering has attracted rising research interest recently to take advantage of massive amounts of face images on the web. State-of-the-art performance has been achieved by Graph Convolutional Networks (GCN) due to their powerful representation capacity. However, existing GCN-based methods build face graphs mainly according to kNN relations in the feature space, which may lead to a lot of noise edges connecting two faces of different classes. The face features will be polluted when messages pass along these noise edges, thus degrading the performance of GCNs. In this paper, a novel algorithm named Ada-NETS is proposed to cluster faces by constructing clean graphs for GCNs. In Ada-NETS, each face is transformed to a new structure space, obtaining robust features by considering face features of the neighbour images. Then, an adaptive neighbour discovery strategy is proposed to determine a proper number of edges connecting to each face image. It significantly reduces the noise edges while maintaining the good ones to build a graph with clean yet rich edges for GCNs to cluster faces. Experiments on multiple public clustering datasets show that Ada-NETS significantly outperforms current state-of-the-art methods, proving its superiority and generalization. Code is available at https://github.com/damo-cv/Ada-NETS.
\ No newline at end of file
diff --git a/data/2022/iclr/AdaAug: Learning Class- and Instance-adaptive Data Augmentation Policies b/data/2022/iclr/AdaAug: Learning Class- and Instance-adaptive Data Augmentation Policies
new file mode 100644
index 0000000000..f5dc752a14
--- /dev/null
+++ b/data/2022/iclr/AdaAug: Learning Class- and Instance-adaptive Data Augmentation Policies	
@@ -0,0 +1 @@
+Data augmentation is an effective way to improve the generalization capability of modern deep learning models. However, the underlying augmentation methods mostly rely on handcrafted operations. Moreover, an augmentation policy useful to one dataset may not transfer well to other datasets. Therefore, Automated Data Augmentation (AutoDA) methods, like AutoAugment and Population-based Augmentation, have been proposed recently to automate the process of searching for optimal augmentation policies. However, the augmentation policies found are not adaptive to the dataset used, hindering the effectiveness of these AutoDA methods. In this paper, we propose a novel AutoDA method called AdaAug to efficiently learn adaptive augmentation policies in a class-dependent and potentially instance-dependent manner. Our experiments show that the adaptive augmentation policies learned by our method transfer well to unseen datasets such as the Oxford Flowers, Oxford-IIT Pets, FGVC Aircraft, and Stanford Cars datasets when compared with other AutoDA baselines. In addition, our method also achieves a state-of-the-art performance on the CIFAR-10, CIFAR-100, and SVHN datasets.1
\ No newline at end of file
diff --git a/data/2022/iclr/AdaMatch: A Unified Approach to Semi-Supervised Learning and Domain Adaptation b/data/2022/iclr/AdaMatch: A Unified Approach to Semi-Supervised Learning and Domain Adaptation
new file mode 100644
index 0000000000..c0022e3094
--- /dev/null
+++ b/data/2022/iclr/AdaMatch: A Unified Approach to Semi-Supervised Learning and Domain Adaptation	
@@ -0,0 +1 @@
+We extend semi-supervised learning to the problem of domain adaptation to learn significantly higher-accuracy models that train on one data distribution and test on a different one. With the goal of generality, we introduce AdaMatch, a method that unifies the tasks of unsupervised domain adaptation (UDA), semi-supervised learning (SSL), and semi-supervised domain adaptation (SSDA). In an extensive experimental study, we compare its behavior with respective state-of-the-art techniques from SSL, SSDA, and UDA on vision classification tasks. We find AdaMatch either matches or significantly exceeds the state-of-the-art in each case using the same hyper-parameters regardless of the dataset or task. For example, AdaMatch nearly doubles the accuracy compared to that of the prior state-of-the-art on the UDA task for DomainNet and even exceeds the accuracy of the prior state-of-the-art obtained with pre-training by 6.4% when AdaMatch is trained completely from scratch. Furthermore, by providing AdaMatch with just one labeled example per class from the target domain (i.e., the SSDA setting), we increase the target accuracy by an additional 6.1%, and with 5 labeled examples, by 13.6%.
\ No newline at end of file
diff --git a/data/2022/iclr/AdaRL: What, Where, and How to Adapt in Transfer Reinforcement Learning b/data/2022/iclr/AdaRL: What, Where, and How to Adapt in Transfer Reinforcement Learning
new file mode 100644
index 0000000000..d7ba4a162d
--- /dev/null
+++ b/data/2022/iclr/AdaRL: What, Where, and How to Adapt in Transfer Reinforcement Learning	
@@ -0,0 +1 @@
+One practical challenge in reinforcement learning (RL) is how to make quick adaptations when faced with new environments. In this paper, we propose a principled framework for adaptive RL, called \textit{AdaRL}, that adapts reliably and efficiently to changes across domains with a few samples from the target domain, even in partially observable environments. Specifically, we leverage a parsimonious graphical representation that characterizes structural relationships over variables in the RL system. Such graphical representations provide a compact way to encode what and where the changes across domains are, and furthermore inform us with a minimal set of changes that one has to consider for the purpose of policy adaptation. We show that by explicitly leveraging this compact representation to encode changes, we can efficiently adapt the policy to the target domain, in which only a few samples are needed and further policy optimization is avoided. We illustrate the efficacy of AdaRL through a series of experiments that vary factors in the observation, transition, and reward functions for Cartpole and Atari games.
\ No newline at end of file
diff --git a/data/2022/iclr/Adaptive Wavelet Transformer Network for 3D Shape Representation Learning b/data/2022/iclr/Adaptive Wavelet Transformer Network for 3D Shape Representation Learning
new file mode 100644
index 0000000000..68fc0f741f
--- /dev/null
+++ b/data/2022/iclr/Adaptive Wavelet Transformer Network for 3D Shape Representation Learning	
@@ -0,0 +1 @@
+We present a novel method for 3D shape representation learning using multi-scale wavelet decomposition. Previous works often decompose 3D shapes into complementary components in spatial domain at a single scale. In this work, we study to decompose 3D shapes into sub-bands components in frequency domain at multiple scales, resulting in a hierarchical decomposition tree in a principled manner rooted in multi-resolution wavelet analysis . Specifically, we propose Adaptive Wavelet Transformer Network (AWT-Net) that firstly generates approximation or detail wavelet coefficients per point, classifying each point into high or low sub-bands components, using lifting scheme at multiple scales recursively and hierarchically. Then, AWT-Net exploits Transformer to enhance the original shape features by querying and fusing features from different but integrated sub-bands. The wavelet coefficients can be learned without direct supervision on coefficients, and AWT-Net is fully differentiable and can be learned in an end-to-end fashion. Extensive experiments demonstrate that AWT-Net achieves competitive performance on 3D shape classification and segmentation benchmarks.
\ No newline at end of file
diff --git a/data/2022/iclr/Adversarial Retriever-Ranker for Dense Text Retrieval b/data/2022/iclr/Adversarial Retriever-Ranker for Dense Text Retrieval
new file mode 100644
index 0000000000..62230f640d
--- /dev/null
+++ b/data/2022/iclr/Adversarial Retriever-Ranker for Dense Text Retrieval	
@@ -0,0 +1 @@
+Current dense text retrieval models face two typical challenges. First, they adopt a siamese dual-encoder architecture to encode queries and documents independently for fast indexing and searching, while neglecting the finer-grained term-wise interactions. This results in a sub-optimal recall performance. Second, their model training highly relies on a negative sampling technique to build up the negative documents in their contrastive losses. To address these challenges, we present Adversarial Retriever-Ranker (AR2), which consists of a dual-encoder retriever plus a cross-encoder ranker. The two models are jointly optimized according to a minimax adversarial objective: the retriever learns to retrieve negative documents to cheat the ranker, while the ranker learns to rank a collection of candidates including both the ground-truth and the retrieved ones, as well as providing progressive direct feedback to the dual-encoder retriever. Through this adversarial game, the retriever gradually produces harder negative documents to train a better ranker, whereas the cross-encoder ranker provides progressive feedback to improve retriever. We evaluate AR2 on three benchmarks. Experimental results show that AR2 consistently and significantly outperforms existing dense retriever methods and achieves new state-of-the-art results on all of them. This includes the improvements on Natural Questions R@5 to 77.9%(+2.1%), TriviaQA R@5 to 78.2%(+1.4), and MS-MARCO MRR@10 to 39.5%(+1.3%). Code and models are available at https://github.com/microsoft/AR2.
\ No newline at end of file
diff --git a/data/2022/iclr/Adversarial Robustness Through the Lens of Causality b/data/2022/iclr/Adversarial Robustness Through the Lens of Causality
new file mode 100644
index 0000000000..125a8af5ee
--- /dev/null
+++ b/data/2022/iclr/Adversarial Robustness Through the Lens of Causality	
@@ -0,0 +1 @@
+The adversarial vulnerability of deep neural networks has attracted signiﬁcant attention in machine learning. From a causal viewpoint, adversarial attacks can be considered as a speciﬁc type of distribution change on natural data. As causal reasoning has an instinct for modeling distribution change, we propose to incorporate causality into mitigating adversarial vulnerability. However, causal formulations of the intuition of adversarial attack and the development of robust DNNs are still lacking in the literature. To bridge this gap, we construct a causal graph to model the generation process of adversarial examples and deﬁne the adversarial distribution to formalize the intuition of adversarial attacks. From a causal perspective, we ﬁnd that the label is spuriously correlated with the style (content-independent) information when an instance is given. The spurious correlation implies that the adversarial distribution is constructed via making the statistical conditional association between style information and labels drastically diﬀerent from that in natural distribution. Thus, DNNs that ﬁt the spurious correlation are vulnerable to the adversarial distribution. Inspired by the observation, we propose the adversarial distribution alignment method to eliminate the diﬀerence between the natural distribution and the adversarial distribution. Extensive experiments demonstrate the eﬃcacy of the proposed method. Our method can be seen as the ﬁrst attempt to leverage causality for mitigating adversarial vulnerability.
\ No newline at end of file
diff --git a/data/2022/iclr/Adversarial Support Alignment b/data/2022/iclr/Adversarial Support Alignment
new file mode 100644
index 0000000000..58e5bdb4a6
--- /dev/null
+++ b/data/2022/iclr/Adversarial Support Alignment	
@@ -0,0 +1 @@
+We study the problem of aligning the supports of distributions. Compared to the existing work on distribution alignment, support alignment does not require the densities to be matched. We propose symmetric support difference as a divergence measure to quantify the mismatch between supports. We show that select discriminators (e.g. discriminator trained for Jensen-Shannon divergence) are able to map support differences as support differences in their one-dimensional output space. Following this result, our method aligns supports by minimizing a symmetrized relaxed optimal transport cost in the discriminator 1D space via an adversarial process. Furthermore, we show that our approach can be viewed as a limit of existing notions of alignment by increasing transportation assignment tolerance. We quantitatively evaluate the method across domain adaptation tasks with shifts in label distributions. Our experiments show that the proposed method is more robust against these shifts than other alignment-based baselines.
\ No newline at end of file
diff --git a/data/2022/iclr/Adversarial Unlearning of Backdoors via Implicit Hypergradient b/data/2022/iclr/Adversarial Unlearning of Backdoors via Implicit Hypergradient
new file mode 100644
index 0000000000..99ef071cc7
--- /dev/null
+++ b/data/2022/iclr/Adversarial Unlearning of Backdoors via Implicit Hypergradient	
@@ -0,0 +1 @@
+We propose a minimax formulation for removing backdoors from a given poisoned model based on a small set of clean data. This formulation encompasses much of prior work on backdoor removal. We propose the Implicit Bacdoor Adversarial Unlearning (I-BAU) algorithm to solve the minimax. Unlike previous work, which breaks down the minimax into separate inner and outer problems, our algorithm utilizes the implicit hypergradient to account for the interdependence between inner and outer optimization. We theoretically analyze its convergence and the generalizability of the robustness gained by solving minimax on clean data to unseen test data. In our evaluation, we compare I-BAU with six state-of-art backdoor defenses on seven backdoor attacks over two datasets and various attack settings, including the common setting where the attacker targets one class as well as important but underexplored settings where multiple classes are targeted. I-BAU's performance is comparable to and most often significantly better than the best baseline. Particularly, its performance is more robust to the variation on triggers, attack settings, poison ratio, and clean data size. Moreover, I-BAU requires less computation to take effect; particularly, it is more than $13\times$ faster than the most efficient baseline in the single-target attack setting. Furthermore, it can remain effective in the extreme case where the defender can only access 100 clean samples -- a setting where all the baselines fail to produce acceptable results.
\ No newline at end of file
diff --git a/data/2022/iclr/Adversarially Robust Conformal Prediction b/data/2022/iclr/Adversarially Robust Conformal Prediction
new file mode 100644
index 0000000000..59f82c783c
--- /dev/null
+++ b/data/2022/iclr/Adversarially Robust Conformal Prediction	
@@ -0,0 +1 @@
+Conformal prediction is a model-agnostic tool for constructing prediction sets that are valid under the common i.i.d. assumption, which has been applied to quantify the prediction uncertainty of deep net classifiers. In this paper, we generalize this framework to the case where adversaries exist during inference time, under which the i.i.d. assumption is grossly violated. By combining conformal prediction with randomized smoothing, our proposed method forms a prediction set with finite-sample coverage guarantee that holds for any data distribution with `2norm bounded adversarial noise, generated by any adversarial attack algorithm. The core idea is to bound the Lipschitz constant of the non-conformity score by smoothing it with Gaussian noise and leverage this knowledge to account for the effect of the unknown adversarial perturbation. We demonstrate the necessity of our method in the adversarial setting and the validity of our theoretical guarantee on three widely used benchmark data sets: CIFAR10, CIFAR100, and ImageNet.
\ No newline at end of file
diff --git a/data/2022/iclr/Almost Tight L0-norm Certified Robustness of Top-k Predictions against Adversarial Perturbations b/data/2022/iclr/Almost Tight L0-norm Certified Robustness of Top-k Predictions against Adversarial Perturbations
new file mode 100644
index 0000000000..024e2eb51e
--- /dev/null
+++ b/data/2022/iclr/Almost Tight L0-norm Certified Robustness of Top-k Predictions against Adversarial Perturbations	
@@ -0,0 +1 @@
+Top-$k$ predictions are used in many real-world applications such as machine learning as a service, recommender systems, and web searches. $\ell_0$-norm adversarial perturbation characterizes an attack that arbitrarily modifies some features of an input such that a classifier makes an incorrect prediction for the perturbed input. $\ell_0$-norm adversarial perturbation is easy to interpret and can be implemented in the physical world. Therefore, certifying robustness of top-$k$ predictions against $\ell_0$-norm adversarial perturbation is important. However, existing studies either focused on certifying $\ell_0$-norm robustness of top-$1$ predictions or $\ell_2$-norm robustness of top-$k$ predictions. In this work, we aim to bridge the gap. Our approach is based on randomized smoothing, which builds a provably robust classifier from an arbitrary classifier via randomizing an input. Our major theoretical contribution is an almost tight $\ell_0$-norm certified robustness guarantee for top-$k$ predictions. We empirically evaluate our method on CIFAR10 and ImageNet. For instance, our method can build a classifier that achieves a certified top-3 accuracy of 69.2\% on ImageNet when an attacker can arbitrarily perturb 5 pixels of a testing image.
\ No newline at end of file
diff --git a/data/2022/iclr/AlphaZero-based Proof Cost Network to Aid Game Solving b/data/2022/iclr/AlphaZero-based Proof Cost Network to Aid Game Solving
new file mode 100644
index 0000000000..fe76ca1843
--- /dev/null
+++ b/data/2022/iclr/AlphaZero-based Proof Cost Network to Aid Game Solving	
@@ -0,0 +1 @@
+The AlphaZero algorithm learns and plays games without hand-crafted expert knowledge. However, since its objective is to play well, we hypothesize that a better objective can be deﬁned for the related but separate task of solving games. This paper proposes a novel approach to solving problems by modifying the training target of the AlphaZero algorithm, such that it prioritizes solving the game quickly, rather than winning. We train a Proof Cost Network (PCN), where proof cost is a heuristic that estimates the amount of work required to solve problems. This matches the general concept of the so-called proof number from proof number search, which has been shown to be well-suited for game solving. We pro-pose two speciﬁc training targets. The ﬁrst ﬁnds the shortest path to a solution, while the second estimates the proof cost. We conduct experiments on solving 15x15 Gomoku and 9x9 Killall-Go problems with both MCTS-based and focused depth-ﬁrst proof number search solvers. Comparisons between using AlphaZero networks and PCN as heuristics show that PCN can solve more problems.
\ No newline at end of file
diff --git a/data/2022/iclr/Amortized Implicit Differentiation for Stochastic Bilevel Optimization b/data/2022/iclr/Amortized Implicit Differentiation for Stochastic Bilevel Optimization
new file mode 100644
index 0000000000..ed1070c47d
--- /dev/null
+++ b/data/2022/iclr/Amortized Implicit Differentiation for Stochastic Bilevel Optimization	
@@ -0,0 +1 @@
+We study a class of algorithms for solving bilevel optimization problems in both stochastic and deterministic settings when the inner-level objective is strongly convex. Specifically, we consider algorithms based on inexact implicit differentiation and we exploit a warm-start strategy to amortize the estimation of the exact gradient. We then introduce a unified theoretical framework inspired by the study of singularly perturbed systems (Habets, 1974) to analyze such amortized algorithms. By using this framework, our analysis shows these algorithms to match the computational complexity of oracle methods that have access to an unbiased estimate of the gradient, thus outperforming many existing results for bilevel optimization. We illustrate these findings on synthetic experiments and demonstrate the efficiency of these algorithms on hyper-parameter optimization experiments involving several thousands of variables.
\ No newline at end of file
diff --git a/data/2022/iclr/Amortized Tree Generation for Bottom-up Synthesis Planning and Synthesizable Molecular Design b/data/2022/iclr/Amortized Tree Generation for Bottom-up Synthesis Planning and Synthesizable Molecular Design
new file mode 100644
index 0000000000..982adcf5c8
--- /dev/null
+++ b/data/2022/iclr/Amortized Tree Generation for Bottom-up Synthesis Planning and Synthesizable Molecular Design	
@@ -0,0 +1 @@
+Molecular design and synthesis planning are two critical steps in the process of molecular discovery that we propose to formulate as a single shared task of conditional synthetic pathway generation. We report an amortized approach to generate synthetic pathways as a Markov decision process conditioned on a target molecular embedding. This approach allows us to conduct synthesis planning in a bottom-up manner and design synthesizable molecules by decoding from optimized conditional codes, demonstrating the potential to solve both problems of design and synthesis simultaneously. The approach leverages neural networks to probabilistically model the synthetic trees, one reaction step at a time, according to reactivity rules encoded in a discrete action space of reaction templates. We train these networks on hundreds of thousands of artificial pathways generated from a pool of purchasable compounds and a list of expert-curated templates. We validate our method with (a) the recovery of molecules using conditional generation, (b) the identification of synthesizable structural analogs, and (c) the optimization of molecular structures given oracle functions relevant to drug discovery.
\ No newline at end of file
diff --git a/data/2022/iclr/An Agnostic Approach to Federated Learning with Class Imbalance b/data/2022/iclr/An Agnostic Approach to Federated Learning with Class Imbalance
new file mode 100644
index 0000000000..d2c676c109
--- /dev/null
+++ b/data/2022/iclr/An Agnostic Approach to Federated Learning with Class Imbalance	
@@ -0,0 +1 @@
+Federated Learning (FL) has emerged as the tool of choice for training deep models over heterogeneous and decentralized datasets. As a reflection of the experiences from different clients, severe class imbalance issues are observed in realworld FL problems. Moreover, there exists a drastic mismatch between the imbalances from the local and global perspectives, i.e. a local majority class can be the minority of the population. Additionally, the privacy requirement of FL poses an extra challenge, as one should handle class imbalance without identifying the minority class. In this paper we propose a novel agnostic constrained learning formulation to tackle the class imbalance problem in FL without requiring further information beyond the standard FL objective. A meta algorithm, CLIMB, is designed to solve the target optimization problem, with its convergence property analyzed under certain oracle assumptions. Through an extensive empirical study over various data heterogeneity and class imbalance configurations, we showcase that CLIMB considerably improves the performance in the minority class without compromising the overall accuracy of the classifier, which significantly outperforms previous arts. In fact, we observe the greatest performance boost in the most difficult scenario where every client only holds data from one class. The code can be found here.
\ No newline at end of file
diff --git a/data/2022/iclr/An Autoregressive Flow Model for 3D Molecular Geometry Generation from Scratch b/data/2022/iclr/An Autoregressive Flow Model for 3D Molecular Geometry Generation from Scratch
new file mode 100644
index 0000000000..a95617826a
--- /dev/null
+++ b/data/2022/iclr/An Autoregressive Flow Model for 3D Molecular Geometry Generation from Scratch	
@@ -0,0 +1 @@
+We consider the problem of generating 3D molecular geometries from scratch. While multiple methods have been developed for generating molecular graphs, generating 3D molecular geometries from scratch is largely under-explored. In this work, we propose G-SphereNet, a novel autoregressive ﬂow model for generating 3D molecular geometries. G-SphereNet employs a ﬂexible sequential generation scheme by placing atoms in 3D space step-by-step. Instead of generating 3D coordinates directly, we propose to determine 3D positions of atoms by generating distances, angles and torsion angles, thereby ensuring both invariance and equivariance. In addition, we propose to use spherical message passing and attention mechanism for conditional information extraction. Experimental results show that G-SphereNet outperforms previous methods on random molecular geometry generation and targeted molecule discovery tasks. Our code is publicly available as part of the DIG package ( https://github.com/divelab/DIG ).
\ No newline at end of file
diff --git a/data/2022/iclr/An Experimental Design Perspective on Model-Based Reinforcement Learning b/data/2022/iclr/An Experimental Design Perspective on Model-Based Reinforcement Learning
new file mode 100644
index 0000000000..7cbfe54d0e
--- /dev/null
+++ b/data/2022/iclr/An Experimental Design Perspective on Model-Based Reinforcement Learning	
@@ -0,0 +1 @@
+In many practical applications of RL, it is expensive to observe state transitions from the environment. For example, in the problem of plasma control for nuclear fusion, computing the next state for a given state-action pair requires querying an expensive transition function which can lead to many hours of computer simulation or dollars of scientific research. Such expensive data collection prohibits application of standard RL algorithms which usually require a large number of observations to learn. In this work, we address the problem of efficiently learning a policy while making a minimal number of state-action queries to the transition function. In particular, we leverage ideas from Bayesian optimal experimental design to guide the selection of state-action queries for efficient learning. We propose an acquisition function that quantifies how much information a state-action pair would provide about the optimal solution to a Markov decision process. At each iteration, our algorithm maximizes this acquisition function, to choose the most informative state-action pair to be queried, thus yielding a data-efficient RL approach. We experiment with a variety of simulated continuous control problems and show that our approach learns an optimal policy with up to $5$ -- $1,000\times$ less data than model-based RL baselines and $10^3$ -- $10^5\times$ less data than model-free RL baselines. We also provide several ablated comparisons which point to substantial improvements arising from the principled method of obtaining data.
\ No newline at end of file
diff --git a/data/2022/iclr/An Explanation of In-context Learning as Implicit Bayesian Inference b/data/2022/iclr/An Explanation of In-context Learning as Implicit Bayesian Inference
new file mode 100644
index 0000000000..ffcfa959f1
--- /dev/null
+++ b/data/2022/iclr/An Explanation of In-context Learning as Implicit Bayesian Inference	
@@ -0,0 +1 @@
+Large language models (LMs) such as GPT-3 have the surprising ability to do in-context learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples. The LM learns from these examples without being explicitly pretrained to learn. Thus, it is unclear what enables in-context learning. In this paper, we study how in-context learning can emerge when pretraining documents have long-range coherence. Here, the LM must infer a latent document-level concept to generate coherent next tokens during pretraining. At test time, in-context learning occurs when the LM also infers a shared latent concept between examples in a prompt. We prove when this occurs despite a distribution mismatch between prompts and pretraining data in a setting where the pretraining distribution is a mixture of HMMs. In contrast to messy large-scale datasets used to train LMs capable of in-context learning, we generate a small-scale synthetic dataset (GINC) where Transformers and LSTMs both exhibit in-context learning. Beyond the theory, experiments on GINC exhibit large-scale real-world phenomena including improved in-context performance with model scaling (despite the same pretraining loss), sensitivity to example order, and instances where zero-shot is better than few-shot in-context learning.
\ No newline at end of file
diff --git a/data/2022/iclr/An Information Fusion Approach to Learning with Instance-Dependent Label Noise b/data/2022/iclr/An Information Fusion Approach to Learning with Instance-Dependent Label Noise
new file mode 100644
index 0000000000..881e29e0f7
--- /dev/null
+++ b/data/2022/iclr/An Information Fusion Approach to Learning with Instance-Dependent Label Noise	
@@ -0,0 +1 @@
+Instance-dependent label noise (IDN) widely exists in real-world datasets and usually misleads the training of deep neural networks. Noise transition matrix (NTM) (i.e., the probability that clean labels flip into noisy labels) is used to characterize the label noise and can be adopted to bridge the gap between clean and noisy underlying data distributions. However, most instances are long-tail, i.e., the number of occurrences of each instance is usually limited, which leads to the gap between the underlying distribution and the empirical distribution. Therefore, the genuine problem caused by IDN is empirical, instead of underlying, data distribution mismatch during training. To directly tackle the empirical distribution mismatch problem, we propose posterior transition matrix (PTM) to posteriorly model label noise given limited observed noisy labels, which achieves statistically consistent classifiers. Note that even if an instance is corrupted by the same NTM, the intrinsic randomness incurs different noisy labels, and thus requires different correction methods. Motivated by this observation, we propose an Information Fusion (IF) approach to fine-tune the NTM based on the estimated PTM. Specifically, we adopt the noisy labels and model predicted probabilities to estimate the PTM and then correct the NTM in forward propagation. Empirical evaluations on synthetic and real-world datasets demonstrate that our method is superior to the state-of-the-art approaches, and achieves more stable training for instance-dependent label noise.
\ No newline at end of file
diff --git a/data/2022/iclr/An Operator Theoretic View On Pruning Deep Neural Networks b/data/2022/iclr/An Operator Theoretic View On Pruning Deep Neural Networks
new file mode 100644
index 0000000000..d5429fe509
--- /dev/null
+++ b/data/2022/iclr/An Operator Theoretic View On Pruning Deep Neural Networks	
@@ -0,0 +1 @@
+The discovery of sparse subnetworks that are able to perform as well as full models has found broad applied and theoretical interest. While many pruning methods have been developed to this end, the na\"ive approach of removing parameters based on their magnitude has been found to be as robust as more complex, state-of-the-art algorithms. The lack of theory behind magnitude pruning's success, especially pre-convergence, and its relation to other pruning methods, such as gradient based pruning, are outstanding open questions in the field that are in need of being addressed. We make use of recent advances in dynamical systems theory, namely Koopman operator theory, to define a new class of theoretically motivated pruning algorithms. We show that these algorithms can be equivalent to magnitude and gradient based pruning, unifying these seemingly disparate methods, and find that they can be used to shed light on magnitude pruning's performance during the early part of training.
\ No newline at end of file
diff --git a/data/2022/iclr/An Unconstrained Layer-Peeled Perspective on Neural Collapse b/data/2022/iclr/An Unconstrained Layer-Peeled Perspective on Neural Collapse
new file mode 100644
index 0000000000..78e6878701
--- /dev/null
+++ b/data/2022/iclr/An Unconstrained Layer-Peeled Perspective on Neural Collapse	
@@ -0,0 +1 @@
+Neural collapse is a highly symmetric geometric pattern of neural networks that emerges during the terminal phase of training, with profound implications on the generalization performance and robustness of the trained networks. To understand how the last-layer features and classifiers exhibit this recently discovered implicit bias, in this paper, we introduce a surrogate model called the unconstrained layer-peeled model (ULPM). We prove that gradient flow on this model converges to critical points of a minimum-norm separation problem exhibiting neural collapse in its global minimizer. Moreover, we show that the ULPM with the cross-entropy loss has a benign global landscape for its loss function, which allows us to prove that all the critical points are strict saddle points except the global minimizers that exhibit the neural collapse phenomenon. Empirically, we show that our results also hold during the training of neural networks in real-world tasks when explicit regularization or weight decay is not used.
\ No newline at end of file
diff --git a/data/2022/iclr/Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models b/data/2022/iclr/Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models
new file mode 100644
index 0000000000..84cdd5a7c4
--- /dev/null
+++ b/data/2022/iclr/Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models	
@@ -0,0 +1 @@
+Diffusion probabilistic models (DPMs) represent a class of powerful generative models. Despite their success, the inference of DPMs is expensive since it generally needs to iterate over thousands of timesteps. A key problem in the inference is to estimate the variance in each timestep of the reverse process. In this work, we present a surprising result that both the optimal reverse variance and the corresponding optimal KL divergence of a DPM have analytic forms w.r.t. its score function. Building upon it, we propose Analytic-DPM, a training-free inference framework that estimates the analytic forms of the variance and KL divergence using the Monte Carlo method and a pretrained score-based model. Further, to correct the potential bias caused by the score-based model, we derive both lower and upper bounds of the optimal variance and clip the estimate for a better result. Empirically, our analytic-DPM improves the log-likelihood of various DPMs, produces high-quality samples, and meanwhile enjoys a 20x to 80x speed up.
\ No newline at end of file
diff --git a/data/2022/iclr/Analyzing and Improving the Optimization Landscape of Noise-Contrastive Estimation b/data/2022/iclr/Analyzing and Improving the Optimization Landscape of Noise-Contrastive Estimation
new file mode 100644
index 0000000000..d1f4ac23a4
--- /dev/null
+++ b/data/2022/iclr/Analyzing and Improving the Optimization Landscape of Noise-Contrastive Estimation	
@@ -0,0 +1 @@
+Noise-contrastive estimation (NCE) is a statistically consistent method for learning unnormalized probabilistic models. It has been empirically observed that the choice of the noise distribution is crucial for NCE's performance. However, such observations have never been made formal or quantitative. In fact, it is not even clear whether the difficulties arising from a poorly chosen noise distribution are statistical or algorithmic in nature. In this work, we formally pinpoint reasons for NCE's poor performance when an inappropriate noise distribution is used. Namely, we prove these challenges arise due to an ill-behaved (more precisely, flat) loss landscape. To address this, we introduce a variant of NCE called"eNCE"which uses an exponential loss and for which normalized gradient descent addresses the landscape issues provably when the target and noise distributions are in a given exponential family.
\ No newline at end of file
diff --git a/data/2022/iclr/Ancestral protein sequence reconstruction using a tree-structured Ornstein-Uhlenbeck variational autoencoder b/data/2022/iclr/Ancestral protein sequence reconstruction using a tree-structured Ornstein-Uhlenbeck variational autoencoder
new file mode 100644
index 0000000000..3ce563cd55
--- /dev/null
+++ b/data/2022/iclr/Ancestral protein sequence reconstruction using a tree-structured Ornstein-Uhlenbeck variational autoencoder	
@@ -0,0 +1 @@
+We introduce a deep generative model for representation learning of biological sequences that, unlike existing models, explicitly represents the evolutionary process. The model makes use of a tree-structured Ornstein-Uhlenbeck process, obtained from a given phylogenetic tree, as an informative prior for a variational autoencoder. We show the model performs well on the task of ancestral sequence reconstruction of single protein families. Our results and ablation studies indicate that the explicit representation of evolution using a suitable tree-structured prior has the potential to improve representation learning of biological sequences considerably. Finally, we brieﬂy discuss extensions of the model to genomic-scale data sets and the case of a latent phylogenetic tree.
\ No newline at end of file
diff --git a/data/2022/iclr/Anisotropic Random Feature Regression in High Dimensions b/data/2022/iclr/Anisotropic Random Feature Regression in High Dimensions
new file mode 100644
index 0000000000..640fe0f479
--- /dev/null
+++ b/data/2022/iclr/Anisotropic Random Feature Regression in High Dimensions	
@@ -0,0 +1 @@
+In contrast to standard statistical wisdom, modern learning algorithms typically find their best performance in the overparameterized regime in which the model has many more parameters than needed to fit the training data. A growing number of recent works have shown that random feature models can offer a detailed theoretical explanation for this unexpected behavior, but typically these analyses have utilized isotropic distributional assumptions on the underlying data generation process, thereby failing to provide a realistic characterization of real-world models that are designed to identify and harness the structure in natural data. In this work, we examine the high-dimensional asymptotics of random feature regression in the presence of structured data, allowing for arbitrary input correlations and arbitrary alignment between the data and the weights of the target function. We define a partial order on the space of weight-data alignments and prove that generalization performance improves in response to stronger alignment. We also clarify several previous observations in the literature by distinguishing the behavior of the samplewise and parameter-wise learning curves, finding that sample-wise multiple descent can occur at scales dictated by the eigenstructure of the data covariance, but that parameter-wise multiple descent is limited to double descent, although strong anisotropy can induce additional signatures such as wide plateaus and steep cliffs. Finally, these signatures are related to phase transitions in the spectrum of the feature kernel matrix, and unlike the double descent peak, persist even under optimal regularization.
\ No newline at end of file
diff --git a/data/2022/iclr/Anomaly Detection for Tabular Data with Internal Contrastive Learning b/data/2022/iclr/Anomaly Detection for Tabular Data with Internal Contrastive Learning
new file mode 100644
index 0000000000..7434300c56
--- /dev/null
+++ b/data/2022/iclr/Anomaly Detection for Tabular Data with Internal Contrastive Learning	
@@ -0,0 +1 @@
+We
\ No newline at end of file
diff --git a/data/2022/iclr/Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy b/data/2022/iclr/Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy
new file mode 100644
index 0000000000..456a7eec99
--- /dev/null
+++ b/data/2022/iclr/Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy	
@@ -0,0 +1 @@
+Unsupervised detection of anomaly points in time series is a challenging problem, which requires the model to derive a distinguishable criterion. Previous methods tackle the problem mainly through learning pointwise representation or pairwise association, however, neither is sufficient to reason about the intricate dynamics. Recently, Transformers have shown great power in unified modeling of pointwise representation and pairwise association, and we find that the self-attention weight distribution of each time point can embody rich association with the whole series. Our key observation is that due to the rarity of anomalies, it is extremely difficult to build nontrivial associations from abnormal points to the whole series, thereby, the anomalies' associations shall mainly concentrate on their adjacent time points. This adjacent-concentration bias implies an association-based criterion inherently distinguishable between normal and abnormal points, which we highlight through the \emph{Association Discrepancy}. Technically, we propose the \emph{Anomaly Transformer} with a new \emph{Anomaly-Attention} mechanism to compute the association discrepancy. A minimax strategy is devised to amplify the normal-abnormal distinguishability of the association discrepancy. The Anomaly Transformer achieves state-of-the-art results on six unsupervised time series anomaly detection benchmarks of three applications: service monitoring, space&earth exploration, and water treatment.
\ No newline at end of file
diff --git a/data/2022/iclr/Anti-Concentrated Confidence Bonuses For Scalable Exploration b/data/2022/iclr/Anti-Concentrated Confidence Bonuses For Scalable Exploration
new file mode 100644
index 0000000000..b571829aac
--- /dev/null
+++ b/data/2022/iclr/Anti-Concentrated Confidence Bonuses For Scalable Exploration	
@@ -0,0 +1 @@
+Intrinsic rewards play a central role in handling the exploration-exploitation trade-off when designing sequential decision-making algorithms, in both foundational theory and state-of-the-art deep reinforcement learning. The LinUCB algorithm, a centerpiece of the stochastic linear bandits literature, prescribes an elliptical bonus which addresses the challenge of leveraging shared information in large action spaces. This bonus scheme cannot be directly transferred to high-dimensional exploration problems, however, due to the computational cost of maintaining the inverse covariance matrix of action features. We introduce anti-concentrated conﬁdence bounds for efﬁciently approximating the elliptical bonus, using an ensemble of regressors trained to predict random noise from policy network-derived features. Using this approximation, we obtain stochastic linear bandit algorithms which obtain ˜ O ( d √ T ) regret bounds for poly( d ) ﬁxed actions. We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic reward heuristics on Atari benchmarks.
\ No newline at end of file
diff --git a/data/2022/iclr/Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice b/data/2022/iclr/Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice
new file mode 100644
index 0000000000..f2f23ecb49
--- /dev/null
+++ b/data/2022/iclr/Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice	
@@ -0,0 +1 @@
+Vision Transformer (ViT) has recently demonstrated promise in computer vision problems. However, unlike Convolutional Neural Networks (CNN), it is known that the performance of ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity. Despite a couple of empirical solutions, a rigorous framework studying on this scalability issue remains elusive. In this paper, we first establish a rigorous theory framework to analyze ViT features from the Fourier spectrum domain. We show that the self-attention mechanism inherently amounts to a low-pass filter, which indicates when ViT scales up its depth, excessive low-pass filtering will cause feature maps to only preserve their Direct-Current (DC) component. We then propose two straightforward yet effective techniques to mitigate the undesirable low-pass limitation. The first technique, termed AttnScale, decomposes a self-attention block into low-pass and high-pass components, then rescales and combines these two filters to produce an all-pass self-attention matrix. The second technique, termed FeatScale, re-weights feature maps on separate frequency bands to amplify the high-frequency signals. Both techniques are efficient and hyperparameter-free, while effectively overcoming relevant ViT training artifacts such as attention collapse and patch uniformity. By seamlessly plugging in our techniques to multiple ViT variants, we demonstrate that they consistently help ViTs benefit from deeper architectures, bringing up to 1.1% performance gains"for free"(e.g., with little parameter overhead). We publicly release our codes and pre-trained models at https://github.com/VITA-Group/ViT-Anti-Oversmoothing.
\ No newline at end of file
diff --git a/data/2022/iclr/Anytime Dense Prediction with Confidence Adaptivity b/data/2022/iclr/Anytime Dense Prediction with Confidence Adaptivity
new file mode 100644
index 0000000000..c2c8ca1127
--- /dev/null
+++ b/data/2022/iclr/Anytime Dense Prediction with Confidence Adaptivity	
@@ -0,0 +1 @@
+Anytime inference requires a model to make a progression of predictions which might be halted at any time. Prior research on anytime visual recognition has mostly focused on image classiﬁcation. We propose the ﬁrst uniﬁed and end-to-end approach for anytime dense prediction. A cascade of “exits” is attached to the model to make multiple predictions. We redesign the exits to account for the depth and spatial resolution of the features for each exit. To reduce total computation, and make full use of prior predictions, we develop a novel spatially adaptive approach to avoid further computation on regions where early predictions are already sufﬁciently conﬁdent. Our full method, named anytime dense prediction with conﬁdence (ADP-C), achieves the same level of ﬁnal accuracy as the base model, and meanwhile signiﬁcantly reduces total computation. We evaluate our method on Cityscapes semantic segmentation and MPII human pose estimation: ADP-C en-ables anytime inference without sacriﬁcing accuracy while also reducing the total FLOPs of its base models by 44.4% and 59.1%. We compare with anytime inference by deep equilibrium networks and feature-based stochastic sampling, show-ing that ADP-C dominates both across the accuracy-computation curve. Our code is available at https://github.com/liuzhuang13/anytime . ﬁnal We
\ No newline at end of file
diff --git a/data/2022/iclr/Approximation and Learning with Deep Convolutional Models: a Kernel Perspective b/data/2022/iclr/Approximation and Learning with Deep Convolutional Models: a Kernel Perspective
new file mode 100644
index 0000000000..b3170aa399
--- /dev/null
+++ b/data/2022/iclr/Approximation and Learning with Deep Convolutional Models: a Kernel Perspective	
@@ -0,0 +1 @@
+The empirical success of deep convolutional networks on tasks involving high-dimensional data such as images or audio suggests that they can efficiently approximate certain functions that are well-suited for such tasks. In this paper, we study this through the lens of kernel methods, by considering simple hierarchical kernels with two or three convolution and pooling layers, inspired by convolutional kernel networks. These achieve good empirical performance on standard vision datasets, while providing a precise description of their functional space that yields new insights on their inductive bias. We show that the RKHS consists of additive models of interaction terms between patches, and that its norm encourages spatial similarities between these terms through pooling layers. We then provide generalization bounds which illustrate how pooling and patches yield improved sample complexity guarantees when the target function presents such regularities.
\ No newline at end of file
diff --git a/data/2022/iclr/Assessing Generalization of SGD via Disagreement b/data/2022/iclr/Assessing Generalization of SGD via Disagreement
new file mode 100644
index 0000000000..4df898d73c
--- /dev/null
+++ b/data/2022/iclr/Assessing Generalization of SGD via Disagreement	
@@ -0,0 +1 @@
+We empirically show that the test error of deep networks can be estimated by simply training the same architecture on the same training set but with a different run of Stochastic Gradient Descent (SGD), and measuring the disagreement rate between the two networks on unlabeled test data. This builds on -- and is a stronger version of -- the observation in Nakkiran&Bansal '20, which requires the second run to be on an altogether fresh training set. We further theoretically show that this peculiar phenomenon arises from the \emph{well-calibrated} nature of \emph{ensembles} of SGD-trained models. This finding not only provides a simple empirical measure to directly predict the test error using unlabeled test data, but also establishes a new conceptual connection between generalization and calibration.
\ No newline at end of file
diff --git a/data/2022/iclr/Associated Learning: an Alternative to End-to-End Backpropagation that Works on CNN, RNN, and Transformer b/data/2022/iclr/Associated Learning: an Alternative to End-to-End Backpropagation that Works on CNN, RNN, and Transformer
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Asymmetry Learning for Counterfactually-invariant Classification in OOD Tasks b/data/2022/iclr/Asymmetry Learning for Counterfactually-invariant Classification in OOD Tasks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Attacking deep networks with surrogate-based adversarial black-box methods is easy b/data/2022/iclr/Attacking deep networks with surrogate-based adversarial black-box methods is easy
new file mode 100644
index 0000000000..45efbbb09a
--- /dev/null
+++ b/data/2022/iclr/Attacking deep networks with surrogate-based adversarial black-box methods is easy	
@@ -0,0 +1 @@
+A recent line of work on black-box adversarial attacks has revived the use of transfer from surrogate models by integrating it into query-based search. However, we find that existing approaches of this type underperform their potential, and can be overly complicated besides. Here, we provide a short and simple algorithm which achieves state-of-the-art results through a search which uses the surrogate network's class-score gradients, with no need for other priors or heuristics. The guiding assumption of the algorithm is that the studied networks are in a fundamental sense learning similar functions, and that a transfer attack from one to the other should thus be fairly"easy". This assumption is validated by the extremely low query counts and failure rates achieved: e.g. an untargeted attack on a VGG-16 ImageNet network using a ResNet-152 as the surrogate yields a median query count of 6 at a success rate of 99.9%. Code is available at https://github.com/fiveai/GFCS.
\ No newline at end of file
diff --git a/data/2022/iclr/Attention-based Interpretability with Concept Transformers b/data/2022/iclr/Attention-based Interpretability with Concept Transformers
new file mode 100644
index 0000000000..0b7b683b18
--- /dev/null
+++ b/data/2022/iclr/Attention-based Interpretability with Concept Transformers	
@@ -0,0 +1 @@
+Attention is a mechanism that has been instrumental in driving remarkable performance gains of deep neural network models in a host of visual, NLP and multimodal tasks. One additional notable aspect of attention is that it conveniently exposes the “reasoning” behind each particular output generated by the model. Specifically, attention scores over input regions or intermediate features have been interpreted as a measure of the contribution of the attended element to the model inference. While the debate in regard to the interpretability of attention is still not settled, researchers have pointed out the existence of architectures and scenarios that afford a meaningful interpretation of the attention mechanism. Here we propose the generalization of attention from low-level input features to high-level concepts as a mechanism to ensure the interpretability of attention scores within a given application domain. In particular, we design the ConceptTransformer, a deep learning module that exposes explanations of the output of a model in which it is embedded in terms of attention over user-defined high-level concepts. Such explanations are plausible (i.e. convincing to the human user) and faithful (i.e. truly reflective of the reasoning process of the model). Plausibility of such explanations is obtained by construction by training the attention heads to conform with known relations between inputs, concepts and outputs dictated by domain knowledge. Faithfulness is achieved by design by enforcing a linear relation between the transformer value vectors that represent the concepts and their contribution to the classification log-probabilities. We validate our ConceptTransformer module on established explainability benchmarks and show how it can be used to infuse domain knowledge into classifiers to improve accuracy, and conversely to extract concept-based explanations of classification outputs. Code to reproduce our results is available at: https://github.com/ibm/concept_transformer.
\ No newline at end of file
diff --git a/data/2022/iclr/Audio Lottery: Speech Recognition Made Ultra-Lightweight, Noise-Robust, and Transferable b/data/2022/iclr/Audio Lottery: Speech Recognition Made Ultra-Lightweight, Noise-Robust, and Transferable
new file mode 100644
index 0000000000..31d4affee0
--- /dev/null
+++ b/data/2022/iclr/Audio Lottery: Speech Recognition Made Ultra-Lightweight, Noise-Robust, and Transferable	
@@ -0,0 +1 @@
+Lightweight speech recognition models have seen explosive demands owing to a growing amount of speech-interactive features on mobile devices. Since designing such systems from scratch is non-trivial, practitioners typically choose to compress large (pre-trained) speech models. Recently, lottery ticket hypothesis reveals the existence of highly sparse subnetworks that can be trained in isolation without sacrificing the performance of the full models. In this paper, we investigate the tantalizing possibility of using lottery ticket hypothesis to discover lightweight speech recognition models, that are (1) robust to various noise existing in speech; (2) transferable to fit the open-world personalization; and 3) compatible with structured sparsity. We conducted extensive experiments on CNN-LSTM, RNNTransducer, and Transformer models, and verified the existence of highly sparse “winning tickets” that can match the full model performance across those backbones. We obtained winning tickets that have less than 20% of full model weights on all backbones, while the most lightweight one only keeps 4.4% weights. Those winning tickets generalize to structured sparsity with no performance loss, and transfer exceptionally from large source datasets to various target datasets. Perhaps most surprisingly, when the training utterances have high background noises, the winning tickets even substantially outperform the full models, showing the extra bonus of noise robustness by inducing sparsity. Codes are available at https://github.com/VITA-Group/Audio-Lottery.
\ No newline at end of file
diff --git a/data/2022/iclr/Augmented Sliced Wasserstein Distances b/data/2022/iclr/Augmented Sliced Wasserstein Distances
new file mode 100644
index 0000000000..2861cd2ba4
--- /dev/null
+++ b/data/2022/iclr/Augmented Sliced Wasserstein Distances	
@@ -0,0 +1 @@
+While theoretically appealing, the application of the Wasserstein distance to large-scale machine learning problems has been hampered by its prohibitive computational cost. The sliced Wasserstein distance and its variants improve the computational efficiency through random projection, yet they suffer from low projection efficiency because the majority of projections result in trivially small values. In this work, we propose a new family of distance metrics, called augmented sliced Wasserstein distances (ASWDs), constructed by first mapping samples to higher-dimensional hypersurfaces parameterized by neural networks. It is derived from a key observation that (random) linear projections of samples residing on these hypersurfaces would translate to much more flexible projections in the original sample space, so they can capture complex structures of the data distribution. We show that the hypersurfaces can be optimized by gradient ascent efficiently. We provide the condition under which the ASWD is a valid metric and show that this can be obtained by an injective neural network architecture. Numerical results demonstrate that the ASWD significantly outperforms other Wasserstein variants for both synthetic and real-world problems.
\ No newline at end of file
diff --git a/data/2022/iclr/Auto-Transfer: Learning to Route Transferable Representations b/data/2022/iclr/Auto-Transfer: Learning to Route Transferable Representations
new file mode 100644
index 0000000000..fa9237dad1
--- /dev/null
+++ b/data/2022/iclr/Auto-Transfer: Learning to Route Transferable Representations	
@@ -0,0 +1 @@
+Knowledge transfer between heterogeneous source and target networks and tasks has received a lot of attention in recent times as large amounts of quality labeled data can be difficult to obtain in many applications. Existing approaches typically constrain the target deep neural network (DNN) feature representations to be close to the source DNNs feature representations, which can be limiting. We, in this paper, propose a novel adversarial multi-armed bandit approach that automatically learns to route source representations to appropriate target representations following which they are combined in meaningful ways to produce accurate target models. We see upwards of 5\% accuracy improvements compared with the state-of-the-art knowledge transfer methods on four benchmark (target) image datasets CUB200, Stanford Dogs, MIT67, and Stanford40 where the source dataset is ImageNet. We qualitatively analyze the goodness of our transfer scheme by showing individual examples of the important features focused on by our target network at different layers compared with the (closest) competitors. We also observe that our improvement over other methods is higher for smaller target datasets making it an effective tool for small data applications that may benefit from transfer learning.
\ No newline at end of file
diff --git a/data/2022/iclr/Auto-scaling Vision Transformers without Training b/data/2022/iclr/Auto-scaling Vision Transformers without Training
new file mode 100644
index 0000000000..89cd6d13d4
--- /dev/null
+++ b/data/2022/iclr/Auto-scaling Vision Transformers without Training	
@@ -0,0 +1 @@
+This work targets automated designing and scaling of Vision Transformers (ViTs). The motivation comes from two pain spots: 1) the lack of efficient and principled methods for designing and scaling ViTs; 2) the tremendous computational cost of training ViT that is much heavier than its convolution counterpart. To tackle these issues, we propose As-ViT, an auto-scaling framework for ViTs without training, which automatically discovers and scales up ViTs in an efficient and principled manner. Specifically, we first design a"seed"ViT topology by leveraging a training-free search process. This extremely fast search is fulfilled by a comprehensive study of ViT's network complexity, yielding a strong Kendall-tau correlation with ground-truth accuracies. Second, starting from the"seed"topology, we automate the scaling rule for ViTs by growing widths/depths to different ViT layers. This results in a series of architectures with different numbers of parameters in a single run. Finally, based on the observation that ViTs can tolerate coarse tokenization in early training stages, we propose a progressive tokenization strategy to train ViTs faster and cheaper. As a unified framework, As-ViT achieves strong performance on classification (83.5% top1 on ImageNet-1k) and detection (52.7% mAP on COCO) without any manual crafting nor scaling of ViT architectures: the end-to-end model design and scaling process cost only 12 hours on one V100 GPU. Our code is available at https://github.com/VITA-Group/AsViT.
\ No newline at end of file
diff --git a/data/2022/iclr/Automated Self-Supervised Learning for Graphs b/data/2022/iclr/Automated Self-Supervised Learning for Graphs
new file mode 100644
index 0000000000..fa78af9148
--- /dev/null
+++ b/data/2022/iclr/Automated Self-Supervised Learning for Graphs	
@@ -0,0 +1 @@
+Graph self-supervised learning has gained increasing attention due to its capacity to learn expressive node representations. Many pretext tasks, or loss functions have been designed from distinct perspectives. However, we observe that different pretext tasks affect downstream tasks differently cross datasets, which suggests that searching pretext tasks is crucial for graph self-supervised learning. Different from existing works focusing on designing single pretext tasks, this work aims to investigate how to automatically leverage multiple pretext tasks effectively. Nevertheless, evaluating representations derived from multiple pretext tasks without direct access to ground truth labels makes this problem challenging. To address this obstacle, we make use of a key principle of many real-world graphs, i.e., homophily, or the principle that"like attracts like,"as the guidance to effectively search various self-supervised pretext tasks. We provide theoretical understanding and empirical evidence to justify the flexibility of homophily in this search task. Then we propose the AutoSSL framework which can automatically search over combinations of various self-supervised tasks. By evaluating the framework on 7 real-world datasets, our experimental results show that AutoSSL can significantly boost the performance on downstream tasks including node clustering and node classification compared with training under individual tasks. Code is released at https://github.com/ChandlerBang/AutoSSL.
\ No newline at end of file
diff --git a/data/2022/iclr/Automatic Loss Function Search for Predict-Then-Optimize Problems with Strong Ranking Property b/data/2022/iclr/Automatic Loss Function Search for Predict-Then-Optimize Problems with Strong Ranking Property
new file mode 100644
index 0000000000..d9280e9186
--- /dev/null
+++ b/data/2022/iclr/Automatic Loss Function Search for Predict-Then-Optimize Problems with Strong Ranking Property	
@@ -0,0 +1 @@
+Combinatorial optimization problems with parameters to be predicted from side information are commonly seen in a variety of problems during the paradigm shifts from reactive decision making to proactive decision making. Due to the misalignment between the continuous prediction results and the discrete decisions in optimization problems, it is hard to achieve a satisfactory prediction result with the ordinary l2 loss in the prediction phase. To properly connect the prediction loss with the optimization goal, in this paper we propose a total group preorder (TGP) loss and its differential version called approximate total group preorder (ATGP) loss for predict-then-optimize (PTO) problems with strong ranking property. These new losses are provably more robust than the usual l2 loss in a linear regression setting and have great potential to extend to other settings. We also propose an automatic searching algorithm that adapts the ATGP loss to PTO problems with different combinatorial structures. Extensive experiments on the ranking problem, the knapsack problem, and the shortest path problem have demonstrated that our proposed method can achieve a significantly better performance compared to the other methods designed for PTO problems.
\ No newline at end of file
diff --git a/data/2022/iclr/Autonomous Learning of Object-Centric Abstractions for High-Level Planning b/data/2022/iclr/Autonomous Learning of Object-Centric Abstractions for High-Level Planning
new file mode 100644
index 0000000000..882b49db31
--- /dev/null
+++ b/data/2022/iclr/Autonomous Learning of Object-Centric Abstractions for High-Level Planning	
@@ -0,0 +1 @@
+We propose a method for autonomously learning an object-centric representation of a continuous and high-dimensional environment that is suitable for planning. Such representations can immediately be transferred between tasks that share the same types of objects, resulting in agents that require fewer samples to learn a model of a new task. We ﬁrst demonstrate our approach on a 2D crafting domain consisting of numerous objects where the agent learns a compact, lifted representation that generalises across objects. We then apply it to a series of Minecraft tasks to learn object-centric representations and object types—directly from pixel data—that can be leveraged to solve new tasks quickly. The resulting learned representations enable the use of a task-level planner, resulting in an agent capable of transferring learned representations to form complex, long-term plans.
\ No newline at end of file
diff --git a/data/2022/iclr/Autonomous Reinforcement Learning: Formalism and Benchmarking b/data/2022/iclr/Autonomous Reinforcement Learning: Formalism and Benchmarking
new file mode 100644
index 0000000000..011d4d98ba
--- /dev/null
+++ b/data/2022/iclr/Autonomous Reinforcement Learning: Formalism and Benchmarking	
@@ -0,0 +1 @@
+Reinforcement learning (RL) provides a naturalistic framing for learning through trial and error, which is appealing both because of its simplicity and effectiveness and because of its resemblance to how humans and animals acquire skills through experience. However, real-world embodied learning, such as that performed by humans and animals, is situated in a continual, non-episodic world, whereas common benchmark tasks in RL are episodic, with the environment resetting between trials to provide the agent with multiple attempts. This discrepancy presents a major challenge when attempting to take RL algorithms developed for episodic simulated environments and run them on real-world platforms, such as robots. In this paper, we aim to address this discrepancy by laying out a framework for Autonomous Reinforcement Learning (ARL): reinforcement learning where the agent not only learns through its own experience, but also contends with lack of human supervision to reset between trials. We introduce a simulated benchmark EARL around this framework, containing a set of diverse and challenging simulated tasks reflective of the hurdles introduced to learning when only a minimal reliance on extrinsic intervention can be assumed. We show that standard approaches to episodic RL and existing approaches struggle as interventions are minimized, underscoring the need for developing new algorithms for reinforcement learning with a greater focus on autonomy.
\ No newline at end of file
diff --git a/data/2022/iclr/Autoregressive Diffusion Models b/data/2022/iclr/Autoregressive Diffusion Models
new file mode 100644
index 0000000000..468895e588
--- /dev/null
+++ b/data/2022/iclr/Autoregressive Diffusion Models	
@@ -0,0 +1 @@
+We introduce Autoregressive Diffusion Models (ARDMs), a model class encompassing and generalizing order-agnostic autoregressive models (Uria et al., 2014) and absorbing discrete diffusion (Austin et al., 2021), which we show are special cases of ARDMs under mild assumptions. ARDMs are simple to implement and easy to train. Unlike standard ARMs, they do not require causal masking of model representations, and can be trained using an efficient objective similar to modern probabilistic diffusion models that scales favourably to highly-dimensional data. At test time, ARDMs support parallel generation which can be adapted to fit any given generation budget. We find that ARDMs require significantly fewer steps than discrete diffusion models to attain the same performance. Finally, we apply ARDMs to lossless compression, and show that they are uniquely suited to this task. Contrary to existing approaches based on bits-back coding, ARDMs obtain compelling results not only on complete datasets, but also on compressing single data points. Moreover, this can be done using a modest number of network calls for (de)compression due to the model's adaptable parallel generation.
\ No newline at end of file
diff --git a/data/2022/iclr/Autoregressive Quantile Flows for Predictive Uncertainty Estimation b/data/2022/iclr/Autoregressive Quantile Flows for Predictive Uncertainty Estimation
new file mode 100644
index 0000000000..7f621a3b3d
--- /dev/null
+++ b/data/2022/iclr/Autoregressive Quantile Flows for Predictive Uncertainty Estimation	
@@ -0,0 +1 @@
+Numerous applications of machine learning involve representing probability distributions over high-dimensional data. We propose autoregressive quantile flows, a flexible class of normalizing flow models trained using a novel objective based on proper scoring rules. Our objective does not require calculating computationally expensive determinants of Jacobians during training and supports new types of neural architectures, such as neural autoregressive flows from which it is easy to sample. We leverage these models in quantile flow regression, an approach that parameterizes predictive conditional distributions with flows, resulting in improved probabilistic predictions on tasks such as time series forecasting and object detection. Our novel objective functions and neural flow parameterizations also yield improvements on popular generation and density estimation tasks, and represent a step beyond maximum likelihood learning of flows.
\ No newline at end of file
diff --git a/data/2022/iclr/Axiomatic Explanations for Visual Search, Retrieval, and Similarity Learning b/data/2022/iclr/Axiomatic Explanations for Visual Search, Retrieval, and Similarity Learning
new file mode 100644
index 0000000000..6b5707884d
--- /dev/null
+++ b/data/2022/iclr/Axiomatic Explanations for Visual Search, Retrieval, and Similarity Learning	
@@ -0,0 +1 @@
+Visual search, recommendation, and contrastive similarity learning power technologies that impact billions of users worldwide. Modern model architectures can be complex and difficult to interpret, and there are several competing techniques one can use to explain a search engine's behavior. We show that the theory of fair credit assignment provides a $\textit{unique}$ axiomatic solution that generalizes several existing recommendation- and metric-explainability techniques in the literature. Using this formalism, we show when existing approaches violate"fairness"and derive methods that sidestep these shortcomings and naturally handle counterfactual information. More specifically, we show existing approaches implicitly approximate second-order Shapley-Taylor indices and extend CAM, GradCAM, LIME, SHAP, SBSM, and other methods to search engines. These extensions can extract pairwise correspondences between images from trained $\textit{opaque-box}$ models. We also introduce a fast kernel-based method for estimating Shapley-Taylor indices that require orders of magnitude fewer function evaluations to converge. Finally, we show that these game-theoretic measures yield more consistent explanations for image similarity architectures.
\ No newline at end of file
diff --git a/data/2022/iclr/BAM: Bayes with Adaptive Memory b/data/2022/iclr/BAM: Bayes with Adaptive Memory
new file mode 100644
index 0000000000..e98d83e7c2
--- /dev/null
+++ b/data/2022/iclr/BAM: Bayes with Adaptive Memory	
@@ -0,0 +1 @@
+Online learning via Bayes' theorem allows new data to be continuously integrated into an agent's current beliefs. However, a naive application of Bayesian methods in non stationary environments leads to slow adaptation and results in state estimates that may converge confidently to the wrong parameter value. A common solution when learning in changing environments is to discard/downweight past data; however, this simple mechanism of"forgetting"fails to account for the fact that many real-world environments involve revisiting similar states. We propose a new framework, Bayes with Adaptive Memory (BAM), that takes advantage of past experience by allowing the agent to choose which past observations to remember and which to forget. We demonstrate that BAM generalizes many popular Bayesian update rules for non-stationary environments. Through a variety of experiments, we demonstrate the ability of BAM to continuously adapt in an ever-changing world.
\ No newline at end of file
diff --git a/data/2022/iclr/BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis b/data/2022/iclr/BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis
new file mode 100644
index 0000000000..312bd2bd5e
--- /dev/null
+++ b/data/2022/iclr/BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis	
@@ -0,0 +1 @@
+Diffusion probabilistic models (DPMs) and their extensions have emerged as competitive generative models yet confront challenges of efficient sampling. We propose a new bilateral denoising diffusion model (BDDM) that parameterizes both the forward and reverse processes with a schedule network and a score network, which can train with a novel bilateral modeling objective. We show that the new surrogate objective can achieve a lower bound of the log marginal likelihood tighter than a conventional surrogate. We also find that BDDM allows inheriting pre-trained score network parameters from any DPMs and consequently enables speedy and stable learning of the schedule network and optimization of a noise schedule for sampling. Our experiments demonstrate that BDDMs can generate high-fidelity audio samples with as few as three sampling steps. Moreover, compared to other state-of-the-art diffusion-based neural vocoders, BDDMs produce comparable or higher quality samples indistinguishable from human speech, notably with only seven sampling steps (143x faster than WaveGrad and 28.6x faster than DiffWave). We release our code at https://github.com/tencent-ailab/bddm.
\ No newline at end of file
diff --git a/data/2022/iclr/BEiT: BERT Pre-Training of Image Transformers b/data/2022/iclr/BEiT: BERT Pre-Training of Image Transformers
new file mode 100644
index 0000000000..7f00f373c3
--- /dev/null
+++ b/data/2022/iclr/BEiT: BERT Pre-Training of Image Transformers	
@@ -0,0 +1 @@
+We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first"tokenize"the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models are available at https://aka.ms/beit.
\ No newline at end of file
diff --git a/data/2022/iclr/Back2Future: Leveraging Backfill Dynamics for Improving Real-time Predictions in Future b/data/2022/iclr/Back2Future: Leveraging Backfill Dynamics for Improving Real-time Predictions in Future
new file mode 100644
index 0000000000..defd5354df
--- /dev/null
+++ b/data/2022/iclr/Back2Future: Leveraging Backfill Dynamics for Improving Real-time Predictions in Future	
@@ -0,0 +1 @@
+In real-time forecasting in public health, data collection is a non-trivial and demanding task. Often after initially released, it undergoes several revisions later (maybe due to human or technical constraints) - as a result, it may take weeks until the data reaches to a stable value. This so-called 'backfill' phenomenon and its effect on model performance has been barely studied in the prior literature. In this paper, we introduce the multi-variate backfill problem using COVID-19 as the motivating example. We construct a detailed dataset composed of relevant signals over the past year of the pandemic. We then systematically characterize several patterns in backfill dynamics and leverage our observations for formulating a novel problem and neural framework Back2Future that aims to refines a given model's predictions in real-time. Our extensive experiments demonstrate that our method refines the performance of top models for COVID-19 forecasting, in contrast to non-trivial baselines, yielding 18% improvement over baselines, enabling us obtain a new SOTA performance. In addition, we show that our model improves model evaluation too;hence policy-makers can better understand the true accuracy of forecasting models in real-time.
\ No newline at end of file
diff --git a/data/2022/iclr/Backdoor Defense via Decoupling the Training Process b/data/2022/iclr/Backdoor Defense via Decoupling the Training Process
new file mode 100644
index 0000000000..df2933eeef
--- /dev/null
+++ b/data/2022/iclr/Backdoor Defense via Decoupling the Training Process	
@@ -0,0 +1 @@
+Recent studies have revealed that deep neural networks (DNNs) are vulnerable to backdoor attacks, where attackers embed hidden backdoors in the DNN model by poisoning a few training samples. The attacked model behaves normally on benign samples, whereas its prediction will be maliciously changed when the backdoor is activated. We reveal that poisoned samples tend to cluster together in the feature space of the attacked DNN model, which is mostly due to the end-to-end supervised training paradigm. Inspired by this observation, we propose a novel backdoor defense via decoupling the original end-to-end training process into three stages. Specifically, we first learn the backbone of a DNN model via \emph{self-supervised learning} based on training samples without their labels. The learned backbone will map samples with the same ground-truth label to similar locations in the feature space. Then, we freeze the parameters of the learned backbone and train the remaining fully connected layers via standard training with all (labeled) training samples. Lastly, to further alleviate side-effects of poisoned samples in the second stage, we remove labels of some `low-credible' samples determined based on the learned model and conduct a \emph{semi-supervised fine-tuning} of the whole model. Extensive experiments on multiple benchmark datasets and DNN models verify that the proposed defense is effective in reducing backdoor threats while preserving high accuracy in predicting benign samples. Our code is available at \url{https://github.com/SCLBD/DBD}.
\ No newline at end of file
diff --git a/data/2022/iclr/BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models b/data/2022/iclr/BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models
new file mode 100644
index 0000000000..64d23ee824
--- /dev/null
+++ b/data/2022/iclr/BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models	
@@ -0,0 +1 @@
+Pre-trained Natural Language Processing (NLP) models can be easily adapted to a variety of downstream language tasks. This significantly accelerates the development of language models. However, NLP models have been shown to be vulnerable to backdoor attacks, where a pre-defined trigger word in the input text causes model misprediction. Previous NLP backdoor attacks mainly focus on some specific tasks. This makes those attacks less general and applicable to other kinds of NLP models and tasks. In this work, we propose \Name, the first task-agnostic backdoor attack against the pre-trained NLP models. The key feature of our attack is that the adversary does not need prior information about the downstream tasks when implanting the backdoor to the pre-trained model. When this malicious model is released, any downstream models transferred from it will also inherit the backdoor, even after the extensive transfer learning process. We further design a simple yet effective strategy to bypass a state-of-the-art defense. Experimental results indicate that our approach can compromise a wide range of downstream NLP tasks in an effective and stealthy way.
\ No newline at end of file
diff --git a/data/2022/iclr/Bag of Instances Aggregation Boosts Self-supervised Distillation b/data/2022/iclr/Bag of Instances Aggregation Boosts Self-supervised Distillation
new file mode 100644
index 0000000000..240fe8181e
--- /dev/null
+++ b/data/2022/iclr/Bag of Instances Aggregation Boosts Self-supervised Distillation	
@@ -0,0 +1 @@
+Recent advances in self-supervised learning have experienced remarkable progress, especially for contrastive learning based methods, which regard each image as well as its augmentations as an individual class and try to distinguish them from all other images. However, due to the large quantity of exemplars, this kind of pretext task intrinsically suffers from slow convergence and is hard for optimization. This is especially true for small-scale models, in which we find the performance drops dramatically comparing with its supervised counterpart. In this paper, we propose a simple but effective distillation strategy for unsupervised learning. The highlight is that the relationship among similar samples counts and can be seamlessly transferred to the student to boost the performance. Our method, termed as BINGO, which is short for Bag of InstaNces aGgregatiOn, targets at transferring the relationship learned by the teacher to the student. Here bag of instances indicates a set of similar samples constructed by the teacher and are grouped within a bag, and the goal of distillation is to aggregate compact representations over the student with respect to instances in a bag. Notably, BINGO achieves new state-of-the-art performance on small-scale models, i.e., 65.5% and 68.9% top-1 accuracies with linear evaluation on ImageNet, using ResNet-18 and ResNet-34 as the backbones respectively, surpassing baselines (52.5% and 57.4% top-1 accuracies) by a significant margin. The code is available at https://github.com/haohang96/bingo.
\ No newline at end of file
diff --git a/data/2022/iclr/Bandit Learning with Joint Effect of Incentivized Sampling, Delayed Sampling Feedback, and Self-Reinforcing User Preferences b/data/2022/iclr/Bandit Learning with Joint Effect of Incentivized Sampling, Delayed Sampling Feedback, and Self-Reinforcing User Preferences
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Bayesian Framework for Gradient Leakage b/data/2022/iclr/Bayesian Framework for Gradient Leakage
new file mode 100644
index 0000000000..7bbb4b2505
--- /dev/null
+++ b/data/2022/iclr/Bayesian Framework for Gradient Leakage	
@@ -0,0 +1 @@
+Federated learning is an established method for training machine learning models without sharing training data. However, recent work has shown that it cannot guarantee data privacy as shared gradients can still leak sensitive information. To formalize the problem of gradient leakage, we propose a theoretical framework that enables, for the first time, analysis of the Bayes optimal adversary phrased as an optimization problem. We demonstrate that existing leakage attacks can be seen as approximations of this optimal adversary with different assumptions on the probability distributions of the input data and gradients. Our experiments confirm the effectiveness of the Bayes optimal adversary when it has knowledge of the underlying distribution. Further, our experimental evaluation shows that several existing heuristic defenses are not effective against stronger attacks, especially early in the training process. Thus, our findings indicate that the construction of more effective defenses and their evaluation remains an open problem.
\ No newline at end of file
diff --git a/data/2022/iclr/Bayesian Modeling and Uncertainty Quantification for Learning to Optimize: What, Why, and How b/data/2022/iclr/Bayesian Modeling and Uncertainty Quantification for Learning to Optimize: What, Why, and How
new file mode 100644
index 0000000000..eec0c59e60
--- /dev/null
+++ b/data/2022/iclr/Bayesian Modeling and Uncertainty Quantification for Learning to Optimize: What, Why, and How	
@@ -0,0 +1 @@
+an
\ No newline at end of file
diff --git a/data/2022/iclr/Bayesian Neural Network Priors Revisited b/data/2022/iclr/Bayesian Neural Network Priors Revisited
new file mode 100644
index 0000000000..b2c1cc10dd
--- /dev/null
+++ b/data/2022/iclr/Bayesian Neural Network Priors Revisited	
@@ -0,0 +1 @@
+Isotropic Gaussian priors are the de facto standard for modern Bayesian neural network inference. However, it is unclear whether these priors accurately reflect our true beliefs about the weight distributions or give optimal performance. To find better priors, we study summary statistics of neural network weights in networks trained using stochastic gradient descent (SGD). We find that convolutional neural network (CNN) and ResNet weights display strong spatial correlations, while fully connected networks (FCNNs) display heavy-tailed weight distributions. We show that building these observations into priors can lead to improved performance on a variety of image classification datasets. Surprisingly, these priors mitigate the cold posterior effect in FCNNs, but slightly increase the cold posterior effect in ResNets.
\ No newline at end of file
diff --git a/data/2022/iclr/Benchmarking the Spectrum of Agent Capabilities b/data/2022/iclr/Benchmarking the Spectrum of Agent Capabilities
new file mode 100644
index 0000000000..86e477ded0
--- /dev/null
+++ b/data/2022/iclr/Benchmarking the Spectrum of Agent Capabilities	
@@ -0,0 +1 @@
+Evaluating the general abilities of intelligent agents requires complex simulation environments. Existing benchmarks typically evaluate only one narrow task per environment, requiring researchers to perform expensive training runs on many different environments. We introduce Crafter, an open world survival game with visual inputs that evaluates a wide range of general abilities within a single environment. Agents either learn from the provided reward signal or through intrinsic objectives and are evaluated by semantically meaningful achievements that can be unlocked during each episode, such as discovering resources and crafting tools. Consistently unlocking all achievements requires strong generalization, deep exploration, and long-term reasoning. We experimentally verify that Crafter is of appropriate difficulty to drive future research and provide baselines scores of reward agents and unsupervised agents. Furthermore, we observe sophisticated behaviors emerging from maximizing the reward signal, such as building tunnel systems, bridges, houses, and plantations. We hope that Crafter will accelerate research progress by quickly evaluating a wide spectrum of abilities.
\ No newline at end of file
diff --git a/data/2022/iclr/Better Supervisory Signals by Observing Learning Paths b/data/2022/iclr/Better Supervisory Signals by Observing Learning Paths
new file mode 100644
index 0000000000..c6edf066cf
--- /dev/null
+++ b/data/2022/iclr/Better Supervisory Signals by Observing Learning Paths	
@@ -0,0 +1 @@
+Better-supervised models might have better performance. In this paper, we first clarify what makes for good supervision for a classification problem, and then explain two existing label refining methods, label smoothing and knowledge distillation, in terms of our proposed criterion. To further answer why and how better supervision emerges, we observe the learning path, i.e., the trajectory of the model's predictions during training, for each training sample. We find that the model can spontaneously refine"bad"labels through a"zig-zag"learning path, which occurs on both toy and real datasets. Observing the learning path not only provides a new perspective for understanding knowledge distillation, overfitting, and learning dynamics, but also reveals that the supervisory signal of a teacher network can be very unstable near the best points in training on real tasks. Inspired by this, we propose a new knowledge distillation scheme, Filter-KD, which improves downstream classification performance in various settings.
\ No newline at end of file
diff --git a/data/2022/iclr/Beyond ImageNet Attack: Towards Crafting Adversarial Examples for Black-box Domains b/data/2022/iclr/Beyond ImageNet Attack: Towards Crafting Adversarial Examples for Black-box Domains
new file mode 100644
index 0000000000..22df9b370d
--- /dev/null
+++ b/data/2022/iclr/Beyond ImageNet Attack: Towards Crafting Adversarial Examples for Black-box Domains	
@@ -0,0 +1 @@
+Adversarial examples have posed a severe threat to deep neural networks due to their transferable nature. Currently, various works have paid great efforts to enhance the cross-model transferability, which mostly assume the substitute model is trained in the same domain as the target model. However, in reality, the relevant information of the deployed model is unlikely to leak. Hence, it is vital to build a more practical black-box threat model to overcome this limitation and evaluate the vulnerability of deployed models. In this paper, with only the knowledge of the ImageNet domain, we propose a Beyond ImageNet Attack (BIA) to investigate the transferability towards black-box domains (unknown classification tasks). Specifically, we leverage a generative model to learn the adversarial function for disrupting low-level features of input images. Based on this framework, we further propose two variants to narrow the gap between the source and target domains from the data and model perspectives, respectively. Extensive experiments on coarse-grained and fine-grained domains demonstrate the effectiveness of our proposed methods. Notably, our methods outperform state-of-the-art approaches by up to 7.71\% (towards coarse-grained domains) and 25.91\% (towards fine-grained domains) on average. Our code is available at \url{https://github.com/qilong-zhang/Beyond-ImageNet-Attack}.
\ No newline at end of file
diff --git a/data/2022/iclr/Bi-linear Value Networks for Multi-goal Reinforcement Learning b/data/2022/iclr/Bi-linear Value Networks for Multi-goal Reinforcement Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/BiBERT: Accurate Fully Binarized BERT b/data/2022/iclr/BiBERT: Accurate Fully Binarized BERT
new file mode 100644
index 0000000000..64568e38d4
--- /dev/null
+++ b/data/2022/iclr/BiBERT: Accurate Fully Binarized BERT	
@@ -0,0 +1 @@
+The large pre-trained BERT has achieved remarkable performance on Natural Language Processing (NLP) tasks but is also computation and memory expensive. As one of the powerful compression approaches, binarization extremely reduces the computation and memory consumption by utilizing 1-bit parameters and bitwise operations. Unfortunately, the full binarization of BERT (i.e., 1-bit weight, embedding, and activation) usually suffer a significant performance drop, and there is rare study addressing this problem. In this paper, with the theoretical justification and empirical analysis, we identify that the severe performance drop can be mainly attributed to the information degradation and optimization direction mismatch respectively in the forward and backward propagation, and propose BiBERT, an accurate fully binarized BERT, to eliminate the performance bottlenecks. Specifically, BiBERT introduces an efficient Bi-Attention structure for maximizing representation information statistically and a Direction-Matching Distillation (DMD) scheme to optimize the full binarized BERT accurately. Extensive experiments show that BiBERT outperforms both the straightforward baseline and existing state-of-the-art quantized BERTs with ultra-low bit activations by convincing margins on the NLP benchmark. As the first fully binarized BERT, our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size, demonstrating the vast advantages and potential of the fully binarized BERT model in real-world resource-constrained scenarios.
\ No newline at end of file
diff --git a/data/2022/iclr/Blaschke Product Neural Networks (BPNN): A Physics-Infused Neural Network for Phase Retrieval of Meromorphic Functions b/data/2022/iclr/Blaschke Product Neural Networks (BPNN): A Physics-Infused Neural Network for Phase Retrieval of Meromorphic Functions
new file mode 100644
index 0000000000..a05b1ddd33
--- /dev/null
+++ b/data/2022/iclr/Blaschke Product Neural Networks (BPNN): A Physics-Infused Neural Network for Phase Retrieval of Meromorphic Functions	
@@ -0,0 +1 @@
+Numerous physical systems are described by ordinary or partial differential equations whose solutions are given by holomorphic or meromorphic functions in the complex domain. In many cases, only the magnitude of these functions are observed on various points on the purely imaginary jw-axis since coherent measurement of their phases is often expensive. However, it is desirable to retrieve the lost phases from the magnitudes when possible. To this end, we propose a physics-infused deep neural network based on the Blaschke products for phase retrieval. Inspired by the Helson and Sarason Theorem, we recover coefficients of a rational function of Blaschke products using a Blaschke Product Neural Network (BPNN), based upon the magnitude observations as input. The resulting rational function is then used for phase retrieval. We compare the BPNN to conventional deep neural networks (NNs) on several phase retrieval problems, comprising both synthetic and contemporary real-world problems (e.g., metamaterials for which data collection requires substantial expertise and is time consuming). On each phase retrieval problem, we compare against a population of conventional NNs of varying size and hyperparameter settings. Even without any hyper-parameter search, we find that BPNNs consistently outperform the population of optimized NNs in scarce data scenarios, and do so despite being much smaller models. The results can in turn be applied to calculate the refractive index of metamaterials, which is an important problem in emerging areas of material science.
\ No newline at end of file
diff --git a/data/2022/iclr/Boosted Curriculum Reinforcement Learning b/data/2022/iclr/Boosted Curriculum Reinforcement Learning
new file mode 100644
index 0000000000..36d5a43a9d
--- /dev/null
+++ b/data/2022/iclr/Boosted Curriculum Reinforcement Learning	
@@ -0,0 +1 @@
+Curriculum value-based reinforcement learning (RL) solves a complex target task by reusing action-values across a tailored sequence of related tasks of increasing difficulty. However, finding an exact way of reusing action-values in this setting is still a poorly understood problem. In this paper, we introduce the concept of boosting to curriculum value-based RL, by approximating the action-value function as a sum of residuals trained on each task. This approach, which we refer to as boosted curriculum reinforcement learning (BCRL), has the benefit of naturally increasing the representativeness of the functional space by adding a new residual each time a new task is presented. This procedure allows reusing previous action-values while promoting expressiveness of the action-value function. We theoretically study BCRL as an approximate value iteration algorithm, discussing advantages over regular curriculum RL in terms of approximation accuracy and convergence to the optimal action-value function. Finally, we provide detailed empirical evidence of the benefits of BCRL in problems requiring curricula for accurate action-value estimation and targeted exploration.
\ No newline at end of file
diff --git a/data/2022/iclr/Boosting Randomized Smoothing with Variance Reduced Classifiers b/data/2022/iclr/Boosting Randomized Smoothing with Variance Reduced Classifiers
new file mode 100644
index 0000000000..84d3887b7a
--- /dev/null
+++ b/data/2022/iclr/Boosting Randomized Smoothing with Variance Reduced Classifiers	
@@ -0,0 +1 @@
+Randomized Smoothing (RS) is a promising method for obtaining robustness certificates by evaluating a base model under noise. In this work, we: (i) theoretically motivate why ensembles are a particularly suitable choice as base models for RS, and (ii) empirically confirm this choice, obtaining state-of-the-art results in multiple settings. The key insight of our work is that the reduced variance of ensembles over the perturbations introduced in RS leads to significantly more consistent classifications for a given input. This, in turn, leads to substantially increased certifiable radii for samples close to the decision boundary. Additionally, we introduce key optimizations which enable an up to 55-fold decrease in sample complexity of RS for predetermined radii, thus drastically reducing its computational overhead. Experimentally, we show that ensembles of only 3 to 10 classifiers consistently improve on their strongest constituting model with respect to their average certified radius (ACR) by 5% to 21% on both CIFAR10 and ImageNet, achieving a new state-of-the-art ACR of 0.86 and 1.11, respectively. We release all code and models required to reproduce our results at https://github.com/eth-sri/smoothing-ensembles.
\ No newline at end of file
diff --git a/data/2022/iclr/Boosting the Certified Robustness of L-infinity Distance Nets b/data/2022/iclr/Boosting the Certified Robustness of L-infinity Distance Nets
new file mode 100644
index 0000000000..9babbcbf96
--- /dev/null
+++ b/data/2022/iclr/Boosting the Certified Robustness of L-infinity Distance Nets	
@@ -0,0 +1 @@
+Recently, Zhang et al. (2021) developed a new neural network architecture based on $\ell_\infty$-distance functions, which naturally possesses certified $\ell_\infty$ robustness by its construction. Despite the novel design and theoretical foundation, so far the model only achieved comparable performance to conventional networks. In this paper, we make the following two contributions: $\mathrm{(i)}$ We demonstrate that $\ell_\infty$-distance nets enjoy a fundamental advantage in certified robustness over conventional networks (under typical certification approaches); $\mathrm{(ii)}$ With an improved training process we are able to significantly boost the certified accuracy of $\ell_\infty$-distance nets. Our training approach largely alleviates the optimization problem that arose in the previous training scheme, in particular, the unexpected large Lipschitz constant due to the use of a crucial trick called $\ell_p$-relaxation. The core of our training approach is a novel objective function that combines scaled cross-entropy loss and clipped hinge loss with a decaying mixing coefficient. Experiments show that using the proposed training strategy, the certified accuracy of $\ell_\infty$-distance net can be dramatically improved from 33.30% to 40.06% on CIFAR-10 ($\epsilon=8/255$), meanwhile outperforming other approaches in this area by a large margin. Our results clearly demonstrate the effectiveness and potential of $\ell_\infty$-distance net for certified robustness. Codes are available at https://github.com/zbh2047/L_inf-dist-net-v2.
\ No newline at end of file
diff --git a/data/2022/iclr/Bootstrapped Meta-Learning b/data/2022/iclr/Bootstrapped Meta-Learning
new file mode 100644
index 0000000000..47f53d8d4c
--- /dev/null
+++ b/data/2022/iclr/Bootstrapped Meta-Learning	
@@ -0,0 +1 @@
+Meta-learning empowers artificial intelligence to increase its efficiency by learning how to learn. Unlocking this potential involves overcoming a challenging meta-optimisation problem. We propose an algorithm that tackles this problem by letting the meta-learner teach itself. The algorithm first bootstraps a target from the meta-learner, then optimises the meta-learner by minimising the distance to that target under a chosen (pseudo-)metric. Focusing on meta-learning with gradients, we establish conditions that guarantee performance improvements and show that the metric can control meta-optimisation. Meanwhile, the bootstrapping mechanism can extend the effective meta-learning horizon without requiring backpropagation through all updates. We achieve a new state-of-the art for model-free agents on the Atari ALE benchmark and demonstrate that it yields both performance and efficiency gains in multi-task meta-learning. Finally, we explore how bootstrapping opens up new possibilities and find that it can meta-learn efficient exploration in an epsilon-greedy Q-learning agent, without backpropagating through the update rule.
\ No newline at end of file
diff --git a/data/2022/iclr/Bootstrapping Semantic Segmentation with Regional Contrast b/data/2022/iclr/Bootstrapping Semantic Segmentation with Regional Contrast
new file mode 100644
index 0000000000..8d6c462af8
--- /dev/null
+++ b/data/2022/iclr/Bootstrapping Semantic Segmentation with Regional Contrast	
@@ -0,0 +1 @@
+We present ReCo, a contrastive learning framework designed at a regional level to assist learning in semantic segmentation. ReCo performs semi-supervised or supervised pixel-level contrastive learning on a sparse set of hard negative pixels, with minimal additional memory footprint. ReCo is easy to implement, being built on top of off-the-shelf segmentation networks, and consistently improves performance in both semi-supervised and supervised semantic segmentation methods, achieving smoother segmentation boundaries and faster convergence. The strongest effect is in semi-supervised learning with very few labels. With ReCo, we achieve high-quality semantic segmentation models, requiring only 5 examples of each semantic class. Code is available at https://github.com/lorenmt/reco.
\ No newline at end of file
diff --git a/data/2022/iclr/Bregman Gradient Policy Optimization b/data/2022/iclr/Bregman Gradient Policy Optimization
new file mode 100644
index 0000000000..f16adc2739
--- /dev/null
+++ b/data/2022/iclr/Bregman Gradient Policy Optimization	
@@ -0,0 +1 @@
+In the paper, we design a novel Bregman gradient policy optimization framework for reinforcement learning based on Bregman divergences and momentum techniques. Specifically, we propose a Bregman gradient policy optimization (BGPO) algorithm based on the basic momentum technique and mirror descent iteration. Meanwhile, we further propose an accelerated Bregman gradient policy optimization (VR-BGPO) algorithm based on the variance reduced technique. Moreover, we provide a convergence analysis framework for our Bregman gradient policy optimization under the nonconvex setting. We prove that our BGPO achieves a sample complexity of $O(\epsilon^{-4})$ for finding $\epsilon$-stationary policy only requiring one trajectory at each iteration, and our VR-BGPO reaches the best known sample complexity of $O(\epsilon^{-3})$, which also only requires one trajectory at each iteration. In particular, by using different Bregman divergences, our BGPO framework unifies many existing policy optimization algorithms such as the existing (variance reduced) policy gradient algorithms such as natural policy gradient algorithm. Extensive experimental results on multiple reinforcement learning tasks demonstrate the efficiency of our new algorithms.
\ No newline at end of file
diff --git a/data/2022/iclr/Bridging Recommendation and Marketing via Recurrent Intensity Modeling b/data/2022/iclr/Bridging Recommendation and Marketing via Recurrent Intensity Modeling
new file mode 100644
index 0000000000..a7c2bcd794
--- /dev/null
+++ b/data/2022/iclr/Bridging Recommendation and Marketing via Recurrent Intensity Modeling	
@@ -0,0 +1 @@
+This paper studies some unexplored connections between personalized recommendation and marketing systems. Obviously, the two systems are different, in two main ways. Firstly, personalized item-recommendation (ItemRec) is user-centric, whereas marketing recommends the best user-state segments (UserRec) on behalf of its item providers. (We treat different temporal states of the same user as separate marketing opportunities.) To overcome this difference, we realize a novel connection to Marked-Temporal Point Processes (MTPPs), where we view both problems as different projections from a uniﬁed temporal intensity model for all user-item pairs. In this way, we derive Recurrent Intensity Models (RIMs) as unifying extensions from recurrent ItemRec models, though the connection can be more general. The second difference is in the temporal domains where they operate. While recommendation happens in real-time as each user appears, marketers often aim to reach a certain percentage of audience in the distribution of all user-states in a period of time. We formulate both considerations into a constrained optimization problem we call online match (OnlnMtch) and derive a Dual algorithm based on dual decomposition. Dual allows us to make ItemRec decisions in real time, while satisfying long-term marketing constraints in expectation. Finally, our connections between recommendation and marketing lead to novel applications. We run experiments where we use marketing as an alternative to cold-start item exploration, by setting a positive minimal-exposure constraint for every item over the user-state distribution in a future period of time. Our experiments are scalable to inﬁnite streams of user-states and open-sourced. 1
\ No newline at end of file
diff --git a/data/2022/iclr/Bridging the Gap: Providing Post-Hoc Symbolic Explanations for Sequential Decision-Making Problems with Inscrutable Representations b/data/2022/iclr/Bridging the Gap: Providing Post-Hoc Symbolic Explanations for Sequential Decision-Making Problems with Inscrutable Representations
new file mode 100644
index 0000000000..f0ee0e6529
--- /dev/null
+++ b/data/2022/iclr/Bridging the Gap: Providing Post-Hoc Symbolic Explanations for Sequential Decision-Making Problems with Inscrutable Representations	
@@ -0,0 +1 @@
+As increasingly complex AI systems are introduced into our daily lives, it becomes important for such systems to be capable of explaining the rationale for their decisions and allowing users to contest these decisions. A significant hurdle to allowing for such explanatory dialogue could be the vocabulary mismatch between the user and the AI system. This paper introduces methods for providing contrastive explanations in terms of user-specified concepts for sequential decisionmaking settings where the system’s model of the task may be best represented as a blackbox simulator. We do this by building partial symbolic models of the task that can be leveraged to answer the user queries. We empirically test these methods on a popular Atari game (Montezuma’s Revenge) and modified versions of Sokoban (a well known planning benchmark) and report the results of user studies to evaluate whether people find explanations generated in this form useful.
\ No newline at end of file
diff --git a/data/2022/iclr/Bundle Networks: Fiber Bundles, Local Trivializations, and a Generative Approach to Exploring Many-to-one Maps b/data/2022/iclr/Bundle Networks: Fiber Bundles, Local Trivializations, and a Generative Approach to Exploring Many-to-one Maps
new file mode 100644
index 0000000000..c57995c582
--- /dev/null
+++ b/data/2022/iclr/Bundle Networks: Fiber Bundles, Local Trivializations, and a Generative Approach to Exploring Many-to-one Maps	
@@ -0,0 +1 @@
+Many-to-one maps are ubiquitous in machine learning, from the image recognition model that assigns a multitude of distinct images to the concept of"cat"to the time series forecasting model which assigns a range of distinct time-series to a single scalar regression value. While the primary use of such models is naturally to associate correct output to each input, in many problems it is also useful to be able to explore, understand, and sample from a model's fibers, which are the set of input values $x$ such that $f(x) = y$, for fixed $y$ in the output space. In this paper we show that popular generative architectures are ill-suited to such tasks. Motivated by this we introduce a novel generative architecture, a Bundle Network, based on the concept of a fiber bundle from (differential) topology. BundleNets exploit the idea of a local trivialization wherein a space can be locally decomposed into a product space that cleanly encodes the many-to-one nature of the map. By enforcing this decomposition in BundleNets and by utilizing state-of-the-art invertible components, investigating a network's fibers becomes natural.
\ No newline at end of file
diff --git a/data/2022/iclr/Byzantine-Robust Learning on Heterogeneous Datasets via Bucketing b/data/2022/iclr/Byzantine-Robust Learning on Heterogeneous Datasets via Bucketing
new file mode 100644
index 0000000000..817c6bc6f1
--- /dev/null
+++ b/data/2022/iclr/Byzantine-Robust Learning on Heterogeneous Datasets via Bucketing	
@@ -0,0 +1 @@
+In Byzantine robust distributed or federated learning, a central server wants to train a machine learning model over data distributed across multiple workers. However, a fraction of these workers may deviate from the prescribed algorithm and send arbitrary messages. While this problem has received significant attention recently, most current defenses assume that the workers have identical data. For realistic cases when the data across workers are heterogeneous (non-iid), we design new attacks which circumvent current defenses, leading to significant loss of performance. We then propose a simple bucketing scheme that adapts existing robust algorithms to heterogeneous datasets at a negligible computational cost. We also theoretically and experimentally validate our approach, showing that combining bucketing with existing robust algorithms is effective against challenging attacks. Our work is the first to establish guaranteed convergence for the non-iid Byzantine robust problem under realistic assumptions.
\ No newline at end of file
diff --git a/data/2022/iclr/C-Planning: An Automatic Curriculum for Learning Goal-Reaching Tasks b/data/2022/iclr/C-Planning: An Automatic Curriculum for Learning Goal-Reaching Tasks
new file mode 100644
index 0000000000..051ca866f7
--- /dev/null
+++ b/data/2022/iclr/C-Planning: An Automatic Curriculum for Learning Goal-Reaching Tasks	
@@ -0,0 +1 @@
+Goal-conditioned reinforcement learning (RL) can solve tasks in a wide range of domains, including navigation and manipulation, but learning to reach distant goals remains a central challenge to the field. Learning to reach such goals is particularly hard without any offline data, expert demonstrations, and reward shaping. In this paper, we propose an algorithm to solve the distant goal-reaching task by using search at training time to automatically generate a curriculum of intermediate states. Our algorithm, Classifier-Planning (C-Planning), frames the learning of the goal-conditioned policies as expectation maximization: the E-step corresponds to planning an optimal sequence of waypoints using graph search, while the M-step aims to learn a goal-conditioned policy to reach those waypoints. Unlike prior methods that combine goal-conditioned RL with graph search, ours performs search only during training and not testing, significantly decreasing the compute costs of deploying the learned policy. Empirically, we demonstrate that our method is more sample efficient than prior methods. Moreover, it is able to solve very long horizons manipulation and navigation tasks, tasks that prior goal-conditioned methods and methods based on graph search fail to solve.
\ No newline at end of file
diff --git a/data/2022/iclr/CADDA: Class-wise Automatic Differentiable Data Augmentation for EEG Signals b/data/2022/iclr/CADDA: Class-wise Automatic Differentiable Data Augmentation for EEG Signals
new file mode 100644
index 0000000000..f86baec00e
--- /dev/null
+++ b/data/2022/iclr/CADDA: Class-wise Automatic Differentiable Data Augmentation for EEG Signals	
@@ -0,0 +1 @@
+Data augmentation is a key element of deep learning pipelines, as it informs the network during training about transformations of the input data that keep the label unchanged. Manually finding adequate augmentation methods and parameters for a given pipeline is however rapidly cumbersome. In particular, while intuition can guide this decision for images, the design and choice of augmentation policies remains unclear for more complex types of data, such as neuroscience signals. Besides, class-dependent augmentation strategies have been surprisingly unexplored in the literature, although it is quite intuitive: changing the color of a car image does not change the object class to be predicted, but doing the same to the picture of an orange does. This paper investigates gradient-based automatic data augmentation algorithms amenable to class-wise policies with exponentially larger search spaces. Motivated by supervised learning applications using EEG signals for which good augmentation policies are mostly unknown, we propose a new differentiable relaxation of the problem. In the class-agnostic setting, results show that our new relaxation leads to optimal performance with faster training than competing gradient-based methods, while also outperforming gradient-free methods in the class-wise setting. This work proposes also novel differentiable augmentation operations relevant for sleep stage classification.
\ No newline at end of file
diff --git a/data/2022/iclr/CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation b/data/2022/iclr/CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation
new file mode 100644
index 0000000000..a95f2b90ea
--- /dev/null
+++ b/data/2022/iclr/CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation	
@@ -0,0 +1 @@
+Unsupervised domain adaptation (UDA) aims to transfer knowledge learned from a labeled source domain to a different unlabeled target domain. Most existing UDA methods focus on learning domain-invariant feature representation, either from the domain level or category level, using convolution neural networks (CNNs)-based frameworks. One fundamental problem for the category level based UDA is the production of pseudo labels for samples in target domain, which are usually too noisy for accurate domain alignment, inevitably compromising the UDA performance. With the success of Transformer in various tasks, we find that the cross-attention in Transformer is robust to the noisy input pairs for better feature alignment, thus in this paper Transformer is adopted for the challenging UDA task. Specifically, to generate accurate input pairs, we design a two-way center-aware labeling algorithm to produce pseudo labels for target samples. Along with the pseudo labels, a weight-sharing triple-branch transformer framework is proposed to apply self-attention and cross-attention for source/target feature learning and source-target domain alignment, respectively. Such design explicitly enforces the framework to learn discriminative domain-specific and domain-invariant representations simultaneously. The proposed method is dubbed CDTrans (cross-domain transformer), and it provides one of the first attempts to solve UDA tasks with a pure transformer solution. Experiments show that our proposed method achieves the best performance on public UDA datasets, e.g. VisDA-2017 and DomainNet. Code and models are available at https://github.com/CDTrans/CDTrans.
\ No newline at end of file
diff --git a/data/2022/iclr/CKConv: Continuous Kernel Convolution For Sequential Data b/data/2022/iclr/CKConv: Continuous Kernel Convolution For Sequential Data
new file mode 100644
index 0000000000..f77a01f0e6
--- /dev/null
+++ b/data/2022/iclr/CKConv: Continuous Kernel Convolution For Sequential Data	
@@ -0,0 +1 @@
+Conventional neural architectures for sequential data present important limitations. Recurrent networks suffer from exploding and vanishing gradients, small effective memory horizons, and must be trained sequentially. Convolutional networks are unable to handle sequences of unknown size and their memory horizon must be defined a priori. In this work, we show that all these problems can be solved by formulating convolutional kernels in CNNs as continuous functions. The resulting Continuous Kernel Convolution (CKConv) allows us to model arbitrarily long sequences in a parallel manner, within a single operation, and without relying on any form of recurrence. We show that Continuous Kernel Convolutional Networks (CKCNNs) obtain state-of-the-art results in multiple datasets, e.g., permuted MNIST, and, thanks to their continuous nature, are able to handle non-uniformly sampled datasets and irregularly-sampled data natively. CKCNNs match or perform better than neural ODEs designed for these purposes in a faster and simpler manner.
\ No newline at end of file
diff --git a/data/2022/iclr/CLEVA-Compass: A Continual Learning Evaluation Assessment Compass to Promote Research Transparency and Comparability b/data/2022/iclr/CLEVA-Compass: A Continual Learning Evaluation Assessment Compass to Promote Research Transparency and Comparability
new file mode 100644
index 0000000000..7c983ec0b8
--- /dev/null
+++ b/data/2022/iclr/CLEVA-Compass: A Continual Learning Evaluation Assessment Compass to Promote Research Transparency and Comparability	
@@ -0,0 +1 @@
+What is the state of the art in continual machine learning? Although a natural question for predominant static benchmarks, the notion to train systems in a lifelong manner entails a plethora of additional challenges with respect to set-up and evaluation. The latter have recently sparked a growing amount of critiques on prominent algorithm-centric perspectives and evaluation protocols being too narrow, resulting in several attempts at constructing guidelines in favor of specific desiderata or arguing against the validity of prevalent assumptions. In this work, we depart from this mindset and argue that the goal of a precise formulation of desiderata is an ill-posed one, as diverse applications may always warrant distinct scenarios. Instead, we introduce the Continual Learning EValuation Assessment Compass: the CLEVA-Compass. The compass provides the visual means to both identify how approaches are practically reported and how works can simultaneously be contextualized in the broader literature landscape. In addition to promoting compact specification in the spirit of recent replication trends, it thus provides an intuitive chart to understand the priorities of individual systems, where they resemble each other, and what elements are missing towards a fair comparison.
\ No newline at end of file
diff --git a/data/2022/iclr/COPA: Certifying Robust Policies for Offline Reinforcement Learning against Poisoning Attacks b/data/2022/iclr/COPA: Certifying Robust Policies for Offline Reinforcement Learning against Poisoning Attacks
new file mode 100644
index 0000000000..811d2ec3cd
--- /dev/null
+++ b/data/2022/iclr/COPA: Certifying Robust Policies for Offline Reinforcement Learning against Poisoning Attacks	
@@ -0,0 +1 @@
+As reinforcement learning (RL) has achieved near human-level performance in a variety of tasks, its robustness has raised great attention. While a vast body of research has explored test-time (evasion) attacks in RL and corresponding defenses, its robustness against training-time (poisoning) attacks remains largely unanswered. In this work, we focus on certifying the robustness of offline RL in the presence of poisoning attacks, where a subset of training trajectories could be arbitrarily manipulated. We propose the first certification framework, COPA, to certify the number of poisoning trajectories that can be tolerated regarding different certification criteria. Given the complex structure of RL, we propose two certification criteria: per-state action stability and cumulative reward bound. To further improve the certification, we propose new partition and aggregation protocols to train robust policies. We further prove that some of the proposed certification methods are theoretically tight and some are NP-Complete problems. We leverage COPA to certify three RL environments trained with different algorithms and conclude: (1) The proposed robust aggregation protocols such as temporal aggregation can significantly improve the certifications; (2) Our certification for both per-state action stability and cumulative reward bound are efficient and tight; (3) The certification for different training algorithms and environments are different, implying their intrinsic robustness properties. All experimental results are available at https://copa-leaderboard.github.io.
\ No newline at end of file
diff --git a/data/2022/iclr/COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation b/data/2022/iclr/COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation
new file mode 100644
index 0000000000..72854c603e
--- /dev/null
+++ b/data/2022/iclr/COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation	
@@ -0,0 +1 @@
+We consider the offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset. This problem setting is appealing in many real-world scenarios, where direct interaction with the environment is costly or risky, and where the resulting policy should comply with safety constraints. However, it is challenging to compute a policy that guarantees satisfying the cost constraints in the offline RL setting, since the off-policy evaluation inherently has an estimation error. In this paper, we present an offline constrained RL algorithm that optimizes the policy in the space of the stationary distribution. Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction. Experimental results show that COptiDICE attains better policies in terms of constraint satisfaction and return-maximization, outperforming baseline algorithms.
\ No newline at end of file
diff --git a/data/2022/iclr/CROP: Certifying Robust Policies for Reinforcement Learning through Functional Smoothing b/data/2022/iclr/CROP: Certifying Robust Policies for Reinforcement Learning through Functional Smoothing
new file mode 100644
index 0000000000..fc46d29904
--- /dev/null
+++ b/data/2022/iclr/CROP: Certifying Robust Policies for Reinforcement Learning through Functional Smoothing	
@@ -0,0 +1 @@
+As reinforcement learning (RL) has achieved great success and been even adopted in safety-critical domains such as autonomous vehicles, a range of empirical studies have been conducted to improve its robustness against adversarial attacks. However, how to certify its robustness with theoretical guarantees still remains challenging. In this paper, we present the first unified framework CROP (Certifying Robust Policies for RL) to provide robustness certification on both action and reward levels. In particular, we propose two robustness certification criteria: robustness of per-state actions and lower bound of cumulative rewards. We then develop a local smoothing algorithm for policies derived from Q-functions to guarantee the robustness of actions taken along the trajectory; we also develop a global smoothing algorithm for certifying the lower bound of a finite-horizon cumulative reward, as well as a novel local smoothing algorithm to perform adaptive search in order to obtain tighter reward certification. Empirically, we apply CROP to evaluate several existing empirically robust RL algorithms, including adversarial training and different robust regularization, in four environments (two representative Atari games, Highway, and CartPole). Furthermore, by evaluating these algorithms against adversarial attacks, we demonstrate that our certifications are often tight. All experiment results are available at website https://crop-leaderboard.github.io.
\ No newline at end of file
diff --git a/data/2022/iclr/Can an Image Classifier Suffice For Action Recognition? b/data/2022/iclr/Can an Image Classifier Suffice For Action Recognition?
new file mode 100644
index 0000000000..890611a082
--- /dev/null
+++ b/data/2022/iclr/Can an Image Classifier Suffice For Action Recognition?	
@@ -0,0 +1 @@
+We explore a new perspective on video understanding by casting the video recognition problem as an image recognition task. Our approach rearranges input video frames into super images, which allow for training an image classifier directly to fulfill the task of action recognition, in exactly the same way as image classification. With such a simple idea, we show that transformer-based image classifiers alone can suffice for action recognition. In particular, our approach demonstrates strong and promising performance against SOTA methods on several public datasets including Kinetics400, Moments In Time, Something-Something V2 (SSV2), Jester and Diving48. We also experiment with the prevalent ResNet image classifiers in computer vision to further validate our idea. The results on both Kinetics400 and SSV2 are comparable to some of the best-performed CNN approaches based on spatio-temporal modeling. Our source codes and models are available at https://github.com/IBM/sifar-pytorch.
\ No newline at end of file
diff --git a/data/2022/iclr/Capacity of Group-invariant Linear Readouts from Equivariant Representations: How Many Objects can be Linearly Classified Under All Possible Views? b/data/2022/iclr/Capacity of Group-invariant Linear Readouts from Equivariant Representations: How Many Objects can be Linearly Classified Under All Possible Views?
new file mode 100644
index 0000000000..61d5e475d0
--- /dev/null
+++ b/data/2022/iclr/Capacity of Group-invariant Linear Readouts from Equivariant Representations: How Many Objects can be Linearly Classified Under All Possible Views?	
@@ -0,0 +1 @@
+Equivariance has emerged as a desirable property of representations of objects subject to identity-preserving transformations that constitute a group, such as translations and rotations. However, the expressivity of a representation constrained by group equivariance is still not fully understood. We address this gap by providing a generalization of Cover's Function Counting Theorem that quantifies the number of linearly separable and group-invariant binary dichotomies that can be assigned to equivariant representations of objects. We find that the fraction of separable dichotomies is determined by the dimension of the space that is fixed by the group action. We show how this relation extends to operations such as convolutions, element-wise nonlinearities, and global and local pooling. While other operations do not change the fraction of separable dichotomies, local pooling decreases the fraction, despite being a highly nonlinear operation. Finally, we test our theory on intermediate representations of randomly initialized and fully trained convolutional neural networks and find perfect agreement.
\ No newline at end of file
diff --git a/data/2022/iclr/Capturing Structural Locality in Non-parametric Language Models b/data/2022/iclr/Capturing Structural Locality in Non-parametric Language Models
new file mode 100644
index 0000000000..155873ce75
--- /dev/null
+++ b/data/2022/iclr/Capturing Structural Locality in Non-parametric Language Models	
@@ -0,0 +1 @@
+Structural locality is a ubiquitous feature of real-world datasets, wherein data points are organized into local hierarchies. Some examples include topical clusters in text or project hierarchies in source code repositories. In this paper, we explore utilizing this structural locality within non-parametric language models, which generate sequences that reference retrieved examples from an external source. We propose a simple yet effective approach for adding locality information into such models by adding learned parameters that improve the likelihood of retrieving examples from local neighborhoods. Experiments on two different domains, Java source code and Wikipedia text, demonstrate that locality features improve model efficacy over models without access to these features, with interesting differences. We also perform an analysis of how and where locality features contribute to improved performance and why the traditionally used contextual similarity metrics alone are not enough to grasp the locality structure.
\ No newline at end of file
diff --git a/data/2022/iclr/Case-based reasoning for better generalization in textual reinforcement learning b/data/2022/iclr/Case-based reasoning for better generalization in textual reinforcement learning
new file mode 100644
index 0000000000..90d9bb5749
--- /dev/null
+++ b/data/2022/iclr/Case-based reasoning for better generalization in textual reinforcement learning	
@@ -0,0 +1 @@
+Text-based games (TBG) have emerged as promising environments for driving research in grounded language understanding and studying problems like generalization and sample efficiency. Several deep reinforcement learning (RL) methods with varying architectures and learning schemes have been proposed for TBGs. However, these methods fail to generalize efficiently, especially under distributional shifts. In a departure from deep RL approaches, in this paper, we propose a general method inspired by case-based reasoning to train agents and generalize out of the training distribution. The case-based reasoner collects instances of positive experiences from the agent's interaction with the world in the past and later reuses the collected experiences to act efficiently. The method can be applied in conjunction with any existing on-policy neural agent in the literature for TBGs. Our experiments show that the proposed approach consistently improves existing methods, obtains good out-of-distribution generalization, and achieves new state-of-the-art results on widely used environments.
\ No newline at end of file
diff --git a/data/2022/iclr/Causal Contextual Bandits with Targeted Interventions b/data/2022/iclr/Causal Contextual Bandits with Targeted Interventions
new file mode 100644
index 0000000000..6fe0b461b5
--- /dev/null
+++ b/data/2022/iclr/Causal Contextual Bandits with Targeted Interventions	
@@ -0,0 +1 @@
+We study a contextual bandit setting where the learning agent has the ability to perform interventions on targeted subsets of the population, apart from possessing qualitative causal side-information. This novel formalism captures intricacies in real-world scenarios such as software product experimentation where targeted experiments can be conducted. However, this fundamentally changes the set of options that the agent has, compared to standard contextual bandit settings, necessitating new techniques. This is also the first work that integrates causal side-information in a contextual bandit setting, where the agent aims to learn a policy that maps contexts to arms (as opposed to just identifying one best arm). We pro-pose a new algorithm, which we show empirically performs better than baselines on experiments that use purely synthetic data and on real world-inspired experiments. We also prove a bound on regret that theoretically guards performance.
\ No newline at end of file
diff --git a/data/2022/iclr/Certified Robustness for Deep Equilibrium Models via Interval Bound Propagation b/data/2022/iclr/Certified Robustness for Deep Equilibrium Models via Interval Bound Propagation
new file mode 100644
index 0000000000..031fe67191
--- /dev/null
+++ b/data/2022/iclr/Certified Robustness for Deep Equilibrium Models via Interval Bound Propagation	
@@ -0,0 +1 @@
+Deep equilibrium layers (DEQs) have demonstrated promising performance and are competitive with standard explicit models on many benchmarks. However, little is known about certifying robustness for these models. Inspired by interval bound propagation (IBP), we propose the IBP-MonDEQ layer, a DEQ layer whose robustness can be veriﬁed by computing upper and lower interval bounds on the output. Our key insights are that these interval bounds can be obtained as the ﬁxed-point solution to an IBP-inspired equilibrium equation, and furthermore, that this solution always exists and is unique when the layer obeys a certain parameterization. This ﬁxed point can be interpreted as the result of applying IBP to an inﬁnitely deep, weight-tied neural network, which may be of independent interest, as IBP bounds are typically unstable for deeper networks. Our empirical comparison reveals that models with IBP-MonDEQ layers can achieve comparable (cid:96) 8 certiﬁed robustness to similarly-sized fully explicit networks. 1
\ No newline at end of file
diff --git a/data/2022/iclr/Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via Augmentation Overlap b/data/2022/iclr/Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via Augmentation Overlap
new file mode 100644
index 0000000000..88e21fa572
--- /dev/null
+++ b/data/2022/iclr/Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via Augmentation Overlap	
@@ -0,0 +1 @@
+Recently, contrastive learning has risen to be a promising approach for large-scale self-supervised learning. However, theoretical understanding of how it works is still unclear. In this paper, we propose a new guarantee on the downstream performance without resorting to the conditional independence assumption that is widely adopted in previous work but hardly holds in practice. Our new theory hinges on the insight that the support of different intra-class samples will become more overlapped under aggressive data augmentations, thus simply aligning the positive samples (augmented views of the same sample) could make contrastive learning cluster intra-class samples together. Based on this augmentation overlap perspective, theoretically, we obtain asymptotically closed bounds for downstream performance under weaker assumptions, and empirically, we propose an unsupervised model selection metric ARC that aligns well with downstream accuracy. Our theory suggests an alternative understanding of contrastive learning: the role of aligning positive samples is more like a surrogate task than an ultimate goal, and the overlapped augmented views (i.e., the chaos) create a ladder for contrastive learning to gradually learn class-separated representations. The code for computing ARC is available at https://github.com/zhangq327/ARC.
\ No newline at end of file
diff --git a/data/2022/iclr/Charformer: Fast Character Transformers via Gradient-based Subword Tokenization b/data/2022/iclr/Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
new file mode 100644
index 0000000000..60bfef1028
--- /dev/null
+++ b/data/2022/iclr/Charformer: Fast Character Transformers via Gradient-based Subword Tokenization	
@@ -0,0 +1 @@
+State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level. Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that Charformer outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. Additionally, Charformer is fast, improving the speed of both vanilla byte-level and subword-level Transformers by 28%-100% while maintaining competitive quality. We believe this work paves the way for highly performant token-free models that are trained completely end-to-end.
\ No newline at end of file
diff --git a/data/2022/iclr/Chemical-Reaction-Aware Molecule Representation Learning b/data/2022/iclr/Chemical-Reaction-Aware Molecule Representation Learning
new file mode 100644
index 0000000000..66c05a09ac
--- /dev/null
+++ b/data/2022/iclr/Chemical-Reaction-Aware Molecule Representation Learning	
@@ -0,0 +1 @@
+Molecule representation learning (MRL) methods aim to embed molecules into a real vector space. However, existing SMILES-based (Simplified Molecular-Input Line-Entry System) or GNN-based (Graph Neural Networks) MRL methods either take SMILES strings as input that have difficulty in encoding molecule structure information, or over-emphasize the importance of GNN architectures but neglect their generalization ability. Here we propose using chemical reactions to assist learning molecule representation. The key idea of our approach is to preserve the equivalence of molecules with respect to chemical reactions in the embedding space, i.e., forcing the sum of reactant embeddings and the sum of product embeddings to be equal for each chemical equation. This constraint is proven effective to 1) keep the embedding space well-organized and 2) improve the generalization ability of molecule embeddings. Moreover, our model can use any GNN as the molecule encoder and is thus agnostic to GNN architectures. Experimental results demonstrate that our method achieves state-of-the-art performance in a variety of downstream tasks, e.g., 17.4% absolute Hit@1 gain in chemical reaction prediction, 2.3% absolute AUC gain in molecule property prediction, and 18.5% relative RMSE gain in graph-edit-distance prediction, respectively, over the best baseline method. The code is available at https://github.com/hwwang55/MolR.
\ No newline at end of file
diff --git a/data/2022/iclr/Chunked Autoregressive GAN for Conditional Waveform Synthesis b/data/2022/iclr/Chunked Autoregressive GAN for Conditional Waveform Synthesis
new file mode 100644
index 0000000000..75ed5492a0
--- /dev/null
+++ b/data/2022/iclr/Chunked Autoregressive GAN for Conditional Waveform Synthesis	
@@ -0,0 +1 @@
+Conditional waveform synthesis models learn a distribution of audio waveforms given conditioning such as text, mel-spectrograms, or MIDI. These systems employ deep generative models that model the waveform via either sequential (autoregressive) or parallel (non-autoregressive) sampling. Generative adversarial networks (GANs) have become a common choice for non-autoregressive waveform synthesis. However, state-of-the-art GAN-based models produce artifacts when performing mel-spectrogram inversion. In this paper, we demonstrate that these artifacts correspond with an inability for the generator to learn accurate pitch and periodicity. We show that simple pitch and periodicity conditioning is insufficient for reducing this error relative to using autoregression. We discuss the inductive bias that autoregression provides for learning the relationship between instantaneous frequency and phase, and show that this inductive bias holds even when autoregressively sampling large chunks of the waveform during each forward pass. Relative to prior state-of-the-art GAN-based models, our proposed model, Chunked Autoregressive GAN (CARGAN) reduces pitch error by 40-60%, reduces training time by 58%, maintains a fast generation speed suitable for real-time or interactive applications, and maintains or improves subjective quality.
\ No newline at end of file
diff --git a/data/2022/iclr/Churn Reduction via Distillation b/data/2022/iclr/Churn Reduction via Distillation
new file mode 100644
index 0000000000..3095a0c58d
--- /dev/null
+++ b/data/2022/iclr/Churn Reduction via Distillation	
@@ -0,0 +1 @@
+In real-world systems, models are frequently updated as more data becomes available, and in addition to achieving high accuracy, the goal is to also maintain a low difference in predictions compared to the base model (i.e. predictive “churn”). If model retraining results in vastly different behavior, then it could cause negative effects in downstream systems, especially if this churn can be avoided with limited impact on model accuracy. In this paper, we show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn. We then show that distillation performs strongly for low churn training against a number of recent baselines on a wide range of datasets and model architectures, including fully-connected networks, convolutional networks, and transformers.
\ No newline at end of file
diff --git a/data/2022/iclr/Clean Images are Hard to Reblur: Exploiting the Ill-Posed Inverse Task for Dynamic Scene Deblurring b/data/2022/iclr/Clean Images are Hard to Reblur: Exploiting the Ill-Posed Inverse Task for Dynamic Scene Deblurring
new file mode 100644
index 0000000000..7b030d5667
--- /dev/null
+++ b/data/2022/iclr/Clean Images are Hard to Reblur: Exploiting the Ill-Posed Inverse Task for Dynamic Scene Deblurring	
@@ -0,0 +1 @@
+The goal of dynamic scene deblurring is to remove the motion blur in a given image. Typical learning-based approaches implement their solutions by minimizing the L1 or L2 distance between the output and the reference sharp image. Recent attempts adopt visual recognition features in training to improve the perceptual quality. However, those features are primarily designed to capture high-level contexts rather than low-level structures such as blurriness. Instead, we propose a more direct way to make images sharper by exploiting the inverse task of deblurring, namely, reblurring. Reblurring amplifies the remaining blur to rebuild the original blur, however, a well-deblurred clean image with zero-magnitude blur is hard to reblur. Thus, we design two types of reblurring loss functions for better deblurring. The supervised reblurring loss at training stage compares the amplified blur between the deblurred and the sharp images. The self-supervised reblurring loss at inference stage inspects if there noticeable blur remains in the deblurred. Our experimental results on large-scale benchmarks and real images demonstrate the effectiveness of the reblurring losses in improving the perceptual quality of the deblurred images in terms of NIQE and LPIPS scores as well as visual sharpness.
\ No newline at end of file
diff --git a/data/2022/iclr/ClimateGAN: Raising Climate Change Awareness by Generating Images of Floods b/data/2022/iclr/ClimateGAN: Raising Climate Change Awareness by Generating Images of Floods
new file mode 100644
index 0000000000..015148897c
--- /dev/null
+++ b/data/2022/iclr/ClimateGAN: Raising Climate Change Awareness by Generating Images of Floods	
@@ -0,0 +1 @@
+Climate change is a major threat to humanity, and the actions required to prevent its catastrophic consequences include changes in both policy-making and individual behaviour. However, taking action requires understanding the effects of climate change, even though they may seem abstract and distant. Projecting the potential consequences of extreme climate events such as flooding in familiar places can help make the abstract impacts of climate change more concrete and encourage action. As part of a larger initiative to build a website that projects extreme climate events onto user-chosen photos, we present our solution to simulate photo-realistic floods on authentic images. To address this complex task in the absence of suitable training data, we propose ClimateGAN, a model that leverages both simulated and real data for unsupervised domain adaptation and conditional image generation. In this paper, we describe the details of our framework, thoroughly evaluate components of our architecture and demonstrate that our model is capable of robustly generating photo-realistic flooding.
\ No newline at end of file
diff --git a/data/2022/iclr/Closed-form Sample Probing for Learning Generative Models in Zero-shot Learning b/data/2022/iclr/Closed-form Sample Probing for Learning Generative Models in Zero-shot Learning
new file mode 100644
index 0000000000..b5067620f2
--- /dev/null
+++ b/data/2022/iclr/Closed-form Sample Probing for Learning Generative Models in Zero-shot Learning	
@@ -0,0 +1 @@
+Generative model based approaches have led to signiﬁcant advances in zero-shot learning (ZSL) over the past few years. These approaches typically aim to learn a conditional generator that synthesizes training samples of classes conditioned on class deﬁnitions. The ﬁnal zero-shot learning model is then obtained by training a supervised classiﬁcation model over the real and/or synthesized training samples of seen and unseen classes, combined. Therefore, naturally, the generative model needs to produce not only relevant samples, but also those that are sufﬁciently rich for classiﬁer training purposes, which is handled by various heuristics in existing works. In this paper, we introduce a principled approach for training generative models directly for training data generation purposes. Our main observation is that the use of closed-form models opens doors to end-to-end training thanks to the differentiability of the solvers. In our approach, at each generative model up-date step, we ﬁt a task-speciﬁc closed-form ZSL model from generated samples, and measure its loss on novel samples all within the compute graph, a procedure that we refer to as sample probing . In this manner, the generator receives feed-back directly based on the value of its samples for model training purposes. Our experimental results show that the proposed sample probing approach improves the ZSL results even when integrated into state-of-the-art generative models.
\ No newline at end of file
diff --git a/data/2022/iclr/CoBERL: Contrastive BERT for Reinforcement Learning b/data/2022/iclr/CoBERL: Contrastive BERT for Reinforcement Learning
new file mode 100644
index 0000000000..47b67cb86f
--- /dev/null
+++ b/data/2022/iclr/CoBERL: Contrastive BERT for Reinforcement Learning	
@@ -0,0 +1 @@
+Many reinforcement learning (RL) agents require a large amount of experience to solve tasks. We propose Contrastive BERT for RL (CoBERL), an agent that combines a new contrastive loss and a hybrid LSTM-transformer architecture to tackle the challenge of improving data efficiency. CoBERL enables efficient, robust learning from pixels across a wide range of domains. We use bidirectional masked prediction in combination with a generalization of recent contrastive methods to learn better representations for transformers in RL, without the need of hand engineered data augmentations. We find that CoBERL consistently improves performance across the full Atari suite, a set of control tasks and a challenging 3D environment.
\ No newline at end of file
diff --git a/data/2022/iclr/CoMPS: Continual Meta Policy Search b/data/2022/iclr/CoMPS: Continual Meta Policy Search
new file mode 100644
index 0000000000..a825e5070b
--- /dev/null
+++ b/data/2022/iclr/CoMPS: Continual Meta Policy Search	
@@ -0,0 +1 @@
+We develop a new continual meta-learning method to address challenges in sequential multi-task learning. In this setting, the agent's goal is to achieve high reward over any sequence of tasks quickly. Prior meta-reinforcement learning algorithms have demonstrated promising results in accelerating the acquisition of new tasks. However, they require access to all tasks during training. Beyond simply transferring past experience to new tasks, our goal is to devise continual reinforcement learning algorithms that learn to learn, using their experience on previous tasks to learn new tasks more quickly. We introduce a new method, continual meta-policy search (CoMPS), that removes this limitation by meta-training in an incremental fashion, over each task in a sequence, without revisiting prior tasks. CoMPS continuously repeats two subroutines: learning a new task using RL and using the experience from RL to perform completely offline meta-learning to prepare for subsequent task learning. We find that CoMPS outperforms prior continual learning and off-policy meta-reinforcement methods on several sequences of challenging continuous control tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/CoST: Contrastive Learning of Disentangled Seasonal-Trend Representations for Time Series Forecasting b/data/2022/iclr/CoST: Contrastive Learning of Disentangled Seasonal-Trend Representations for Time Series Forecasting
new file mode 100644
index 0000000000..71a64f1be0
--- /dev/null
+++ b/data/2022/iclr/CoST: Contrastive Learning of Disentangled Seasonal-Trend Representations for Time Series Forecasting	
@@ -0,0 +1 @@
+Deep learning has been actively studied for time series forecasting, and the mainstream paradigm is based on the end-to-end training of neural network architectures, ranging from classical LSTM/RNNs to more recent TCNs and Transformers. Motivated by the recent success of representation learning in computer vision and natural language processing, we argue that a more promising paradigm for time series forecasting, is to first learn disentangled feature representations, followed by a simple regression fine-tuning step -- we justify such a paradigm from a causal perspective. Following this principle, we propose a new time series representation learning framework for time series forecasting named CoST, which applies contrastive learning methods to learn disentangled seasonal-trend representations. CoST comprises both time domain and frequency domain contrastive losses to learn discriminative trend and seasonal representations, respectively. Extensive experiments on real-world datasets show that CoST consistently outperforms the state-of-the-art methods by a considerable margin, achieving a 21.3% improvement in MSE on multivariate benchmarks. It is also robust to various choices of backbone encoders, as well as downstream regressors. Code is available at https://github.com/salesforce/CoST.
\ No newline at end of file
diff --git a/data/2022/iclr/CodeTrek: Flexible Modeling of Code using an Extensible Relational Representation b/data/2022/iclr/CodeTrek: Flexible Modeling of Code using an Extensible Relational Representation
new file mode 100644
index 0000000000..08d2b134e3
--- /dev/null
+++ b/data/2022/iclr/CodeTrek: Flexible Modeling of Code using an Extensible Relational Representation	
@@ -0,0 +1 @@
+REK
\ No newline at end of file
diff --git a/data/2022/iclr/Coherence-based Label Propagation over Time Series for Accelerated Active Learning b/data/2022/iclr/Coherence-based Label Propagation over Time Series for Accelerated Active Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods b/data/2022/iclr/Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods
new file mode 100644
index 0000000000..4ec953d9fe
--- /dev/null
+++ b/data/2022/iclr/Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) have achieved state-of-the-art performance in node classification, regression, and recommendation tasks. GNNs work well when rich and high-quality connections are available. However, their effectiveness is often jeopardized in many real-world graphs in which node degrees have power-law distributions. The extreme case of this situation, where a node may have no neighbors, is called Strict Cold Start (SCS). SCS forces the prediction to rely completely on the node's own features. We propose Cold Brew, a teacher-student distillation approach to address the SCS and noisy-neighbor challenges for GNNs. We also introduce feature contribution ratio (FCR), a metric to quantify the behavior of inductive GNNs to solve SCS. We experimentally show that FCR disentangles the contributions of different graph data components and helps select the best architecture for SCS generalization. We further demonstrate the superior performance of Cold Brew on several public benchmark and proprietary e-commerce datasets, where many nodes have either very few or noisy connections. Our source code is available at https://github.com/amazon-research/gnn-tail-generalization.
\ No newline at end of file
diff --git a/data/2022/iclr/Collapse by Conditioning: Training Class-conditional GANs with Limited Data b/data/2022/iclr/Collapse by Conditioning: Training Class-conditional GANs with Limited Data
new file mode 100644
index 0000000000..31295f0d0a
--- /dev/null
+++ b/data/2022/iclr/Collapse by Conditioning: Training Class-conditional GANs with Limited Data	
@@ -0,0 +1 @@
+Class-conditioning offers a direct means to control a Generative Adversarial Network (GAN) based on a discrete input variable. While necessary in many applications, the additional information provided by the class labels could even be expected to benefit the training of the GAN itself. On the contrary, we observe that class-conditioning causes mode collapse in limited data settings, where unconditional learning leads to satisfactory generative ability. Motivated by this observation, we propose a training strategy for class-conditional GANs (cGANs) that effectively prevents the observed mode-collapse by leveraging unconditional learning. Our training strategy starts with an unconditional GAN and gradually injects the class conditioning into the generator and the objective function. The proposed method for training cGANs with limited data results not only in stable training but also in generating high-quality images, thanks to the early-stage exploitation of the shared information across classes. We analyze the observed mode collapse problem in comprehensive experiments on four datasets. Our approach demonstrates outstanding results compared with state-of-the-art methods and established baselines. The code is available at https://github.com/mshahbazi72/transitional-cGAN
\ No newline at end of file
diff --git a/data/2022/iclr/ComPhy: Compositional Physical Reasoning of Objects and Events from Videos b/data/2022/iclr/ComPhy: Compositional Physical Reasoning of Objects and Events from Videos
new file mode 100644
index 0000000000..4e6289952f
--- /dev/null
+++ b/data/2022/iclr/ComPhy: Compositional Physical Reasoning of Objects and Events from Videos	
@@ -0,0 +1 @@
+Objects' motions in nature are governed by complex interactions and their properties. While some properties, such as shape and material, can be identified via the object's visual appearances, others like mass and electric charge are not directly visible. The compositionality between the visible and hidden properties poses unique challenges for AI models to reason from the physical world, whereas humans can effortlessly infer them with limited observations. Existing studies on video reasoning mainly focus on visually observable elements such as object appearance, movement, and contact interaction. In this paper, we take an initial step to highlight the importance of inferring the hidden physical properties not directly observable from visual appearances, by introducing the Compositional Physical Reasoning (ComPhy) dataset. For a given set of objects, ComPhy includes few videos of them moving and interacting under different initial conditions. The model is evaluated based on its capability to unravel the compositional hidden properties, such as mass and charge, and use this knowledge to answer a set of questions posted on one of the videos. Evaluation results of several state-of-the-art video reasoning models on ComPhy show unsatisfactory performance as they fail to capture these hidden properties. We further propose an oracle neural-symbolic framework named Compositional Physics Learner (CPL), combining visual perception, physical property learning, dynamic prediction, and symbolic execution into a unified framework. CPL can effectively identify objects' physical properties from their interactions and predict their dynamics to answer questions.
\ No newline at end of file
diff --git a/data/2022/iclr/Communication-Efficient Actor-Critic Methods for Homogeneous Markov Games b/data/2022/iclr/Communication-Efficient Actor-Critic Methods for Homogeneous Markov Games
new file mode 100644
index 0000000000..dff6bee575
--- /dev/null
+++ b/data/2022/iclr/Communication-Efficient Actor-Critic Methods for Homogeneous Markov Games	
@@ -0,0 +1 @@
+Recent success in cooperative multi-agent reinforcement learning (MARL) relies on centralized training and policy sharing. Centralized training eliminates the issue of non-stationarity MARL yet induces large communication costs, and policy sharing is empirically crucial to efficient learning in certain tasks yet lacks theoretical justification. In this paper, we formally characterize a subclass of cooperative Markov games where agents exhibit a certain form of homogeneity such that policy sharing provably incurs no suboptimality. This enables us to develop the first consensus-based decentralized actor-critic method where the consensus update is applied to both the actors and the critics while ensuring convergence. We also develop practical algorithms based on our decentralized actor-critic method to reduce the communication cost during training, while still yielding policies comparable with centralized training.
\ No newline at end of file
diff --git a/data/2022/iclr/Comparing Distributions by Measuring Differences that Affect Decision Making b/data/2022/iclr/Comparing Distributions by Measuring Differences that Affect Decision Making
new file mode 100644
index 0000000000..1065ced964
--- /dev/null
+++ b/data/2022/iclr/Comparing Distributions by Measuring Differences that Affect Decision Making	
@@ -0,0 +1 @@
+Measuring the discrepancy between two probability distributions is a fundamental problem in machine learning and statistics. We propose a new class of discrepancies based on the optimal loss for a decision task – two distributions are diﬀerent if the optimal decision loss is higher on their mixture than on each individual distribution. By suitably choosing the decision task, this generalizes the Jensen-Shannon divergence and the maximum mean discrepancy family. We apply our approach to two-sample tests, and on various benchmarks, we achieve superior test power compared to competing methods. In addition, a modeler can directly specify their preferences when comparing distributions through the decision loss. We apply this property to understanding the eﬀects of climate change on diﬀerent economic activities and selecting features targeting diﬀerent decision tasks
\ No newline at end of file
diff --git a/data/2022/iclr/Complete Verification via Multi-Neuron Relaxation Guided Branch-and-Bound b/data/2022/iclr/Complete Verification via Multi-Neuron Relaxation Guided Branch-and-Bound
new file mode 100644
index 0000000000..40ebd8d6c0
--- /dev/null
+++ b/data/2022/iclr/Complete Verification via Multi-Neuron Relaxation Guided Branch-and-Bound	
@@ -0,0 +1 @@
+State-of-the-art neural network verifiers are fundamentally based on one of two paradigms: either encoding the whole verification problem via tight multi-neuron convex relaxations or applying a Branch-and-Bound (BaB) procedure leveraging imprecise but fast bounding methods on a large number of easier subproblems. The former can capture complex multi-neuron dependencies but sacrifices completeness due to the inherent limitations of convex relaxations. The latter enables complete verification but becomes increasingly ineffective on larger and more challenging networks. In this work, we present a novel complete verifier which combines the strengths of both paradigms: it leverages multi-neuron relaxations to drastically reduce the number of subproblems generated during the BaB process and an efficient GPU-based dual optimizer to solve the remaining ones. An extensive evaluation demonstrates that our verifier achieves a new state-of-the-art on both established benchmarks as well as networks with significantly higher accuracy than previously considered. The latter result (up to 28% certification gains) indicates meaningful progress towards creating verifiers that can handle practically relevant networks.
\ No newline at end of file
diff --git a/data/2022/iclr/Compositional Attention: Disentangling Search and Retrieval b/data/2022/iclr/Compositional Attention: Disentangling Search and Retrieval
new file mode 100644
index 0000000000..13de0c16bd
--- /dev/null
+++ b/data/2022/iclr/Compositional Attention: Disentangling Search and Retrieval	
@@ -0,0 +1 @@
+Multi-head, key-value attention is the backbone of the widely successful Transformer model and its variants. This attention mechanism uses multiple parallel key-value attention blocks (called heads), each performing two fundamental computations: (1) search - selection of a relevant entity from a set via query-key interactions, and (2) retrieval - extraction of relevant features from the selected entity via a value matrix. Importantly, standard attention heads learn a rigid mapping between search and retrieval. In this work, we first highlight how this static nature of the pairing can potentially: (a) lead to learning of redundant parameters in certain tasks, and (b) hinder generalization. To alleviate this problem, we propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure. The proposed mechanism disentangles search and retrieval and composes them in a dynamic, flexible and context-dependent manner through an additional soft competition stage between the query-key combination and value pairing. Through a series of numerical experiments, we show that it outperforms standard multi-head attention on a variety of tasks, including some out-of-distribution settings. Through our qualitative analysis, we demonstrate that Compositional Attention leads to dynamic specialization based on the type of retrieval needed. Our proposed mechanism generalizes multi-head attention, allows independent scaling of search and retrieval, and can easily be implemented in lieu of standard attention heads in any network architecture.
\ No newline at end of file
diff --git a/data/2022/iclr/Compositional Training for End-to-End Deep AUC Maximization b/data/2022/iclr/Compositional Training for End-to-End Deep AUC Maximization
new file mode 100644
index 0000000000..73e229bea1
--- /dev/null
+++ b/data/2022/iclr/Compositional Training for End-to-End Deep AUC Maximization	
@@ -0,0 +1 @@
+Recently, deep AUC maximization (DAM) has achieved great success in different domains (e.g., medical image classification). However, the end-to-end training for deep AUC maximization still remains a challenging problem. Previous studies employ an ad-hoc two-stage approach that first trains the network by optimizing a traditional loss (e.g., cross-entropy loss) and then finetunes the network by optimizing an AUC loss. This is because that training a deep neural network from scratch by maximizing an AUC loss usually does not yield a satisfactory performance. This phenomenon can be attributed to the degraded feature representations learned by maximizing the AUC loss from scratch. To address this issue, we propose a novel compositional training framework for end-to-end DAM, namely compositional DAM. The key idea of compositional training is to minimize a compositional objective function, where the outer function corresponds to an AUC loss and the inner function represents a gradient descent step for minimizing a traditional loss, e.g., the cross-entropy (CE) loss. To optimize the non-standard compositional objective, we propose an efficient and provable stochastic optimization algorithm. The proposed algorithm enhances the capabilities of both robust feature learning and robust classifier learning by alternatively taking a gradient descent step for the CE loss and for the AUC loss in a systematic way. We conduct extensive empirical studies on imbalanced benchmark and medical image datasets, which unanimously verify the effectiveness of the proposed method. Our results show that the compositional training approach dramatically improves both the feature representations and the testing AUC score compared with traditional deep learning approaches, and yields better performance than the two-stage approaches for DAM as well. The proposed method is implemented in our open-sourced library LibAUC (www.libauc.org) and code is available at https://github.com/Optimization-AI/LibAUC.
\ No newline at end of file
diff --git a/data/2022/iclr/ConFeSS: A Framework for Single Source Cross-Domain Few-Shot Learning b/data/2022/iclr/ConFeSS: A Framework for Single Source Cross-Domain Few-Shot Learning
new file mode 100644
index 0000000000..8ee8c63c7a
--- /dev/null
+++ b/data/2022/iclr/ConFeSS: A Framework for Single Source Cross-Domain Few-Shot Learning	
@@ -0,0 +1 @@
+Most current few-shot learning methods train a model from abundantly labeled base category data and then transfer and adapt the model to sparsely labeled novel category data. These methods mostly generalize well on novel categories from the same domain as the base categories but perform poorly for distant domain categories. In this paper, we propose a framework for few-shot learning coined as ConFeSS (Contrastive Learning and Feature Selection System) that tackles large domain shift between base and novel categories. The first step of our framework trains a feature extracting backbone with the contrastive loss on the base category data. Since the contrastive loss does not use supervision, the features can generalize better to distant target domains. For the second step, we train a masking module to select relevant features that are more suited to target domain classification. Finally, a classifier is fine-tuned along with the backbone such that the backbone produces features similar to the relevant ones. To evaluate our framework, we tested it on a recently introduced cross-domain few-shot learning benchmark. Experimental results demonstrate that our framework outperforms all meta-learning approaches and produces competitive results against recent cross-domain methods. Additional analyses are also performed to better understand our framework.
\ No newline at end of file
diff --git a/data/2022/iclr/Concurrent Adversarial Learning for Large-Batch Training b/data/2022/iclr/Concurrent Adversarial Learning for Large-Batch Training
new file mode 100644
index 0000000000..6a487357b9
--- /dev/null
+++ b/data/2022/iclr/Concurrent Adversarial Learning for Large-Batch Training	
@@ -0,0 +1 @@
+Large-batch training has become a commonly used technique when training neural networks with a large number of GPU/TPU processors. As batch size increases, stochastic optimizers tend to converge to sharp local minima, leading to degraded test performance. Current methods usually use extensive data augmentation to increase the batch size, but we found the performance gain with data augmentation decreases as batch size increases, and data augmentation will become insufficient after certain point. In this paper, we propose to use adversarial learning to increase the batch size in large-batch training. Despite being a natural choice for smoothing the decision surface and biasing towards a flat region, adversarial learning has not been successfully applied in large-batch training since it requires at least two sequential gradient computations at each step, which will at least double the running time compared with vanilla training even with a large number of processors. To overcome this issue, we propose a novel Concurrent Adversarial Learning (ConAdv) method that decouple the sequential gradient computations in adversarial learning by utilizing staled parameters. Experimental results demonstrate that ConAdv can successfully increase the batch size on ResNet-50 training on ImageNet while maintaining high accuracy. In particular, we show ConAdv along can achieve 75.3\% top-1 accuracy on ImageNet ResNet-50 training with 96K batch size, and the accuracy can be further improved to 76.2\% when combining ConAdv with data augmentation. This is the first work successfully scales ResNet-50 training batch size to 96K.
\ No newline at end of file
diff --git a/data/2022/iclr/Conditional Contrastive Learning with Kernel b/data/2022/iclr/Conditional Contrastive Learning with Kernel
new file mode 100644
index 0000000000..56d5cb39a7
--- /dev/null
+++ b/data/2022/iclr/Conditional Contrastive Learning with Kernel	
@@ -0,0 +1 @@
+Conditional contrastive learning frameworks consider the conditional sampling procedure that constructs positive or negative data pairs conditioned on specific variables. Fair contrastive learning constructs negative pairs, for example, from the same gender (conditioning on sensitive information), which in turn reduces undesirable information from the learned representations; weakly supervised contrastive learning constructs positive pairs with similar annotative attributes (conditioning on auxiliary information), which in turn are incorporated into the representations. Although conditional contrastive learning enables many applications, the conditional sampling procedure can be challenging if we cannot obtain sufficient data pairs for some values of the conditioning variable. This paper presents Conditional Contrastive Learning with Kernel (CCL-K) that converts existing conditional contrastive objectives into alternative forms that mitigate the insufficient data problem. Instead of sampling data according to the value of the conditioning variable, CCL-K uses the Kernel Conditional Embedding Operator that samples data from all available data and assigns weights to each sampled data given the kernel similarity between the values of the conditioning variable. We conduct experiments using weakly supervised, fair, and hard negatives contrastive learning, showing CCL-K outperforms state-of-the-art baselines.
\ No newline at end of file
diff --git a/data/2022/iclr/Conditional Image Generation by Conditioning Variational Auto-Encoders b/data/2022/iclr/Conditional Image Generation by Conditioning Variational Auto-Encoders
new file mode 100644
index 0000000000..94ee92e046
--- /dev/null
+++ b/data/2022/iclr/Conditional Image Generation by Conditioning Variational Auto-Encoders	
@@ -0,0 +1 @@
+We present a conditional variational auto-encoder (VAE) which, to avoid the substantial cost of training from scratch, uses an architecture and training objective capable of leveraging a foundation model in the form of a pretrained unconditional VAE. To train the conditional VAE, we only need to train an artifact to perform amortized inference over the unconditional VAE's latent variables given a conditioning input. We demonstrate our approach on tasks including image inpainting, for which it outperforms state-of-the-art GAN-based approaches at faithfully representing the inherent uncertainty. We conclude by describing a possible application of our inpainting model, in which it is used to perform Bayesian experimental design for the purpose of guiding a sensor.
\ No newline at end of file
diff --git a/data/2022/iclr/Conditional Object-Centric Learning from Video b/data/2022/iclr/Conditional Object-Centric Learning from Video
new file mode 100644
index 0000000000..089d865481
--- /dev/null
+++ b/data/2022/iclr/Conditional Object-Centric Learning from Video	
@@ -0,0 +1 @@
+Object-centric representations are a promising path toward more systematic generalization by providing flexible abstractions upon which compositional world models can be built. Recent work on simple 2D and 3D datasets has shown that models with object-centric inductive biases can learn to segment and represent meaningful objects from the statistical structure of the data alone without the need for any supervision. However, such fully-unsupervised methods still fail to scale to diverse realistic data, despite the use of increasingly complex inductive biases such as priors for the size of objects or the 3D geometry of the scene. In this paper, we instead take a weakly-supervised approach and focus on how 1) using the temporal dynamics of video data in the form of optical flow and 2) conditioning the model on simple object location cues can be used to enable segmenting and tracking objects in significantly more realistic synthetic data. We introduce a sequential extension to Slot Attention which we train to predict optical flow for realistic looking synthetic scenes and show that conditioning the initial state of this model on a small set of hints, such as center of mass of objects in the first frame, is sufficient to significantly improve instance segmentation. These benefits generalize beyond the training distribution to novel objects, novel backgrounds, and to longer video sequences. We also find that such initial-state-conditioning can be used during inference as a flexible interface to query the model for specific objects or parts of objects, which could pave the way for a range of weakly-supervised approaches and allow more effective interaction with trained models.
\ No newline at end of file
diff --git a/data/2022/iclr/Conditioning Sequence-to-sequence Networks with Learned Activations b/data/2022/iclr/Conditioning Sequence-to-sequence Networks with Learned Activations
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Connectome-constrained Latent Variable Model of Whole-Brain Neural Activity b/data/2022/iclr/Connectome-constrained Latent Variable Model of Whole-Brain Neural Activity
new file mode 100644
index 0000000000..e959687807
--- /dev/null
+++ b/data/2022/iclr/Connectome-constrained Latent Variable Model of Whole-Brain Neural Activity	
@@ -0,0 +1 @@
+The availability of both anatomical connectivity and brain-wide neural activity measurements in C. elegans make the worm a promising system for learning detailed, mechanistic models of an entire nervous system in a data-driven way. However, one faces several challenges when constructing such a model. We often do not have direct experimental access to important modeling details such as single-neuron dynamics and the signs and strengths of the synaptic connectivity. Further, neural activity can only be measured in a subset of neurons, often indirectly via calcium imaging, and significant trial-to-trial variability has been observed. To address these challenges, we introduce a connectome-constrained latent variable model (CC-LVM) of the unobserved voltage dynamics of the entire C. elegans nervous system and the observed calcium signals. We used the framework of variational autoencoders to fit parameters of the mechanistic simulation constituting the generative model of the LVM to calcium imaging observations. A variational approximate posterior distribution over latent voltage traces for all neurons is efficiently inferred using an inference network, and constrained by a prior distribution given by the biophysical simulation of neural dynamics. We applied this model to an experimental whole-brain dataset, and found that connectomic constraints enable our LVM to predict the activity of neurons whose activity were withheld significantly better than models unconstrained by a connectome. We explored models with different degrees of biophysical detail, and found that models with realistic conductancebased synapses provide markedly better predictions than current-based synapses for this system.
\ No newline at end of file
diff --git a/data/2022/iclr/Consistent Counterfactuals for Deep Models b/data/2022/iclr/Consistent Counterfactuals for Deep Models
new file mode 100644
index 0000000000..906465f7f9
--- /dev/null
+++ b/data/2022/iclr/Consistent Counterfactuals for Deep Models	
@@ -0,0 +1 @@
+Counterfactual examples are one of the most commonly-cited methods for explaining the predictions of machine learning models in key areas such as finance and medical diagnosis. Counterfactuals are often discussed under the assumption that the model on which they will be used is static, but in deployment models may be periodically retrained or fine-tuned. This paper studies the consistency of model prediction on counterfactual examples in deep networks under small changes to initial training conditions, such as weight initialization and leave-one-out variations in data, as often occurs during model deployment. We demonstrate experimentally that counterfactual examples for deep models are often inconsistent across such small changes, and that increasing the cost of the counterfactual, a stability-enhancing mitigation suggested by prior work in the context of simpler models, is not a reliable heuristic in deep networks. Rather, our analysis shows that a model's local Lipschitz continuity around the counterfactual is key to its consistency across related models. To this end, we propose Stable Neighbor Search as a way to generate more consistent counterfactual explanations, and illustrate the effectiveness of this approach on several benchmark datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/Constrained Physical-Statistics Models for Dynamical System Identification and Prediction b/data/2022/iclr/Constrained Physical-Statistics Models for Dynamical System Identification and Prediction
new file mode 100644
index 0000000000..8eccd07244
--- /dev/null
+++ b/data/2022/iclr/Constrained Physical-Statistics Models for Dynamical System Identification and Prediction	
@@ -0,0 +1 @@
+Modeling dynamical systems combining prior physical knowledge and machine learning (ML) is promising in scientific problems when the underlying processes are not fully understood, e.g. when the dynamics is partially known. A common practice to identify the respective parameters of the physical and ML components is to formulate the problem as supervised learning on observed trajectories. However, this formulation leads to an infinite number of possible decompositions. To solve this ill-posedness, we reformulate the learning problem by introducing an upper bound on the prediction error of a physical-statistical model. This allows us to control the contribution of both the physical and statistical components to the overall prediction. This framework generalizes several existing hybrid schemes proposed in the literature. We provide theoretical guarantees on the wellposedness of our formulation along with a proof of convergence in a simple affine setting. For more complex dynamics, we validate our framework experimentally.
\ No newline at end of file
diff --git a/data/2022/iclr/Constrained Policy Optimization via Bayesian World Models b/data/2022/iclr/Constrained Policy Optimization via Bayesian World Models
new file mode 100644
index 0000000000..7118a23ff1
--- /dev/null
+++ b/data/2022/iclr/Constrained Policy Optimization via Bayesian World Models	
@@ -0,0 +1 @@
+Improving sample-efficiency and safety are crucial challenges when deploying reinforcement learning in high-stakes real world applications. We propose LAMBDA, a novel model-based approach for policy optimization in safety critical tasks modeled via constrained Markov decision processes. Our approach utilizes Bayesian world models, and harnesses the resulting uncertainty to maximize optimistic upper bounds on the task objective, as well as pessimistic upper bounds on the safety constraints. We demonstrate LAMBDA's state of the art performance on the Safety-Gym benchmark suite in terms of sample efficiency and constraint violation.
\ No newline at end of file
diff --git a/data/2022/iclr/Constraining Linear-chain CRFs to Regular Languages b/data/2022/iclr/Constraining Linear-chain CRFs to Regular Languages
new file mode 100644
index 0000000000..733c77816a
--- /dev/null
+++ b/data/2022/iclr/Constraining Linear-chain CRFs to Regular Languages	
@@ -0,0 +1 @@
+A major challenge in structured prediction is to represent the interdependencies within output structures. When outputs are structured as sequences, linear-chain conditional random fields (CRFs) are a widely used model class which can learn \textit{local} dependencies in the output. However, the CRF's Markov assumption makes it impossible for CRFs to represent distributions with \textit{nonlocal} dependencies, and standard CRFs are unable to respect nonlocal constraints of the data (such as global arity constraints on output labels). We present a generalization of CRFs that can enforce a broad class of constraints, including nonlocal ones, by specifying the space of possible output structures as a regular language $\mathcal{L}$. The resulting regular-constrained CRF (RegCCRF) has the same formal properties as a standard CRF, but assigns zero probability to all label sequences not in $\mathcal{L}$. Notably, RegCCRFs can incorporate their constraints during training, while related models only enforce constraints during decoding. We prove that constrained training is never worse than constrained decoding, and show empirically that it can be substantially better in practice. Additionally, we demonstrate a practical benefit on downstream tasks by incorporating a RegCCRF into a deep neural model for semantic role labeling, exceeding state-of-the-art results on a standard dataset.
\ No newline at end of file
diff --git a/data/2022/iclr/Constructing Orthogonal Convolutions in an Explicit Manner b/data/2022/iclr/Constructing Orthogonal Convolutions in an Explicit Manner
new file mode 100644
index 0000000000..644518e11f
--- /dev/null
+++ b/data/2022/iclr/Constructing Orthogonal Convolutions in an Explicit Manner	
@@ -0,0 +1 @@
+Convolutions with orthogonal input-output Jacobian matrix, i.e., orthogonal convolution, have recently attracted substantial attention. A convolution layer with an orthogonal Jacobian matrix is 1-Lipschitz in the 2-norm, making the output robust to the perturbation in input. Meanwhile, an orthogonal Jacobian matrix preserves the gradient norm in back-propagation, which is critical for stable training deep networks. Nevertheless, existing orthogonal convolutions are burdened by high computational costs for preserving orthogonality. In this work, we exploit the relation between the singular values of the convolution layer’s Jacobian and the structure of the convolution kernel. To achieve orthogonality, we explicitly construct the convolution kernel for enforcing all singular values of the convolution layer’s Jacobian to be 1s. After training, the explicitly constructed orthogonal (ECO) convolutions are constructed only once, and their weights are stored. Then, in evaluation, we only need to load the stored weights of the trained ECO convolution, and the computational cost of ECO convolution is the same as the standard dilated convolution. It is more efficient than the recent state-of-the-art approach, skew orthogonal convolution (SOC) in evaluation. Experiments on CIFAR-10 and CIFAR-100 demonstrate that the proposed ECO convolution is faster than SOC in evaluation while leading to competitive standard and certified robust accuracies.
\ No newline at end of file
diff --git a/data/2022/iclr/Constructing a Good Behavior Basis for Transfer using Generalized Policy Updates b/data/2022/iclr/Constructing a Good Behavior Basis for Transfer using Generalized Policy Updates
new file mode 100644
index 0000000000..9b87e4f589
--- /dev/null
+++ b/data/2022/iclr/Constructing a Good Behavior Basis for Transfer using Generalized Policy Updates	
@@ -0,0 +1 @@
+We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks with no or very little new data. Specifically, we consider the framework of generalized policy evaluation and improvement, in which the rewards for all tasks of interest are assumed to be expressible as a linear combination of a fixed set of features. We show theoretically that, under certain assumptions, having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance on all possible downstream tasks which are typically more complex than the ones on which the agent was trained. Based on this theoretical analysis, we propose a simple algorithm that iteratively constructs this set of policies. In addition to empirically validating our theoretical results, we compare our approach with recently proposed diverse policy set construction methods and show that, while others fail, our approach is able to build a behavior basis that enables instantaneous transfer to all possible downstream tasks. We also show empirically that having access to a set of independent policies can better bootstrap the learning process on downstream tasks where the new reward function cannot be described as a linear combination of the features. Finally, we demonstrate how this policy set can be useful in a lifelong reinforcement learning setting.
\ No newline at end of file
diff --git a/data/2022/iclr/Contact Points Discovery for Soft-Body Manipulations with Differentiable Physics b/data/2022/iclr/Contact Points Discovery for Soft-Body Manipulations with Differentiable Physics
new file mode 100644
index 0000000000..7f37d24ab1
--- /dev/null
+++ b/data/2022/iclr/Contact Points Discovery for Soft-Body Manipulations with Differentiable Physics	
@@ -0,0 +1 @@
+Differentiable physics has recently been shown as a powerful tool for solving soft-body manipulation tasks. However, the differentiable physics solver often gets stuck when the initial contact points of the end effectors are sub-optimal or when performing multi-stage tasks that require contact point switching, which often leads to local minima. To address this challenge, we propose a contact point discovery approach (CPDeform) that guides the stand-alone differentiable physics solver to deform various soft-body plasticines. The key idea of our approach is to integrate optimal transport-based contact points discovery into the differentiable physics solver to overcome the local minima from initial contact points or contact switching. On single-stage tasks, our method can automatically find suitable initial contact points based on transport priorities. On complex multi-stage tasks, we can iteratively switch the contact points of end-effectors based on transport priorities. To evaluate the effectiveness of our method, we introduce PlasticineLab-M that extends the existing differentiable physics benchmark PlasticineLab to seven new challenging multi-stage soft-body manipulation tasks. Extensive experimental results suggest that: 1) on multi-stage tasks that are infeasible for the vanilla differentiable physics solver, our approach discovers contact points that efficiently guide the solver to completion; 2) on tasks where the vanilla solver performs sub-optimally or near-optimally, our contact point discovery method performs better than or on par with the manipulation performance obtained with handcrafted contact points.
\ No newline at end of file
diff --git a/data/2022/iclr/Context-Aware Sparse Deep Coordination Graphs b/data/2022/iclr/Context-Aware Sparse Deep Coordination Graphs
new file mode 100644
index 0000000000..f8c213fb79
--- /dev/null
+++ b/data/2022/iclr/Context-Aware Sparse Deep Coordination Graphs	
@@ -0,0 +1 @@
+Learning sparse coordination graphs adaptive to the coordination dynamics among agents is a long-standing problem in cooperative multi-agent learning. This paper studies this problem and proposes a novel method using the variance of payoff functions to construct context-aware sparse coordination topologies. We theoretically consolidate our method by proving that the smaller the variance of payoff functions is, the less likely action selection will change after removing the corresponding edge. Moreover, we propose to learn action representations to effectively reduce the influence of payoff functions' estimation errors on graph construction. To empirically evaluate our method, we present the Multi-Agent COordination (MACO) benchmark by collecting classic coordination problems in the literature, increasing their difficulty, and classifying them into different types. We carry out a case study and experiments on the MACO and StarCraft II micromanagement benchmark to demonstrate the dynamics of sparse graph learning, the influence of graph sparseness, and the learning performance of our method. (The MACO benchmark and codes are publicly available at https://github.com/TonghanWang/CASEC-MACO-benchmark.)
\ No newline at end of file
diff --git a/data/2022/iclr/Contextualized Scene Imagination for Generative Commonsense Reasoning b/data/2022/iclr/Contextualized Scene Imagination for Generative Commonsense Reasoning
new file mode 100644
index 0000000000..7bc85bdfdd
--- /dev/null
+++ b/data/2022/iclr/Contextualized Scene Imagination for Generative Commonsense Reasoning	
@@ -0,0 +1 @@
+Humans use natural language to compose common concepts from their environment into plausible, day-to-day scene descriptions. However, such generative commonsense reasoning (GCSR) skills are lacking in state-of-the-art text generation methods. Descriptive sentences about arbitrary concepts generated by neural text generation models (e.g., pre-trained text-to-text Transformers) are often grammatically fluent but may not correspond to human common sense, largely due to their lack of mechanisms to capture concept relations, to identify implicit concepts, and to perform generalizable reasoning about unseen concept compositions. In this paper, we propose an Imagine-and-Verbalize (I&V) method, which learns to imagine a relational scene knowledge graph (SKG) with relations between the input concepts, and leverage the SKG as a constraint when generating a plausible scene description. We collect and harmonize a set of knowledge resources from different domains and modalities, providing a rich auxiliary supervision signal for I&V. The experiments demonstrate the effectiveness of I&V in improving language models on both concept-to-sentence and concept-to-story generation tasks, while enabling the model to learn well from fewer task examples and generate SKGs that make common sense to human annotators.
\ No newline at end of file
diff --git a/data/2022/iclr/Continual Learning with Filter Atom Swapping b/data/2022/iclr/Continual Learning with Filter Atom Swapping
new file mode 100644
index 0000000000..66dd8d0238
--- /dev/null
+++ b/data/2022/iclr/Continual Learning with Filter Atom Swapping	
@@ -0,0 +1 @@
+Continual learning has been widely studied in recent years to resolve the catastrophic forgetting of deep neural networks. In this paper, we ﬁrst enforce a low-rank ﬁlter subspace by decomposing convolutional ﬁlters within each network layer over a small set of ﬁlter atoms. Then, we perform continual learning with ﬁlter atom swapping. In other words, we learn for each task a new ﬁlter subspace for each convolutional layer, i.e., hundreds of parameters as ﬁlter atoms, but keep subspace coefﬁcients shared across tasks. By maintaining a small footprint memory of ﬁlter atoms, we can easily archive models for past tasks to avoid forgetting. The effectiveness of this simple scheme for continual learning is illustrated both empirically and theoretically. The proposed atom swapping framework further enables ﬂexible and efﬁcient model ensemble with members selected within task or across tasks to improve the performance in different continual learning settings. Being validated on multiple benchmark datasets with different convolutional network structures, the proposed method outperforms the state-of-the-art methods in both accuracy and scalability.
\ No newline at end of file
diff --git a/data/2022/iclr/Continual Learning with Recursive Gradient Optimization b/data/2022/iclr/Continual Learning with Recursive Gradient Optimization
new file mode 100644
index 0000000000..ca0449e9b4
--- /dev/null
+++ b/data/2022/iclr/Continual Learning with Recursive Gradient Optimization	
@@ -0,0 +1 @@
+Learning multiple tasks sequentially without forgetting previous knowledge, called Continual Learning(CL), remains a long-standing challenge for neural networks. Most existing methods rely on additional network capacity or data replay. In contrast, we introduce a novel approach which we refer to as Recursive Gradient Optimization(RGO). RGO is composed of an iteratively updated optimizer that modifies the gradient to minimize forgetting without data replay and a virtual Feature Encoding Layer(FEL) that represents different long-term structures with only task descriptors. Experiments demonstrate that RGO has significantly better performance on popular continual classification benchmarks when compared to the baselines and achieves new state-of-the-art performance on 20-split-CIFAR100(82.22%) and 20-split-miniImageNet(72.63%). With higher average accuracy than Single-Task Learning(STL), this method is flexible and reliable to provide continual learning capabilities for learning models that rely on gradient descent.
\ No newline at end of file
diff --git a/data/2022/iclr/Continual Normalization: Rethinking Batch Normalization for Online Continual Learning b/data/2022/iclr/Continual Normalization: Rethinking Batch Normalization for Online Continual Learning
new file mode 100644
index 0000000000..7dec03861d
--- /dev/null
+++ b/data/2022/iclr/Continual Normalization: Rethinking Batch Normalization for Online Continual Learning	
@@ -0,0 +1 @@
+Existing continual learning methods use Batch Normalization (BN) to facilitate training and improve generalization across tasks. However, the non-i.i.d and non-stationary nature of continual learning data, especially in the online setting, amplify the discrepancy between training and testing in BN and hinder the performance of older tasks. In this work, we study the cross-task normalization effect of BN in online continual learning where BN normalizes the testing data using moments biased towards the current task, resulting in higher catastrophic forgetting. This limitation motivates us to propose a simple yet effective method that we call Continual Normalization (CN) to facilitate training similar to BN while mitigating its negative effect. Extensive experiments on different continual learning algorithms and online scenarios show that CN is a direct replacement for BN and can provide substantial performance improvements. Our implementation is available at \url{https://github.com/phquang/Continual-Normalization}.
\ No newline at end of file
diff --git a/data/2022/iclr/Continuous-Time Meta-Learning with Forward Mode Differentiation b/data/2022/iclr/Continuous-Time Meta-Learning with Forward Mode Differentiation
new file mode 100644
index 0000000000..88ae374333
--- /dev/null
+++ b/data/2022/iclr/Continuous-Time Meta-Learning with Forward Mode Differentiation	
@@ -0,0 +1 @@
+Drawing inspiration from gradient-based meta-learning methods with infinitely small gradient steps, we introduce Continuous-Time Meta-Learning (COMLN), a meta-learning algorithm where adaptation follows the dynamics of a gradient vector field. Specifically, representations of the inputs are meta-learned such that a task-specific linear classifier is obtained as a solution of an ordinary differential equation (ODE). Treating the learning process as an ODE offers the notable advantage that the length of the trajectory is now continuous, as opposed to a fixed and discrete number of gradient steps. As a consequence, we can optimize the amount of adaptation necessary to solve a new task using stochastic gradient descent, in addition to learning the initial conditions as is standard practice in gradient-based meta-learning. Importantly, in order to compute the exact meta-gradients required for the outer-loop updates, we devise an efficient algorithm based on forward mode differentiation, whose memory requirements do not scale with the length of the learning trajectory, thus allowing longer adaptation in constant memory. We provide analytical guarantees for the stability of COMLN, we show empirically its efficiency in terms of runtime and memory usage, and we illustrate its effectiveness on a range of few-shot image classification problems.
\ No newline at end of file
diff --git a/data/2022/iclr/Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization b/data/2022/iclr/Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization
new file mode 100644
index 0000000000..9178cb96c4
--- /dev/null
+++ b/data/2022/iclr/Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization	
@@ -0,0 +1 @@
+We present Reward-Switching Policy Optimization (RSPO), a paradigm to discover diverse strategies in complex RL environments by iteratively finding novel policies that are both locally optimal and sufficiently different from existing ones. To encourage the learning policy to consistently converge towards a previously undiscovered local optimum, RSPO switches between extrinsic and intrinsic rewards via a trajectory-based novelty measurement during the optimization process. When a sampled trajectory is sufficiently distinct, RSPO performs standard policy optimization with extrinsic rewards. For trajectories with high likelihood under existing policies, RSPO utilizes an intrinsic diversity reward to promote exploration. Experiments show that RSPO is able to discover a wide spectrum of strategies in a variety of domains, ranging from single-agent particle-world tasks and MuJoCo continuous control to multi-agent stag-hunt games and StarCraftII challenges.
\ No newline at end of file
diff --git a/data/2022/iclr/Contrastive Clustering to Mine Pseudo Parallel Data for Unsupervised Translation b/data/2022/iclr/Contrastive Clustering to Mine Pseudo Parallel Data for Unsupervised Translation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Contrastive Fine-grained Class Clustering via Generative Adversarial Networks b/data/2022/iclr/Contrastive Fine-grained Class Clustering via Generative Adversarial Networks
new file mode 100644
index 0000000000..a5277da003
--- /dev/null
+++ b/data/2022/iclr/Contrastive Fine-grained Class Clustering via Generative Adversarial Networks	
@@ -0,0 +1 @@
+Unsupervised fine-grained class clustering is a practical yet challenging task due to the difficulty of feature representations learning of subtle object details. We introduce C3-GAN, a method that leverages the categorical inference power of InfoGAN with contrastive learning. We aim to learn feature representations that encourage a dataset to form distinct cluster boundaries in the embedding space, while also maximizing the mutual information between the latent code and its image observation. Our approach is to train a discriminator, which is also used for inferring clusters, to optimize the contrastive loss, where image-latent pairs that maximize the mutual information are considered as positive pairs and the rest as negative pairs. Specifically, we map the input of a generator, which was sampled from the categorical distribution, to the embedding space of the discriminator and let them act as a cluster centroid. In this way, C3-GAN succeeded in learning a clustering-friendly embedding space where each cluster is distinctively separable. Experimental results show that C3-GAN achieved the state-of-the-art clustering performance on four fine-grained image datasets, while also alleviating the mode collapse phenomenon. Code is available at https://github.com/naver-ai/c3-gan.
\ No newline at end of file
diff --git a/data/2022/iclr/Controlling Directions Orthogonal to a Classifier b/data/2022/iclr/Controlling Directions Orthogonal to a Classifier
new file mode 100644
index 0000000000..5bc25240b9
--- /dev/null
+++ b/data/2022/iclr/Controlling Directions Orthogonal to a Classifier	
@@ -0,0 +1 @@
+We propose to identify directions invariant to a given classifier so that these directions can be controlled in tasks such as style transfer. While orthogonal decomposition is directly identifiable when the given classifier is linear, we formally define a notion of orthogonality in the non-linear case. We also provide a surprisingly simple method for constructing the orthogonal classifier (a classifier utilizing directions other than those of the given classifier). Empirically, we present three use cases where controlling orthogonal variation is important: style transfer, domain adaptation, and fairness. The orthogonal classifier enables desired style transfer when domains vary in multiple aspects, improves domain adaptation with label shifts and mitigates the unfairness as a predictor. The code is available at http://github.com/Newbeeer/orthogonal_classifier
\ No newline at end of file
diff --git a/data/2022/iclr/Controlling the Complexity and Lipschitz Constant improves Polynomial Nets b/data/2022/iclr/Controlling the Complexity and Lipschitz Constant improves Polynomial Nets
new file mode 100644
index 0000000000..c887f71645
--- /dev/null
+++ b/data/2022/iclr/Controlling the Complexity and Lipschitz Constant improves Polynomial Nets	
@@ -0,0 +1 @@
+While the class of Polynomial Nets demonstrates comparable performance to neural networks (NN), it currently has neither theoretical generalization characterization nor robustness guarantees. To this end, we derive new complexity bounds for the set of Coupled CP-Decomposition (CCP) and Nested Coupled CP-decomposition (NCP) models of Polynomial Nets in terms of the $\ell_\infty$-operator-norm and the $\ell_2$-operator norm. In addition, we derive bounds on the Lipschitz constant for both models to establish a theoretical certificate for their robustness. The theoretical results enable us to propose a principled regularization scheme that we also evaluate experimentally in six datasets and show that it improves the accuracy as well as the robustness of the models to adversarial perturbations. We showcase how this regularization can be combined with adversarial training, resulting in further improvements.
\ No newline at end of file
diff --git a/data/2022/iclr/Convergent Graph Solvers b/data/2022/iclr/Convergent Graph Solvers
new file mode 100644
index 0000000000..eead8db0c8
--- /dev/null
+++ b/data/2022/iclr/Convergent Graph Solvers	
@@ -0,0 +1 @@
+We propose the convergent graph solver (CGS), a deep learning method that learns iterative mappings to predict the properties of a graph system at its stationary state (fixed point) with guaranteed convergence. CGS systematically computes the fixed points of a target graph system and decodes them to estimate the stationary properties of the system without the prior knowledge of existing solvers or intermediate solutions. The forward propagation of CGS proceeds in three steps: (1) constructing the input dependent linear contracting iterative maps, (2) computing the fixed-points of the linear maps, and (3) decoding the fixed-points to estimate the properties. The contractivity of the constructed linear maps guarantees the existence and uniqueness of the fixed points following the Banach fixed point theorem. To train CGS efficiently, we also derive a tractable analytical expression for its gradient by leveraging the implicit function theorem. We evaluate the performance of CGS by applying it to various network-analytic and graph benchmark problems. The results indicate that CGS has competitive capabilities for predicting the stationary properties of graph systems, irrespective of whether the target systems are linear or non-linear. CGS also shows high performance for graph classification problems where the existence or the meaning of a fixed point is hard to be clearly defined, which highlights the potential of CGS as a general graph neural network architecture.
\ No newline at end of file
diff --git a/data/2022/iclr/Convergent and Efficient Deep Q Learning Algorithm b/data/2022/iclr/Convergent and Efficient Deep Q Learning Algorithm
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/CoordX: Accelerating Implicit Neural Representation with a Split MLP Architecture b/data/2022/iclr/CoordX: Accelerating Implicit Neural Representation with a Split MLP Architecture
new file mode 100644
index 0000000000..a53ad6d6f8
--- /dev/null
+++ b/data/2022/iclr/CoordX: Accelerating Implicit Neural Representation with a Split MLP Architecture	
@@ -0,0 +1 @@
+Implicit neural representations with multi-layer perceptrons (MLPs) have recently gained prominence for a wide variety of tasks such as novel view synthesis and 3D object representation and rendering. However, a significant challenge with these representations is that both training and inference with an MLP over a large number of input coordinates to learn and represent an image, video, or 3D object, require large amounts of computation and incur long processing times. In this work, we aim to accelerate inference and training of coordinate-based MLPs for implicit neural representations by proposing a new split MLP architecture, CoordX. With CoordX, the initial layers are split to learn each dimension of the input coordinates separately. The intermediate features are then fused by the last layers to generate the learned signal at the corresponding coordinate point. This significantly reduces the amount of computation required and leads to large speedups in training and inference, while achieving similar accuracy as the baseline MLP. This approach thus aims at first learning functions that are a decomposition of the original signal and then fusing them to generate the learned signal. Our proposed architecture can be generally used for many implicit neural representation tasks with no additional memory overheads. We demonstrate a speedup of up to 2.92x compared to the baseline model for image, video, and 3D shape representation and rendering tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Coordination Among Neural Modules Through a Shared Global Workspace b/data/2022/iclr/Coordination Among Neural Modules Through a Shared Global Workspace
new file mode 100644
index 0000000000..11cfba65e5
--- /dev/null
+++ b/data/2022/iclr/Coordination Among Neural Modules Through a Shared Global Workspace	
@@ -0,0 +1 @@
+Deep learning has seen a movement away from representing examples with a monolithic hidden state towards a richly structured state. For example, Transformers segment by position, and object-centric architectures decompose images into entities. In all these architectures, interactions between different elements are modeled via pairwise interactions: Transformers make use of self-attention to incorporate information from other positions; object-centric architectures make use of graph neural networks to model interactions among entities. However, pairwise interactions may not achieve global coordination or a coherent, integrated representation that can be used for downstream tasks. In cognitive science, a global workspace architecture has been proposed in which functionally specialized components share information through a common, bandwidth-limited communication channel. We explore the use of such a communication channel in the context of deep learning for modeling the structure of complex environments. The proposed method includes a shared workspace through which communication among different specialist modules takes place but due to limits on the communication bandwidth, specialist modules must compete for access. We show that capacity limitations have a rational basis in that (1) they encourage specialization and compositionality and (2) they facilitate the synchronization of otherwise independent specialists.
\ No newline at end of file
diff --git a/data/2022/iclr/Counterfactual Plans under Distributional Ambiguity b/data/2022/iclr/Counterfactual Plans under Distributional Ambiguity
new file mode 100644
index 0000000000..2a1fc3c5ee
--- /dev/null
+++ b/data/2022/iclr/Counterfactual Plans under Distributional Ambiguity	
@@ -0,0 +1 @@
+Counterfactual explanations are attracting significant attention due to the flourishing applications of machine learning models in consequential domains. A counterfactual plan consists of multiple possibilities to modify a given instance so that the model's prediction will be altered. As the predictive model can be updated subject to the future arrival of new data, a counterfactual plan may become ineffective or infeasible with respect to the future values of the model parameters. In this work, we study the counterfactual plans under model uncertainty, in which the distribution of the model parameters is partially prescribed using only the first- and second-moment information. First, we propose an uncertainty quantification tool to compute the lower and upper bounds of the probability of validity for any given counterfactual plan. We then provide corrective methods to adjust the counterfactual plan to improve the validity measure. The numerical experiments validate our bounds and demonstrate that our correction increases the robustness of the counterfactual plans in different real-world datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/Creating Training Sets via Weak Indirect Supervision b/data/2022/iclr/Creating Training Sets via Weak Indirect Supervision
new file mode 100644
index 0000000000..fcfdf97bfc
--- /dev/null
+++ b/data/2022/iclr/Creating Training Sets via Weak Indirect Supervision	
@@ -0,0 +1 @@
+Creating labeled training sets has become one of the major roadblocks in machine learning. To address this, recent \emph{Weak Supervision (WS)} frameworks synthesize training labels from multiple potentially noisy supervision sources. However, existing frameworks are restricted to supervision sources that share the same output space as the target task. To extend the scope of usable sources, we formulate Weak Indirect Supervision (WIS), a new research problem for automatically synthesizing training labels based on indirect supervision sources that have different output label spaces. To overcome the challenge of mismatched output spaces, we develop a probabilistic modeling approach, PLRM, which uses user-provided label relations to model and leverage indirect supervision sources. Moreover, we provide a theoretically-principled test of the distinguishability of PLRM for unseen labels, along with a generalization bound. On both image and text classification tasks as well as an industrial advertising application, we demonstrate the advantages of PLRM by outperforming baselines by a margin of 2%-9%.
\ No newline at end of file
diff --git a/data/2022/iclr/Critical Points in Quantum Generative Models b/data/2022/iclr/Critical Points in Quantum Generative Models
new file mode 100644
index 0000000000..1e4225e4e0
--- /dev/null
+++ b/data/2022/iclr/Critical Points in Quantum Generative Models	
@@ -0,0 +1 @@
+One of the most important properties of neural networks is the clustering of local minima of the loss function near the global minimum, enabling efficient training. Though generative models implemented on quantum computers are known to be more expressive than their traditional counterparts, it has empirically been observed that these models experience a transition in the quality of their local minima. Namely, below some critical number of parameters, all local minima are far from the global minimum in function value; above this critical parameter count, all local minima are good approximators of the global minimum. Furthermore, for a certain class of quantum generative models, this transition has empirically been observed to occur at parameter counts exponentially large in the problem size, meaning practical training of these models is out of reach. Here, we give the first proof of this transition in trainability, specializing to this latter class of quantum generative model. We use techniques inspired by those used to study the loss landscapes of classical neural networks. We also verify that our analytic results hold experimentally even at modest model sizes.
\ No newline at end of file
diff --git a/data/2022/iclr/Cross-Domain Imitation Learning via Optimal Transport b/data/2022/iclr/Cross-Domain Imitation Learning via Optimal Transport
new file mode 100644
index 0000000000..c4a3ad430d
--- /dev/null
+++ b/data/2022/iclr/Cross-Domain Imitation Learning via Optimal Transport	
@@ -0,0 +1 @@
+Cross-domain imitation learning studies how to leverage expert demonstrations of one agent to train an imitation agent with a different embodiment or morphology. Comparing trajectories and stationary distributions between the expert and imitation agents is challenging because they live on different systems that may not even have the same dimensionality. We propose Gromov-Wasserstein Imitation Learning (GWIL), a method for cross-domain imitation that uses the Gromov-Wasserstein distance to align and compare states between the different spaces of the agents. Our theory formally characterizes the scenarios where GWIL preserves optimality, revealing its possibilities and limitations. We demonstrate the effectiveness of GWIL in non-trivial continuous control domains ranging from simple rigid transformation of the expert domain to arbitrary transformation of the state-action space.
\ No newline at end of file
diff --git a/data/2022/iclr/Cross-Lingual Transfer with Class-Weighted Language-Invariant Representations b/data/2022/iclr/Cross-Lingual Transfer with Class-Weighted Language-Invariant Representations
new file mode 100644
index 0000000000..7bf1026334
--- /dev/null
+++ b/data/2022/iclr/Cross-Lingual Transfer with Class-Weighted Language-Invariant Representations	
@@ -0,0 +1 @@
+Recent advances in neural modeling have produced deep multilingual language models capable of extracting cross-lingual knowledge from non-parallel texts and enabling zero-shot downstream transfer. While their success is often attributed to shared representations, quantitative analyses are limited. Towards a better understanding, through empirical analyses, we show that the invariance of feature representations across languages—an effect of shared representations—strongly correlates with transfer performance. We also observe that distributional shifts in class priors between source and target language task data negatively affect performance, a largely overlooked issue that could cause negative transfer with existing unsupervised approaches. Based on these ﬁndings, we propose and evaluate a method for unsupervised transfer, called importance-weighted domain alignment (IWDA)
\ No newline at end of file
diff --git a/data/2022/iclr/Cross-Trajectory Representation Learning for Zero-Shot Generalization in RL b/data/2022/iclr/Cross-Trajectory Representation Learning for Zero-Shot Generalization in RL
new file mode 100644
index 0000000000..8a9ed05399
--- /dev/null
+++ b/data/2022/iclr/Cross-Trajectory Representation Learning for Zero-Shot Generalization in RL	
@@ -0,0 +1 @@
+A highly desirable property of a reinforcement learning (RL) agent -- and a major difficulty for deep RL approaches -- is the ability to generalize policies learned on a few tasks over a high-dimensional observation space to similar tasks not seen during training. Many promising approaches to this challenge consider RL as a process of training two functions simultaneously: a complex nonlinear encoder that maps high-dimensional observations to a latent representation space, and a simple linear policy over this space. We posit that a superior encoder for zero-shot generalization in RL can be trained by using solely an auxiliary SSL objective if the training process encourages the encoder to map behaviorally similar observations to similar representations, as reward-based signal can cause overfitting in the encoder (Raileanu et al., 2021). We propose Cross-Trajectory Representation Learning (CTRL), a method that runs within an RL agent and conditions its encoder to recognize behavioral similarity in observations by applying a novel SSL objective to pairs of trajectories from the agent's policies. CTRL can be viewed as having the same effect as inducing a pseudo-bisimulation metric but, crucially, avoids the use of rewards and associated overfitting risks. Our experiments ablate various components of CTRL and demonstrate that in combination with PPO it achieves better generalization performance on the challenging Procgen benchmark suite (Cobbe et al., 2020).
\ No newline at end of file
diff --git a/data/2022/iclr/CrossBeam: Learning to Search in Bottom-Up Program Synthesis b/data/2022/iclr/CrossBeam: Learning to Search in Bottom-Up Program Synthesis
new file mode 100644
index 0000000000..3c6d001b61
--- /dev/null
+++ b/data/2022/iclr/CrossBeam: Learning to Search in Bottom-Up Program Synthesis	
@@ -0,0 +1 @@
+Many approaches to program synthesis perform a search within an enormous space of programs to find one that satisfies a given specification. Prior works have used neural models to guide combinatorial search algorithms, but such approaches still explore a huge portion of the search space and quickly become intractable as the size of the desired program increases. To tame the search space blowup, we propose training a neural model to learn a hands-on search policy for bottom-up synthesis, instead of relying on a combinatorial search algorithm. Our approach, called CrossBeam, uses the neural model to choose how to combine previously-explored programs into new programs, taking into account the search history and partial program executions. Motivated by work in structured prediction on learning to search, CrossBeam is trained on-policy using data extracted from its own bottom-up searches on training tasks. We evaluate CrossBeam in two very different domains, string manipulation and logic programming. We observe that CrossBeam learns to search efficiently, exploring much smaller portions of the program space compared to the state-of-the-art.
\ No newline at end of file
diff --git a/data/2022/iclr/CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention b/data/2022/iclr/CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention
new file mode 100644
index 0000000000..4e5f1c4918
--- /dev/null
+++ b/data/2022/iclr/CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention	
@@ -0,0 +1 @@
+Transformers have made great progress in dealing with computer vision tasks. However, existing vision transformers do not yet possess the ability of building the interactions among features of different scales, which is perceptually important to visual inputs. The reasons are two-fold: (1) Input embeddings of each layer are equal-scale, so no cross-scale feature can be extracted; (2) to lower the computational cost, some vision transformers merge adjacent embeddings inside the self-attention module, thus sacrificing small-scale (fine-grained) features of the embeddings and also disabling the cross-scale interactions. To this end, we propose Cross-scale Embedding Layer (CEL) and Long Short Distance Attention (LSDA). On the one hand, CEL blends each embedding with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand, LSDA splits the self-attention module into a short-distance one and a long-distance counterpart, which not only reduces the computational burden but also keeps both small-scale and large-scale features in the embeddings. Through the above two designs, we achieve cross-scale attention. Besides, we put forward a dynamic position bias for vision transformers to make the popular relative position bias apply to variable-sized images. Hinging on the cross-scale attention module, we construct a versatile vision architecture, dubbed CrossFormer, which accommodates variable-sized inputs. Extensive experiments show that CrossFormer outperforms the other vision transformers on image classification, object detection, instance segmentation, and semantic segmentation tasks. The code has been released: https://github.com/cheerss/CrossFormer.
\ No newline at end of file
diff --git a/data/2022/iclr/CrossMatch: Cross-Classifier Consistency Regularization for Open-Set Single Domain Generalization b/data/2022/iclr/CrossMatch: Cross-Classifier Consistency Regularization for Open-Set Single Domain Generalization
new file mode 100644
index 0000000000..f3a79a6fff
--- /dev/null
+++ b/data/2022/iclr/CrossMatch: Cross-Classifier Consistency Regularization for Open-Set Single Domain Generalization	
@@ -0,0 +1 @@
+Single domain generalization (SDG) is a challenging scenario of domain generalization, where only one source domain is available to train the model. Typical SDG methods are based on the adversarial data augmentation strategy, which complements the diversity of source domain to learn a robust model. Existing SDG methods require the source and target domains to have the same label space. However, as target domains may contain novel categories unseen in source label space, this assumption is not practical in many real-world applications. In this paper, we propose a challenging and untouched problem: Open-Set Single Domain Generalization (OS-SDG), where target domains include unseen categories out of source label space. The goal of OS-SDG is to learn a model, with only one source domain, to classify a target sample with correct class if it belongs to source label space
\ No newline at end of file
diff --git a/data/2022/iclr/CrowdPlay: Crowdsourcing Human Demonstrations for Offline Learning b/data/2022/iclr/CrowdPlay: Crowdsourcing Human Demonstrations for Offline Learning
new file mode 100644
index 0000000000..bbf99ab789
--- /dev/null
+++ b/data/2022/iclr/CrowdPlay: Crowdsourcing Human Demonstrations for Offline Learning	
@@ -0,0 +1 @@
+Crowdsourcing has been instrumental for driving AI advances that rely on large-scale data. At the same time, reinforcement learning has seen rapid progress through benchmark environments that strike a balance between tractability and real-world complexity, such as ALE and OpenAI Gym. In this paper, we aim to ﬁll a gap at the intersection of these two: The use of crowdsourcing to generate large-scale human demonstration data in the support of advancing research into imitation learning and ofﬂine learning. To this end, we present CrowdPlay , a complete crowdsourcing pipeline for any standard RL environment including OpenAI Gym (made available under an open-source license); a large-scale publicly available crowdsourced dataset of human gameplay demonstrations in Atari 2600 games, including multimodal behavior and human-human and human-AI multiagent data; ofﬂine learning benchmarks with extensive human data evaluation; and a detailed study of incentives, including real-time feedback to drive high quality data. We hope that this will drive the improvement in design of algorithms that account for the complexity of human, behavioral data and thereby enable a step forward in direction of effective learning for real-world settings. Our code and dataset are available at https://mgerstgrasser.github.io/crowdplay/ .
\ No newline at end of file
diff --git a/data/2022/iclr/Crystal Diffusion Variational Autoencoder for Periodic Material Generation b/data/2022/iclr/Crystal Diffusion Variational Autoencoder for Periodic Material Generation
new file mode 100644
index 0000000000..181d6ff4ec
--- /dev/null
+++ b/data/2022/iclr/Crystal Diffusion Variational Autoencoder for Periodic Material Generation	
@@ -0,0 +1 @@
+Generating the periodic structure of stable materials is a long-standing challenge for the material design community. This task is difficult because stable materials only exist in a low-dimensional subspace of all possible periodic arrangements of atoms: 1) the coordinates must lie in the local energy minimum defined by quantum mechanics, and 2) global stability also requires the structure to follow the complex, yet specific bonding preferences between different atom types. Existing methods fail to incorporate these factors and often lack proper invariances. We propose a Crystal Diffusion Variational Autoencoder (CDVAE) that captures the physical inductive bias of material stability. By learning from the data distribution of stable materials, the decoder generates materials in a diffusion process that moves atomic coordinates towards a lower energy state and updates atom types to satisfy bonding preferences between neighbors. Our model also explicitly encodes interactions across periodic boundaries and respects permutation, translation, rotation, and periodic invariances. We significantly outperform past methods in three tasks: 1) reconstructing the input structure, 2) generating valid, diverse, and realistic materials, and 3) generating materials that optimize a specific property. We also provide several standard datasets and evaluation metrics for the broader machine learning community.
\ No newline at end of file
diff --git a/data/2022/iclr/Curriculum learning as a tool to uncover learning principles in the brain b/data/2022/iclr/Curriculum learning as a tool to uncover learning principles in the brain
new file mode 100644
index 0000000000..2783de9b42
--- /dev/null
+++ b/data/2022/iclr/Curriculum learning as a tool to uncover learning principles in the brain	
@@ -0,0 +1 @@
+We present a novel approach to use curricula to identify principles by which a system learns. Previous work in curriculum learning has focused on how curricula can be designed to improve learning of a model on particular tasks. We consider the inverse problem: what can a curriculum tell us about how a learning system acquired a task? Using recurrent neural networks (RNNs) and models of common experimental neuroscience tasks, we demonstrate that curricula can be used to differentiate learning principles using target-based and a representation-based loss functions as use cases. In particular, we compare the performance of RNNs using a target-based learning principle versus those using a representational learning principle on three different curricula in the context of two tasks. We show that the learned state-space trajectories of RNNs trained by these two learning principles under all curricula tested are indistinguishable. However, by comparing learning times during different curricula, we can disambiguate the learning principles and challenge traditional approaches of interrogating learning systems. Although all animals in neuroscience lab settings are trained by curriculum-based procedures called shaping, almost no behavioral or neural data are collected or published on the relative successes or training times under different curricula. Our results motivate the systematic collection and curation of data during shaping by demonstrating curriculum learning in RNNs as a tool to probe and differentiate learning principles used by biological systems, over statistical analyses of learned state spaces.
\ No newline at end of file
diff --git a/data/2022/iclr/Curvature-Guided Dynamic Scale Networks for Multi-View Stereo b/data/2022/iclr/Curvature-Guided Dynamic Scale Networks for Multi-View Stereo
new file mode 100644
index 0000000000..3430eb5f44
--- /dev/null
+++ b/data/2022/iclr/Curvature-Guided Dynamic Scale Networks for Multi-View Stereo	
@@ -0,0 +1 @@
+Multi-view stereo (MVS) is a crucial task for precise 3D reconstruction. Most recent studies tried to improve the performance of matching cost volume in MVS by designing aggregated 3D cost volumes and their regularization. This paper focuses on learning a robust feature extraction network to enhance the performance of matching costs without heavy computation in the other steps. In particular, we present a dynamic scale feature extraction network, namely, CDSFNet. It is composed of multiple novel convolution layers, each of which can select a proper patch scale for each pixel guided by the normal curvature of the image surface. As a result, CDFSNet can estimate the optimal patch scales to learn discriminative features for accurate matching computation between reference and source images. By combining the robust extracted features with an appropriate cost formulation strategy, our resulting MVS architecture can estimate depth maps more precisely. Extensive experiments showed that the proposed method outperforms other state-of-the-art methods on complex outdoor scenes. It significantly improves the completeness of reconstructed models. As a result, the method can process higher resolution inputs within faster run-time and lower memory than other MVS methods. Our source code is available at url{https://github.com/TruongKhang/cds-mvsnet}.
\ No newline at end of file
diff --git a/data/2022/iclr/CycleMLP: A MLP-like Architecture for Dense Prediction b/data/2022/iclr/CycleMLP: A MLP-like Architecture for Dense Prediction
new file mode 100644
index 0000000000..a42c96e212
--- /dev/null
+++ b/data/2022/iclr/CycleMLP: A MLP-like Architecture for Dense Prediction	
@@ -0,0 +1 @@
+This paper presents a simple MLP-like architecture, CycleMLP, which is a versatile backbone for visual recognition and dense predictions. As compared to modern MLP architectures, e.g., MLP-Mixer, ResMLP, and gMLP, whose architectures are correlated to image size and thus are infeasible in object detection and segmentation, CycleMLP has two advantages compared to modern approaches. (1) It can cope with various image sizes. (2) It achieves linear computational complexity to image size by using local windows. In contrast, previous MLPs have $O(N^2)$ computations due to fully spatial connections. We build a family of models which surpass existing MLPs and even state-of-the-art Transformer-based models, e.g., Swin Transformer, while using fewer parameters and FLOPs. We expand the MLP-like models' applicability, making them a versatile backbone for dense prediction tasks. CycleMLP achieves competitive results on object detection, instance segmentation, and semantic segmentation. In particular, CycleMLP-Tiny outperforms Swin-Tiny by 1.3% mIoU on ADE20K dataset with fewer FLOPs. Moreover, CycleMLP also shows excellent zero-shot robustness on ImageNet-C dataset. Code is available at https://github.com/ShoufaChen/CycleMLP.
\ No newline at end of file
diff --git a/data/2022/iclr/D-CODE: Discovering Closed-form ODEs from Observed Trajectories b/data/2022/iclr/D-CODE: Discovering Closed-form ODEs from Observed Trajectories
new file mode 100644
index 0000000000..770bac16d7
--- /dev/null
+++ b/data/2022/iclr/D-CODE: Discovering Closed-form ODEs from Observed Trajectories	
@@ -0,0 +1 @@
+For centuries, scientists have manually designed closed-form ordinary differential equations (ODEs) to model dynamical systems. An automated tool to distill closedform ODEs from observed trajectories would accelerate the modeling process. Traditionally, symbolic regression is used to uncover a closed-form prediction function a = f(b) with label-feature pairs (ai, bi) as training examples. However, an ODE models the time derivative ẋ(t) of a dynamical system, e.g. ẋ(t) = f(x(t), t), and the “label” ẋ(t) is usually not observed. The existing ways to bridge this gap only perform well for a narrow range of settings with low measurement noise and frequent sampling. In this work, we propose the Discovery of Closedform ODE framework (D-CODE), which advances symbolic regression beyond the paradigm of supervised learning. D-CODE uses a novel objective function based on the variational formulation of ODEs to bypass the unobserved time derivative. For formal justification, we prove that this objective is a valid proxy for the estimation error of the true (but unknown) ODE. In the experiments, D-CODE successfully discovered the governing equations of a diverse range of dynamical systems under challenging measurement settings with high noise and infrequent sampling.
\ No newline at end of file
diff --git a/data/2022/iclr/DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR b/data/2022/iclr/DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR
new file mode 100644
index 0000000000..90c29ea883
--- /dev/null
+++ b/data/2022/iclr/DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR	
@@ -0,0 +1 @@
+We present in this paper a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer) and offer a deeper understanding of the role of queries in DETR. This new formulation directly uses box coordinates as queries in Transformer decoders and dynamically updates them layer-by-layer. Using box coordinates not only helps using explicit positional priors to improve the query-to-feature similarity and eliminate the slow training convergence issue in DETR, but also allows us to modulate the positional attention map using the box width and height information. Such a design makes it clear that queries in DETR can be implemented as performing soft ROI pooling layer-by-layer in a cascade manner. As a result, it leads to the best performance on MS-COCO benchmark among the DETR-like detection models under the same setting, e.g., AP 45.7\% using ResNet50-DC5 as backbone trained in 50 epochs. We also conducted extensive experiments to confirm our analysis and verify the effectiveness of our methods. Code is available at \url{https://github.com/SlongLiu/DAB-DETR}.
\ No newline at end of file
diff --git a/data/2022/iclr/DARA: Dynamics-Aware Reward Augmentation in Offline Reinforcement Learning b/data/2022/iclr/DARA: Dynamics-Aware Reward Augmentation in Offline Reinforcement Learning
new file mode 100644
index 0000000000..90bc09da15
--- /dev/null
+++ b/data/2022/iclr/DARA: Dynamics-Aware Reward Augmentation in Offline Reinforcement Learning	
@@ -0,0 +1 @@
+Offline reinforcement learning algorithms promise to be applicable in settings where a fixed dataset is available and no new experience can be acquired. However, such formulation is inevitably offline-data-hungry and, in practice, collecting a large offline dataset for one specific task over one specific environment is also costly and laborious. In this paper, we thus 1) formulate the offline dynamics adaptation by using (source) offline data collected from another dynamics to relax the requirement for the extensive (target) offline data, 2) characterize the dynamics shift problem in which prior offline methods do not scale well, and 3) derive a simple dynamics-aware reward augmentation (DARA) framework from both model-free and model-based offline settings. Specifically, DARA emphasizes learning from those source transition pairs that are adaptive for the target environment and mitigates the offline dynamics shift by characterizing state-action-next-state pairs instead of the typical state-action distribution sketched by prior offline RL methods. The experimental evaluation demonstrates that DARA, by augmenting rewards in the source offline dataset, can acquire an adaptive policy for the target environment and yet significantly reduce the requirement of target offline data. With only modest amounts of target offline data, our performance consistently outperforms the prior offline RL methods in both simulated and real-world tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/DEGREE: Decomposition Based Explanation for Graph Neural Networks b/data/2022/iclr/DEGREE: Decomposition Based Explanation for Graph Neural Networks
new file mode 100644
index 0000000000..7e6de360f6
--- /dev/null
+++ b/data/2022/iclr/DEGREE: Decomposition Based Explanation for Graph Neural Networks	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) are gaining extensive attention for their application in graph data. However, the black-box nature of GNNs prevents users from understanding and trusting the models, thus hampering their applicability. Whereas explaining GNNs remains a challenge, most existing methods fall into approximation based and perturbation based approaches with suffer from faithfulness problems and unnatural artifacts, respectively. To tackle these problems, we propose DEGREE \degree to provide a faithful explanation for GNN predictions. By decomposing the information generation and aggregation mechanism of GNNs, DEGREE allows tracking the contributions of specific components of the input graph to the final prediction. Based on this, we further design a subgraph level interpretation algorithm to reveal complex interactions between graph nodes that are overlooked by previous methods. The efficiency of our algorithm can be further improved by utilizing GNN characteristics. Finally, we conduct quantitative and qualitative experiments on synthetic and real-world datasets to demonstrate the effectiveness of DEGREE on node classification and graph classification tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/DEPTS: Deep Expansion Learning for Periodic Time Series Forecasting b/data/2022/iclr/DEPTS: Deep Expansion Learning for Periodic Time Series Forecasting
new file mode 100644
index 0000000000..d2b5eaf710
--- /dev/null
+++ b/data/2022/iclr/DEPTS: Deep Expansion Learning for Periodic Time Series Forecasting	
@@ -0,0 +1 @@
+Periodic time series (PTS) forecasting plays a crucial role in a variety of industries to foster critical tasks, such as early warning, pre-planning, resource scheduling, etc. However, the complicated dependencies of the PTS signal on its inherent periodicity as well as the sophisticated composition of various periods hinder the performance of PTS forecasting. In this paper, we introduce a deep expansion learning framework, DEPTS, for PTS forecasting. DEPTS starts with a decoupled formulation by introducing the periodic state as a hidden variable, which stimulates us to make two dedicated modules to tackle the aforementioned two challenges. First, we develop an expansion module on top of residual learning to perform a layer-by-layer expansion of those complicated dependencies. Second, we introduce a periodicity module with a parameterized periodic function that holds sufficient capacity to capture diversified periods. Moreover, our two customized modules also have certain interpretable capabilities, such as attributing the forecasts to either local momenta or global periodicity and characterizing certain core periodic properties, e.g., amplitudes and frequencies. Extensive experiments on both synthetic data and real-world data demonstrate the effectiveness of DEPTS on handling PTS. In most cases, DEPTS achieves significant improvements over the best baseline. Specifically, the error reduction can even reach up to 20% for a few cases. Finally, all codes are publicly available.
\ No newline at end of file
diff --git a/data/2022/iclr/DISSECT: Disentangled Simultaneous Explanations via Concept Traversals b/data/2022/iclr/DISSECT: Disentangled Simultaneous Explanations via Concept Traversals
new file mode 100644
index 0000000000..2b691dbcc3
--- /dev/null
+++ b/data/2022/iclr/DISSECT: Disentangled Simultaneous Explanations via Concept Traversals	
@@ -0,0 +1 @@
+Explaining deep learning model inferences is a promising venue for scientific understanding, improving safety, uncovering hidden biases, evaluating fairness, and beyond, as argued by many scholars. One of the principal benefits of counterfactual explanations is allowing users to explore"what-if"scenarios through what does not and cannot exist in the data, a quality that many other forms of explanation such as heatmaps and influence functions are inherently incapable of doing. However, most previous work on generative explainability cannot disentangle important concepts effectively, produces unrealistic examples, or fails to retain relevant information. We propose a novel approach, DISSECT, that jointly trains a generator, a discriminator, and a concept disentangler to overcome such challenges using little supervision. DISSECT generates Concept Traversals (CTs), defined as a sequence of generated examples with increasing degrees of concepts that influence a classifier's decision. By training a generative model from a classifier's signal, DISSECT offers a way to discover a classifier's inherent"notion"of distinct concepts automatically rather than rely on user-predefined concepts. We show that DISSECT produces CTs that (1) disentangle several concepts, (2) are influential to a classifier's decision and are coupled to its reasoning due to joint training (3), are realistic, (4) preserve relevant information, and (5) are stable across similar inputs. We validate DISSECT on several challenging synthetic and realistic datasets where previous methods fall short of satisfying desirable criteria for interpretability and show that it performs consistently well and better than existing methods. Finally, we present experiments showing applications of DISSECT for detecting potential biases of a classifier and identifying spurious artifacts that impact predictions.
\ No newline at end of file
diff --git a/data/2022/iclr/DIVA: Dataset Derivative of a Learning Task b/data/2022/iclr/DIVA: Dataset Derivative of a Learning Task
new file mode 100644
index 0000000000..3642362c54
--- /dev/null
+++ b/data/2022/iclr/DIVA: Dataset Derivative of a Learning Task	
@@ -0,0 +1 @@
+We present a method to compute the derivative of a learning task with respect to a dataset. A learning task is a function from a training set to the validation error, which can be represented by a trained deep neural network (DNN). The"dataset derivative"is a linear operator, computed around the trained model, that informs how perturbations of the weight of each training sample affect the validation error, usually computed on a separate validation dataset. Our method, DIVA (Differentiable Validation) hinges on a closed-form differentiable expression of the leave-one-out cross-validation error around a pre-trained DNN. Such expression constitutes the dataset derivative. DIVA could be used for dataset auto-curation, for example removing samples with faulty annotations, augmenting a dataset with additional relevant samples, or rebalancing. More generally, DIVA can be used to optimize the dataset, along with the parameters of the model, as part of the training process without the need for a separate validation dataset, unlike bi-level optimization methods customary in AutoML. To illustrate the flexibility of DIVA, we report experiments on sample auto-curation tasks such as outlier rejection, dataset extension, and automatic aggregation of multi-modal data.
\ No newline at end of file
diff --git a/data/2022/iclr/DKM: Differentiable k-Means Clustering Layer for Neural Network Compression b/data/2022/iclr/DKM: Differentiable k-Means Clustering Layer for Neural Network Compression
new file mode 100644
index 0000000000..76c251fa96
--- /dev/null
+++ b/data/2022/iclr/DKM: Differentiable k-Means Clustering Layer for Neural Network Compression	
@@ -0,0 +1 @@
+Deep neural network (DNN) model compression for efficient on-device inference is becoming increasingly important to reduce memory requirements and keep user data on-device. To this end, we propose a novel differentiable k-means clustering layer (DKM) and its application to train-time weight clustering-based DNN model compression. DKM casts k-means clustering as an attention problem and enables joint optimization of the DNN parameters and clustering centroids. Unlike prior works that rely on additional regularizers and parameters, DKM-based compression keeps the original loss function and model architecture fixed. We evaluated DKM-based compression on various DNN models for computer vision and natural language processing (NLP) tasks. Our results demonstrate that DKM delivers superior compression and accuracy trade-off on ImageNet1k and GLUE benchmarks. For example, DKM-based compression can offer 74.5% top-1 ImageNet1k accuracy on ResNet50 DNN model with 3.3MB model size (29.4x model compression factor). For MobileNet-v1, which is a challenging DNN to compress, DKM delivers 63.9% top-1 ImageNet1k accuracy with 0.72 MB model size (22.4x model compression factor). This result is 6.8% higher top-1accuracy and 33% relatively smaller model size than the current state-of-the-art DNN compression algorithms. Additionally, DKM enables compression of DistilBERT model by 11.8x with minimal (1.1%) accuracy loss on GLUE NLP benchmarks.
\ No newline at end of file
diff --git a/data/2022/iclr/DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization b/data/2022/iclr/DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization
new file mode 100644
index 0000000000..b15a6c76eb
--- /dev/null
+++ b/data/2022/iclr/DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization	
@@ -0,0 +1 @@
+Despite overparameterization, deep networks trained via supervised learning are easy to optimize and exhibit excellent generalization. One hypothesis to explain this is that overparameterized deep networks enjoy the benefits of implicit regularization induced by stochastic gradient descent, which favors parsimonious solutions that generalize well on test inputs. It is reasonable to surmise that deep reinforcement learning (RL) methods could also benefit from this effect. In this paper, we discuss how the implicit regularization effect of SGD seen in supervised learning could in fact be harmful in the offline deep RL setting, leading to poor generalization and degenerate feature representations. Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions with excessive"aliasing", in stark contrast to the supervised learning case. We back up these findings empirically, showing that feature representations learned by a deep network value function trained via bootstrapping can indeed become degenerate, aliasing the representations for state-action pairs that appear on either side of the Bellman backup. To address this issue, we derive the form of this implicit regularizer and, inspired by this derivation, propose a simple and effective explicit regularizer, called DR3, that counteracts the undesirable effects of this implicit regularizer. When combined with existing offline RL methods, DR3 substantially improves performance and stability, alleviating unlearning in Atari 2600 games, D4RL domains and robotic manipulation from images.
\ No newline at end of file
diff --git a/data/2022/iclr/Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation b/data/2022/iclr/Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation
new file mode 100644
index 0000000000..3d36a62bf3
--- /dev/null
+++ b/data/2022/iclr/Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation	
@@ -0,0 +1 @@
+Traditional computer vision models are trained to predict a fixed set of predefined categories. Recently, natural language has been shown to be a broader and richer source of supervision that provides finer descriptions to visual concepts than supervised"gold"labels. Previous works, such as CLIP, use InfoNCE loss to train a model to predict the pairing between images and text captions. CLIP, however, is data hungry and requires more than 400M image-text pairs for training. The inefficiency can be partially attributed to the fact that the image-text pairs are noisy. To address this, we propose OTTER (Optimal TransporT distillation for Efficient zero-shot Recognition), which uses online entropic optimal transport to find a soft image-text match as labels for contrastive learning. Based on pretrained image and text encoders, models trained with OTTER achieve strong performance with only 3M image text pairs. Compared with InfoNCE loss, label smoothing, and knowledge distillation, OTTER consistently outperforms these baselines in zero shot evaluation on Google Open Images (19,958 classes) and multi-labeled ImageNet 10K (10032 classes) from Tencent ML-Images. Over 42 evaluations on 7 different dataset/architecture settings x 6 metrics, OTTER outperforms (32) or ties (2) all baselines in 34 of them.
\ No newline at end of file
diff --git a/data/2022/iclr/Data Poisoning Won't Save You From Facial Recognition b/data/2022/iclr/Data Poisoning Won't Save You From Facial Recognition
new file mode 100644
index 0000000000..f8ede23203
--- /dev/null
+++ b/data/2022/iclr/Data Poisoning Won't Save You From Facial Recognition	
@@ -0,0 +1 @@
+Data poisoning has been proposed as a compelling defense against facial recognition models trained on Web-scraped pictures. Users can perturb images they post online, so that models will misclassify future (unperturbed) pictures. We demonstrate that this strategy provides a false sense of security, as it ignores an inherent asymmetry between the parties: users' pictures are perturbed once and for all before being published (at which point they are scraped) and must thereafter fool all future models -- including models trained adaptively against the users' past attacks, or models that use technologies discovered after the attack. We evaluate two systems for poisoning attacks against large-scale facial recognition, Fawkes (500'000+ downloads) and LowKey. We demonstrate how an"oblivious"model trainer can simply wait for future developments in computer vision to nullify the protection of pictures collected in the past. We further show that an adversary with black-box access to the attack can (i) train a robust model that resists the perturbations of collected pictures and (ii) detect poisoned pictures uploaded online. We caution that facial recognition poisoning will not admit an"arms race"between attackers and defenders. Once perturbed pictures are scraped, the attack cannot be changed so any future successful defense irrevocably undermines users' privacy.
\ No newline at end of file
diff --git a/data/2022/iclr/Data-Driven Offline Optimization for Architecting Hardware Accelerators b/data/2022/iclr/Data-Driven Offline Optimization for Architecting Hardware Accelerators
new file mode 100644
index 0000000000..47d918c46d
--- /dev/null
+++ b/data/2022/iclr/Data-Driven Offline Optimization for Architecting Hardware Accelerators	
@@ -0,0 +1 @@
+Industry has gradually moved towards application-specific hardware accelerators in order to attain higher efficiency. While such a paradigm shift is already starting to show promising results, designers need to spend considerable manual effort and perform a large number of time-consuming simulations to find accelerators that can accelerate multiple target applications while obeying design constraints. Moreover, such a"simulation-driven"approach must be re-run from scratch every time the set of target applications or design constraints change. An alternative paradigm is to use a"data-driven", offline approach that utilizes logged simulation data, to architect hardware accelerators, without needing any form of simulations. Such an approach not only alleviates the need to run time-consuming simulation, but also enables data reuse and applies even when set of target applications changes. In this paper, we develop such a data-driven offline optimization method for designing hardware accelerators, dubbed PRIME, that enjoys all of these properties. Our approach learns a conservative, robust estimate of the desired cost function, utilizes infeasible points, and optimizes the design against this estimate without any additional simulator queries during optimization. PRIME architects accelerators -- tailored towards both single and multiple applications -- improving performance upon state-of-the-art simulation-driven methods by about 1.54x and 1.20x, while considerably reducing the required total simulation time by 93% and 99%, respectively. In addition, PRIME also architects effective accelerators for unseen applications in a zero-shot setting, outperforming simulation-based methods by 1.26x.
\ No newline at end of file
diff --git a/data/2022/iclr/Data-Efficient Graph Grammar Learning for Molecular Generation b/data/2022/iclr/Data-Efficient Graph Grammar Learning for Molecular Generation
new file mode 100644
index 0000000000..5efac44e95
--- /dev/null
+++ b/data/2022/iclr/Data-Efficient Graph Grammar Learning for Molecular Generation	
@@ -0,0 +1 @@
+The problem of molecular generation has received significant attention recently. Existing methods are typically based on deep neural networks and require training on large datasets with tens of thousands of samples. In practice, however, the size of class-specific chemical datasets is usually limited (e.g., dozens of samples) due to labor-intensive experimentation and data collection. This presents a considerable challenge for the deep learning generative models to comprehensively describe the molecular design space. Another major challenge is to generate only physically synthesizable molecules. This is a non-trivial task for neural network-based generative models since the relevant chemical knowledge can only be extracted and generalized from the limited training data. In this work, we propose a data-efficient generative model that can be learned from datasets with orders of magnitude smaller sizes than common benchmarks. At the heart of this method is a learnable graph grammar that generates molecules from a sequence of production rules. Without any human assistance, these production rules are automatically constructed from training data. Furthermore, additional chemical knowledge can be incorporated in the model by further grammar optimization. Our learned graph grammar yields state-of-the-art results on generating high-quality molecules for three monomer datasets that contain only ${\sim}20$ samples each. Our approach also achieves remarkable performance in a challenging polymer generation task with only $117$ training samples and is competitive against existing methods using $81$k data points. Code is available at https://github.com/gmh14/data_efficient_grammar.
\ No newline at end of file
diff --git a/data/2022/iclr/DeSKO: Stability-Assured Robust Control with a Deep Stochastic Koopman Operator b/data/2022/iclr/DeSKO: Stability-Assured Robust Control with a Deep Stochastic Koopman Operator
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Dealing with Non-Stationarity in MARL via Trust-Region Decomposition b/data/2022/iclr/Dealing with Non-Stationarity in MARL via Trust-Region Decomposition
new file mode 100644
index 0000000000..68f320cb94
--- /dev/null
+++ b/data/2022/iclr/Dealing with Non-Stationarity in MARL via Trust-Region Decomposition	
@@ -0,0 +1 @@
+Non-stationarity is one thorny issue in cooperative multi-agent reinforcement learning (MARL). One of the reasons is the policy changes of agents during the learning process. Some existing works have discussed various consequences caused by non-stationarity with several kinds of measurement indicators. This makes the objectives or goals of existing algorithms are inevitably inconsistent and disparate. In this paper, we introduce a novel notion, the $\delta$-measurement, to explicitly measure the non-stationarity of a policy sequence, which can be further proved to be bounded by the KL-divergence of consecutive joint policies. A straightforward but highly non-trivial way is to control the joint policies' divergence, which is difficult to estimate accurately by imposing the trust-region constraint on the joint policy. Although it has lower computational complexity to decompose the joint policy and impose trust-region constraints on the factorized policies, simple policy factorization like mean-field approximation will lead to more considerable policy divergence, which can be considered as the trust-region decomposition dilemma. We model the joint policy as a pairwise Markov random field and propose a trust-region decomposition network (TRD-Net) based on message passing to estimate the joint policy divergence more accurately. The Multi-Agent Mirror descent policy algorithm with Trust region decomposition, called MAMT, is established by adjusting the trust-region of the local policies adaptively in an end-to-end manner. MAMT can approximately constrain the consecutive joint policies' divergence to satisfy $\delta$-stationarity and alleviate the non-stationarity problem. Our method can bring noticeable and stable performance improvement compared with baselines in cooperative tasks of different complexity.
\ No newline at end of file
diff --git a/data/2022/iclr/Decentralized Learning for Overparameterized Problems: A Multi-Agent Kernel Approximation Approach b/data/2022/iclr/Decentralized Learning for Overparameterized Problems: A Multi-Agent Kernel Approximation Approach
new file mode 100644
index 0000000000..7c2ff4beb0
--- /dev/null
+++ b/data/2022/iclr/Decentralized Learning for Overparameterized Problems: A Multi-Agent Kernel Approximation Approach	
@@ -0,0 +1 @@
+This work develops a novel framework for communication-efﬁcient distributed learning where the models to be learnt are overparameterized. We focus on a class of kernel learning problems (which includes the popular neural tangent kernel (NTK) learning as a special case) and propose a novel multi-agent kernel approximation technique that allows the agents to distributedly estimate the full kernel function, and subsequently perform distributed learning, without directly exchanging any local data or parameters. The proposed framework is a signiﬁ-cant departure from the classical consensus-based approaches, because the agents do not exchange problem parameters, and consensus is not required. We analyze the optimization and the generalization performance of the proposed framework for the ` 2 loss. We show that with M agents and N total samples, when certain generalized inner-product (GIP) kernels (resp. the random features (RF) kernel) are used, each agent needs to communicate O
\ No newline at end of file
diff --git a/data/2022/iclr/Declarative nets that are equilibrium models b/data/2022/iclr/Declarative nets that are equilibrium models
new file mode 100644
index 0000000000..24194aa595
--- /dev/null
+++ b/data/2022/iclr/Declarative nets that are equilibrium models	
@@ -0,0 +1 @@
+Implicit layers are computational modules that output the solution to some problem depending on the input and the layer parameters. Deep equilibrium models (DEQs) output a solution to a ﬁxed point equation. Deep declarative networks (DDNs) solve an optimisation problem in their forward pass, an arguably more intuitive, interpretable problem than ﬁnding a ﬁxed point. We show that solving a kernelised regularised maximum likelihood estimate as an inner problem in a DDN yields a large class of DEQ architectures. Our proof uses the exponential family in canonical form
\ No newline at end of file
diff --git a/data/2022/iclr/Deconstructing the Inductive Biases of Hamiltonian Neural Networks b/data/2022/iclr/Deconstructing the Inductive Biases of Hamiltonian Neural Networks
new file mode 100644
index 0000000000..e625afa377
--- /dev/null
+++ b/data/2022/iclr/Deconstructing the Inductive Biases of Hamiltonian Neural Networks	
@@ -0,0 +1 @@
+Physics-inspired neural networks (NNs), such as Hamiltonian or Lagrangian NNs, dramatically outperform other learned dynamics models by leveraging strong inductive biases. These models, however, are challenging to apply to many real world systems, such as those that don't conserve energy or contain contacts, a common setting for robotics and reinforcement learning. In this paper, we examine the inductive biases that make physics-inspired models successful in practice. We show that, contrary to conventional wisdom, the improved generalization of HNNs is the result of modeling acceleration directly and avoiding artificial complexity from the coordinate system, rather than symplectic structure or energy conservation. We show that by relaxing the inductive biases of these models, we can match or exceed performance on energy-conserving systems while dramatically improving performance on practical, non-conservative systems. We extend this approach to constructing transition models for common Mujoco environments, showing that our model can appropriately balance inductive biases with the flexibility required for model-based control.
\ No newline at end of file
diff --git a/data/2022/iclr/Decoupled Adaptation for Cross-Domain Object Detection b/data/2022/iclr/Decoupled Adaptation for Cross-Domain Object Detection
new file mode 100644
index 0000000000..d5a1b06cde
--- /dev/null
+++ b/data/2022/iclr/Decoupled Adaptation for Cross-Domain Object Detection	
@@ -0,0 +1 @@
+Cross-domain object detection is more challenging than object classification since multiple objects exist in an image and the location of each object is unknown in the unlabeled target domain. As a result, when we adapt features of different objects to enhance the transferability of the detector, the features of the foreground and the background are easy to be confused, which may hurt the discriminability of the detector. Besides, previous methods focused on category adaptation but ignored another important part for object detection, i.e., the adaptation on bounding box regression. To this end, we propose D-adapt, namely Decoupled Adaptation, to decouple the adversarial adaptation and the training of the detector. Besides, we fill the blank of regression domain adaptation in object detection by introducing a bounding box adaptor. Experiments show that D-adapt achieves state-of-the-art results on four cross-domain object detection tasks and yields 17% and 21% relative improvement on benchmark datasets Clipart1k and Comic2k in particular.
\ No newline at end of file
diff --git a/data/2022/iclr/Deep Attentive Variational Inference b/data/2022/iclr/Deep Attentive Variational Inference
new file mode 100644
index 0000000000..9f99bef004
--- /dev/null
+++ b/data/2022/iclr/Deep Attentive Variational Inference	
@@ -0,0 +1 @@
+Stochastic Variational Inference is a powerful framework for learning large-scale probabilistic latent variable models. However, typical assumptions on the factorization or independence of the latent variables can substantially restrict its capacity for inference and generative modeling. A major line of active research aims at building more expressive variational models by designing deep hierarchies of interdependent latent variables. Although these models exhibit superior performance and enable richer latent representations, we show that they incur diminishing returns: adding more stochastic layers to an already very deep model yields small predictive improvement while substantially increasing the inference and training time. Moreover, the architecture for this class of models favors proximate interactions among the latent variables between neighboring layers when designing the conditioning factors of the involved distributions. This is the first work that proposes attention mechanisms to build more expressive variational distributions in deep probabilistic models by explicitly modeling both nearby and distant interactions in the latent space. Specifically, we propose deep attentive variational autoencoder and test it on a variety of established datasets. We show it achieves state-of-the-art log-likelihoods while using fewer latent layers and requiring less training time than existing models. The proposed holistic inference reduces computational footprint by alleviating the need for deep hierarchies. Project code: https://github.com/ifiaposto/Deep_Attentive_VI
\ No newline at end of file
diff --git a/data/2022/iclr/Deep AutoAugment b/data/2022/iclr/Deep AutoAugment
new file mode 100644
index 0000000000..636da90bd2
--- /dev/null
+++ b/data/2022/iclr/Deep AutoAugment	
@@ -0,0 +1 @@
+While recent automated data augmentation methods lead to state-of-the-art results, their design spaces and the derived data augmentation strategies still incorporate strong human priors. In this work, instead of fixing a set of hand-picked default augmentations alongside the searched data augmentations, we propose a fully automated approach for data augmentation search named Deep AutoAugment (DeepAA). DeepAA progressively builds a multi-layer data augmentation pipeline from scratch by stacking augmentation layers one at a time until reaching convergence. For each augmentation layer, the policy is optimized to maximize the cosine similarity between the gradients of the original and augmented data along the direction with low variance. Our experiments show that even without default augmentations, we can learn an augmentation policy that achieves strong performance with that of previous works. Extensive ablation studies show that the regularized gradient matching is an effective search method for data augmentation policies. Our code is available at: https://github.com/MSU-MLSys-Lab/DeepAA .
\ No newline at end of file
diff --git a/data/2022/iclr/Deep Ensembling with No Overhead for either Training or Testing: The All-Round Blessings of Dynamic Sparsity b/data/2022/iclr/Deep Ensembling with No Overhead for either Training or Testing: The All-Round Blessings of Dynamic Sparsity
new file mode 100644
index 0000000000..95747166c0
--- /dev/null
+++ b/data/2022/iclr/Deep Ensembling with No Overhead for either Training or Testing: The All-Round Blessings of Dynamic Sparsity	
@@ -0,0 +1 @@
+The success of deep ensembles on improving predictive performance, uncertainty estimation, and out-of-distribution robustness has been extensively studied in the machine learning literature. Albeit the promising results, naively training multiple deep neural networks and combining their predictions at inference leads to prohibitive computational costs and memory requirements. Recently proposed efficient ensemble approaches reach the performance of the traditional deep ensembles with significantly lower costs. However, the training resources required by these approaches are still at least the same as training a single dense model. In this work, we draw a unique connection between sparse neural network training and deep ensembles, yielding a novel efficient ensemble learning framework called FreeTickets. Instead of training multiple dense networks and averaging them, we directly train sparse subnetworks from scratch and extract diverse yet accurate subnetworks during this efficient, sparse-to-sparse training. Our framework, FreeTickets, is defined as the ensemble of these relatively cheap sparse subnetworks. Despite being an ensemble method, FreeTickets has even fewer parameters and training FLOPs than a single dense model. This seemingly counter-intuitive outcome is due to the ultra training/inference efficiency of dynamic sparse training. FreeTickets surpasses the dense baseline in all the following criteria: prediction accuracy, uncertainty estimation, out-of-distribution (OoD) robustness, as well as efficiency for both training and inference. Impressively, FreeTickets outperforms the naive deep ensemble with ResNet50 on ImageNet using around only 1/5 of the training FLOPs required by the latter. We have released our source code at https://github.com/VITA-Group/FreeTickets.
\ No newline at end of file
diff --git a/data/2022/iclr/Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers b/data/2022/iclr/Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers
new file mode 100644
index 0000000000..f830bb2c3c
--- /dev/null
+++ b/data/2022/iclr/Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers	
@@ -0,0 +1 @@
+Training very deep neural networks is still an extremely challenging task. The common solution is to use shortcut connections and normalization layers, which are both crucial ingredients in the popular ResNet architecture. However, there is strong evidence to suggest that ResNets behave more like ensembles of shallower networks than truly deep ones. Recently, it was shown that deep vanilla networks (i.e. networks without normalization layers or shortcut connections) can be trained as fast as ResNets by applying certain transformations to their activation functions. However, this method (called Deep Kernel Shaping) isn't fully compatible with ReLUs, and produces networks that overfit significantly more than ResNets on ImageNet. In this work, we rectify this situation by developing a new type of transformation that is fully compatible with a variant of ReLUs -- Leaky ReLUs. We show in experiments that our method, which introduces negligible extra computational cost, achieves validation accuracies with deep vanilla networks that are competitive with ResNets (of the same width/depth), and significantly higher than those obtained with the Edge of Chaos (EOC) method. And unlike with EOC, the validation accuracies we obtain do not get worse with depth.
\ No newline at end of file
diff --git a/data/2022/iclr/Deep Point Cloud Reconstruction b/data/2022/iclr/Deep Point Cloud Reconstruction
new file mode 100644
index 0000000000..05afbddd6b
--- /dev/null
+++ b/data/2022/iclr/Deep Point Cloud Reconstruction	
@@ -0,0 +1 @@
+Point cloud obtained from 3D scanning is often sparse, noisy, and irregular. To cope with these issues, recent studies have been separately conducted to densify, denoise, and complete inaccurate point cloud. In this paper, we advocate that jointly solving these tasks leads to significant improvement for point cloud reconstruction. To this end, we propose a deep point cloud reconstruction network consisting of two stages: 1) a 3D sparse stacked-hourglass network as for the initial densification and denoising, 2) a refinement via transformers converting the discrete voxels into 3D points. In particular, we further improve the performance of transformer by a newly proposed module called amplified positional encoding. This module has been designed to differently amplify the magnitude of positional encoding vectors based on the points' distances for adaptive refinements. Extensive experiments demonstrate that our network achieves state-of-the-art performance among the recent studies in the ScanNet, ICL-NUIM, and ShapeNetPart datasets. Moreover, we underline the ability of our network to generalize toward real-world and unmet scenes.
\ No newline at end of file
diff --git a/data/2022/iclr/Deep ReLU Networks Preserve Expected Length b/data/2022/iclr/Deep ReLU Networks Preserve Expected Length
new file mode 100644
index 0000000000..eae4e2ba0d
--- /dev/null
+++ b/data/2022/iclr/Deep ReLU Networks Preserve Expected Length	
@@ -0,0 +1 @@
+Assessing the complexity of functions computed by a neural network helps us understand how the network will learn and generalize. One natural measure of complexity is how the network distorts length - if the network takes a unit-length curve as input, what is the length of the resulting curve of outputs? It has been widely believed that this length grows exponentially in network depth. We prove that in fact this is not the case: the expected length distortion does not grow with depth, and indeed shrinks slightly, for ReLU networks with standard random initialization. We also generalize this result by proving upper bounds both for higher moments of the length distortion and for the distortion of higher-dimensional volumes. These theoretical results are corroborated by our experiments.
\ No newline at end of file
diff --git a/data/2022/iclr/Defending Against Image Corruptions Through Adversarial Augmentations b/data/2022/iclr/Defending Against Image Corruptions Through Adversarial Augmentations
new file mode 100644
index 0000000000..b9b46e538d
--- /dev/null
+++ b/data/2022/iclr/Defending Against Image Corruptions Through Adversarial Augmentations	
@@ -0,0 +1 @@
+Modern neural networks excel at image classification, yet they remain vulnerable to common image corruptions such as blur, speckle noise or fog. Recent methods that focus on this problem, such as AugMix and DeepAugment, introduce defenses that operate in expectation over a distribution of image corruptions. In contrast, the literature on $\ell_p$-norm bounded perturbations focuses on defenses against worst-case corruptions. In this work, we reconcile both approaches by proposing AdversarialAugment, a technique which optimizes the parameters of image-to-image models to generate adversarially corrupted augmented images. We theoretically motivate our method and give sufficient conditions for the consistency of its idealized version as well as that of DeepAugment. Our classifiers improve upon the state-of-the-art on common image corruption benchmarks conducted in expectation on CIFAR-10-C and improve worst-case performance against $\ell_p$-norm bounded perturbations on both CIFAR-10 and ImageNet.
\ No newline at end of file
diff --git a/data/2022/iclr/Delaunay Component Analysis for Evaluation of Data Representations b/data/2022/iclr/Delaunay Component Analysis for Evaluation of Data Representations
new file mode 100644
index 0000000000..0772893475
--- /dev/null
+++ b/data/2022/iclr/Delaunay Component Analysis for Evaluation of Data Representations	
@@ -0,0 +1 @@
+Advanced representation learning techniques require reliable and general evaluation methods. Recently, several algorithms based on the common idea of geometric and topological analysis of a manifold approximated from the learned data representations have been proposed. In this work, we introduce Delaunay Component Analysis (DCA) - an evaluation algorithm which approximates the data manifold using a more suitable neighbourhood graph called Delaunay graph. This provides a reliable manifold estimation even for challenging geometric arrangements of representations such as clusters with varying shape and density as well as outliers, which is where existing methods often fail. Furthermore, we exploit the nature of Delaunay graphs and introduce a framework for assessing the quality of individual novel data representations. We experimentally validate the proposed DCA method on representations obtained from neural networks trained with contrastive objective, supervised and generative models, and demonstrate various use cases of our extended single point evaluation framework.
\ No newline at end of file
diff --git a/data/2022/iclr/DemoDICE: Offline Imitation Learning with Supplementary Imperfect Demonstrations b/data/2022/iclr/DemoDICE: Offline Imitation Learning with Supplementary Imperfect Demonstrations
new file mode 100644
index 0000000000..63933dc60d
--- /dev/null
+++ b/data/2022/iclr/DemoDICE: Offline Imitation Learning with Supplementary Imperfect Demonstrations	
@@ -0,0 +1 @@
+We consider offline imitation learning (IL), which aims to mimic the expert’s behavior from its demonstration without further interaction with the environment. One of the main challenges in offline IL is to deal with the narrow support of the data distribution exhibited by the expert demonstrations that cover only a small fraction of the state and the action spaces. As a result, offline IL algorithms that rely only on expert demonstrations are very unstable since the situation easily deviates from those in the expert demonstrations. In this paper, we assume additional demonstration data of unknown degrees of optimality, which we call imperfect demonstrations. Compared with the recent IL algorithms that adopt adversarial minimax training objectives, we substantially stabilize overall learning process by reducing minimax optimization to a direct convex optimization in a principled manner. Using extensive tasks, we show that DemoDICE achieves promising results in the offline IL from expert and imperfect demonstrations.
\ No newline at end of file
diff --git a/data/2022/iclr/Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization b/data/2022/iclr/Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization
new file mode 100644
index 0000000000..267b951728
--- /dev/null
+++ b/data/2022/iclr/Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization	
@@ -0,0 +1 @@
+Batch Normalization (BN) is a commonly used technique to accelerate and stabilize training of deep neural networks. Despite its empirical success, a full theoretical understanding of BN is yet to be developed. In this work, we analyze BN through the lens of convex optimization. We introduce an analytic framework based on convex duality to obtain exact convex representations of weight-decay regularized ReLU networks with BN, which can be trained in polynomial-time. Our analyses also show that optimal layer weights can be obtained as simple closed-form formulas in the high-dimensional and/or overparameterized regimes. Furthermore, we find that Gradient Descent provides an algorithmic bias effect on the standard non-convex BN network, and we design an approach to explicitly encode this implicit regularization into the convex objective. Experiments with CIFAR image classification highlight the effectiveness of this explicit regularization for mimicking and substantially improving the performance of standard BN networks.
\ No newline at end of file
diff --git a/data/2022/iclr/Demystifying Limited Adversarial Transferability in Automatic Speech Recognition Systems b/data/2022/iclr/Demystifying Limited Adversarial Transferability in Automatic Speech Recognition Systems
new file mode 100644
index 0000000000..9d8b35d883
--- /dev/null
+++ b/data/2022/iclr/Demystifying Limited Adversarial Transferability in Automatic Speech Recognition Systems	
@@ -0,0 +1 @@
+The targeted transferability of adversarial samples enables attackers to exploit black-box models in the real world. Optimization attacks are the most popular means of producing such transferable samples. This is because these samples have high levels of transferability in some domains. However, recent research has shown that samples from these attacks do not transfer when applied to Automatic Speech Recognition systems (ASRs). In this paper, we study this phenomenon, perform exhaustive experiments, and identify the factors that are preventing transferability in ASRs. To do so, we perform an ablation study on each stage of the ASR pipeline. We discover and quantify six factors (i.e., input type, MFCC, RNN, output type, and vocabulary and sequence sizes) that impact the targeted transferability of optimization attacks against ASRs. Our ﬁndings can be leveraged to design ASRs that are more robust to other transferable attack types (e.g., signal processing attacks), or to modify architectures in other domains to reduce their vulnerability to targeted transferability.
\ No newline at end of file
diff --git a/data/2022/iclr/Denoising Likelihood Score Matching for Conditional Score-based Data Generation b/data/2022/iclr/Denoising Likelihood Score Matching for Conditional Score-based Data Generation
new file mode 100644
index 0000000000..bbafa7660b
--- /dev/null
+++ b/data/2022/iclr/Denoising Likelihood Score Matching for Conditional Score-based Data Generation	
@@ -0,0 +1 @@
+Many existing conditional score-based data generation methods utilize Bayes' theorem to decompose the gradients of a log posterior density into a mixture of scores. These methods facilitate the training procedure of conditional score models, as a mixture of scores can be separately estimated using a score model and a classifier. However, our analysis indicates that the training objectives for the classifier in these methods may lead to a serious score mismatch issue, which corresponds to the situation that the estimated scores deviate from the true ones. Such an issue causes the samples to be misled by the deviated scores during the diffusion process, resulting in a degraded sampling quality. To resolve it, we formulate a novel training objective, called Denoising Likelihood Score Matching (DLSM) loss, for the classifier to match the gradients of the true log likelihood density. Our experimental evidence shows that the proposed method outperforms the previous methods on both Cifar-10 and Cifar-100 benchmarks noticeably in terms of several key evaluation metrics. We thus conclude that, by adopting DLSM, the conditional scores can be accurately modeled, and the effect of the score mismatch issue is alleviated.
\ No newline at end of file
diff --git a/data/2022/iclr/DictFormer: Tiny Transformer with Shared Dictionary b/data/2022/iclr/DictFormer: Tiny Transformer with Shared Dictionary
new file mode 100644
index 0000000000..b1847a00c5
--- /dev/null
+++ b/data/2022/iclr/DictFormer: Tiny Transformer with Shared Dictionary	
@@ -0,0 +1 @@
+We introduce DictFormer with efficient shared dictionary to provide a compact, fast, and accurate transformer model. DictFormer significantly reduces the redundancy in the transformer’s parameters by replacing the prior transformer’s parameters with compact, shared dictionary, few unshared coefficients and indices. Also, DictFormer enables faster computations since expensive weights multiplications are converted into cheap shared look-ups on dictionary and few linear projections. Training dictionary and coefficients are not trivial since indices used for looking up dictionary are not differentiable. We adopt a sparse-constraint training with l1 norm relaxation to learn coefficients and indices in DictFormer. DictFormer is flexible to support different model sizes by dynamically changing dictionary size. Compared to existing lightweight Transformers, DictFormer consistently reduces model size over Transformer on multiple tasks, e.g., machine translation, abstractive summarization, and language modeling. Extensive experiments show that DictFormer reduces 3.6× to 8.9× model size with similar accuracy over multiple tasks, compared to Transformer.
\ No newline at end of file
diff --git a/data/2022/iclr/DiffSkill: Skill Abstraction from Differentiable Physics for Deformable Object Manipulations with Tools b/data/2022/iclr/DiffSkill: Skill Abstraction from Differentiable Physics for Deformable Object Manipulations with Tools
new file mode 100644
index 0000000000..1fdb2bbdf8
--- /dev/null
+++ b/data/2022/iclr/DiffSkill: Skill Abstraction from Differentiable Physics for Deformable Object Manipulations with Tools	
@@ -0,0 +1 @@
+We consider the problem of sequential robotic manipulation of deformable objects using tools. Previous works have shown that differentiable physics simulators provide gradients to the environment state and help trajectory optimization to converge orders of magnitude faster than model-free reinforcement learning algorithms for deformable object manipulation. However, such gradient-based trajectory optimization typically requires access to the full simulator states and can only solve short-horizon, single-skill tasks due to local optima. In this work, we propose a novel framework, named DiffSkill, that uses a differentiable physics simulator for skill abstraction to solve long-horizon deformable object manipulation tasks from sensory observations. In particular, we first obtain short-horizon skills using individual tools from a gradient-based optimizer, using the full state information in a differentiable simulator; we then learn a neural skill abstractor from the demonstration trajectories which takes RGBD images as input. Finally, we plan over the skills by finding the intermediate goals and then solve long-horizon tasks. We show the advantages of our method in a new set of sequential deformable object manipulation tasks compared to previous reinforcement learning algorithms and compared to the trajectory optimizer.
\ No newline at end of file
diff --git a/data/2022/iclr/Differentiable DAG Sampling b/data/2022/iclr/Differentiable DAG Sampling
new file mode 100644
index 0000000000..8ada8a5edb
--- /dev/null
+++ b/data/2022/iclr/Differentiable DAG Sampling	
@@ -0,0 +1 @@
+We propose a new differentiable probabilistic model over DAGs (DP-DAG). DP-DAG allows fast and differentiable DAG sampling suited to continuous optimization. To this end, DP-DAG samples a DAG by successively (1) sampling a linear ordering of the node and (2) sampling edges consistent with the sampled linear ordering. We further propose VI-DP-DAG, a new method for DAG learning from observational data which combines DP-DAG with variational inference. Hence,VI-DP-DAG approximates the posterior probability over DAG edges given the observed data. VI-DP-DAG is guaranteed to output a valid DAG at any time during training and does not require any complex augmented Lagrangian optimization scheme in contrast to existing differentiable DAG learning approaches. In our extensive experiments, we compare VI-DP-DAG to other differentiable DAG learning baselines on synthetic and real datasets. VI-DP-DAG significantly improves DAG structure and causal mechanism learning while training faster than competitors.
\ No newline at end of file
diff --git a/data/2022/iclr/Differentiable Expectation-Maximization for Set Representation Learning b/data/2022/iclr/Differentiable Expectation-Maximization for Set Representation Learning
new file mode 100644
index 0000000000..c99febaf24
--- /dev/null
+++ b/data/2022/iclr/Differentiable Expectation-Maximization for Set Representation Learning	
@@ -0,0 +1 @@
+We tackle the set2vec problem, the task of extracting a vector representation from an input set comprised of a variable number of feature vectors. Although recent approaches based on self attention such as (Set)Transformers were very successful due to the capability of capturing complex interaction between set elements, the computational overhead is the well-known downside. The inducing-point attention and the latest optimal transport kernel embedding (OTKE) are promising remedies that attain comparable or better performance with reduced computational cost, by incorporating a fixed number of learnable queries in attention. In this paper we approach the set2vec problem from a completely different perspective. The elements of an input set are considered as i.i.d. samples from a mixture distribution, and we define our set embedding feed-forward network as the maximum-a-posterior (MAP) estimate of the mixture which is approximately attained by a few ExpectationMaximization (EM) steps. The whole MAP-EM steps are differentiable operations with a fixed number of mixture parameters, allowing efficient auto-diff backpropagation for any given downstream task. Furthermore, the proposed mixture set data fitting framework allows unsupervised set representation learning naturally via marginal likelihood maximization aka the empirical Bayes. Interestingly, we also find that OTKE can be seen as a special case of our framework, specifically a single-step EM with extra balanced assignment constraints on the E-step. Compared to OTKE, our approach provides more flexible set embedding as well as prior-induced model regularization. We evaluate our approach on various tasks demonstrating improved performance over the state-of-the-arts.
\ No newline at end of file
diff --git a/data/2022/iclr/Differentiable Gradient Sampling for Learning Implicit 3D Scene Reconstructions from a Single Image b/data/2022/iclr/Differentiable Gradient Sampling for Learning Implicit 3D Scene Reconstructions from a Single Image
new file mode 100644
index 0000000000..da84e5f5ed
--- /dev/null
+++ b/data/2022/iclr/Differentiable Gradient Sampling for Learning Implicit 3D Scene Reconstructions from a Single Image	
@@ -0,0 +1 @@
+Implicit neural shape functions, e.g. occupancy fields or signed distance functions, are promising 3D representations for modeling arbitrary 3D surfaces. However, existing approaches that use these representations for single-view 3D reconstruction require 3D supervision signals at every location in the scene, posing difficulties when extending to real-world scenes where ideal watertight geometry necessary to compute dense supervision is difficult to obtain. In such cases, constraints on the spatial gradient of the implicit field, rather than the value itself, can provide a training signal, but this has not been employed as a source of supervision for single-view reconstruction in part due to the difficulties of differ-entiably sampling a spatial gradient from a feature map. In this paper, we derive a novel closed-form Differentiable Gradient Sampling (DGS) solution that enables backpropagation of the loss on spatial gradients to the feature maps, thus allowing training on large-scale scenes without dense 3D supervision. As a result, we demonstrate single view implicit 3D surface reconstructions on real-world scenes via learning directly from a scanned dataset. Our model performs well when generalizing to unseen images from Pix3D or downloaded directly from the Internet (Fig. 1). Extensive quantitative analysis confirms that our proposed DGS module plays an essential role in our learning framework. Video and code are available at https://github.com/zhusz/ICLR22-DGS.
\ No newline at end of file
diff --git a/data/2022/iclr/Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners b/data/2022/iclr/Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners
new file mode 100644
index 0000000000..8e125018cb
--- /dev/null
+++ b/data/2022/iclr/Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners	
@@ -0,0 +1 @@
+Large-scale pre-trained language models have contributed significantly to natural language processing by demonstrating remarkable abilities as few-shot learners. However, their effectiveness depends mainly on scaling the model parameters and prompt design, hindering their implementation in most real-world applications. This study proposes a novel pluggable, extensible, and efficient approach named DifferentiAble pRompT (DART), which can convert small language models into better few-shot learners without any prompt engineering. The main principle behind this approach involves reformulating potential natural language processing tasks into the task of a pre-trained language model and differentially optimizing the prompt template as well as the target label with backpropagation. Furthermore, the proposed approach can be: (i) Plugged to any pre-trained language models; (ii) Extended to widespread classification tasks. A comprehensive evaluation of standard NLP tasks demonstrates that the proposed approach achieves a better few-shot performance. Code is available in https://github.com/zjunlp/DART.
\ No newline at end of file
diff --git a/data/2022/iclr/Differentiable Scaffolding Tree for Molecule Optimization b/data/2022/iclr/Differentiable Scaffolding Tree for Molecule Optimization
new file mode 100644
index 0000000000..4124f39f5e
--- /dev/null
+++ b/data/2022/iclr/Differentiable Scaffolding Tree for Molecule Optimization	
@@ -0,0 +1 @@
+The structural design of functional molecules, also called molecular optimization, is an essential chemical science and engineering task with important applications, such as drug discovery. Deep generative models and combinatorial optimization methods achieve initial success but still struggle with directly modeling discrete chemical structures and often heavily rely on brute-force enumeration. The challenge comes from the discrete and non-differentiable nature of molecule structures. To address this, we propose differentiable scaffolding tree (DST) that utilizes a learned knowledge network to convert discrete chemical structures to locally differentiable ones. DST enables a gradient-based optimization on a chemical graph structure by back-propagating the derivatives from the target properties through a graph neural network (GNN). Our empirical studies show the gradient-based molecular optimizations are both effective and sample efficient. Furthermore, the learned graph parameters can also provide an explanation that helps domain experts understand the model output.
\ No newline at end of file
diff --git a/data/2022/iclr/Differentially Private Fine-tuning of Language Models b/data/2022/iclr/Differentially Private Fine-tuning of Language Models
new file mode 100644
index 0000000000..67d11e76cc
--- /dev/null
+++ b/data/2022/iclr/Differentially Private Fine-tuning of Language Models	
@@ -0,0 +1 @@
+We give simpler, sparser, and faster algorithms for differentially private fine-tuning of large-scale pre-trained language models, which achieve the state-of-the-art privacy versus utility tradeoffs on many standard NLP tasks. We propose a meta-framework for this problem, inspired by the recent success of highly parameter-efficient methods for fine-tuning. Our experiments show that differentially private adaptations of these approaches outperform previous private algorithms in three important dimensions: utility, privacy, and the computational and memory cost of private training. On many commonly studied datasets, the utility of private models approaches that of non-private models. For example, on the MNLI dataset we achieve an accuracy of $87.8\%$ using RoBERTa-Large and $83.5\%$ using RoBERTa-Base with a privacy budget of $\epsilon = 6.7$. In comparison, absent privacy constraints, RoBERTa-Large achieves an accuracy of $90.2\%$. Our findings are similar for natural language generation tasks. Privately fine-tuning with DART, GPT-2-Small, GPT-2-Medium, GPT-2-Large, and GPT-2-XL achieve BLEU scores of 38.5, 42.0, 43.1, and 43.8 respectively (privacy budget of $\epsilon = 6.8,\delta=$ 1e-5) whereas the non-private baseline is $48.1$. All our experiments suggest that larger models are better suited for private fine-tuning: while they are well known to achieve superior accuracy non-privately, we find that they also better maintain their accuracy when privacy is introduced.
\ No newline at end of file
diff --git a/data/2022/iclr/Differentially Private Fractional Frequency Moments Estimation with Polylogarithmic Space b/data/2022/iclr/Differentially Private Fractional Frequency Moments Estimation with Polylogarithmic Space
new file mode 100644
index 0000000000..46e4a36da7
--- /dev/null
+++ b/data/2022/iclr/Differentially Private Fractional Frequency Moments Estimation with Polylogarithmic Space	
@@ -0,0 +1 @@
+We prove that $\mathbb{F}_p$ sketch, a well-celebrated streaming algorithm for frequency moments estimation, is differentially private as is when $p\in(0, 1]$. $\mathbb{F}_p$ sketch uses only polylogarithmic space, exponentially better than existing DP baselines and only worse than the optimal non-private baseline by a logarithmic factor. The evaluation shows that $\mathbb{F}_p$ sketch can achieve reasonable accuracy with strong privacy guarantees.
\ No newline at end of file
diff --git a/data/2022/iclr/Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme b/data/2022/iclr/Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme
new file mode 100644
index 0000000000..5871d68323
--- /dev/null
+++ b/data/2022/iclr/Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme	
@@ -0,0 +1 @@
+Voice conversion is a common speech synthesis task which can be solved in different ways depending on a particular real-world scenario. The most challenging one often referred to as one-shot many-to-many voice conversion consists in copying the target voice from only one reference utterance in the most general case when both source and target speakers do not belong to the training dataset. We present a scalable high-quality solution based on diffusion probabilistic modeling and demonstrate its superior quality compared to state-of-the-art one-shot voice conversion approaches. Moreover, focusing on real-time applications, we investigate general principles which can make diffusion models faster while keeping synthesis quality at a high level. As a result, we develop a novel Stochastic Differential Equations solver suitable for various diffusion model types and generative tasks as shown through empirical studies and justify it by theoretical analysis.
\ No newline at end of file
diff --git a/data/2022/iclr/Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching b/data/2022/iclr/Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching
new file mode 100644
index 0000000000..fbd25483f0
--- /dev/null
+++ b/data/2022/iclr/Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching	
@@ -0,0 +1 @@
+Learning meaningful behaviors in the absence of reward is a difficult problem in reinforcement learning. A desirable and challenging unsupervised objective is to learn a set of diverse skills that provide a thorough coverage of the state space while being directed, i.e., reliably reaching distinct regions of the environment. In this paper, we build on the mutual information framework for skill discovery and introduce UPSIDE, which addresses the coverage-directedness trade-off in the following ways: 1) We design policies with a decoupled structure of a directed skill, trained to reach a specific region, followed by a diffusing part that induces a local coverage. 2) We optimize policies by maximizing their number under the constraint that each of them reaches distinct regions of the environment (i.e., they are sufficiently discriminable) and prove that this serves as a lower bound to the original mutual information objective. 3) Finally, we compose the learned directed skills into a growing tree that adaptively covers the environment. We illustrate in several navigation and control environments how the skills learned by UPSIDE solve sparse-reward downstream tasks better than existing baselines.
\ No newline at end of file
diff --git a/data/2022/iclr/Discovering Invariant Rationales for Graph Neural Networks b/data/2022/iclr/Discovering Invariant Rationales for Graph Neural Networks
new file mode 100644
index 0000000000..0b45f35e8e
--- /dev/null
+++ b/data/2022/iclr/Discovering Invariant Rationales for Graph Neural Networks	
@@ -0,0 +1 @@
+Intrinsic interpretability of graph neural networks (GNNs) is to find a small subset of the input graph's features -- rationale -- which guides the model prediction. Unfortunately, the leading rationalization models often rely on data biases, especially shortcut features, to compose rationales and make predictions without probing the critical and causal patterns. Moreover, such data biases easily change outside the training distribution. As a result, these models suffer from a huge drop in interpretability and predictive performance on out-of-distribution data. In this work, we propose a new strategy of discovering invariant rationale (DIR) to construct intrinsically interpretable GNNs. It conducts interventions on the training distribution to create multiple interventional distributions. Then it approaches the causal rationales that are invariant across different distributions while filtering out the spurious patterns that are unstable. Experiments on both synthetic and real-world datasets validate the superiority of our DIR in terms of interpretability and generalization ability on graph classification over the leading baselines. Code and datasets are available at https://github.com/Wuyxin/DIR-GNN.
\ No newline at end of file
diff --git a/data/2022/iclr/Discovering Latent Concepts Learned in BERT b/data/2022/iclr/Discovering Latent Concepts Learned in BERT
new file mode 100644
index 0000000000..34c12f1221
--- /dev/null
+++ b/data/2022/iclr/Discovering Latent Concepts Learned in BERT	
@@ -0,0 +1 @@
+A large number of studies that analyze deep neural network models and their ability to encode various linguistic and non-linguistic concepts provide an interpretation of the inner mechanics of these models. The scope of the analyses is limited to pre-defined concepts that reinforce the traditional linguistic knowledge and do not reflect on how novel concepts are learned by the model. We address this limitation by discovering and analyzing latent concepts learned in neural network models in an unsupervised fashion and provide interpretations from the model's perspective. In this work, we study: i) what latent concepts exist in the pre-trained BERT model, ii) how the discovered latent concepts align or diverge from classical linguistic hierarchy and iii) how the latent concepts evolve across layers. Our findings show: i) a model learns novel concepts (e.g. animal categories and demographic groups), which do not strictly adhere to any pre-defined categorization (e.g. POS, semantic tags), ii) several latent concepts are based on multiple properties which may include semantics, syntax, and morphology, iii) the lower layers in the model dominate in learning shallow lexical concepts while the higher layers learn semantic relations and iv) the discovered latent concepts highlight potential biases learned in the model. We also release a novel BERT ConceptNet dataset (BCN) consisting of 174 concept labels and 1M annotated instances.
\ No newline at end of file
diff --git a/data/2022/iclr/Discovering Nonlinear PDEs from Scarce Data with Physics-encoded Learning b/data/2022/iclr/Discovering Nonlinear PDEs from Scarce Data with Physics-encoded Learning
new file mode 100644
index 0000000000..63d8ebb701
--- /dev/null
+++ b/data/2022/iclr/Discovering Nonlinear PDEs from Scarce Data with Physics-encoded Learning	
@@ -0,0 +1 @@
+There have been growing interests in leveraging experimental measurements to discover the underlying partial differential equations (PDEs) that govern complex physical phenomena. Although past research attempts have achieved great success in data-driven PDE discovery, the robustness of the existing methods cannot be guaranteed when dealing with low-quality measurement data. To overcome this challenge, we propose a novel physics-encoded discrete learning framework for discovering spatiotemporal PDEs from scarce and noisy data. The general idea is to (1) firstly introduce a novel deep convolutional-recurrent network, which can encode prior physics knowledge (e.g., known PDE terms, assumed PDE structure, initial/boundary conditions, etc.) while remaining flexible on representation capability, to accurately reconstruct high-fidelity data, and (2) perform sparse regression with the reconstructed data to identify the explicit form of the governing PDEs. We validate our method on three nonlinear PDE systems. The effectiveness and superiority of the proposed method over baseline models are demonstrated.
\ No newline at end of file
diff --git a/data/2022/iclr/Discovering and Explaining the Representation Bottleneck of DNNS b/data/2022/iclr/Discovering and Explaining the Representation Bottleneck of DNNS
new file mode 100644
index 0000000000..f0e75c59ac
--- /dev/null
+++ b/data/2022/iclr/Discovering and Explaining the Representation Bottleneck of DNNS	
@@ -0,0 +1 @@
+This paper explores the bottleneck of feature representations of deep neural networks (DNNs), from the perspective of the complexity of interactions between input variables encoded in DNNs. To this end, we focus on the multi-order interaction between input variables, where the order represents the complexity of interactions. We discover that a DNN is more likely to encode both too simple interactions and too complex interactions, but usually fails to learn interactions of intermediate complexity. Such a phenomenon is widely shared by different DNNs for different tasks. This phenomenon indicates a cognition gap between DNNs and human beings, and we call it a representation bottleneck. We theoretically prove the underlying reason for the representation bottleneck. Furthermore, we propose a loss to encourage/penalize the learning of interactions of specific complexities, and analyze the representation capacities of interactions of different complexities.
\ No newline at end of file
diff --git a/data/2022/iclr/Discrepancy-Based Active Learning for Domain Adaptation b/data/2022/iclr/Discrepancy-Based Active Learning for Domain Adaptation
new file mode 100644
index 0000000000..1f76bd21e5
--- /dev/null
+++ b/data/2022/iclr/Discrepancy-Based Active Learning for Domain Adaptation	
@@ -0,0 +1 @@
+The goal of the paper is to design active learning strategies which lead to domain adaptation under an assumption of Lipschitz functions. Building on previous work by Mansour et al. (2009) we adapt the concept of discrepancy distance between source and target distributions to restrict the maximization over the hypothesis class to a localized class of functions which are performing accurate labeling on the source domain. We derive generalization error bounds for such active learning strategies in terms of Rademacher average and localized discrepancy for general loss functions which satisfy a regularity condition. A practical K-medoids algorithm that can address the case of large data set is inferred from the theoretical bounds. Our numerical experiments show that the proposed algorithm is competitive against other state-of-the-art active learning techniques in the context of domain adaptation, in particular on large data sets of around one hundred thousand images.
\ No newline at end of file
diff --git a/data/2022/iclr/Discrete Representations Strengthen Vision Transformer Robustness b/data/2022/iclr/Discrete Representations Strengthen Vision Transformer Robustness
new file mode 100644
index 0000000000..45b46f0407
--- /dev/null
+++ b/data/2022/iclr/Discrete Representations Strengthen Vision Transformer Robustness	
@@ -0,0 +1 @@
+Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. While recent studies suggest that ViTs are more robust than their convolutional counterparts, our experiments find that ViTs trained on ImageNet are overly reliant on local textures and fail to make adequate use of shape information. ViTs thus have difficulties generalizing to out-of-distribution, real-world data. To address this deficiency, we present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder. Different from the standard continuous pixel tokens, discrete tokens are invariant under small perturbations and contain less information individually, which promote ViTs to learn global information that is invariant. Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks while maintaining the performance on ImageNet.
\ No newline at end of file
diff --git a/data/2022/iclr/Discriminative Similarity for Data Clustering b/data/2022/iclr/Discriminative Similarity for Data Clustering
new file mode 100644
index 0000000000..4ffba8aa03
--- /dev/null
+++ b/data/2022/iclr/Discriminative Similarity for Data Clustering	
@@ -0,0 +1 @@
+Similarity-based clustering methods separate data into clusters according to the pairwise similarity between the data, and the pairwise similarity is crucial for their performance. In this paper, we propose {\em Clustering by Discriminative Similarity (CDS)}, a novel method which learns discriminative similarity for data clustering. CDS learns an unsupervised similarity-based classifier from each data partition, and searches for the optimal partition of the data by minimizing the generalization error of the learnt classifiers associated with the data partitions. By generalization analysis via Rademacher complexity, the generalization error bound for the unsupervised similarity-based classifier is expressed as the sum of discriminative similarity between the data from different classes. It is proved that the derived discriminative similarity can also be induced by the integrated squared error bound for kernel density classification. In order to evaluate the performance of the proposed discriminative similarity, we propose a new clustering method using a kernel as the similarity function, CDS via unsupervised kernel classification (CDSK), with its effectiveness demonstrated by experimental results.
\ No newline at end of file
diff --git a/data/2022/iclr/Disentanglement Analysis with Partial Information Decomposition b/data/2022/iclr/Disentanglement Analysis with Partial Information Decomposition
new file mode 100644
index 0000000000..44760eca4e
--- /dev/null
+++ b/data/2022/iclr/Disentanglement Analysis with Partial Information Decomposition	
@@ -0,0 +1 @@
+We propose a framework to analyze how multivariate representations disentangle ground-truth generative factors. A quantitative analysis of disentanglement has been based on metrics designed to compare how one variable explains each generative factor. Current metrics, however, may fail to detect entanglement that involves more than two variables, e.g., representations that duplicate and rotate generative factors in high dimensional spaces. In this work, we establish a framework to analyze information sharing in a multivariate representation with Partial Information Decomposition and propose a new disentanglement metric. This framework enables us to understand disentanglement in terms of uniqueness, redundancy, and synergy. We develop an experimental protocol to assess how increasingly entangled representations are evaluated with each metric and confirm that the proposed metric correctly responds to entanglement. Through experiments on variational autoencoders, we find that models with similar disentanglement scores have a variety of characteristics in entanglement, for each of which a distinct strategy may be required to obtain a disentangled representation.
\ No newline at end of file
diff --git a/data/2022/iclr/Distilling GANs with Style-Mixed Triplets for X2I Translation with Limited Data b/data/2022/iclr/Distilling GANs with Style-Mixed Triplets for X2I Translation with Limited Data
new file mode 100644
index 0000000000..7ad16086d3
--- /dev/null
+++ b/data/2022/iclr/Distilling GANs with Style-Mixed Triplets for X2I Translation with Limited Data	
@@ -0,0 +1 @@
+Conditional image synthesis is an integral part of many X2I translation systems, including image-to-image, text-to-image and audio-to-image translation systems. Training these large systems generally requires huge amounts of training data. Therefore, we investigate knowledge distillation to transfer knowledge from a high-quality unconditioned generative model (e.g., StyleGAN) to a conditioned synthetic image generation modules in a variety of systems. To initialize the conditional and reference branch (from a unconditional GAN) we exploit the style mixing characteristics of high-quality GANs to generate an infinite supply of style-mixed triplets to perform the knowledge distillation. Extensive experimental results in a number of image generation tasks (i.e., image-to-image, semantic segmentation-to-image, text-to-image and audio-to-image) demonstrate qualitatively and quantitatively that our method successfully transfers knowledge to the synthetic image generation modules, resulting in more realistic images than previous methods as confirmed by a significant drop in the FID. Code is available in https://github.com/yaxingwang/KDIT.
\ No newline at end of file
diff --git a/data/2022/iclr/Distribution Compression in Near-Linear Time b/data/2022/iclr/Distribution Compression in Near-Linear Time
new file mode 100644
index 0000000000..44f4e5b6ee
--- /dev/null
+++ b/data/2022/iclr/Distribution Compression in Near-Linear Time	
@@ -0,0 +1 @@
+In distribution compression, one aims to accurately summarize a probability distribution $\mathbb{P}$ using a small number of representative points. Near-optimal thinning procedures achieve this goal by sampling $n$ points from a Markov chain and identifying $\sqrt{n}$ points with $\widetilde{\mathcal{O}}(1/\sqrt{n})$ discrepancy to $\mathbb{P}$. Unfortunately, these algorithms suffer from quadratic or super-quadratic runtime in the sample size $n$. To address this deficiency, we introduce Compress++, a simple meta-procedure for speeding up any thinning algorithm while suffering at most a factor of $4$ in error. When combined with the quadratic-time kernel halving and kernel thinning algorithms of Dwivedi and Mackey (2021), Compress++ delivers $\sqrt{n}$ points with $\mathcal{O}(\sqrt{\log n/n})$ integration error and better-than-Monte-Carlo maximum mean discrepancy in $\mathcal{O}(n \log^3 n)$ time and $\mathcal{O}( \sqrt{n} \log^2 n )$ space. Moreover, Compress++ enjoys the same near-linear runtime given any quadratic-time input and reduces the runtime of super-quadratic algorithms by a square-root factor. In our benchmarks with high-dimensional Monte Carlo samples and Markov chains targeting challenging differential equation posteriors, Compress++ matches or nearly matches the accuracy of its input algorithm in orders of magnitude less time.
\ No newline at end of file
diff --git a/data/2022/iclr/Distributional Reinforcement Learning with Monotonic Splines b/data/2022/iclr/Distributional Reinforcement Learning with Monotonic Splines
new file mode 100644
index 0000000000..8212e1a233
--- /dev/null
+++ b/data/2022/iclr/Distributional Reinforcement Learning with Monotonic Splines	
@@ -0,0 +1 @@
+Distributional Reinforcement Learning (RL) differs from traditional RL by estimating the distribution over returns to capture the intrinsic uncertainty of MDPs. One key challenge in distributional RL lies in how to parameterize the quantile function when minimizing the Wasserstein metric of temporal differences. Existing algorithms use step functions or piecewise linear functions. In this paper, we propose to learn smooth continuous quantile functions represented by monotonic rational-quadratic splines, which also naturally solve the quantile crossing problem. Experiments in stochastic environments show that a dense estimation for quantile functions enhances distributional RL in terms of faster empirical convergence and higher rewards in most cases.
\ No newline at end of file
diff --git a/data/2022/iclr/Distributionally Robust Fair Principal Components via Geodesic Descents b/data/2022/iclr/Distributionally Robust Fair Principal Components via Geodesic Descents
new file mode 100644
index 0000000000..afc883fc9c
--- /dev/null
+++ b/data/2022/iclr/Distributionally Robust Fair Principal Components via Geodesic Descents	
@@ -0,0 +1 @@
+Principal component analysis is a simple yet useful dimensionality reduction technique in modern machine learning pipelines. In consequential domains such as college admission, healthcare and credit approval, it is imperative to take into account emerging criteria such as the fairness and the robustness of the learned projection. In this paper, we propose a distributionally robust optimization problem for principal component analysis which internalizes a fairness criterion in the objective function. The learned projection thus balances the trade-off between the total reconstruction error and the reconstruction error gap between subgroups, taken in the min-max sense over all distributions in a moment-based ambiguity set. The resulting optimization problem over the Stiefel manifold can be efficiently solved by a Riemannian subgradient descent algorithm with a sub-linear convergence rate. Our experimental results on real-world datasets show the merits of our proposed method over state-of-the-art baselines.
\ No newline at end of file
diff --git a/data/2022/iclr/Distributionally Robust Models with Parametric Likelihood Ratios b/data/2022/iclr/Distributionally Robust Models with Parametric Likelihood Ratios
new file mode 100644
index 0000000000..d51ec75f97
--- /dev/null
+++ b/data/2022/iclr/Distributionally Robust Models with Parametric Likelihood Ratios	
@@ -0,0 +1 @@
+As machine learning models are deployed ever more broadly, it becomes increasingly important that they are not only able to perform well on their training distribution, but also yield accurate predictions when confronted with distribution shift. The Distributionally Robust Optimization (DRO) framework proposes to address this issue by training models to minimize their expected risk under a collection of distributions, to imitate test-time shifts. This is most commonly achieved by instance-level re-weighting of the training objective to emulate the likelihood ratio with possible test distributions, which allows for estimating their empirical risk via importance sampling (assuming that they are subpopulations of the training distribution). However, re-weighting schemes in the literature are usually limited due to the difficulty of keeping the optimization problem tractable and the complexity of enforcing normalization constraints. In this paper, we show that three simple ideas -- mini-batch level normalization, a KL penalty and simultaneous gradient updates -- allow us to train models with DRO using a broader class of parametric likelihood ratios. In a series of experiments on both image and text classification benchmarks, we find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches, and that the method performs reliably well with little hyper-parameter tuning. Code to reproduce our experiments can be found at https://github.com/pmichel31415/P-DRO.
\ No newline at end of file
diff --git a/data/2022/iclr/Diurnal or Nocturnal? Federated Learning of Multi-branch Networks from Periodically Shifting Distributions b/data/2022/iclr/Diurnal or Nocturnal? Federated Learning of Multi-branch Networks from Periodically Shifting Distributions
new file mode 100644
index 0000000000..00c3a85333
--- /dev/null
+++ b/data/2022/iclr/Diurnal or Nocturnal? Federated Learning of Multi-branch Networks from Periodically Shifting Distributions	
@@ -0,0 +1 @@
+Federated learning has been deployed to train machine learning models from de-centralized client data on mobile devices in practice. The clients available for training are observed to have periodically shifting distributions changing with the time of day, which can cause instability in training and degrade the model performance. In this paper, instead of modeling the distribution shift with a block-cyclic pattern as previous works, we model it with a mixture of distributions that gradually shifts between daytime and nighttime modes, and ﬁnd this intuitive model to better match the observations in practical federated learning systems. Furthermore, we propose to jointly train a clustering model and a multi-branch network to allocate lightweight specialized branches to clients from different modes. A temporal prior is used to signiﬁcantly boost the training performance. Experiments for image classiﬁcation on EMNIST and CIFAR datasets, and next word prediction on the Stack Overﬂow dataset show that the proposed algorithm can counter the effects of the distribution shift and signiﬁcantly improve the ﬁnal model performance.
\ No newline at end of file
diff --git a/data/2022/iclr/Dive Deeper Into Integral Pose Regression b/data/2022/iclr/Dive Deeper Into Integral Pose Regression
new file mode 100644
index 0000000000..a32ef1cefb
--- /dev/null
+++ b/data/2022/iclr/Dive Deeper Into Integral Pose Regression	
@@ -0,0 +1 @@
+Integral pose regression combines an implicit heatmap with end-to-end training for human body and hand pose estimation. Unlike detection-based heatmap methods, which decode final joint positions from the heatmap with a non-differentiable argmax operation, integral regression methods apply a differentiable expectation operation. This paper offers a deep dive into the inference and back-propagation of integral pose regression to better understand the differences in performance and training compared to detection-based methods. For inference, we give theoretical support as to why expectation should always be better than the argmax operation, i.e. integral regression should always outperform detection. Yet, in practice, this is observed only in hard cases because the heatmap activation for regression shrinks in easy cases. We then experimentally show that activation shrinkage is one of the leading causes for integral regression’s inferior performance. For backpropagation, we theoretically and empirically analyze the gradients to explain the slow training speed of integral regression. Based on these findings, we incorporate the supervision of a spatial prior to speed up training and improve performance.
\ No newline at end of file
diff --git a/data/2022/iclr/Divergence-aware Federated Self-Supervised Learning b/data/2022/iclr/Divergence-aware Federated Self-Supervised Learning
new file mode 100644
index 0000000000..62ec36b27d
--- /dev/null
+++ b/data/2022/iclr/Divergence-aware Federated Self-Supervised Learning	
@@ -0,0 +1 @@
+Self-supervised learning (SSL) is capable of learning remarkable representations from centrally available data. Recent works further implement federated learning with SSL to learn from rapidly growing decentralized unlabeled images (e.g., from cameras and phones), often resulted from privacy constraints. Extensive attention has been paid to SSL approaches based on Siamese networks. However, such an effort has not yet revealed deep insights into various fundamental building blocks for the federated self-supervised learning (FedSSL) architecture. We aim to fill in this gap via in-depth empirical study and propose a new method to tackle the non-independently and identically distributed (non-IID) data problem of decentralized data. Firstly, we introduce a generalized FedSSL framework that embraces existing SSL methods based on Siamese networks and presents flexibility catering to future methods. In this framework, a server coordinates multiple clients to conduct SSL training and periodically updates local models of clients with the aggregated global model. Using the framework, our study uncovers unique insights of FedSSL: 1) stop-gradient operation, previously reported to be essential, is not always necessary in FedSSL; 2) retaining local knowledge of clients in FedSSL is particularly beneficial for non-IID data. Inspired by the insights, we then propose a new approach for model update, Federated Divergence-aware Exponential Moving Average update (FedEMA). FedEMA updates local models of clients adaptively using EMA of the global model, where the decay rate is dynamically measured by model divergence. Extensive experiments demonstrate that FedEMA outperforms existing methods by 3-4% on linear evaluation. We hope that this work will provide useful insights for future research.
\ No newline at end of file
diff --git a/data/2022/iclr/Diverse Client Selection for Federated Learning via Submodular Maximization b/data/2022/iclr/Diverse Client Selection for Federated Learning via Submodular Maximization
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Divisive Feature Normalization Improves Image Recognition Performance in AlexNet b/data/2022/iclr/Divisive Feature Normalization Improves Image Recognition Performance in AlexNet
new file mode 100644
index 0000000000..20217f9153
--- /dev/null
+++ b/data/2022/iclr/Divisive Feature Normalization Improves Image Recognition Performance in AlexNet	
@@ -0,0 +1 @@
+Local divisive normalization provides a phenomenological description of many nonlinear response properties of neurons across visual cortical areas. To gain insight into the utility of this operation, we studied the effects on AlexNet of a local divisive normalization between features, with learned parameters. Developing features were arranged in a line topology, with the inﬂuence between features determined by an exponential function of the distance between them. We compared an AlexNet model with no normalization or with canonical normalizations (Batch, Group, Layer) to the same models with divisive normalization added. Divisive normalization always improved performance for models with batch or group or no normalization, generally by 1-2 percentage points, on both the CIFAR-100 and ImageNet databases. To gain insight into mechanisms underlying the improved performance, we examined several aspects of network representations. In the early layers both canonical and divisive normalizations reduced manifold capacities and increased average dimension of the individual categorical manifolds. In later layers the capacity was higher and manifold dimension lower for models roughly in order of their performance improvement. Examining the sparsity of activations across a given layer, divisive normalization layers increased sparsity, while the canonical normalization layers decreased it. Nonetheless, in the ﬁnal layer, the sparseness of activity increased in the order of no normalization, divisive, combined, and canonical. We also investigated how the receptive ﬁelds (RFs) in the ﬁrst convolutional layer (where RFs are most interpretable) change with normalization. Divisive normalization enhanced RF Fourier power at low wavelengths,
\ No newline at end of file
diff --git a/data/2022/iclr/Do Not Escape From the Manifold: Discovering the Local Coordinates on the Latent Space of GANs b/data/2022/iclr/Do Not Escape From the Manifold: Discovering the Local Coordinates on the Latent Space of GANs
new file mode 100644
index 0000000000..a0f5decb6b
--- /dev/null
+++ b/data/2022/iclr/Do Not Escape From the Manifold: Discovering the Local Coordinates on the Latent Space of GANs	
@@ -0,0 +1 @@
+The discovery of the disentanglement properties of the latent space in GANs motivated a lot of research to find the semantically meaningful directions on it. In this paper, we suggest that the disentanglement property is closely related to the geometry of the latent space. In this regard, we propose an unsupervised method for finding the semantic-factorizing directions on the intermediate latent space of GANs based on the local geometry. Intuitively, our proposed method, called Local Basis, finds the principal variation of the latent space in the neighborhood of the base latent variable. Experimental results show that the local principal variation corresponds to the semantic factorization and traversing along it provides strong robustness to image traversal. Moreover, we suggest an explanation for the limited success in finding the global traversal directions in the latent space, especially W-space of StyleGAN2. We show that W-space is warped globally by comparing the local geometry, discovered from Local Basis, through the metric on Grassmannian Manifold. The global warpage implies that the latent space is not well-aligned globally and therefore the global traversal directions are bound to show limited success on it.
\ No newline at end of file
diff --git a/data/2022/iclr/Do Users Benefit From Interpretable Vision? A User Study, Baseline, And Dataset b/data/2022/iclr/Do Users Benefit From Interpretable Vision? A User Study, Baseline, And Dataset
new file mode 100644
index 0000000000..2347f42490
--- /dev/null
+++ b/data/2022/iclr/Do Users Benefit From Interpretable Vision? A User Study, Baseline, And Dataset	
@@ -0,0 +1 @@
+A variety of methods exist to explain image classification models. However, whether they provide any benefit to users over simply comparing various inputs and the model's respective predictions remains unclear. We conducted a user study (N=240) to test how such a baseline explanation technique performs against concept-based and counterfactual explanations. To this end, we contribute a synthetic dataset generator capable of biasing individual attributes and quantifying their relevance to the model. In a study, we assess if participants can identify the relevant set of attributes compared to the ground-truth. Our results show that the baseline outperformed concept-based explanations. Counterfactual explanations from an invertible neural network performed similarly as the baseline. Still, they allowed users to identify some attributes more accurately. Our results highlight the importance of measuring how well users can reason about biases of a model, rather than solely relying on technical evaluations or proxy tasks. We open-source our study and dataset so it can serve as a blue-print for future studies. For code see, https://github.com/berleon/do_users_benefit_from_interpretable_vision
\ No newline at end of file
diff --git a/data/2022/iclr/Do We Need Anisotropic Graph Neural Networks? b/data/2022/iclr/Do We Need Anisotropic Graph Neural Networks?
new file mode 100644
index 0000000000..2f996e5a6b
--- /dev/null
+++ b/data/2022/iclr/Do We Need Anisotropic Graph Neural Networks?	
@@ -0,0 +1 @@
+Graph Benchmark (Hu et al., 2020). These datasets cover a wide range of domains, cover both transductive and inductive tasks, and are larger than datasets which are typically used in GNN works. We use evaluation metrics and splits speciﬁed by these papers. Baseline architectures chosen reﬂect popular general-purpose choices (Kipf & Welling, 2017; Xu et al., 2019; Hamilton et al., 2017; Veli ˇ ckovi ´ c et al., 2018; Gilmer et al., 2017), along with the state-of-the-art PNA (Corso et al., 2020) and GATv2 (Brody et al., 2022) architectures.
\ No newline at end of file
diff --git a/data/2022/iclr/Do deep networks transfer invariances across classes? b/data/2022/iclr/Do deep networks transfer invariances across classes?
new file mode 100644
index 0000000000..399ec6eeec
--- /dev/null
+++ b/data/2022/iclr/Do deep networks transfer invariances across classes?	
@@ -0,0 +1 @@
+To generalize well, classifiers must learn to be invariant to nuisance transformations that do not alter an input’s class. Many problems have “class-agnostic” nuisance transformations that apply similarly to all classes, such as lighting and background changes for image classification. Neural networks can learn these invariances given sufficient data, but many real-world datasets are heavily class imbalanced and contain only a few examples for most of the classes. We therefore pose the question: how well do neural networks transfer class-agnostic invariances learned from the large classes to the small ones? Through careful experimentation, we observe that invariance to class-agnostic transformations is still heavily dependent on class size, with the networks being much less invariant on smaller classes. This result holds even when using data balancing techniques, and suggests poor invariance transfer across classes. Our results provide one explanation for why classifiers generalize poorly on unbalanced and long-tailed distributions. Based on this analysis, we show how a generative approach for learning the nuisance transformations can help transfer invariances across classes and improve performance on a set of imbalanced image classification benchmarks. Source code for our experiments is available at https://github.com/AllanYangZhou/ generative-invariance-transfer.
\ No newline at end of file
diff --git a/data/2022/iclr/Does your graph need a confidence boost? Convergent boosted smoothing on graphs with tabular node features b/data/2022/iclr/Does your graph need a confidence boost? Convergent boosted smoothing on graphs with tabular node features
new file mode 100644
index 0000000000..8a1254536b
--- /dev/null
+++ b/data/2022/iclr/Does your graph need a confidence boost? Convergent boosted smoothing on graphs with tabular node features	
@@ -0,0 +1 @@
+For supervised learning with tabular data, decision tree ensembles produced via boosting techniques generally dominate real-world applications involving iid training/test sets. However for graph data where the iid assumption is violated due to structured relations between samples, it remains unclear how to best incorporate this structure within existing boosting pipelines. To this end, we propose a generalized framework for iterating boosting with graph propagation steps that share node/sample information across edges connecting related samples. Unlike previous efforts to integrate graph-based models with boosting, our approach is anchored in a principled meta loss function such that provable convergence can be guaranteed under relatively mild assumptions. Across a variety of non-iid graph datasets with tabular node features, our method achieves comparable or superior performance than both tabular and graph neural network models, as well as existing hybrid strategies that combine the two. Beyond producing better predictive performance than recently proposed graph models, our proposed techniques are easy to implement, computationally more efficient, and enjoy stronger theoretical guarantees (which make our results more reproducible).
\ No newline at end of file
diff --git a/data/2022/iclr/Domain Adversarial Training: A Game Perspective b/data/2022/iclr/Domain Adversarial Training: A Game Perspective
new file mode 100644
index 0000000000..3e1b83d2e3
--- /dev/null
+++ b/data/2022/iclr/Domain Adversarial Training: A Game Perspective	
@@ -0,0 +1 @@
+The dominant line of work in domain adaptation has focused on learning invariant representations using domain-adversarial training. In this paper, we interpret this approach from a game theoretical perspective . Deﬁning optimal solutions in domain-adversarial training as local Nash equilibria, we show that gradient descent in domain-adversarial training can violate the asymptotic convergence guarantees of the optimizer, oftentimes hindering the transfer performance. Our analysis leads us to replace gradient descent with high-order ODE solvers (i.e., Runge–Kutta), for which we derive asymptotic convergence guarantees. This family of optimizers is signiﬁcantly more stable and allows more aggressive learning rates, leading to high performance gains when used as a drop-in replacement over standard optimizers. Our experiments show that in conjunction with state-of-the-art domain-adversarial methods, we achieve up to 3.5% improvement with less than half of training iterations. Our optimizers are easy to implement, free of additional parameters, and can be plugged into any domain-adversarial framework.
\ No newline at end of file
diff --git a/data/2022/iclr/Domino: Discovering Systematic Errors with Cross-Modal Embeddings b/data/2022/iclr/Domino: Discovering Systematic Errors with Cross-Modal Embeddings
new file mode 100644
index 0000000000..2629e815df
--- /dev/null
+++ b/data/2022/iclr/Domino: Discovering Systematic Errors with Cross-Modal Embeddings	
@@ -0,0 +1 @@
+Machine learning models that achieve high overall accuracy often make systematic errors on important subsets (or slices) of data. Identifying underperforming slices is particularly challenging when working with high-dimensional inputs (e.g. images, audio), where important slices are often unlabeled. In order to address this issue, recent studies have proposed automated slice discovery methods (SDMs), which leverage learned model representations to mine input data for slices on which a model performs poorly. To be useful to a practitioner, these methods must identify slices that are both underperforming and coherent (i.e. united by a human-understandable concept). However, no quantitative evaluation framework currently exists for rigorously assessing SDMs with respect to these criteria. Additionally, prior qualitative evaluations have shown that SDMs often identify slices that are incoherent. In this work, we address these challenges by first designing a principled evaluation framework that enables a quantitative comparison of SDMs across 1,235 slice discovery settings in three input domains (natural images, medical images, and time-series data). Then, motivated by the recent development of powerful cross-modal representation learning approaches, we present Domino, an SDM that leverages cross-modal embeddings and a novel error-aware mixture model to discover and describe coherent slices. We find that Domino accurately identifies 36% of the 1,235 slices in our framework - a 12 percentage point improvement over prior methods. Further, Domino is the first SDM that can provide natural language descriptions of identified slices, correctly generating the exact name of the slice in 35% of settings.
\ No newline at end of file
diff --git a/data/2022/iclr/Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information b/data/2022/iclr/Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information
new file mode 100644
index 0000000000..0cf3892c17
--- /dev/null
+++ b/data/2022/iclr/Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information	
@@ -0,0 +1 @@
+We present a novel adaptive optimization algorithm for large-scale machine learning problems. Equipped with a low-cost estimate of local curvature and Lipschitz smoothness, our method dynamically adapts the search direction and step-size. The search direction contains gradient information preconditioned by a well-scaled diagonal preconditioning matrix that captures the local curvature information. Our methodology does not require the tedious task of learning rate tuning, as the learning rate is updated automatically without adding an extra hyperparameter. We provide convergence guarantees on a comprehensive collection of optimization problems, including convex, strongly convex, and nonconvex problems, in both deterministic and stochastic regimes. We also conduct an extensive empirical evaluation on standard machine learning problems, justifying our algorithm's versatility and demonstrating its strong performance compared to other start-of-the-art first-order and second-order methods.
\ No newline at end of file
diff --git a/data/2022/iclr/DriPP: Driven Point Processes to Model Stimuli Induced Patterns in M EEG Signals b/data/2022/iclr/DriPP: Driven Point Processes to Model Stimuli Induced Patterns in M EEG Signals
new file mode 100644
index 0000000000..5ed313b994
--- /dev/null
+++ b/data/2022/iclr/DriPP: Driven Point Processes to Model Stimuli Induced Patterns in M EEG Signals	
@@ -0,0 +1 @@
+The quantitative analysis of non-invasive electrophysiology signals from electroencephalography (EEG) and magnetoencephalography (MEG) boils down to the identification of temporal patterns such as evoked responses, transient bursts of neural oscillations but also blinks or heartbeats for data cleaning. Several works have shown that these patterns can be extracted efficiently in an unsupervised way, e.g., using Convolutional Dictionary Learning. This leads to an event-based description of the data. Given these events, a natural question is to estimate how their occurrences are modulated by certain cognitive tasks and experimental manipulations. To address it, we propose a point process approach. While point processes have been used in neuroscience in the past, in particular for single cell recordings (spike trains), techniques such as Convolutional Dictionary Learning make them amenable to human studies based on EEG/MEG signals. We develop a novel statistical point process model-called driven temporal point processes (DriPP)-where the intensity function of the point process model is linked to a set of point processes corresponding to stimulation events. We derive a fast and principled expectation-maximization (EM) algorithm to estimate the parameters of this model. Simulations reveal that model parameters can be identified from long enough signals. Results on standard MEG datasets demonstrate that our methodology reveals event-related neural responses-both evoked and induced-and isolates non-task specific temporal patterns.
\ No newline at end of file
diff --git a/data/2022/iclr/Dropout Q-Functions for Doubly Efficient Reinforcement Learning b/data/2022/iclr/Dropout Q-Functions for Doubly Efficient Reinforcement Learning
new file mode 100644
index 0000000000..ef1f90d6f4
--- /dev/null
+++ b/data/2022/iclr/Dropout Q-Functions for Doubly Efficient Reinforcement Learning	
@@ -0,0 +1 @@
+Randomized ensembled double Q-learning (REDQ) (Chen et al., 2021b) has recently achieved state-of-the-art sample efficiency on continuous-action reinforcement learning benchmarks. This superior sample efficiency is made possible by using a large Q-function ensemble. However, REDQ is much less computationally efficient than non-ensemble counterparts such as Soft Actor-Critic (SAC) (Haarnoja et al., 2018a). To make REDQ more computationally efficient, we propose a method of improving computational efficiency called DroQ, which is a variant of REDQ that uses a small ensemble of dropout Q-functions. Our dropout Q-functions are simple Q-functions equipped with dropout connection and layer normalization. Despite its simplicity of implementation, our experimental results indicate that DroQ is doubly (sample and computationally) efficient. It achieved comparable sample efficiency with REDQ, much better computational efficiency than REDQ, and comparable computational efficiency with that of SAC.
\ No newline at end of file
diff --git a/data/2022/iclr/Dual Lottery Ticket Hypothesis b/data/2022/iclr/Dual Lottery Ticket Hypothesis
new file mode 100644
index 0000000000..552402cfea
--- /dev/null
+++ b/data/2022/iclr/Dual Lottery Ticket Hypothesis	
@@ -0,0 +1 @@
+Fully exploiting the learning capacity of neural networks requires overparameterized dense networks. On the other side, directly training sparse neural networks typically results in unsatisfactory performance. Lottery Ticket Hypothesis (LTH) provides a novel view to investigate sparse network training and maintain its capacity. Concretely, it claims there exist winning tickets from a randomly initialized network found by iterative magnitude pruning and preserving promising trainability (or we say being in trainable condition). In this work, we regard the winning ticket from LTH as the subnetwork which is in trainable condition and its performance as our benchmark, then go from a complementary direction to articulate the Dual Lottery Ticket Hypothesis (DLTH): Randomly selected subnetworks from a randomly initialized dense network can be transformed into a trainable condition and achieve admirable performance compared with LTH -- random tickets in a given lottery pool can be transformed into winning tickets. Specifically, by using uniform-randomly selected subnetworks to represent the general cases, we propose a simple sparse network training strategy, Random Sparse Network Transformation (RST), to substantiate our DLTH. Concretely, we introduce a regularization term to borrow learning capacity and realize information extrusion from the weights which will be masked. After finishing the transformation for the randomly selected subnetworks, we conduct the regular finetuning to evaluate the model using fair comparisons with LTH and other strong baselines. Extensive experiments on several public datasets and comparisons with competitive approaches validate our DLTH as well as the effectiveness of the proposed model RST. Our work is expected to pave a way for inspiring new research directions of sparse network training in the future. Our code is available at https://github.com/yueb17/DLTH.
\ No newline at end of file
diff --git a/data/2022/iclr/Dynamic Token Normalization improves Vision Transformers b/data/2022/iclr/Dynamic Token Normalization improves Vision Transformers
new file mode 100644
index 0000000000..71aaddd5a2
--- /dev/null
+++ b/data/2022/iclr/Dynamic Token Normalization improves Vision Transformers	
@@ -0,0 +1 @@
+Vision Transformer (ViT) and its variants (e.g., Swin, PVT) have achieved great success in various computer vision tasks, owing to their capability to learn long-range contextual information. Layer Normalization (LN) is an essential ingredient in these models. However, we found that the ordinary LN makes tokens at different positions similar in magnitude because it normalizes embeddings within each token. It is difficult for Transformers to capture inductive bias such as the positional context in an image with LN. We tackle this problem by proposing a new normalizer, termed Dynamic Token Normalization (DTN), where normalization is performed both within each token (intra-token) and across different tokens (inter-token). DTN has several merits. Firstly, it is built on a unified formulation and thus can represent various existing normalization methods. Secondly, DTN learns to normalize tokens in both intra-token and inter-token manners, enabling Transformers to capture both the global contextual information and the local positional context. {Thirdly, by simply replacing LN layers, DTN can be readily plugged into various vision transformers, such as ViT, Swin, PVT, LeViT, T2T-ViT, BigBird and Reformer. Extensive experiments show that the transformer equipped with DTN consistently outperforms baseline model with minimal extra parameters and computational overhead. For example, DTN outperforms LN by $0.5\%$ - $1.2\%$ top-1 accuracy on ImageNet, by $1.2$ - $1.4$ box AP in object detection on COCO benchmark, by $2.3\%$ - $3.9\%$ mCE in robustness experiments on ImageNet-C, and by $0.5\%$ - $0.8\%$ accuracy in Long ListOps on Long-Range Arena.} Codes will be made public at \url{https://github.com/wqshao126/DTN}
\ No newline at end of file
diff --git a/data/2022/iclr/Dynamics-Aware Comparison of Learned Reward Functions b/data/2022/iclr/Dynamics-Aware Comparison of Learned Reward Functions
new file mode 100644
index 0000000000..bad48e6d0e
--- /dev/null
+++ b/data/2022/iclr/Dynamics-Aware Comparison of Learned Reward Functions	
@@ -0,0 +1 @@
+The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world. However, comparing reward functions, for example as a means of evaluating reward learning methods, presents a challenge. Reward functions are typically compared by considering the behavior of optimized policies, but this approach conflates deficiencies in the reward function with those of the policy search algorithm used to optimize it. To address this challenge, Gleave et al. (2020) propose the Equivalent-Policy Invariant Comparison (EPIC) distance. EPIC avoids policy optimization, but in doing so requires computing reward values at transitions that may be impossible under the system dynamics. This is problematic for learned reward functions because it entails evaluating them outside of their training distribution, resulting in inaccurate reward values that we show can render EPIC ineffective at comparing rewards. To address this problem, we propose the Dynamics-Aware Reward Distance (DARD), a new reward pseudometric. DARD uses an approximate transition model of the environment to transform reward functions into a form that allows for comparisons that are invariant to reward shaping while only evaluating reward functions on transitions close to their training distribution. Experiments in simulated physical domains demonstrate that DARD enables reliable reward comparisons without policy optimization and is significantly more predictive than baseline methods of downstream policy performance when dealing with learned reward functions.
\ No newline at end of file
diff --git a/data/2022/iclr/EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits b/data/2022/iclr/EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits
new file mode 100644
index 0000000000..1ad3582d4a
--- /dev/null
+++ b/data/2022/iclr/EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits	
@@ -0,0 +1 @@
+In this paper, we propose a novel neural exploration strategy in contextual bandits, EE-Net, distinct from the standard UCB-based and TS-based approaches. Contextual multi-armed bandits have been studied for decades with various applications. To solve the exploitation-exploration tradeoff in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, linear contextual bandits have adopted ridge regression to estimate the reward function and combine it with TS or UCB strategies for exploration. However, this line of works explicitly assumes the reward is based on a linear function of arm vectors, which may not be true in real-world datasets. To overcome this challenge, a series of neural bandit algorithms have been proposed, where a neural network is used to learn the underlying reward function and TS or UCB are adapted for exploration. Instead of calculating a large-deviation based statistical bound for exploration like previous methods, we propose"EE-Net", a novel neural-based exploration strategy. In addition to using a neural network (Exploitation network) to learn the reward function, EE-Net uses another neural network (Exploration network) to adaptively learn potential gains compared to the currently estimated reward for exploration. Then, a decision-maker is constructed to combine the outputs from the Exploitation and Exploration networks. We prove that EE-Net can achieve $\mathcal{O}(\sqrt{T\log T})$ regret and show that EE-Net outperforms existing linear and neural contextual bandit baselines on real-world datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/EViT: Expediting Vision Transformers via Token Reorganizations b/data/2022/iclr/EViT: Expediting Vision Transformers via Token Reorganizations
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/EXACT: Scalable Graph Neural Networks Training via Extreme Activation Compression b/data/2022/iclr/EXACT: Scalable Graph Neural Networks Training via Extreme Activation Compression
new file mode 100644
index 0000000000..5e1bdc2686
--- /dev/null
+++ b/data/2022/iclr/EXACT: Scalable Graph Neural Networks Training via Extreme Activation Compression	
@@ -0,0 +1 @@
+Training Graph Neural Networks (GNNs) on large graphs is a fundamental challenge due to the high memory usage, which is mainly occupied by activations (e.g., node embeddings). Previous works usually focus on reducing the number of nodes retained in memory. In parallel, unlike what has been developed for other types of neural networks, training with compressed activation maps is less explored for GNNs. This extension is notoriously difﬁcult to implement due to the lack of necessary tools in common graph learning packages. To un-leash the potential of this direction, we provide an optimized GPU implementation which supports training GNNs with compressed activations. Based on the implementation, we propose a memory-efﬁcient framework called “EXACT”, which for the ﬁrst time demonstrates the potential and evaluates the feasibility of training GNNs with compressed activations. We systematically analyze the trade-off among the memory saving, time overhead, and accuracy drop. In practice, EXACT can reduce the memory footprint of activations by up to 32 × with 0 . 2 - 0 . 5% accuracy drop and 10 - 25% time overhead across different models and datasets. We implement EXACT as an extension for Pytorch Geometric and Pytorch. In practice, for Pytorch Geometric, EXACT can trim down the hardware requirement of training a three-layer full-batch GraphSAGE on ogbn-products from a 48GB GPU to a 12GB GPU. The code is available at https://github.com/warai-0toko
\ No newline at end of file
diff --git a/data/2022/iclr/Effect of scale on catastrophic forgetting in neural networks b/data/2022/iclr/Effect of scale on catastrophic forgetting in neural networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Effective Model Sparsification by Scheduled Grow-and-Prune Methods b/data/2022/iclr/Effective Model Sparsification by Scheduled Grow-and-Prune Methods
new file mode 100644
index 0000000000..fe3a30d4df
--- /dev/null
+++ b/data/2022/iclr/Effective Model Sparsification by Scheduled Grow-and-Prune Methods	
@@ -0,0 +1 @@
+Deep neural networks (DNNs) are effective in solving many real-world problems. Larger DNN models usually exhibit better quality (e.g., accuracy) but their excessive computation results in long inference time. Model sparsification can reduce the computation and memory cost while maintaining model quality. Most existing sparsification algorithms unidirectionally remove weights, while others randomly or greedily explore a small subset of weights in each layer for pruning. The limitations of these algorithms reduce the level of achievable sparsity. In addition, many algorithms still require pre-trained dense models and thus suffer from large memory footprint. In this paper, we propose a novel scheduled grow-and-prune (GaP) methodology without having to pre-train a dense model. It addresses the shortcomings of the previous works by repeatedly growing a subset of layers to dense and then pruning them back to sparse after some training. Experiments show that the models pruned using the proposed methods match or beat the quality of the highly optimized dense models at 80% sparsity on a variety of tasks, such as image classification, objective detection, 3D object part segmentation, and translation. They also outperform other state-of-the-art (SOTA) methods for model sparsification. As an example, a 90% non-uniform sparse ResNet-50 model obtained via GaP achieves 77.9% top-1 accuracy on ImageNet, improving the previous SOTA results by 1.5%. Code available at: https://github.com/boone891214/GaP.
\ No newline at end of file
diff --git a/data/2022/iclr/Efficient Active Search for Combinatorial Optimization Problems b/data/2022/iclr/Efficient Active Search for Combinatorial Optimization Problems
new file mode 100644
index 0000000000..547ded86bf
--- /dev/null
+++ b/data/2022/iclr/Efficient Active Search for Combinatorial Optimization Problems	
@@ -0,0 +1 @@
+Recently numerous machine learning based methods for combinatorial optimization problems have been proposed that learn to construct solutions in a sequential decision process via reinforcement learning. While these methods can be easily combined with search strategies like sampling and beam search, it is not straightforward to integrate them into a high-level search procedure offering strong search guidance. Bello et al. (2016) propose active search, which adjusts the weights of a (trained) model with respect to a single instance at test time using reinforcement learning. While active search is simple to implement, it is not competitive with state-of-the-art methods because adjusting all model weights for each test instance is very time and memory intensive. Instead of updating all model weights, we propose and evaluate three efficient active search strategies that only update a subset of parameters during the search. The proposed methods offer a simple way to significantly improve the search performance of a given model and outperform state-of-the-art machine learning based methods on combinatorial problems, even surpassing the well-known heuristic solver LKH3 on the capacitated vehicle routing problem. Finally, we show that (efficient) active search enables learned models to effectively solve instances that are much larger than those seen during training.
\ No newline at end of file
diff --git a/data/2022/iclr/Efficient Computation of Deep Nonlinear Infinite-Width Neural Networks that Learn Features b/data/2022/iclr/Efficient Computation of Deep Nonlinear Infinite-Width Neural Networks that Learn Features
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Efficient Learning of Safe Driving Policy via Human-AI Copilot Optimization b/data/2022/iclr/Efficient Learning of Safe Driving Policy via Human-AI Copilot Optimization
new file mode 100644
index 0000000000..75e7614ad3
--- /dev/null
+++ b/data/2022/iclr/Efficient Learning of Safe Driving Policy via Human-AI Copilot Optimization	
@@ -0,0 +1 @@
+Human intervention is an effective way to inject human knowledge into the training loop of reinforcement learning, which can bring fast learning and ensured training safety. Given the very limited budget of human intervention, it remains challenging to design when and how human expert interacts with the learning agent in the training. In this work, we develop a novel human-in-the-loop learning method called Human-AI Copilot Optimization (HACO).To allow the agent's sufficient exploration in the risky environments while ensuring the training safety, the human expert can take over the control and demonstrate how to avoid probably dangerous situations or trivial behaviors. The proposed HACO then effectively utilizes the data both from the trial-and-error exploration and human's partial demonstration to train a high-performing agent. HACO extracts proxy state-action values from partial human demonstration and optimizes the agent to improve the proxy values meanwhile reduce the human interventions. The experiments show that HACO achieves a substantially high sample efficiency in the safe driving benchmark. HACO can train agents to drive in unseen traffic scenarios with a handful of human intervention budget and achieve high safety and generalizability, outperforming both reinforcement learning and imitation learning baselines with a large margin. Code and demo videos are available at: https://decisionforce.github.io/HACO/.
\ No newline at end of file
diff --git a/data/2022/iclr/Efficient Neural Causal Discovery without Acyclicity Constraints b/data/2022/iclr/Efficient Neural Causal Discovery without Acyclicity Constraints
new file mode 100644
index 0000000000..c065cbaaf1
--- /dev/null
+++ b/data/2022/iclr/Efficient Neural Causal Discovery without Acyclicity Constraints	
@@ -0,0 +1 @@
+Learning the structure of a causal graphical model using both observational and interventional data is a fundamental problem in many scientific fields. A promising direction is continuous optimization for score-based methods, which efficiently learn the causal graph in a data-driven manner. However, to date, those methods require constrained optimization to enforce acyclicity or lack convergence guarantees. In this paper, we present ENCO, an efficient structure learning method for directed, acyclic causal graphs leveraging observational and interventional data. ENCO formulates the graph search as an optimization of independent edge likelihoods, with the edge orientation being modeled as a separate parameter. Consequently, we can provide convergence guarantees of ENCO under mild conditions without constraining the score function with respect to acyclicity. In experiments, we show that ENCO can efficiently recover graphs with hundreds of nodes, an order of magnitude larger than what was previously possible, while handling deterministic variables and latent confounders.
\ No newline at end of file
diff --git a/data/2022/iclr/Efficient Self-supervised Vision Transformers for Representation Learning b/data/2022/iclr/Efficient Self-supervised Vision Transformers for Representation Learning
new file mode 100644
index 0000000000..4539be1485
--- /dev/null
+++ b/data/2022/iclr/Efficient Self-supervised Vision Transformers for Representation Learning	
@@ -0,0 +1 @@
+This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. Second, we propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classification tasks, EsViT outperforms its supervised counterpart on 17 out of 18 datasets. The code and models are publicly available: https://github.com/microsoft/esvit
\ No newline at end of file
diff --git a/data/2022/iclr/Efficient Sharpness-aware Minimization for Improved Training of Neural Networks b/data/2022/iclr/Efficient Sharpness-aware Minimization for Improved Training of Neural Networks
new file mode 100644
index 0000000000..be5e859235
--- /dev/null
+++ b/data/2022/iclr/Efficient Sharpness-aware Minimization for Improved Training of Neural Networks	
@@ -0,0 +1 @@
+Overparametrized Deep Neural Networks (DNNs) often achieve astounding performances, but may potentially result in severe generalization error. Recently, the relation between the sharpness of the loss landscape and the generalization error has been established by Foret et al. (2020), in which the Sharpness Aware Minimizer (SAM) was proposed to mitigate the degradation of the generalization. Unfortunately, SAM s computational cost is roughly double that of base optimizers, such as Stochastic Gradient Descent (SGD). This paper thus proposes Efficient Sharpness Aware Minimizer (ESAM), which boosts SAM s efficiency at no cost to its generalization performance. ESAM includes two novel and efficient training strategies-StochasticWeight Perturbation and Sharpness-Sensitive Data Selection. In the former, the sharpness measure is approximated by perturbing a stochastically chosen set of weights in each iteration; in the latter, the SAM loss is optimized using only a judiciously selected subset of data that is sensitive to the sharpness. We provide theoretical explanations as to why these strategies perform well. We also show, via extensive experiments on the CIFAR and ImageNet datasets, that ESAM enhances the efficiency over SAM from requiring 100% extra computations to 40% vis-a-vis base optimizers, while test accuracies are preserved or even improved.
\ No newline at end of file
diff --git a/data/2022/iclr/Efficient Split-Mix Federated Learning for On-Demand and In-Situ Customization b/data/2022/iclr/Efficient Split-Mix Federated Learning for On-Demand and In-Situ Customization
new file mode 100644
index 0000000000..91f7fb54bf
--- /dev/null
+++ b/data/2022/iclr/Efficient Split-Mix Federated Learning for On-Demand and In-Situ Customization	
@@ -0,0 +1 @@
+Federated learning (FL) provides a distributed learning framework for multiple participants to collaborate learning without sharing raw data. In many practical FL scenarios, participants have heterogeneous resources due to disparities in hardware and inference dynamics that require quickly loading models of different sizes and levels of robustness. The heterogeneity and dynamics together impose significant challenges to existing FL approaches and thus greatly limit FL's applicability. In this paper, we propose a novel Split-Mix FL strategy for heterogeneous participants that, once training is done, provides in-situ customization of model sizes and robustness. Specifically, we achieve customization by learning a set of base sub-networks of different sizes and robustness levels, which are later aggregated on-demand according to inference requirements. This split-mix strategy achieves customization with high efficiency in communication, storage, and inference. Extensive experiments demonstrate that our method provides better in-situ customization than the existing heterogeneous-architecture FL methods. Codes and pre-trained models are available: https://github.com/illidanlab/SplitMix.
\ No newline at end of file
diff --git a/data/2022/iclr/Efficient Token Mixing for Transformers via Adaptive Fourier Neural Operators b/data/2022/iclr/Efficient Token Mixing for Transformers via Adaptive Fourier Neural Operators
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Efficient and Differentiable Conformal Prediction with General Function Classes b/data/2022/iclr/Efficient and Differentiable Conformal Prediction with General Function Classes
new file mode 100644
index 0000000000..fe6aed1e42
--- /dev/null
+++ b/data/2022/iclr/Efficient and Differentiable Conformal Prediction with General Function Classes	
@@ -0,0 +1 @@
+Quantifying the data uncertainty in learning tasks is often done by learning a prediction interval or prediction set of the label given the input. Two commonly desired properties for learned prediction sets are \emph{valid coverage} and \emph{good efficiency} (such as low length or low cardinality). Conformal prediction is a powerful technique for learning prediction sets with valid coverage, yet by default its conformalization step only learns a single parameter, and does not optimize the efficiency over more expressive function classes. In this paper, we propose a generalization of conformal prediction to multiple learnable parameters, by considering the constrained empirical risk minimization (ERM) problem of finding the most efficient prediction set subject to valid empirical coverage. This meta-algorithm generalizes existing conformal prediction algorithms, and we show that it achieves approximate valid population coverage and near-optimal efficiency within class, whenever the function class in the conformalization step is low-capacity in a certain sense. Next, this ERM problem is challenging to optimize as it involves a non-differentiable coverage constraint. We develop a gradient-based algorithm for it by approximating the original constrained ERM using differentiable surrogate losses and Lagrangians. Experiments show that our algorithm is able to learn valid prediction sets and improve the efficiency significantly over existing approaches in several applications such as prediction intervals with improved length, minimum-volume prediction sets for multi-output regression, and label prediction sets for image classification.
\ No newline at end of file
diff --git a/data/2022/iclr/Efficiently Modeling Long Sequences with Structured State Spaces b/data/2022/iclr/Efficiently Modeling Long Sequences with Structured State Spaces
new file mode 100644
index 0000000000..37c3aa4ed5
--- /dev/null
+++ b/data/2022/iclr/Efficiently Modeling Long Sequences with Structured State Spaces	
@@ -0,0 +1 @@
+A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) \( x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) \), and showed that for appropriate choices of the state matrix \( A \), this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning \( A \) with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91\% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation $60\times$ faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors.
\ No newline at end of file
diff --git a/data/2022/iclr/EigenGame Unloaded: When playing games is better than optimizing b/data/2022/iclr/EigenGame Unloaded: When playing games is better than optimizing
new file mode 100644
index 0000000000..16dd861036
--- /dev/null
+++ b/data/2022/iclr/EigenGame Unloaded: When playing games is better than optimizing	
@@ -0,0 +1 @@
+We build on the recently proposed EigenGame that views eigendecomposition as a competitive game. EigenGame's updates are biased if computed using minibatches of data, which hinders convergence and more sophisticated parallelism in the stochastic setting. In this work, we propose an unbiased stochastic update that is asymptotically equivalent to EigenGame, enjoys greater parallelism allowing computation on datasets of larger sample sizes, and outperforms EigenGame in experiments. We present applications to finding the principal components of massive datasets and performing spectral clustering of graphs. We analyze and discuss our proposed update in the context of EigenGame and the shift in perspective from optimization to games.
\ No newline at end of file
diff --git a/data/2022/iclr/Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian Spectrums b/data/2022/iclr/Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian Spectrums
new file mode 100644
index 0000000000..cb2ffb9380
--- /dev/null
+++ b/data/2022/iclr/Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian Spectrums	
@@ -0,0 +1 @@
+Learning rate schedulers have been widely adopted in training deep neural networks. Despite their practical importance, there is a discrepancy between its practice and its theoretical analysis. For instance, it is not known what schedules of SGD achieve best convergence, even for simple problems such as optimizing quadratic objectives. In this paper, we propose Eigencurve, the first family of learning rate schedules that can achieve minimax optimal convergence rates (up to a constant) for SGD on quadratic objectives when the eigenvalue distribution of the underlying Hessian matrix is skewed. The condition is quite common in practice. Experimental results show that Eigencurve can significantly outperform step decay in image classification tasks on CIFAR-10, especially when the number of epochs is small. Moreover, the theory inspires two simple learning rate schedulers for practical applications that can approximate eigencurve. For some problems, the optimal shape of the proposed schedulers resembles that of cosine decay, which sheds light to the success of cosine decay for such situations. For other situations, the proposed schedulers are superior to cosine decay.
\ No newline at end of file
diff --git a/data/2022/iclr/Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation b/data/2022/iclr/Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation
new file mode 100644
index 0000000000..99fa5dfa5f
--- /dev/null
+++ b/data/2022/iclr/Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation	
@@ -0,0 +1 @@
+Tensor computations underlie modern scientiﬁc computing and deep learning. A number of tensor frameworks emerged varying in execution model, hardware support, memory management, model deﬁnition, etc. However, tensor operations in all frameworks follow the same paradigm. Recent neural network architectures demonstrate demand for higher expressiveness of tensor operations. The current paradigm is not suited to write readable, reliable, or easy-to-modify code for multidimensional tensor manipulations. Moreover, some commonly used operations do not provide sufﬁcient checks and can break a tensor structure. These mistakes are elusive as no tools or tests can detect them. Independently, API discrepancies complicate code transfer between frameworks. We propose einops notation: a uniform and generic way to manipulate tensor structure, that signiﬁcantly improves code readability and ﬂexibility by focusing on the structure of input and output tensors. We implement einops notation in a Python package that efﬁciently supports multiple widely used frameworks and provides framework-independent minimalist API for tensor manipulations.
\ No newline at end of file
diff --git a/data/2022/iclr/Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise b/data/2022/iclr/Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise
new file mode 100644
index 0000000000..abeecbe194
--- /dev/null
+++ b/data/2022/iclr/Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise	
@@ -0,0 +1 @@
+The empirical success of deep learning is often attributed to SGD's mysterious ability to avoid sharp local minima in the loss landscape, as sharp minima are known to lead to poor generalization. Recently, empirical evidence of heavy-tailed gradient noise was reported in many deep learning tasks, and it was shown in \c{S}im\c{s}ekli (2019a,b) that SGD can escape sharp local minima under the presence of such heavy-tailed gradient noise, providing a partial solution to the mystery. In this work, we analyze a popular variant of SGD where gradients are truncated above a fixed threshold. We show that it achieves a stronger notion of avoiding sharp minima: it can effectively eliminate sharp local minima entirely from its training trajectory. We characterize the dynamics of truncated SGD driven by heavy-tailed noises. First, we show that the truncation threshold and width of the attraction field dictate the order of the first exit time from the associated local minimum. Moreover, when the objective function satisfies appropriate structural conditions, we prove that as the learning rate decreases, the dynamics of heavy-tailed truncated SGD closely resemble those of a continuous-time Markov chain that never visits any sharp minima. Real data experiments on deep learning confirm our theoretical prediction that heavy-tailed SGD with gradient clipping finds a"flatter"local minima and achieves better generalization.
\ No newline at end of file
diff --git a/data/2022/iclr/Embedded-model flows: Combining the inductive biases of model-free deep learning and explicit probabilistic modeling b/data/2022/iclr/Embedded-model flows: Combining the inductive biases of model-free deep learning and explicit probabilistic modeling
new file mode 100644
index 0000000000..7a48a9a73e
--- /dev/null
+++ b/data/2022/iclr/Embedded-model flows: Combining the inductive biases of model-free deep learning and explicit probabilistic modeling	
@@ -0,0 +1 @@
+Normalizing flows have shown great success as general-purpose density estimators. However, many real world applications require the use of domain-specific knowledge, which normalizing flows cannot readily incorporate. We propose embedded-model flows (EMF), which alternate general-purpose transformations with structured layers that embed domain-specific inductive biases. These layers are automatically constructed by converting user-specified differentiable probabilistic models into equivalent bijective transformations. We also introduce gated structured layers, which allow bypassing the parts of the models that fail to capture the statistics of the data. We demonstrate that EMFs can be used to induce desirable properties such as multimodality, hierarchical coupling and continuity. Furthermore, we show that EMFs enable a high performance form of variational inference where the structure of the prior model is embedded in the variational architecture. In our experiments, we show that this approach outperforms state-of-the-art methods in common structured inference problems.
\ No newline at end of file
diff --git a/data/2022/iclr/Emergent Communication at Scale b/data/2022/iclr/Emergent Communication at Scale
new file mode 100644
index 0000000000..7074c4a605
--- /dev/null
+++ b/data/2022/iclr/Emergent Communication at Scale	
@@ -0,0 +1 @@
+Emergent communication aims for a better understanding of human language evolution and building more efficient representations. We posit that reaching these goals will require scaling up, in contrast to a significant amount of literature that focuses on setting up small-scale problems to tease out desired properties of the emergent languages. We focus on three independent aspects to scale up, namely the dataset, task complexity, and population size. We provide a first set of results for large populations solving complex tasks on realistic large-scale datasets, as well as an easy-to-use codebase to enable further experimentation1. In more complex tasks and datasets, we find that RL training can become unstable, but responds well to established stabilization techniques. We also identify the need for a different metric than topographic similarity, which does not correlate with the generalization performances when working with natural images. In this context, we probe ease-of-learnability and transfer methods to assess emergent languages. Finally, we observe that larger populations do not induce robust emergent protocols with high generalization performance, leading us to explore different ways to leverage population, through voting and imitation learning.
\ No newline at end of file
diff --git a/data/2022/iclr/Enabling Arbitrary Translation Objectives with Adaptive Tree Search b/data/2022/iclr/Enabling Arbitrary Translation Objectives with Adaptive Tree Search
new file mode 100644
index 0000000000..8c3cef5655
--- /dev/null
+++ b/data/2022/iclr/Enabling Arbitrary Translation Objectives with Adaptive Tree Search	
@@ -0,0 +1 @@
+We introduce an adaptive tree search algorithm, that can find high-scoring outputs under translation models that make no assumptions about the form or structure of the search objective. This algorithm -- a deterministic variant of Monte Carlo tree search -- enables the exploration of new kinds of models that are unencumbered by constraints imposed to make decoding tractable, such as autoregressivity or conditional independence assumptions. When applied to autoregressive models, our algorithm has different biases than beam search has, which enables a new analysis of the role of decoding bias in autoregressive models. Empirically, we show that our adaptive tree search algorithm finds outputs with substantially better model scores compared to beam search in autoregressive models, and compared to reranking techniques in models whose scores do not decompose additively with respect to the words in the output. We also characterise the correlation of several translation model objectives with respect to BLEU. We find that while some standard models are poorly calibrated and benefit from the beam search bias, other often more robust models (autoregressive models tuned to maximize expected automatic metric scores, the noisy channel model and a newly proposed objective) benefit from increasing amounts of search using our proposed decoder, whereas the beam search bias limits the improvements obtained from such objectives. Thus, we argue that as models improve, the improvements may be masked by over-reliance on beam search or reranking based methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Encoding Weights of Irregular Sparsity for Fixed-to-Fixed Model Compression b/data/2022/iclr/Encoding Weights of Irregular Sparsity for Fixed-to-Fixed Model Compression
new file mode 100644
index 0000000000..1a674e33f0
--- /dev/null
+++ b/data/2022/iclr/Encoding Weights of Irregular Sparsity for Fixed-to-Fixed Model Compression	
@@ -0,0 +1 @@
+Even though fine-grained pruning techniques achieve a high compression ratio, conventional sparsity representations (such as CSR) associated with irregular sparsity degrade parallelism significantly. Practical pruning methods, thus, usually lower pruning rates (by structured pruning) to improve parallelism. In this paper, we study fixed-to-fixed (lossless) encoding architecture/algorithm to support fine-grained pruning methods such that sparse neural networks can be stored in a highly regular structure. We first estimate the maximum compression ratio of encoding-based compression using entropy. Then, as an effort to push the compression ratio to the theoretical maximum (by entropy), we propose a sequential fixed-to-fixed encoding scheme. We demonstrate that our proposed compression scheme achieves almost the maximum compression ratio for the Transformer and ResNet-50 pruned by various fine-grained pruning methods.
\ No newline at end of file
diff --git a/data/2022/iclr/End-to-End Learning of Probabilistic Hierarchies on Graphs b/data/2022/iclr/End-to-End Learning of Probabilistic Hierarchies on Graphs
new file mode 100644
index 0000000000..fe99fc9e6d
--- /dev/null
+++ b/data/2022/iclr/End-to-End Learning of Probabilistic Hierarchies on Graphs	
@@ -0,0 +1 @@
+We propose a novel probabilistic model over hierarchies on graphs obtained by continuous relaxation of tree-based hierarchies. We draw connections to Markov chain theory, enabling us to perform hierarchical clustering by efﬁcient end-to-end optimization of relaxed versions of quality metrics such as Dasgupta cost or Tree-Sampling Divergence (TSD). We show that our model learns rich, high-quality hierarchies present in 11 real world graphs, including a large graph with 2.3M nodes. Our model consistently outperforms recent as well as strong traditional baselines such as average linkage. Our model also obtains strong results on link prediction despite not being trained on this task, highlighting the quality of the hierarchies discovered by our model.
\ No newline at end of file
diff --git a/data/2022/iclr/Energy-Based Learning for Cooperative Games, with Applications to Valuation Problems in Machine Learning b/data/2022/iclr/Energy-Based Learning for Cooperative Games, with Applications to Valuation Problems in Machine Learning
new file mode 100644
index 0000000000..1b696a0e3a
--- /dev/null
+++ b/data/2022/iclr/Energy-Based Learning for Cooperative Games, with Applications to Valuation Problems in Machine Learning	
@@ -0,0 +1 @@
+Valuation problems, such as feature interpretation, data valuation and model valuation for ensembles, become increasingly more important in many machine learning applications. Such problems are commonly solved by well-known game-theoretic criteria, such as Shapley value or Banzhaf value. In this work, we present a novel energy-based treatment for cooperative games, with a theoretical justification by the maximum entropy framework. Surprisingly, by conducting variational inference of the energy-based model, we recover various game-theoretic valuation criteria through conducting one-step fixed point iteration for maximizing the mean-field ELBO objective. This observation also verifies the rationality of existing criteria, as they are all attempting to decouple the correlations among the players through the mean-field approach. By running fixed point iteration for multiple steps, we achieve a trajectory of the valuations, among which we define the valuation with the best conceivable decoupling error as the Variational Index. We prove that under uniform initializations, these variational valuations all satisfy a set of game-theoretic axioms. We experimentally demonstrate that the proposed Variational Index enjoys lower decoupling error and better valuation performance on certain synthetic and real-world valuation problems.
\ No newline at end of file
diff --git a/data/2022/iclr/Energy-Inspired Molecular Conformation Optimization b/data/2022/iclr/Energy-Inspired Molecular Conformation Optimization
new file mode 100644
index 0000000000..bc5581753f
--- /dev/null
+++ b/data/2022/iclr/Energy-Inspired Molecular Conformation Optimization	
@@ -0,0 +1 @@
+This paper studies an important problem in computational chemistry: predicting a molecule’s spatial atom arrangements, or a molecular conformation. We propose a neural energy minimization formulation that casts the prediction problem into an unrolled optimization process, where a neural network is parametrized to learn the gradient ﬁelds of an implicit conformational energy landscape. Assuming different forms of the underlying potential energy function, we can not only reinterpret and unify many of the existing models but also derive new variants of SE(3)-equivariant neural networks in a principled manner. In our experiments, these new variants show superior performance in molecular conformation optimization comparing to existing SE(3)-equivariant neural networks. Moreover, our energy-inspired formulation is also suitable for molecular conformation generation, where we can generate more diverse and accurate conformers comparing to existing baselines.
\ No newline at end of file
diff --git a/data/2022/iclr/Enhancing Cross-lingual Transfer by Manifold Mixup b/data/2022/iclr/Enhancing Cross-lingual Transfer by Manifold Mixup
new file mode 100644
index 0000000000..b32aa0a9f5
--- /dev/null
+++ b/data/2022/iclr/Enhancing Cross-lingual Transfer by Manifold Mixup	
@@ -0,0 +1 @@
+Based on large-scale pre-trained multilingual representations, recent cross-lingual transfer methods have achieved impressive transfer performances. However, the performance of target languages still lags far behind the source language. In this paper, our analyses indicate such a performance gap is strongly associated with the cross-lingual representation discrepancy. To achieve better cross-lingual transfer performance, we propose the cross-lingual manifold mixup (X-Mixup) method, which adaptively calibrates the representation discrepancy and gives a compromised representation for target languages. Experiments on the XTREME benchmark show X-Mixup achieves 1.8% performance gains on multiple text understanding tasks, compared with strong baselines, and significantly reduces the cross-lingual representation discrepancy.
\ No newline at end of file
diff --git a/data/2022/iclr/EntQA: Entity Linking as Question Answering b/data/2022/iclr/EntQA: Entity Linking as Question Answering
new file mode 100644
index 0000000000..e166f7d718
--- /dev/null
+++ b/data/2022/iclr/EntQA: Entity Linking as Question Answering	
@@ -0,0 +1 @@
+A conventional approach to entity linking is to first find mentions in a given document and then infer their underlying entities in the knowledge base. A well-known limitation of this approach is that it requires finding mentions without knowing their entities, which is unnatural and difficult. We present a new model that does not suffer from this limitation called EntQA, which stands for Entity linking as Question Answering. EntQA first proposes candidate entities with a fast retrieval module, and then scrutinizes the document to find mentions of each candidate with a powerful reader module. Our approach combines progress in entity linking with that in open-domain question answering and capitalizes on pretrained models for dense entity retrieval and reading comprehension. Unlike in previous works, we do not rely on a mention-candidates dictionary or large-scale weak supervision. EntQA achieves strong results on the GERBIL benchmarking platform.
\ No newline at end of file
diff --git a/data/2022/iclr/Entroformer: A Transformer-based Entropy Model for Learned Image Compression b/data/2022/iclr/Entroformer: A Transformer-based Entropy Model for Learned Image Compression
new file mode 100644
index 0000000000..16e14024ca
--- /dev/null
+++ b/data/2022/iclr/Entroformer: A Transformer-based Entropy Model for Learned Image Compression	
@@ -0,0 +1 @@
+One critical component in lossy deep image compression is the entropy model, which predicts the probability distribution of the quantized latent representation in the encoding and decoding modules. Previous works build entropy models upon convolutional neural networks which are inefficient in capturing global dependencies. In this work, we propose a novel transformer-based entropy model, termed Entroformer, to capture long-range dependencies in probability distribution estimation effectively and efficiently. Different from vision transformers in image classification, the Entroformer is highly optimized for image compression, including a top-k self-attention and a diamond relative position encoding. Meanwhile, we further expand this architecture with a parallel bidirectional context model to speed up the decoding process. The experiments show that the Entroformer achieves state-of-the-art performance on image compression while being time-efficient.
\ No newline at end of file
diff --git a/data/2022/iclr/Environment Predictive Coding for Visual Navigation b/data/2022/iclr/Environment Predictive Coding for Visual Navigation
new file mode 100644
index 0000000000..229d02aebc
--- /dev/null
+++ b/data/2022/iclr/Environment Predictive Coding for Visual Navigation	
@@ -0,0 +1 @@
+self-supervised
\ No newline at end of file
diff --git a/data/2022/iclr/Equivariant Graph Mechanics Networks with Constraints b/data/2022/iclr/Equivariant Graph Mechanics Networks with Constraints
new file mode 100644
index 0000000000..c098179cd9
--- /dev/null
+++ b/data/2022/iclr/Equivariant Graph Mechanics Networks with Constraints	
@@ -0,0 +1 @@
+Learning to reason about relations and dynamics over multiple interacting objects is a challenging topic in machine learning. The challenges mainly stem from that the interacting systems are exponentially-compositional, symmetrical, and commonly geometrically-constrained. Current methods, particularly the ones based on equivariant Graph Neural Networks (GNNs), have targeted on the first two challenges but remain immature for constrained systems. In this paper, we propose Graph Mechanics Network (GMN) which is combinatorially efficient, equivariant and constraint-aware. The core of GMN is that it represents, by generalized coordinates, the forward kinematics information (positions and velocities) of a structural object. In this manner, the geometrical constraints are implicitly and naturally encoded in the forward kinematics. Moreover, to allow equivariant message passing in GMN, we have developed a general form of orthogonality-equivariant functions, given that the dynamics of constrained systems are more complicated than the unconstrained counterparts. Theoretically, the proposed equivariant formulation is proved to be universally expressive under certain conditions. Extensive experiments support the advantages of GMN compared to the state-of-the-art GNNs in terms of prediction accuracy, constraint satisfaction and data efficiency on the simulated systems consisting of particles, sticks and hinges, as well as two real-world datasets for molecular dynamics prediction and human motion capture.
\ No newline at end of file
diff --git a/data/2022/iclr/Equivariant Self-Supervised Learning: Encouraging Equivariance in Representations b/data/2022/iclr/Equivariant Self-Supervised Learning: Encouraging Equivariance in Representations
new file mode 100644
index 0000000000..22ded55aa2
--- /dev/null
+++ b/data/2022/iclr/Equivariant Self-Supervised Learning: Encouraging Equivariance in Representations	
@@ -0,0 +1 @@
+:
\ No newline at end of file
diff --git a/data/2022/iclr/Equivariant Subgraph Aggregation Networks b/data/2022/iclr/Equivariant Subgraph Aggregation Networks
new file mode 100644
index 0000000000..42d74ed0a1
--- /dev/null
+++ b/data/2022/iclr/Equivariant Subgraph Aggregation Networks	
@@ -0,0 +1 @@
+Message-passing neural networks (MPNNs) are the leading architecture for deep learning on graph-structured data, in large part due to their simplicity and scalability. Unfortunately, it was shown that these architectures are limited in their expressive power. This paper proposes a novel framework called Equivariant Subgraph Aggregation Networks (ESAN) to address this issue. Our main observation is that while two graphs may not be distinguishable by an MPNN, they often contain distinguishable subgraphs. Thus, we propose to represent each graph as a set of subgraphs derived by some predefined policy, and to process it using a suitable equivariant architecture. We develop novel variants of the 1-dimensional Weisfeiler-Leman (1-WL) test for graph isomorphism, and prove lower bounds on the expressiveness of ESAN in terms of these new WL variants. We further prove that our approach increases the expressive power of both MPNNs and more expressive architectures. Moreover, we provide theoretical results that describe how design choices such as the subgraph selection policy and equivariant neural architecture affect our architecture's expressive power. To deal with the increased computational cost, we propose a subgraph sampling scheme, which can be viewed as a stochastic version of our framework. A comprehensive set of experiments on real and synthetic datasets demonstrates that our framework improves the expressive power and overall performance of popular GNN architectures.
\ No newline at end of file
diff --git a/data/2022/iclr/Equivariant Transformers for Neural Network based Molecular Potentials b/data/2022/iclr/Equivariant Transformers for Neural Network based Molecular Potentials
new file mode 100644
index 0000000000..456caef763
--- /dev/null
+++ b/data/2022/iclr/Equivariant Transformers for Neural Network based Molecular Potentials	
@@ -0,0 +1 @@
+The prediction of quantum mechanical properties is historically plagued by a trade-off between accuracy and speed. Machine learning potentials have previously shown great success in this domain, reaching increasingly better accuracy while maintaining computational efficiency comparable with classical force fields.In this work we propose TorchMD-NET, a novel equivariant Transformer (ET) architecture, outperforming state-of-the-art on MD17, ANI-1, and many QM9 tar-gets in both accuracy and computational efficiency. Through an extensive attention weight analysis, we gain valuable insights into the black box predictor and show differences in the learned representation of conformers versus conformations sampled from molecular dynamics or normal modes. Furthermore, we highlight the importance of datasets including off-equilibrium conformations for the evaluation of molecular potentials.
\ No newline at end of file
diff --git a/data/2022/iclr/Equivariant and Stable Positional Encoding for More Powerful Graph Neural Networks b/data/2022/iclr/Equivariant and Stable Positional Encoding for More Powerful Graph Neural Networks
new file mode 100644
index 0000000000..0051e1e954
--- /dev/null
+++ b/data/2022/iclr/Equivariant and Stable Positional Encoding for More Powerful Graph Neural Networks	
@@ -0,0 +1 @@
+Graph neural networks (GNN) have shown great advantages in many graph-based learning tasks but often fail to predict accurately for a task-based on sets of nodes such as link/motif prediction and so on. Many works have recently proposed to address this problem by using random node features or node distance features. However, they suffer from either slow convergence, inaccurate prediction, or high complexity. In this work, we revisit GNNs that allow using positional features of nodes given by positional encoding (PE) techniques such as Laplacian Eigenmap, Deepwalk, etc. GNNs with PE often get criticized because they are not generalizable to unseen graphs (inductive) or stable. Here, we study these issues in a principled way and propose a provable solution, a class of GNN layers termed PEG with rigorous mathematical analysis. PEG uses separate channels to update the original node features and positional features. PEG imposes permutation equivariance w.r.t. the original node features and imposes $O(p)$ (orthogonal group) equivariance w.r.t. the positional features simultaneously, where $p$ is the dimension of used positional features. Extensive link prediction experiments over 8 real-world networks demonstrate the advantages of PEG in generalization and scalability.
\ No newline at end of file
diff --git a/data/2022/iclr/Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems b/data/2022/iclr/Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems
new file mode 100644
index 0000000000..3be44514d6
--- /dev/null
+++ b/data/2022/iclr/Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems	
@@ -0,0 +1 @@
+This paper introduces a new extragradient-type algorithm for a class of nonconvex-nonconcave minimax problems. It is well-known that finding a local solution for general minimax problems is computationally intractable. This observation has recently motivated the study of structures sufficient for convergence of first order methods in the more general setting of variational inequalities when the so-called weak Minty variational inequality (MVI) holds. This problem class captures non-trivial structures as we demonstrate with examples, for which a large family of existing algorithms provably converge to limit cycles. Our results require a less restrictive parameter range in the weak MVI compared to what is previously known, thus extending the applicability of our scheme. The proposed algorithm is applicable to constrained and regularized problems, and involves an adaptive stepsize allowing for potentially larger stepsizes. Our scheme also converges globally even in settings where the underlying operator exhibits limit cycles.
\ No newline at end of file
diff --git a/data/2022/iclr/Evading Adversarial Example Detection Defenses with Orthogonal Projected Gradient Descent b/data/2022/iclr/Evading Adversarial Example Detection Defenses with Orthogonal Projected Gradient Descent
new file mode 100644
index 0000000000..1b27d363dd
--- /dev/null
+++ b/data/2022/iclr/Evading Adversarial Example Detection Defenses with Orthogonal Projected Gradient Descent	
@@ -0,0 +1 @@
+Evading adversarial example detection defenses requires finding adversarial examples that must simultaneously (a) be misclassified by the model and (b) be detected as non-adversarial. We find that existing attacks that attempt to satisfy multiple simultaneous constraints often over-optimize against one constraint at the cost of satisfying another. We introduce Orthogonal Projected Gradient Descent, an improved attack technique to generate adversarial examples that avoids this problem by orthogonalizing the gradients when running standard gradient-based attacks. We use our technique to evade four state-of-the-art detection defenses, reducing their accuracy to 0% while maintaining a 0% detection rate.
\ No newline at end of file
diff --git a/data/2022/iclr/Evaluating Disentanglement of Structured Representations b/data/2022/iclr/Evaluating Disentanglement of Structured Representations
new file mode 100644
index 0000000000..ba8db4c7b5
--- /dev/null
+++ b/data/2022/iclr/Evaluating Disentanglement of Structured Representations	
@@ -0,0 +1 @@
+We introduce the first metric for evaluating disentanglement at individual hierarchy levels of a structured latent representation. Applied to object-centric generative models, this offers a systematic, unified approach to evaluating (i) object separation between latent slots (ii) disentanglement of object properties inside individual slots (iii) disentanglement of intrinsic and extrinsic object properties. We theoretically show that for structured representations, our framework gives stronger guarantees of selecting a good model than previous disentanglement metrics. Experimentally, we demonstrate that viewing object compositionality as a disentanglement problem addresses several issues with prior visual metrics of object separation. As a core technical component, we present the first representation probing algorithm handling slot permutation invariance.
\ No newline at end of file
diff --git a/data/2022/iclr/Evaluating Distributional Distortion in Neural Language Modeling b/data/2022/iclr/Evaluating Distributional Distortion in Neural Language Modeling
new file mode 100644
index 0000000000..9fc3c02bf2
--- /dev/null
+++ b/data/2022/iclr/Evaluating Distributional Distortion in Neural Language Modeling	
@@ -0,0 +1 @@
+A fundamental characteristic of natural language is the high rate at which speakers produce novel expressions. Because of this novelty, a heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language (Baayen, 2001). Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate. As a result, we have relatively little understanding of whether neural LMs accurately estimate the probability of sequences in this heavy-tail of rare events. To address this gap, we develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages from which we can exactly compute sequence probabilities. Training LMs on generations from these artificial languages, we compare the sequence-level probability estimates given by LMs to the true probabilities in the target language. Our experiments reveal that LSTM and Transformer language models (i) systematically underestimate the probability of sequences drawn from the target language, and (ii) do so more severely for less-probable sequences. Investigating where this probability mass went, (iii) we find that LMs tend to overestimate the probability of ill formed (perturbed) sequences. In addition, we find that this underestimation behaviour (iv) is weakened, but not eliminated by greater amounts of training data, and (v) is exacerbated for target distributions with lower entropy.
\ No newline at end of file
diff --git a/data/2022/iclr/Evaluating Model-Based Planning and Planner Amortization for Continuous Control b/data/2022/iclr/Evaluating Model-Based Planning and Planner Amortization for Continuous Control
new file mode 100644
index 0000000000..9e99fb4dfe
--- /dev/null
+++ b/data/2022/iclr/Evaluating Model-Based Planning and Planner Amortization for Continuous Control	
@@ -0,0 +1 @@
+There is a widespread intuition that model-based control methods should be able to surpass the data efficiency of model-free approaches. In this paper we attempt to evaluate this intuition on various challenging locomotion tasks. We take a hybrid approach, combining model predictive control (MPC) with a learned model and model-free policy learning; the learned policy serves as a proposal for MPC. We find that well-tuned model-free agents are strong baselines even for high DoF control problems but MPC with learned proposals and models (trained on the fly or transferred from related tasks) can significantly improve performance and data efficiency in hard multi-task/multi-goal settings. Finally, we show that it is possible to distil a model-based planner into a policy that amortizes the planning computation without any loss of performance. Videos of agents performing different tasks can be seen at https://sites.google.com/view/mbrl-amortization/home.
\ No newline at end of file
diff --git a/data/2022/iclr/Evaluation Metrics for Graph Generative Models: Problems, Pitfalls, and Practical Solutions b/data/2022/iclr/Evaluation Metrics for Graph Generative Models: Problems, Pitfalls, and Practical Solutions
new file mode 100644
index 0000000000..f56c3599f6
--- /dev/null
+++ b/data/2022/iclr/Evaluation Metrics for Graph Generative Models: Problems, Pitfalls, and Practical Solutions	
@@ -0,0 +1 @@
+Graph generative models are a highly active branch of machine learning. Given the steady development of new models of ever-increasing complexity, it is necessary to provide a principled way to evaluate and compare them. In this paper, we enumerate the desirable criteria for such a comparison metric and provide an overview of the status quo of graph generative model comparison in use today, which predominantly relies on the maximum mean discrepancy (MMD). We perform a systematic evaluation of MMD in the context of graph generative model comparison, highlighting some of the challenges and pitfalls researchers inadvertently may encounter. After conducting a thorough analysis of the behaviour of MMD on synthetically-generated perturbed graphs as well as on recently-proposed graph generative models, we are able to provide a suitable procedure to mitigate these challenges and pitfalls. We aggregate our findings into a list of practical recommendations for researchers to use when evaluating graph generative models.
\ No newline at end of file
diff --git a/data/2022/iclr/Evidential Turing Processes b/data/2022/iclr/Evidential Turing Processes
new file mode 100644
index 0000000000..ef6ec73094
--- /dev/null
+++ b/data/2022/iclr/Evidential Turing Processes	
@@ -0,0 +1 @@
+A probabilistic classifier with reliable predictive uncertainties i) fits successfully to the target domain data, ii) provides calibrated class probabilities in difficult regions of the target domain (e.g.\ class overlap), and iii) accurately identifies queries coming out of the target domain and rejects them. We introduce an original combination of Evidential Deep Learning, Neural Processes, and Neural Turing Machines capable of providing all three essential properties mentioned above for total uncertainty quantification. We observe our method on five classification tasks to be the only one that can excel all three aspects of total calibration with a single standalone predictor. Our unified solution delivers an implementation-friendly and compute efficient recipe for safety clearance and provides intellectual economy to an investigation of algorithmic roots of epistemic awareness in deep neural nets.
\ No newline at end of file
diff --git a/data/2022/iclr/Evolutionary Diversity Optimization with Clustering-based Selection for Reinforcement Learning b/data/2022/iclr/Evolutionary Diversity Optimization with Clustering-based Selection for Reinforcement Learning
new file mode 100644
index 0000000000..99ce60b840
--- /dev/null
+++ b/data/2022/iclr/Evolutionary Diversity Optimization with Clustering-based Selection for Reinforcement Learning	
@@ -0,0 +1 @@
+Reinforcement Learning (RL) has achieved significant successes, which aims to obtain a single policy maximizing the expected cumulative rewards for a given task. However, in many real-world scenarios, e.g., navigating in complex environments and controlling robots, one may need to find a set of policies having both high rewards and diverse behaviors, which can bring better exploration and robust few-shot adaptation. Recently, some methods have been developed by using evolutionary techniques, including iterative reproduction and selection of policies. However, due to the inefficient selection mechanisms, these methods cannot fully guarantee both high quality and diversity. In this paper, we propose EDO-CS, a new Evolutionary Diversity Optimization algorithm with Clusteringbased Selection. In each iteration, the policies are divided into several clusters based on their behaviors, and a high-quality policy is selected from each cluster for reproduction. EDO-CS also adaptively balances the importance between quality and diversity in the reproduction process. Experiments on various (i.e., deceptive and multi-modal) continuous control tasks, show the superior performance of EDO-CS over previous methods, i.e., EDO-CS can achieve a set of policies with both high quality and diversity efficiently while previous methods cannot.
\ No newline at end of file
diff --git a/data/2022/iclr/ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning b/data/2022/iclr/ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning
new file mode 100644
index 0000000000..8f0619120c
--- /dev/null
+++ b/data/2022/iclr/ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning	
@@ -0,0 +1 @@
+Despite the recent success of multi-task learning and transfer learning for natural language processing (NLP), few works have systematically studied the effect of scaling up the number of tasks during pre-training. Towards this goal, this paper introduces ExMix (Extreme Mixture): a massive collection of 107 supervised NLP tasks across diverse domains and task-families. Using ExMix, we study the effect of multi-task pre-training at the largest scale to date, and analyze co-training transfer amongst common families of tasks. Through this analysis, we show that manually curating an ideal set of tasks for multi-task pre-training is not straightforward, and that multi-task scaling can vastly improve models on its own. Finally, we propose ExT5: a model pre-trained using a multi-task objective of self-supervised span denoising and supervised ExMix. Via extensive experiments, we show that ExT5 outperforms strong T5 baselines on SuperGLUE, GEM, Rainbow, Closed-Book QA tasks, and several tasks outside of ExMix. ExT5 also significantly improves sample efficiency while pre-training.
\ No newline at end of file
diff --git a/data/2022/iclr/Explainable GNN-Based Models over Knowledge Graphs b/data/2022/iclr/Explainable GNN-Based Models over Knowledge Graphs
new file mode 100644
index 0000000000..f29a1a38c0
--- /dev/null
+++ b/data/2022/iclr/Explainable GNN-Based Models over Knowledge Graphs	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) are often used to learn transformations of graph data. While effective in practice, such approaches make predictions via numeric manipulations so their output cannot be easily explained symbolically. We pro-pose a new family of GNN-based transformations of graph data that can be trained effectively, but where all predictions can be explained symbolically as logical inferences in Datalog—a well-known rule-based formalism. In particular, we show how to encode an input knowledge graph into a graph with numeric feature vectors, process this graph using a GNN
\ No newline at end of file
diff --git a/data/2022/iclr/Explaining Point Processes by Learning Interpretable Temporal Logic Rules b/data/2022/iclr/Explaining Point Processes by Learning Interpretable Temporal Logic Rules
new file mode 100644
index 0000000000..419f6cb480
--- /dev/null
+++ b/data/2022/iclr/Explaining Point Processes by Learning Interpretable Temporal Logic Rules	
@@ -0,0 +1 @@
+We propose a principled method to learn a set of human-readable logic rules to explain temporal point processes. We assume that the generative mechanisms underlying the temporal point processes are governed by a set of first-order temporal logic rules, as a compact representation of domain knowledge. Our method formulates the rule discovery process from noisy event data as a maximum likelihood problem, and designs an efficient and tractable branch-and-price algorithm to progressively search for new rules and expand existing rules. The proposed algorithm alternates between the rule generation stage and the rule evaluation stage, and uncovers the most important collection of logic rules within a fixed time limit for both synthetic and real event data. In a real healthcare application, we also had human experts (i.e., doctors) verify the learned temporal logic rules and provide further improvements. These expert-revised interpretable rules lead to a point process model which outperforms previous state-of-the-arts for symptom prediction, both in their occurrence times and types. 1
\ No newline at end of file
diff --git a/data/2022/iclr/Explanations of Black-Box Models based on Directional Feature Interactions b/data/2022/iclr/Explanations of Black-Box Models based on Directional Feature Interactions
new file mode 100644
index 0000000000..9f414312e7
--- /dev/null
+++ b/data/2022/iclr/Explanations of Black-Box Models based on Directional Feature Interactions	
@@ -0,0 +1 @@
+As machine learning algorithms are deployed ubiquitously to a variety of domains, it is imperative to make these often black-box models transparent. Several recent works explain black-box models by capturing the most influential features for prediction per instance; such explanation methods are univariate, as they characterize importance per feature. We extend univariate explanation to a higher-order; this enhances explainability, as bivariate methods can capture feature interactions in black-box models, represented as a directed graph. Analyzing this graph enables us to discover groups of features that are equally important (i.e., interchangeable), while the notion of directionality allows us to identify the most influential features. We apply our bivariate method on Shapley value explanations, and experimentally demonstrate the ability of directional explanations to discover feature interactions. We show the superiority of our method against state-of-the-art on CIFAR10, IMDB, Census, Divorce, Drug, and gene data.
\ No newline at end of file
diff --git a/data/2022/iclr/Exploiting Class Activation Value for Partial-Label Learning b/data/2022/iclr/Exploiting Class Activation Value for Partial-Label Learning
new file mode 100644
index 0000000000..a0198f232d
--- /dev/null
+++ b/data/2022/iclr/Exploiting Class Activation Value for Partial-Label Learning	
@@ -0,0 +1 @@
+Partial-label learning (PLL) solves the multi-class classification problem, where each training instance is assigned a set of candidate labels that include the true label. Recent advances showed that PLL can be compatible with deep neural networks, which achieved state-of-the-art performance. However, most of the existing deep PLL methods focus on designing proper training objectives under various assumptions on the collected data, which may limit their performance when the collected data cannot satisfy the adopted assumptions. In this paper, we propose to exploit the learned intrinsic representation of the model to identify the true label in the training process, which does not rely on any assumptions on the collected data. We make two key contributions. As the first contribution, we empirically show that the class activation map (CAM), a simple technique for discriminating the learning patterns of each class in images, could surprisingly be utilized to make accurate predictions on selecting the true label from candidate labels. Unfortunately, as CAM is confined to image inputs with convolutional neural networks, we are yet unable to directly leverage CAM to address the PLL problem with general inputs and models. Thus, as the second contribution, we propose the class activation value (CAV), which owns similar properties of CAM, while CAV is versatile in various types of inputs and models. Building upon CAV, we propose a novel method named CAV Learning (CAVL) that selects the true label by the class with the maximum CAV for model training. Extensive experiments on various datasets demonstrate that our proposed CAVL method achieves stateof-the-art performance.
\ No newline at end of file
diff --git a/data/2022/iclr/Exploring Memorization in Adversarial Training b/data/2022/iclr/Exploring Memorization in Adversarial Training
new file mode 100644
index 0000000000..832a01b1fe
--- /dev/null
+++ b/data/2022/iclr/Exploring Memorization in Adversarial Training	
@@ -0,0 +1 @@
+Deep learning models have a propensity for fitting the entire training set even with random labels, which requires memorization of every training sample. In this paper, we explore the memorization effect in adversarial training (AT) for promoting a deeper understanding of model capacity, convergence, generalization, and especially robust overfitting of the adversarially trained models. We first demonstrate that deep networks have sufficient capacity to memorize adversarial examples of training data with completely random labels, but not all AT algorithms can converge under the extreme circumstance. Our study of AT with random labels motivates further analyses on the convergence and generalization of AT. We find that some AT approaches suffer from a gradient instability issue and most recently suggested complexity measures cannot explain robust generalization by considering models trained on random labels. Furthermore, we identify a significant drawback of memorization in AT that it could result in robust overfitting. We then propose a new mitigation algorithm motivated by detailed memorization analyses. Extensive experiments on various datasets validate the effectiveness of the proposed method.
\ No newline at end of file
diff --git a/data/2022/iclr/Exploring extreme parameter compression for pre-trained language models b/data/2022/iclr/Exploring extreme parameter compression for pre-trained language models
new file mode 100644
index 0000000000..b595fc918f
--- /dev/null
+++ b/data/2022/iclr/Exploring extreme parameter compression for pre-trained language models	
@@ -0,0 +1 @@
+Recent work explored the potential of large-scale Transformer-based pre-trained models, especially Pre-trained Language Models (PLMs) in natural language processing. This raises many concerns from various perspectives, e.g., financial costs and carbon emissions. Compressing PLMs like BERT with negligible performance loss for faster inference and cheaper deployment has attracted much attention. In this work, we aim to explore larger compression ratios for PLMs, among which tensor decomposition is a potential but under-investigated one. Two decomposition and reconstruction protocols are further proposed to improve the effectiveness and efficiency during compression. Our compressed BERT with ${1}/{7}$ parameters in Transformer layers performs on-par with, sometimes slightly better than the original BERT in GLUE benchmark. A tiny version achieves $96.7\%$ performance of BERT-base with $ {1}/{48} $ encoder parameters (i.e., less than 2M parameters excluding the embedding layer) and $2.7 \times$ faster on inference. To show that the proposed method is orthogonal to existing compression methods like knowledge distillation, we also explore the benefit of the proposed method on a distilled BERT.
\ No newline at end of file
diff --git a/data/2022/iclr/Exploring the Limits of Large Scale Pre-training b/data/2022/iclr/Exploring the Limits of Large Scale Pre-training
new file mode 100644
index 0000000000..b9c3a82558
--- /dev/null
+++ b/data/2022/iclr/Exploring the Limits of Large Scale Pre-training	
@@ -0,0 +1 @@
+Recent developments in large-scale machine learning suggest that by scaling up data, model size and training time properly, one might observe that improvements in pre-training would transfer favorably to most downstream tasks. In this work, we systematically study this phenomena and establish that, as we increase the upstream accuracy, the performance of downstream tasks saturates. In particular, we investigate more than 4800 experiments on Vision Transformers, MLP-Mixers and ResNets with number of parameters ranging from ten million to ten billion, trained on the largest scale of available image data (JFT, ImageNet21K) and evaluated on more than 20 downstream image recognition tasks. We propose a model for downstream performance that reflects the saturation phenomena and captures the nonlinear relationship in performance of upstream and downstream tasks. Delving deeper to understand the reasons that give rise to these phenomena, we show that the saturation behavior we observe is closely related to the way that representations evolve through the layers of the models. We showcase an even more extreme scenario where performance on upstream and downstream are at odds with each other. That is, to have a better downstream performance, we need to hurt upstream accuracy.
\ No newline at end of file
diff --git a/data/2022/iclr/Exposing the Implicit Energy Networks behind Masked Language Models via Metropolis--Hastings b/data/2022/iclr/Exposing the Implicit Energy Networks behind Masked Language Models via Metropolis--Hastings
new file mode 100644
index 0000000000..4024546fe8
--- /dev/null
+++ b/data/2022/iclr/Exposing the Implicit Energy Networks behind Masked Language Models via Metropolis--Hastings	
@@ -0,0 +1 @@
+While recent work has shown that scores from models trained by the ubiquitous masked language modeling (MLM) objective effectively discriminate probable from improbable sequences, it is still an open question if these MLMs specify a principled probability distribution over the space of possible sequences. In this paper, we interpret MLMs as energy-based sequence models and propose two energy parametrizations derivable from the trained MLMs. In order to draw samples correctly from these models, we develop a tractable sampling scheme based on the Metropolis--Hastings Monte Carlo algorithm. In our approach, samples are proposed from the same masked conditionals used for training the masked language models, and they are accepted or rejected based on their energy values according to the target distribution. We validate the effectiveness of the proposed parametrizations by exploring the quality of samples drawn from these energy-based models for both open-ended unconditional generation and a conditional generation task of machine translation. We theoretically and empirically justify our sampling algorithm by showing that the masked conditionals on their own do not yield a Markov chain whose stationary distribution is that of our target distribution, and our approach generates higher quality samples than other recently proposed undirected generation approaches (Wang et al., 2019, Ghazvininejad et al., 2019).
\ No newline at end of file
diff --git a/data/2022/iclr/Expressiveness and Approximation Properties of Graph Neural Networks b/data/2022/iclr/Expressiveness and Approximation Properties of Graph Neural Networks
new file mode 100644
index 0000000000..7e26d8bb0e
--- /dev/null
+++ b/data/2022/iclr/Expressiveness and Approximation Properties of Graph Neural Networks	
@@ -0,0 +1 @@
+Characterizing the separation power of graph neural networks (GNNs) provides an understanding of their limitations for graph learning tasks. Results regarding separation power are, however, usually geared at specific GNN architectures, and tools for understanding arbitrary GNN architectures are generally lacking. We provide an elegant way to easily obtain bounds on the separation power of GNNs in terms of the Weisfeiler-Leman (WL) tests, which have become the yardstick to measure the separation power of GNNs. The crux is to view GNNs as expressions in a procedural tensor language describing the computations in the layers of the GNNs. Then, by a simple analysis of the obtained expressions, in terms of the number of indexes and the nesting depth of summations, bounds on the separation power in terms of the WL-tests readily follow. We use tensor language to define Higher-Order Message-Passing Neural Networks (or k-MPNNs), a natural extension of MPNNs. Furthermore, the tensor language point of view allows for the derivation of universality results for classes of GNNs in a natural way. Our approach provides a toolbox with which GNN architecture designers can analyze the separation power of their GNNs, without needing to know the intricacies of the WL-tests. We also provide insights in what is needed to boost the separation power of GNNs.
\ No newline at end of file
diff --git a/data/2022/iclr/Expressivity of Emergent Languages is a Trade-off between Contextual Complexity and Unpredictability b/data/2022/iclr/Expressivity of Emergent Languages is a Trade-off between Contextual Complexity and Unpredictability
new file mode 100644
index 0000000000..662d3e8ead
--- /dev/null
+++ b/data/2022/iclr/Expressivity of Emergent Languages is a Trade-off between Contextual Complexity and Unpredictability	
@@ -0,0 +1 @@
+Researchers are using deep learning models to explore the emergence of language in various language games, where agents interact and develop an emergent language to solve tasks. We focus on the factors that determine the expressivity of emergent languages, which reflects the amount of information about input spaces those languages are capable of encoding. We measure the expressivity of emergent languages based on the generalisation performance across different games, and demonstrate that the expressivity of emergent languages is a trade-off between the complexity and unpredictability of the context those languages emerged from. Another contribution of this work is the discovery of message type collapse, i.e. the number of unique messages is lower than that of inputs. We also show that using the contrastive loss proposed by Chen et al. (2020) can alleviate this problem.
\ No newline at end of file
diff --git a/data/2022/iclr/Extending the WILDS Benchmark for Unsupervised Adaptation b/data/2022/iclr/Extending the WILDS Benchmark for Unsupervised Adaptation
new file mode 100644
index 0000000000..1f35ff30f1
--- /dev/null
+++ b/data/2022/iclr/Extending the WILDS Benchmark for Unsupervised Adaptation	
@@ -0,0 +1 @@
+Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a diﬀerent target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data and can often be obtained from distributions beyond the source distribution as well. However, existing distribution shift benchmarks with unlabeled data do not reﬂect the breadth of scenarios that arise in real-world applications. In this work, we present the Wilds 2.0 update, which extends 8 of the 10 datasets in the Wilds benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment. These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classiﬁcation, regression, and detection), and modalities (photos, satellite images, microscope slides, text, molecular graphs). The update maintains consistency with the original Wilds benchmark by using identical labeled training, validation, and test sets, as well as the evaluation metrics. On these datasets, we systematically benchmark state-of-the-art methods that leverage unlabeled data, including domain-invariant, self-training, and self-supervised methods, and show that their success on Wilds is limited. To facilitate method development and evaluation, we provide an open-source package that automates data loading and contains all of the model architectures and methods used in this paper. Code and leaderboards are available at https://wilds.stanford.edu .
\ No newline at end of file
diff --git a/data/2022/iclr/F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization b/data/2022/iclr/F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization
new file mode 100644
index 0000000000..3f9cd3bf4e
--- /dev/null
+++ b/data/2022/iclr/F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization	
@@ -0,0 +1 @@
+Neural network quantization is a promising compression technique to reduce memory footprint and save energy consumption, potentially leading to real-time inference. However, there is a performance gap between quantized and full-precision models. To reduce it, existing quantization approaches require high-precision INT32 or full-precision multiplication during inference for scaling or dequantization. This introduces a noticeable cost in terms of memory, speed, and required energy. To tackle these issues, we present F8Net, a novel quantization framework consisting of only fixed-point 8-bit multiplication. To derive our method, we first discuss the advantages of fixed-point multiplication with different formats of fixed-point numbers and study the statistical behavior of the associated fixed-point numbers. Second, based on the statistical and algorithmic analysis, we apply different fixed-point formats for weights and activations of different layers. We introduce a novel algorithm to automatically determine the right format for each layer during training. Third, we analyze a previous quantization algorithm -- parameterized clipping activation (PACT) -- and reformulate it using fixed-point arithmetic. Finally, we unify the recently proposed method for quantization fine-tuning and our fixed-point approach to show the potential of our method. We verify F8Net on ImageNet for MobileNet V1/V2 and ResNet18/50. Our approach achieves comparable and better performance, when compared not only to existing quantization techniques with INT32 multiplication or floating-point arithmetic, but also to the full-precision counterparts, achieving state-of-the-art performance.
\ No newline at end of file
diff --git a/data/2022/iclr/FALCON: Fast Visual Concept Learning by Integrating Images, Linguistic descriptions, and Conceptual Relations b/data/2022/iclr/FALCON: Fast Visual Concept Learning by Integrating Images, Linguistic descriptions, and Conceptual Relations
new file mode 100644
index 0000000000..d6996b2c5d
--- /dev/null
+++ b/data/2022/iclr/FALCON: Fast Visual Concept Learning by Integrating Images, Linguistic descriptions, and Conceptual Relations	
@@ -0,0 +1 @@
+We present a meta-learning framework for learning new visual concepts quickly, from just one or a few examples, guided by multiple naturally occurring data streams: simultaneously looking at images, reading sentences that describe the objects in the scene, and interpreting supplemental sentences that relate the novel concept with other concepts. The learned concepts support downstream applications, such as answering questions by reasoning about unseen images. Our model, namely FALCON, represents individual visual concepts, such as colors and shapes, as axis-aligned boxes in a high-dimensional space (the"box embedding space"). Given an input image and its paired sentence, our model first resolves the referential expression in the sentence and associates the novel concept with particular objects in the scene. Next, our model interprets supplemental sentences to relate the novel concept with other known concepts, such as"X has property Y"or"X is a kind of Y". Finally, it infers an optimal box embedding for the novel concept that jointly 1) maximizes the likelihood of the observed instances in the image, and 2) satisfies the relationships between the novel concepts and the known ones. We demonstrate the effectiveness of our model on both synthetic and real-world datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/FILIP: Fine-grained Interactive Language-Image Pre-Training b/data/2022/iclr/FILIP: Fine-grained Interactive Language-Image Pre-Training
new file mode 100644
index 0000000000..0591d65ae2
--- /dev/null
+++ b/data/2022/iclr/FILIP: Fine-grained Interactive Language-Image Pre-Training	
@@ -0,0 +1 @@
+Unsupervised large-scale vision-language pre-training has shown promising advances on various downstream tasks. Existing methods often model the cross-modal interaction either via the similarity of the global feature of each modality which misses sufficient information, or finer-grained interactions using cross/self-attention upon visual and textual tokens. However, cross/self-attention suffers from inferior efficiency in both training and inference. In this paper, we introduce a large-scale Fine-grained Interactive Language-Image Pre-training (FILIP) to achieve finer-level alignment through a cross-modal late interaction mechanism, which uses a token-wise maximum similarity between visual and textual tokens to guide the contrastive objective. FILIP successfully leverages the finer-grained expressiveness between image patches and textual words by modifying only contrastive loss, while simultaneously gaining the ability to pre-compute image and text representations offline at inference, keeping both large-scale training and inference efficient. Furthermore, we construct a new large-scale image-text pair dataset called FILIP300M for pre-training. Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks including zero-shot image classification and image-text retrieval. The visualization on word-patch alignment further shows that FILIP can learn meaningful fine-grained features with promising localization ability.
\ No newline at end of file
diff --git a/data/2022/iclr/FILM: Following Instructions in Language with Modular Methods b/data/2022/iclr/FILM: Following Instructions in Language with Modular Methods
new file mode 100644
index 0000000000..e766246759
--- /dev/null
+++ b/data/2022/iclr/FILM: Following Instructions in Language with Modular Methods	
@@ -0,0 +1 @@
+Recent methods for embodied instruction following are typically trained end-to-end using imitation learning. This often requires the use of expert trajectories and low-level language instructions. Such approaches assume that neural states will integrate multimodal semantics to perform state tracking, building spatial memory, exploration, and long-term planning. In contrast, we propose a modular method with structured representations that (1) builds a semantic map of the scene and (2) performs exploration with a semantic search policy, to achieve the natural language goal. Our modular method achieves SOTA performance (24.46 %) with a substantial (8.17 % absolute) gap from previous work while using less data by eschewing both expert trajectories and low-level instructions. Leveraging low-level language, however, can further increase our performance (26.49 %). Our findings suggest that an explicit spatial memory and a semantic search policy can provide a stronger and more general representation for state-tracking and guidance, even in the absence of expert trajectories or low-level instructions.
\ No newline at end of file
diff --git a/data/2022/iclr/FP-DETR: Detection Transformer Advanced by Fully Pre-training b/data/2022/iclr/FP-DETR: Detection Transformer Advanced by Fully Pre-training
new file mode 100644
index 0000000000..80f30f81f3
--- /dev/null
+++ b/data/2022/iclr/FP-DETR: Detection Transformer Advanced by Fully Pre-training	
@@ -0,0 +1 @@
+Large-scale pre-training has proven to be effective for visual representation learning on downstream tasks, especially for improving robustness and generalization. However, the recently developed detection transformers only employ pre-training on its backbone while leaving the key component, i.e., a 12-layer transformer, being trained from scratch, which prevents the model from above benefits. This separated training paradigm is mainly caused by the discrepancy between the upstream and downstream tasks. To mitigate the issue, we propose FP-DETR, a new method that Fully Pre-Trains an encoder-only transformer and smoothly finetunes it for object detection via a task adapter. Inspired by the success of textual prompts in NLP, we treat query positional embeddings as visual prompts to help the model attend to the target area (prompting) and recognize the object. To this end, we propose the task adapter which leverages self-attention to model the contextual relation between object query embedding. Experiments on the challenging COCO dataset demonstrate that our FP-DETR achieves competitive performance. Moreover, it enjoys better robustness to common corruptions and generalization to small-size datasets than state-of-the-art detection transformers. Code will be made publicly available at https://github.com/encounter1997/FP-DETR.
\ No newline at end of file
diff --git a/data/2022/iclr/Fair Normalizing Flows b/data/2022/iclr/Fair Normalizing Flows
new file mode 100644
index 0000000000..1a520138e1
--- /dev/null
+++ b/data/2022/iclr/Fair Normalizing Flows	
@@ -0,0 +1 @@
+Fair representation learning is an attractive approach that promises fairness of downstream predictors by encoding sensitive data. Unfortunately, recent work has shown that strong adversarial predictors can still exhibit unfairness by recovering sensitive attributes from these representations. In this work, we present Fair Normalizing Flows (FNF), a new approach offering more rigorous fairness guarantees for learned representations. Specifically, we consider a practical setting where we can estimate the probability density for sensitive groups. The key idea is to model the encoder as a normalizing flow trained to minimize the statistical distance between the latent representations of different groups. The main advantage of FNF is that its exact likelihood computation allows us to obtain guarantees on the maximum unfairness of any potentially adversarial downstream predictor. We experimentally demonstrate the effectiveness of FNF in enforcing various group fairness notions, as well as other attractive properties such as interpretability and transfer learning, on a variety of challenging real-world datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/FairCal: Fairness Calibration for Face Verification b/data/2022/iclr/FairCal: Fairness Calibration for Face Verification
new file mode 100644
index 0000000000..02cf795763
--- /dev/null
+++ b/data/2022/iclr/FairCal: Fairness Calibration for Face Verification	
@@ -0,0 +1 @@
+Despite being widely used, face recognition models suffer from bias: the probability of a false positive (incorrect face match) strongly depends on sensitive attributes such as the ethnicity of the face. As a result, these models can disproportionately and negatively impact minority groups, particularly when used by law enforcement. The majority of bias reduction methods have several drawbacks: they use an end-to-end retraining approach, may not be feasible due to privacy issues, and often reduce accuracy. An alternative approach is post-processing methods that build fairer decision classifiers using the features of pre-trained models, thus avoiding the cost of retraining. However, they still have drawbacks: they reduce accuracy (AGENDA, PASS, FTC), or require retuning for different false positive rates (FSN). In this work, we introduce the Fairness Calibration (FairCal) method, a post-training approach that simultaneously: (i) increases model accuracy (improving the state-of-the-art), (ii) produces fairly-calibrated probabilities, (iii) significantly reduces the gap in the false positive rates, (iv) does not require knowledge of the sensitive attribute, and (v) does not require retraining, training an additional model, or retuning. We apply it to the task of Face Verification, and obtain state-of-the-art results with all the above advantages.
\ No newline at end of file
diff --git a/data/2022/iclr/Fairness Guarantees under Demographic Shift b/data/2022/iclr/Fairness Guarantees under Demographic Shift
new file mode 100644
index 0000000000..92bec9a847
--- /dev/null
+++ b/data/2022/iclr/Fairness Guarantees under Demographic Shift	
@@ -0,0 +1 @@
+Recent studies found that using machine learning for social applications can lead to injustice in the form of racist, sexist, and otherwise unfair and discriminatory outcomes. To address this challenge, recent machine learning algorithms have been designed to limit the likelihood such unfair behavior occurs. However, these approaches typically assume the data used for training is representative of what will be encountered in deployment, which is often untrue. In particular, if certain subgroups of the population become more or less probable in deployment (a phenomenon we call demographic shift ), prior work’s fairness assurances are often invalid. In this paper, we consider the impact of demographic shift and present a class of algorithms, called Shifty algorithms, that provide high-con-fidence behavioral guarantees that hold under demographic shift when data from the deployment environment is unavailable during training. Shifty , the first technique of its kind, demonstrates an effective strategy for designing algorithms to overcome demographic shift’s challenges. We evaluate Shifty using the UCI Adult Census dataset (Kohavi and Becker, 1996), as well as a real-world dataset of university entrance exams and subsequent student success. We show that the learned models avoid bias under demographic shift, unlike existing methods. Our experiments demonstrate that our algorithm’s high-confidence fairness guarantees are valid in practice and that our algorithm is an effective tool for training models that are fair when demographic shift occurs.
\ No newline at end of file
diff --git a/data/2022/iclr/Fairness in Representation for Multilingual NLP: Insights from Controlled Experiments on Conditional Language Modeling b/data/2022/iclr/Fairness in Representation for Multilingual NLP: Insights from Controlled Experiments on Conditional Language Modeling
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Fast AdvProp b/data/2022/iclr/Fast AdvProp
new file mode 100644
index 0000000000..3964edbf8c
--- /dev/null
+++ b/data/2022/iclr/Fast AdvProp	
@@ -0,0 +1 @@
+Adversarial Propagation (AdvProp) is an effective way to improve recognition models, leveraging adversarial examples. Nonetheless, AdvProp suffers from the extremely slow training speed, mainly because: a) extra forward and backward passes are required for generating adversarial examples; b) both original samples and their adversarial counterparts are used for training (i.e., 2$\times$ data). In this paper, we introduce Fast AdvProp, which aggressively revamps AdvProp's costly training components, rendering the method nearly as cheap as the vanilla training. Specifically, our modifications in Fast AdvProp are guided by the hypothesis that disentangled learning with adversarial examples is the key for performance improvements, while other training recipes (e.g., paired clean and adversarial training samples, multi-step adversarial attackers) could be largely simplified. Our empirical results show that, compared to the vanilla training baseline, Fast AdvProp is able to further model performance on a spectrum of visual benchmarks, without incurring extra training cost. Additionally, our ablations find Fast AdvProp scales better if larger models are used, is compatible with existing data augmentation methods (i.e., Mixup and CutMix), and can be easily adapted to other recognition tasks like object detection. The code is available here: https://github.com/meijieru/fast_advprop.
\ No newline at end of file
diff --git a/data/2022/iclr/Fast Differentiable Matrix Square Root b/data/2022/iclr/Fast Differentiable Matrix Square Root
new file mode 100644
index 0000000000..2375c57fd5
--- /dev/null
+++ b/data/2022/iclr/Fast Differentiable Matrix Square Root	
@@ -0,0 +1 @@
+Computing the matrix square root or its inverse in a differentiable manner is important in a variety of computer vision tasks. Previous methods either adopt the Singular Value Decomposition (SVD) to explicitly factorize the matrix or use the Newton-Schulz iteration (NS iteration) to derive the approximate solution. However, both methods are not computationally efficient enough in either the forward pass or in the backward pass. In this paper, we propose two more efficient variants to compute the differentiable matrix square root. For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad\'e Approximants (MPA). The backward gradient is computed by iteratively solving the continuous-time Lyapunov equation using the matrix sign function. Both methods yield considerable speed-up compared with the SVD or the Newton-Schulz iteration. Experimental results on the de-correlated batch normalization and second-order vision transformer demonstrate that our methods can also achieve competitive and even slightly better performances. The code is available at \href{https://github.com/KingJamesSong/FastDifferentiableMatSqrt}{https://github.com/KingJamesSong/FastDifferentiableMatSqrt}.
\ No newline at end of file
diff --git a/data/2022/iclr/Fast Generic Interaction Detection for Model Interpretability and Compression b/data/2022/iclr/Fast Generic Interaction Detection for Model Interpretability and Compression
new file mode 100644
index 0000000000..a33fd828f4
--- /dev/null
+++ b/data/2022/iclr/Fast Generic Interaction Detection for Model Interpretability and Compression	
@@ -0,0 +1 @@
+The ability of discovering feature interactions in a black-box model is vital to explainable deep learning. We propose a principled, global interaction detection method by casting our target as a multi-arm bandits problem and solving it swiftly with the UCB algorithm. This adaptive method is free of ad-hoc assumptions and among the cutting-edge methods with outstanding detection accuracy and stability. Based on the detection outcome, a lightweight and interpretable deep learning model (called ParaACE) is further built using the alternating conditional expectation (ACE) method. Our proposed ParaACE improves the prediction performance by 26% and reduces the model size by 100+ times as compared to its Teacher model over various datasets. Furthermore, we show the great potential of our method for scientiﬁc discovery through interpreting various real datasets in the economics and smart medicine sectors. The code is available at https://github.com/zhangtj1996/ParaACE.
\ No newline at end of file
diff --git a/data/2022/iclr/Fast Model Editing at Scale b/data/2022/iclr/Fast Model Editing at Scale
new file mode 100644
index 0000000000..26e35e5bfe
--- /dev/null
+++ b/data/2022/iclr/Fast Model Editing at Scale	
@@ -0,0 +1 @@
+While large pre-trained models have enabled impressive results on a variety of downstream tasks, the largest existing models still make errors, and even accurate predictions may become outdated over time. Because detecting all such failures at training time is impossible, enabling both developers and end users of such models to correct inaccurate outputs while leaving the model otherwise intact is desirable. However, the distributed, black-box nature of the representations learned by large neural networks makes producing such targeted edits difficult. If presented with only a single problematic input and new desired output, fine-tuning approaches tend to overfit; other editing algorithms are either computationally infeasible or simply ineffective when applied to very large models. To enable easy post-hoc editing at scale, we propose Model Editor Networks using Gradient Decomposition (MEND), a collection of small auxiliary editing networks that use a single desired input-output pair to make fast, local edits to a pre-trained model's behavior. MEND learns to transform the gradient obtained by standard fine-tuning, using a low-rank decomposition of the gradient to make the parameterization of this transformation tractable. MEND can be trained on a single GPU in less than a day even for 10 billion+ parameter models; once trained MEND enables rapid application of new edits to the pre-trained model. Our experiments with T5, GPT, BERT, and BART models show that MEND is the only approach to model editing that effectively edits the behavior of models with more than 10 billion parameters. Code and data available at https://sites.google.com/view/mend-editing.
\ No newline at end of file
diff --git a/data/2022/iclr/Fast Regression for Structured Inputs b/data/2022/iclr/Fast Regression for Structured Inputs
new file mode 100644
index 0000000000..f24854e038
--- /dev/null
+++ b/data/2022/iclr/Fast Regression for Structured Inputs	
@@ -0,0 +1 @@
+We study the $\ell_p$ regression problem, which requires finding $\mathbf{x}\in\mathbb R^{d}$ that minimizes $\|\mathbf{A}\mathbf{x}-\mathbf{b}\|_p$ for a matrix $\mathbf{A}\in\mathbb R^{n \times d}$ and response vector $\mathbf{b}\in\mathbb R^{n}$. There has been recent interest in developing subsampling methods for this problem that can outperform standard techniques when $n$ is very large. However, all known subsampling approaches have run time that depends exponentially on $p$, typically, $d^{\mathcal{O}(p)}$, which can be prohibitively expensive. We improve on this work by showing that for a large class of common \emph{structured matrices}, such as combinations of low-rank matrices, sparse matrices, and Vandermonde matrices, there are subsampling based methods for $\ell_p$ regression that depend polynomially on $p$. For example, we give an algorithm for $\ell_p$ regression on Vandermonde matrices that runs in time $\mathcal{O}(n\log^3 n+(dp^2)^{0.5+\omega}\cdot\text{polylog}\,n)$, where $\omega$ is the exponent of matrix multiplication. The polynomial dependence on $p$ crucially allows our algorithms to extend naturally to efficient algorithms for $\ell_\infty$ regression, via approximation of $\ell_\infty$ by $\ell_{\mathcal{O}(\log n)}$. Of practical interest, we also develop a new subsampling algorithm for $\ell_p$ regression for arbitrary matrices, which is simpler than previous approaches for $p \ge 4$.
\ No newline at end of file
diff --git a/data/2022/iclr/Fast topological clustering with Wasserstein distance b/data/2022/iclr/Fast topological clustering with Wasserstein distance
new file mode 100644
index 0000000000..5be5866924
--- /dev/null
+++ b/data/2022/iclr/Fast topological clustering with Wasserstein distance	
@@ -0,0 +1 @@
+The topological patterns exhibited by many real-world networks motivate the development of topology-based methods for assessing the similarity of networks. However, extracting topological structure is difficult, especially for large and dense networks whose node degrees range over multiple orders of magnitude. In this paper, we propose a novel and computationally practical topological clustering method that clusters complex networks with intricate topology using principled theory from persistent homology and optimal transport. Such networks are aggregated into clusters through a centroid-based clustering strategy based on both their topological and geometric structure, preserving correspondence between nodes in different networks. The notions of topological proximity and centroid are characterized using a novel and efficient approach to computation of the Wasserstein distance and barycenter for persistence barcodes associated with connected components and cycles. The proposed method is demonstrated to be effective using both simulated networks and measured functional brain networks.
\ No newline at end of file
diff --git a/data/2022/iclr/FastSHAP: Real-Time Shapley Value Estimation b/data/2022/iclr/FastSHAP: Real-Time Shapley Value Estimation
new file mode 100644
index 0000000000..e56a59c368
--- /dev/null
+++ b/data/2022/iclr/FastSHAP: Real-Time Shapley Value Estimation	
@@ -0,0 +1 @@
+Shapley values are widely used to explain black-box models, but they are costly to calculate because they require many model evaluations. We introduce FastSHAP, a method for estimating Shapley values in a single forward pass using a learned explainer model. FastSHAP amortizes the cost of explaining many inputs via a learning approach inspired by the Shapley value's weighted least squares characterization, and it can be trained using standard stochastic gradient optimization. We compare FastSHAP to existing estimation approaches, revealing that it generates high-quality explanations with orders of magnitude speedup.
\ No newline at end of file
diff --git a/data/2022/iclr/Feature Kernel Distillation b/data/2022/iclr/Feature Kernel Distillation
new file mode 100644
index 0000000000..6df96cf327
--- /dev/null
+++ b/data/2022/iclr/Feature Kernel Distillation	
@@ -0,0 +1 @@
+Trained Neural Networks (NNs) can be viewed as data-dependent kernel machines, with predictions determined by the inner product of last-layer representations across inputs, referred to as the feature kernel . We explore the relevance of the feature kernel for Knowledge Distillation (KD), using a mechanistic understanding of an NN’s optimisation process. We extend the theoretical analysis of Allen-Zhu & Li (2020) to show that a trained NN’s feature kernel is highly dependent on its parameter initialisation, which biases different initialisations of the same architecture to learn different data attributes in a multi-view data setting. This enables us to prove that KD using only pairwise feature kernel comparisons can improve NN test accuracy in such settings, with both single & ensemble teacher models, whereas standard training without KD fails to generalise. We further use our theory to motivate practical considerations for improving student generalisation when using distillation with feature kernels, which allows us to pro-pose a novel approach: Feature Kernel Distillation (FKD). Finally, we experimentally corroborate our theory in the image classiﬁcation setting, showing that FKD is amenable to ensemble distillation, can transfer knowledge across datasets, and outperforms both vanilla KD & other feature kernel based KD baselines across a range of standard architectures & datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/FedBABU: Toward Enhanced Representation for Federated Image Classification b/data/2022/iclr/FedBABU: Toward Enhanced Representation for Federated Image Classification
new file mode 100644
index 0000000000..70fa6a133f
--- /dev/null
+++ b/data/2022/iclr/FedBABU: Toward Enhanced Representation for Federated Image Classification	
@@ -0,0 +1 @@
+Federated learning has evolved to improve a single global model under data heterogeneity (as a curse) or to develop multiple personalized models using data heterogeneity (as a blessing). However, little research has considered both directions simultaneously. In this paper, we first investigate the relationship between them by analyzing Federated Averaging at the client level and determine that a better federated global model performance does not constantly improve personalization. To elucidate the cause of this personalization performance degradation problem, we decompose the entire network into the body (extractor), which is related to universality, and the head (classifier), which is related to personalization. We then point out that this problem stems from training the head. Based on this observation, we propose a novel federated learning algorithm, coined FedBABU, which only updates the body of the model during federated training (i.e., the head is randomly initialized and never updated), and the head is fine-tuned for personalization during the evaluation process. Extensive experiments show consistent performance improvements and an efficient personalization of FedBABU. The code is available at https://github.com/jhoon-oh/FedBABU.
\ No newline at end of file
diff --git a/data/2022/iclr/FedChain: Chained Algorithms for Near-optimal Communication Cost in Federated Learning b/data/2022/iclr/FedChain: Chained Algorithms for Near-optimal Communication Cost in Federated Learning
new file mode 100644
index 0000000000..6f727daeef
--- /dev/null
+++ b/data/2022/iclr/FedChain: Chained Algorithms for Near-optimal Communication Cost in Federated Learning	
@@ -0,0 +1 @@
+Federated learning (FL) aims to minimize the communication complexity of training a model over heterogeneous data distributed across many clients. A common approach is local methods, where clients take multiple optimization steps over local data before communicating with the server (e.g., FedAvg). Local methods can exploit similarity between clients' data. However, in existing analyses, this comes at the cost of slow convergence in terms of the dependence on the number of communication rounds R. On the other hand, global methods, where clients simply return a gradient vector in each round (e.g., SGD), converge faster in terms of R but fail to exploit the similarity between clients even when clients are homogeneous. We propose FedChain, an algorithmic framework that combines the strengths of local methods and global methods to achieve fast convergence in terms of R while leveraging the similarity between clients. Using FedChain, we instantiate algorithms that improve upon previously known rates in the general convex and PL settings, and are near-optimal (via an algorithm-independent lower bound that we show) for problems that satisfy strong convexity. Empirical results support this theoretical gain over existing methods.
\ No newline at end of file
diff --git a/data/2022/iclr/FedPara: Low-rank Hadamard Product for Communication-Efficient Federated Learning b/data/2022/iclr/FedPara: Low-rank Hadamard Product for Communication-Efficient Federated Learning
new file mode 100644
index 0000000000..e93de5eb31
--- /dev/null
+++ b/data/2022/iclr/FedPara: Low-rank Hadamard Product for Communication-Efficient Federated Learning	
@@ -0,0 +1 @@
+In this work
\ No newline at end of file
diff --git a/data/2022/iclr/Federated Learning from Only Unlabeled Data with Class-conditional-sharing Clients b/data/2022/iclr/Federated Learning from Only Unlabeled Data with Class-conditional-sharing Clients
new file mode 100644
index 0000000000..bdfa6741ef
--- /dev/null
+++ b/data/2022/iclr/Federated Learning from Only Unlabeled Data with Class-conditional-sharing Clients	
@@ -0,0 +1 @@
+Supervised federated learning (FL) enables multiple clients to share the trained model without sharing their labeled data. However, potential clients might even be reluctant to label their own data, which could limit the applicability of FL in practice. In this paper, we show the possibility of unsupervised FL whose model is still a classifier for predicting class labels, if the class-prior probabilities are shifted while the class-conditional distributions are shared among the unlabeled data owned by the clients. We propose federation of unsupervised learning (FedUL), where the unlabeled data are transformed into surrogate labeled data for each of the clients, a modified model is trained by supervised FL, and the wanted model is recovered from the modified model. FedUL is a very general solution to unsupervised FL: it is compatible with many supervised FL methods, and the recovery of the wanted model can be theoretically guaranteed as if the data have been labeled. Experiments on benchmark and real-world datasets demonstrate the effectiveness of FedUL. Code is available at https://github.com/lunanbit/FedUL.
\ No newline at end of file
diff --git a/data/2022/iclr/Few-Shot Backdoor Attacks on Visual Object Tracking b/data/2022/iclr/Few-Shot Backdoor Attacks on Visual Object Tracking
new file mode 100644
index 0000000000..1dae779ad6
--- /dev/null
+++ b/data/2022/iclr/Few-Shot Backdoor Attacks on Visual Object Tracking	
@@ -0,0 +1 @@
+Visual object tracking (VOT) has been widely adopted in mission-critical applications, such as autonomous driving and intelligent surveillance systems. In current practice, third-party resources such as datasets, backbone networks, and training platforms are frequently used to train high-performance VOT models. Whilst these resources bring certain convenience, they also introduce new security threats into VOT models. In this paper, we reveal such a threat where an adversary can easily implant hidden backdoors into VOT models by tempering with the training process. Specifically, we propose a simple yet effective few-shot backdoor attack (FSBA) that optimizes two losses alternately: 1) a \emph{feature loss} defined in the hidden feature space, and 2) the standard \emph{tracking loss}. We show that, once the backdoor is embedded into the target model by our FSBA, it can trick the model to lose track of specific objects even when the \emph{trigger} only appears in one or a few frames. We examine our attack in both digital and physical-world settings and show that it can significantly degrade the performance of state-of-the-art VOT trackers. We also show that our attack is resistant to potential defenses, highlighting the vulnerability of VOT models to potential backdoor attacks.
\ No newline at end of file
diff --git a/data/2022/iclr/Few-shot Learning via Dirichlet Tessellation Ensemble b/data/2022/iclr/Few-shot Learning via Dirichlet Tessellation Ensemble
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Filling the G_ap_s: Multivariate Time Series Imputation by Graph Neural Networks b/data/2022/iclr/Filling the G_ap_s: Multivariate Time Series Imputation by Graph Neural Networks
new file mode 100644
index 0000000000..e603532ad9
--- /dev/null
+++ b/data/2022/iclr/Filling the G_ap_s: Multivariate Time Series Imputation by Graph Neural Networks	
@@ -0,0 +1 @@
+Dealing with missing values and incomplete time series is a labor-intensive, tedious, inevitable task when handling data coming from real-world applications. Effective spatio-temporal representations would allow imputation methods to reconstruct missing temporal data by exploiting information coming from sensors at different locations. However, standard methods fall short in capturing the nonlinear time and space dependencies existing within networks of interconnected sensors and do not take full advantage of the available - and often strong - relational information. Notably, most state-of-the-art imputation methods based on deep learning do not explicitly model relational aspects and, in any case, do not exploit processing frameworks able to adequately represent structured spatio-temporal data. Conversely, graph neural networks have recently surged in popularity as both expressive and scalable tools for processing sequential data with relational inductive biases. In this work, we present the first assessment of graph neural networks in the context of multivariate time series imputation. In particular, we introduce a novel graph neural network architecture, named GRIN, which aims at reconstructing missing data in the different channels of a multivariate time series by learning spatio-temporal representations through message passing. Empirical results show that our model outperforms state-of-the-art methods in the imputation task on relevant real-world benchmarks with mean absolute error improvements often higher than 20%.
\ No newline at end of file
diff --git a/data/2022/iclr/Filtered-CoPhy: Unsupervised Learning of Counterfactual Physics in Pixel Space b/data/2022/iclr/Filtered-CoPhy: Unsupervised Learning of Counterfactual Physics in Pixel Space
new file mode 100644
index 0000000000..cf86ae3035
--- /dev/null
+++ b/data/2022/iclr/Filtered-CoPhy: Unsupervised Learning of Counterfactual Physics in Pixel Space	
@@ -0,0 +1 @@
+Learning causal relationships in high-dimensional data (images, videos) is a hard task, as they are often defined on low dimensional manifolds and must be extracted from complex signals dominated by appearance, lighting, textures and also spurious correlations in the data. We present a method for learning counterfactual reasoning of physical processes in pixel space, which requires the prediction of the impact of interventions on initial conditions. Going beyond the identification of structural relationships, we deal with the challenging problem of forecasting raw video over long horizons. Our method does not require the knowledge or supervision of any ground truth positions or other object or scene properties. Our model learns and acts on a suitable hybrid latent representation based on a combination of dense features, sets of 2D keypoints and an additional latent vector per keypoint. We show that this better captures the dynamics of physical processes than purely dense or sparse representations. We introduce a new challenging and carefully designed counterfactual benchmark for predictions in pixel space and outperform strong baselines in physics-inspired ML and video prediction.
\ No newline at end of file
diff --git a/data/2022/iclr/Finding Biological Plausibility for Adversarially Robust Features via Metameric Tasks b/data/2022/iclr/Finding Biological Plausibility for Adversarially Robust Features via Metameric Tasks
new file mode 100644
index 0000000000..d786288abe
--- /dev/null
+++ b/data/2022/iclr/Finding Biological Plausibility for Adversarially Robust Features via Metameric Tasks	
@@ -0,0 +1 @@
+Recent work suggests that feature constraints in the training datasets of deep neu- 1 ral networks (DNNs) drive robustness to adversarial noise (Ilyas et al., 2019). 2 The representations learned by such adversarially robust networks have also been 3 shown to be more human perceptually-aligned than non-robust networks via image 4 manipulations (Santurkar et al., 2019; Engstrom et al., 2019). Despite appearing 5 closer to human visual perception, it is unclear if the constraints in robust DNN 6 representations match biological constraints found in human vision. Human vision 7 seems to rely on texture-based/summary statistic representations in the periphery, 8 which have been shown to explain phenomena such as crowding (Balas et al., 2009) 9 and performance on visual search tasks (Rosenholtz et al., 2012). To understand 10 how adversarially robust optimizations/representations compare to human vision, 11 we performed a psychophysics experiment using a metamer task similar to Freeman 12 & Simoncelli (2011); Wallis et al. (2019); Deza et al. (2017) where we evaluated 13 how well human observers could distinguish between images synthesized to match 14 adversarially robust representations compared to non-robust representations and a 15 texture synthesis model of peripheral vision (Texforms (Long et al., 2018)). We 16 found that the discriminability of robust representation and texture model images 17 decreased to near chance performance as stimuli were presented farther in the 18 periphery. Moreover, performance on robust and texture-model images showed 19 similar trends within participants, while performance on non-robust representa- 20 tions changed minimally across the visual ﬁeld. These results together suggest 21 that (1) adversarially robust representations capture peripheral computation better 22 than non-robust representations and (2) robust representations capture peripheral 23 computation similar to current state-of-the-art texture peripheral vision models. 24 More broadly, our ﬁndings support the idea that localized texture summary statis- 25 tic representations may drive human invariance to adversarial
\ No newline at end of file
diff --git a/data/2022/iclr/Finding an Unsupervised Image Segmenter in each of your Deep Generative Models b/data/2022/iclr/Finding an Unsupervised Image Segmenter in each of your Deep Generative Models
new file mode 100644
index 0000000000..bbf92014fd
--- /dev/null
+++ b/data/2022/iclr/Finding an Unsupervised Image Segmenter in each of your Deep Generative Models	
@@ -0,0 +1 @@
+Recent research has shown that numerous human-interpretable directions exist in the latent space of GANs. In this paper, we develop an automatic procedure for finding directions that lead to foreground-background image separation, and we use these directions to train an image segmentation model without human supervision. Our method is generator-agnostic, producing strong segmentation results with a wide range of different GAN architectures. Furthermore, by leveraging GANs pretrained on large datasets such as ImageNet, we are able to segment images from a range of domains without further training or finetuning. Evaluating our method on image segmentation benchmarks, we compare favorably to prior work while using neither human supervision nor access to the training data. Broadly, our results demonstrate that automatically extracting foreground-background structure from pretrained deep generative models can serve as a remarkably effective substitute for human supervision.
\ No newline at end of file
diff --git a/data/2022/iclr/Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution b/data/2022/iclr/Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution
new file mode 100644
index 0000000000..4cd6698639
--- /dev/null
+++ b/data/2022/iclr/Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution	
@@ -0,0 +1 @@
+When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer—the “head”). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR → STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. We prove that the OOD error of fine-tuning is high when we initialize with a fixed or random head—this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing. Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).
\ No newline at end of file
diff --git a/data/2022/iclr/Fine-grained Differentiable Physics: A Yarn-level Model for Fabrics b/data/2022/iclr/Fine-grained Differentiable Physics: A Yarn-level Model for Fabrics
new file mode 100644
index 0000000000..6f8e1fed0f
--- /dev/null
+++ b/data/2022/iclr/Fine-grained Differentiable Physics: A Yarn-level Model for Fabrics	
@@ -0,0 +1 @@
+Differentiable physics modeling combines physics models with gradient-based learning to provide model explicability and data efficiency. It has been used to learn dynamics, solve inverse problems and facilitate design, and is at its inception of impact. Current successes have concentrated on general physics models such as rigid bodies, deformable sheets, etc., assuming relatively simple structures and forces. Their granularity is intrinsically coarse and therefore incapable of modelling complex physical phenomena. Fine-grained models are still to be developed to incorporate sophisticated material structures and force interactions with gradient-based learning. Following this motivation, we propose a new differentiable fabrics model for composite materials such as cloths, where we dive into the granularity of yarns and model individual yarn physics and yarn-to-yarn interactions. To this end, we propose several differentiable forces, whose counterparts in empirical physics are indifferentiable, to facilitate gradient-based learning. These forces, albeit applied to cloths, are ubiquitous in various physical systems. Through comprehensive evaluation and comparison, we demonstrate our model's explicability in learning meaningful physical parameters, versatility in incorporating complex physical structures and heterogeneous materials, data-efficiency in learning, and high-fidelity in capturing subtle dynamics.
\ No newline at end of file
diff --git a/data/2022/iclr/Finetuned Language Models are Zero-Shot Learners b/data/2022/iclr/Finetuned Language Models are Zero-Shot Learners
new file mode 100644
index 0000000000..421d38ca0b
--- /dev/null
+++ b/data/2022/iclr/Finetuned Language Models are Zero-Shot Learners	
@@ -0,0 +1 @@
+This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.
\ No newline at end of file
diff --git a/data/2022/iclr/Finite-Time Convergence and Sample Complexity of Multi-Agent Actor-Critic Reinforcement Learning with Average Reward b/data/2022/iclr/Finite-Time Convergence and Sample Complexity of Multi-Agent Actor-Critic Reinforcement Learning with Average Reward
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Fixed Neural Network Steganography: Train the images, not the network b/data/2022/iclr/Fixed Neural Network Steganography: Train the images, not the network
new file mode 100644
index 0000000000..5216407cbe
--- /dev/null
+++ b/data/2022/iclr/Fixed Neural Network Steganography: Train the images, not the network	
@@ -0,0 +1 @@
+Recent attempts at image steganography make use of advances in deep learning to train an encoder-decoder network pair to hide and retrieve secret messages in images. These methods are able to hide large amounts of data, but they also incur high decoding error rates (around 20%). In this paper, we propose a novel algorithm for steganography that takes advantage of the fact that neural networks are sensitive to tiny perturbations. Our method, Fixed Neural Network Steganography (FNNS), yields signiﬁcantly lower error rates when compared to prior state-of-the-art methods and achieves 0% error reliably for hiding up to 3 bits per pixel (bpp) of secret information in images. FNNS also successfully evades existing statistical steganalysis systems and can be modiﬁed to evade neural steganalysis systems as well. Recovering every bit correctly, up to 3 bpp, enables novel applications that requires encryption. We introduce one speciﬁc use case for facilitating anonymized and safe image sharing. Our code is available at https://github.com/varshakishore/FNNS .
\ No newline at end of file
diff --git a/data/2022/iclr/FlexConv: Continuous Kernel Convolutions With Differentiable Kernel Sizes b/data/2022/iclr/FlexConv: Continuous Kernel Convolutions With Differentiable Kernel Sizes
new file mode 100644
index 0000000000..4b6b700ac0
--- /dev/null
+++ b/data/2022/iclr/FlexConv: Continuous Kernel Convolutions With Differentiable Kernel Sizes	
@@ -0,0 +1 @@
+When designing Convolutional Neural Networks (CNNs), one must select the size\break of the convolutional kernels before training. Recent works show CNNs benefit from different kernel sizes at different layers, but exploring all possible combinations is unfeasible in practice. A more efficient approach is to learn the kernel size during training. However, existing works that learn the kernel size have a limited bandwidth. These approaches scale kernels by dilation, and thus the detail they can describe is limited. In this work, we propose FlexConv, a novel convolutional operation with which high bandwidth convolutional kernels of learnable kernel size can be learned at a fixed parameter cost. FlexNets model long-term dependencies without the use of pooling, achieve state-of-the-art performance on several sequential datasets, outperform recent works with learned kernel sizes, and are competitive with much deeper ResNets on image benchmark datasets. Additionally, FlexNets can be deployed at higher resolutions than those seen during training. To avoid aliasing, we propose a novel kernel parameterization with which the frequency of the kernels can be analytically controlled. Our novel kernel parameterization shows higher descriptive power and faster convergence speed than existing parameterizations. This leads to important improvements in classification accuracy.
\ No newline at end of file
diff --git a/data/2022/iclr/Focus on the Common Good: Group Distributional Robustness Follows b/data/2022/iclr/Focus on the Common Good: Group Distributional Robustness Follows
new file mode 100644
index 0000000000..18a2788625
--- /dev/null
+++ b/data/2022/iclr/Focus on the Common Good: Group Distributional Robustness Follows	
@@ -0,0 +1 @@
+We consider the problem of training a classification model with group annotated training data. Recent work has established that, if there is distribution shift across different groups, models trained using the standard empirical risk minimization (ERM) objective suffer from poor performance on minority groups and that group distributionally robust optimization (Group-DRO) objective is a better alternative. The starting point of this paper is the observation that though Group-DRO performs better than ERM on minority groups for some benchmark datasets, there are several other datasets where it performs much worse than ERM. Inspired by ideas from the closely related problem of domain generalization, this paper proposes a new and simple algorithm that explicitly encourages learning of features that are shared across various groups. The key insight behind our proposed algorithm is that while Group-DRO focuses on groups with worst regularized loss, focusing instead, on groups that enable better performance even on other groups, could lead to learning of shared/common features, thereby enhancing minority performance beyond what is achieved by Group-DRO. Empirically, we show that our proposed algorithm matches or achieves better performance compared to strong contemporary baselines including ERM and Group-DRO on standard benchmarks on both minority groups and across all groups. Theoretically, we show that the proposed algorithm is a descent method and finds first order stationary points of smooth nonconvex functions.
\ No newline at end of file
diff --git a/data/2022/iclr/Fooling Explanations in Text Classifiers b/data/2022/iclr/Fooling Explanations in Text Classifiers
new file mode 100644
index 0000000000..98187cc3c5
--- /dev/null
+++ b/data/2022/iclr/Fooling Explanations in Text Classifiers	
@@ -0,0 +1 @@
+State-of-the-art text classification models are becoming increasingly reliant on deep neural networks (DNNs). Due to their black-box nature, faithful and robust explanation methods need to accompany classifiers for deployment in real-life scenarios. However, it has been shown in vision applications that explanation methods are susceptible to local, imperceptible perturbations that can significantly alter the explanations without changing the predicted classes. We show here that the existence of such perturbations extends to text classifiers as well. Specifically, we introduceTextExplanationFooler (TEF), a novel explanation attack algorithm that alters text input samples imperceptibly so that the outcome of widely-used explanation methods changes considerably while leaving classifier predictions unchanged. We evaluate the performance of the attribution robustness estimation performance in TEF on five sequence classification datasets, utilizing three DNN architectures and three transformer architectures for each dataset. TEF can significantly decrease the correlation between unchanged and perturbed input attributions, which shows that all models and explanation methods are susceptible to TEF perturbations. Moreover, we evaluate how the perturbations transfer to other model architectures and attribution methods, and show that TEF perturbations are also effective in scenarios where the target model and explanation method are unknown. Finally, we introduce a semi-universal attack that is able to compute fast, computationally light perturbations with no knowledge of the attacked classifier nor explanation method. Overall, our work shows that explanations in text classifiers are very fragile and users need to carefully address their robustness before relying on them in critical applications.
\ No newline at end of file
diff --git a/data/2022/iclr/Fortuitous Forgetting in Connectionist Networks b/data/2022/iclr/Fortuitous Forgetting in Connectionist Networks
new file mode 100644
index 0000000000..e550ddad7e
--- /dev/null
+++ b/data/2022/iclr/Fortuitous Forgetting in Connectionist Networks	
@@ -0,0 +1 @@
+Forgetting is often seen as an unwanted characteristic in both human and machine learning. However, we propose that forgetting can in fact be favorable to learning. We introduce"forget-and-relearn"as a powerful paradigm for shaping the learning trajectories of artificial neural networks. In this process, the forgetting step selectively removes undesirable information from the model, and the relearning step reinforces features that are consistently useful under different conditions. The forget-and-relearn framework unifies many existing iterative training algorithms in the image classification and language emergence literature, and allows us to understand the success of these algorithms in terms of the disproportionate forgetting of undesirable information. We leverage this understanding to improve upon existing algorithms by designing more targeted forgetting operations. Insights from our analysis provide a coherent view on the dynamics of iterative training in neural networks and offer a clear path towards performance improvements.
\ No newline at end of file
diff --git a/data/2022/iclr/Frame Averaging for Invariant and Equivariant Network Design b/data/2022/iclr/Frame Averaging for Invariant and Equivariant Network Design
new file mode 100644
index 0000000000..b4008dd4ce
--- /dev/null
+++ b/data/2022/iclr/Frame Averaging for Invariant and Equivariant Network Design	
@@ -0,0 +1 @@
+Many machine learning tasks involve learning functions that are known to be invariant or equivariant to certain symmetries of the input data. However, it is often challenging to design neural network architectures that respect these symmetries while being expressive and computationally efficient. For example, Euclidean motion invariant/equivariant graph or point cloud neural networks. We introduce Frame Averaging (FA), a general purpose and systematic framework for adapting known (backbone) architectures to become invariant or equivariant to new symmetry types. Our framework builds on the well known group averaging operator that guarantees invariance or equivariance but is intractable. In contrast, we observe that for many important classes of symmetries, this operator can be replaced with an averaging operator over a small subset of the group elements, called a frame. We show that averaging over a frame guarantees exact invariance or equivariance while often being much simpler to compute than averaging over the entire group. Furthermore, we prove that FA-based models have maximal expressive power in a broad setting and in general preserve the expressive power of their backbone architectures. Using frame averaging, we propose a new class of universal Graph Neural Networks (GNNs), universal Euclidean motion invariant point cloud networks, and Euclidean motion invariant Message Passing (MP) GNNs. We demonstrate the practical effectiveness of FA on several applications including point cloud normal estimation, beyond $2$-WL graph separation, and $n$-body dynamics prediction, achieving state-of-the-art results in all of these benchmarks.
\ No newline at end of file
diff --git a/data/2022/iclr/Frequency-aware SGD for Efficient Embedding Learning with Provable Benefits b/data/2022/iclr/Frequency-aware SGD for Efficient Embedding Learning with Provable Benefits
new file mode 100644
index 0000000000..16fe9ca9d5
--- /dev/null
+++ b/data/2022/iclr/Frequency-aware SGD for Efficient Embedding Learning with Provable Benefits	
@@ -0,0 +1 @@
+Embedding learning has found widespread applications in recommendation systems and natural language modeling, among other domains. To learn quality embeddings efficiently, adaptive learning rate algorithms have demonstrated superior empirical performance over SGD, largely accredited to their token-dependent learning rate. However, the underlying mechanism for the efficiency of token-dependent learning rate remains underexplored. We show that incorporating frequency information of tokens in the embedding learning problems leads to provably efficient algorithms, and demonstrate that common adaptive algorithms implicitly exploit the frequency information to a large extent. Specifically, we propose (Counter-based) Frequency-aware Stochastic Gradient Descent, which applies a frequency-dependent learning rate for each token, and exhibits provable speed-up compared to SGD when the token distribution is imbalanced. Empirically, we show the proposed algorithms are able to improve or match adaptive algorithms on benchmark recommendation tasks and a large-scale industrial recommendation system, closing the performance gap between SGD and adaptive algorithms. Our results are the first to show token-dependent learning rate provably improves convergence for non-convex embedding learning problems.
\ No newline at end of file
diff --git a/data/2022/iclr/From Intervention to Domain Transportation: A Novel Perspective to Optimize Recommendation b/data/2022/iclr/From Intervention to Domain Transportation: A Novel Perspective to Optimize Recommendation
new file mode 100644
index 0000000000..072775239a
--- /dev/null
+++ b/data/2022/iclr/From Intervention to Domain Transportation: A Novel Perspective to Optimize Recommendation	
@@ -0,0 +1 @@
+The interventional nature of recommendation has attracted increasing attention in recent years. It particularly motivates researchers to formulate learning and evaluating recommendation as causal inference and data missing-not-at-random problems. However, few take seriously the consequence of violating the critical assumption of overlapping, which we prove can significantly threaten the validity and interpretation of the outcome. We find a critical piece missing in the current understanding of information retrieval (IR) systems: as interventions, recommendation not only affects the already observed data, but it also interferes with the target domain (distribution) of interest. We then rephrase optimizing recommendation as finding an intervention that best transports the patterns it learns from the observed domain to its intervention domain. Towards this end, we use domain transportation to characterize the learning-intervention mechanism of recommendation. We design a principled transportation-constraint risk minimization objective and convert it to a two-player minimax game. We prove the consistency, generalization, and excessive risk bounds for the proposed objective, and elaborate how they compare to the current results. Finally, we carry out extensive real-data and semi-synthetic experiments to demonstrate the advantage of our approach, and launch online testing with a real-world IR system.
\ No newline at end of file
diff --git a/data/2022/iclr/From Stars to Subgraphs: Uplifting Any GNN with Local Structure Awareness b/data/2022/iclr/From Stars to Subgraphs: Uplifting Any GNN with Local Structure Awareness
new file mode 100644
index 0000000000..d2d85c6773
--- /dev/null
+++ b/data/2022/iclr/From Stars to Subgraphs: Uplifting Any GNN with Local Structure Awareness	
@@ -0,0 +1 @@
+Message Passing Neural Networks (MPNNs) are a common type of Graph Neural Network (GNN), in which each node's representation is computed recursively by aggregating representations (messages) from its immediate neighbors akin to a star-shaped pattern. MPNNs are appealing for being efficient and scalable, how-ever their expressiveness is upper-bounded by the 1st-order Weisfeiler-Lehman isomorphism test (1-WL). In response, prior works propose highly expressive models at the cost of scalability and sometimes generalization performance. Our work stands between these two regimes: we introduce a general framework to uplift any MPNN to be more expressive, with limited scalability overhead and greatly improved practical performance. We achieve this by extending local aggregation in MPNNs from star patterns to general subgraph patterns (e.g.,k-egonets):in our framework, each node representation is computed as the encoding of a surrounding induced subgraph rather than encoding of immediate neighbors only (i.e. a star). We choose the subgraph encoder to be a GNN (mainly MPNNs, considering scalability) to design a general framework that serves as a wrapper to up-lift any GNN. We call our proposed method GNN-AK(GNN As Kernel), as the framework resembles a convolutional neural network by replacing the kernel with GNNs. Theoretically, we show that our framework is strictly more powerful than 1&2-WL, and is not less powerful than 3-WL. We also design subgraph sampling strategies which greatly reduce memory footprint and improve speed while maintaining performance. Our method sets new state-of-the-art performance by large margins for several well-known graph ML tasks; specifically, 0.08 MAE on ZINC,74.79% and 86.887% accuracy on CIFAR10 and PATTERN respectively.
\ No newline at end of file
diff --git a/data/2022/iclr/GATSBI: Generative Adversarial Training for Simulation-Based Inference b/data/2022/iclr/GATSBI: Generative Adversarial Training for Simulation-Based Inference
new file mode 100644
index 0000000000..319795895d
--- /dev/null
+++ b/data/2022/iclr/GATSBI: Generative Adversarial Training for Simulation-Based Inference	
@@ -0,0 +1 @@
+Simulation-based inference (SBI) refers to statistical inference on stochastic models for which we can generate samples, but not compute likelihoods. Like SBI algorithms, generative adversarial networks (GANs) do not require explicit likelihoods. We study the relationship between SBI and GANs, and introduce GATSBI, an adversarial approach to SBI. GATSBI reformulates the variational objective in an adversarial setting to learn implicit posterior distributions. Inference with GATSBI is amortised across observations, works in high-dimensional posterior spaces and supports implicit priors. We evaluate GATSBI on two SBI benchmark problems and on two high-dimensional simulators. On a model for wave propagation on the surface of a shallow water body, we show that GATSBI can return well-calibrated posterior estimates even in high dimensions. On a model of camera optics, it infers a high-dimensional posterior given an implicit prior, and performs better than a state-of-the-art SBI approach. We also show how GATSBI can be extended to perform sequential posterior estimation to focus on individual observations. Overall, GATSBI opens up opportunities for leveraging advances in GANs to perform Bayesian inference on high-dimensional simulation-based models.
\ No newline at end of file
diff --git a/data/2022/iclr/GDA-AM: On the Effectiveness of Solving Min-Imax Optimization via Anderson Mixing b/data/2022/iclr/GDA-AM: On the Effectiveness of Solving Min-Imax Optimization via Anderson Mixing
new file mode 100644
index 0000000000..7315ff380d
--- /dev/null
+++ b/data/2022/iclr/GDA-AM: On the Effectiveness of Solving Min-Imax Optimization via Anderson Mixing	
@@ -0,0 +1 @@
+Many modern machine learning algorithms such as generative adversarial networks (GANs) and adversarial training can be formulated as minimax optimization. Gradient descent ascent (GDA) is the most commonly used algorithm due to its simplicity. However, GDA can converge to non-optimal minimax points. We propose a new minimax optimization framework, GDA-AM, that views the GDA dynamics as a fixed-point iteration and solves it using Anderson Mixing to converge to the local minimax. It addresses the diverging issue of simultaneous GDA and accelerates the convergence of alternating GDA. We show theoretically that the algorithm can achieve global convergence for bilinear problems under mild conditions. We also empirically show that GDA-AM solves a variety of minimax problems and improves GAN training on several datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/GLASS: GNN with Labeling Tricks for Subgraph Representation Learning b/data/2022/iclr/GLASS: GNN with Labeling Tricks for Subgraph Representation Learning
new file mode 100644
index 0000000000..92f3e878eb
--- /dev/null
+++ b/data/2022/iclr/GLASS: GNN with Labeling Tricks for Subgraph Representation Learning	
@@ -0,0 +1 @@
+Density d and cut ratio c can be predicted with
\ No newline at end of file
diff --git a/data/2022/iclr/GNN is a Counter? Revisiting GNN for Question Answering b/data/2022/iclr/GNN is a Counter? Revisiting GNN for Question Answering
new file mode 100644
index 0000000000..96747117c5
--- /dev/null
+++ b/data/2022/iclr/GNN is a Counter? Revisiting GNN for Question Answering	
@@ -0,0 +1 @@
+Question Answering (QA) has been a long-standing research topic in AI and NLP fields, and a wealth of studies have been conducted to attempt to equip QA systems with human-level reasoning capability. To approximate the complicated human reasoning process, state-of-the-art QA systems commonly use pre-trained language models (LMs) to access knowledge encoded in LMs together with elaborately designed modules based on Graph Neural Networks (GNNs) to perform reasoning over knowledge graphs (KGs). However, many problems remain open regarding the reasoning functionality of these GNN-based modules. Can these GNN-based modules really perform a complex reasoning process? Are they under- or over-complicated for QA? To open the black box of GNN and investigate these problems, we dissect state-of-the-art GNN modules for QA and analyze their reasoning capability. We discover that even a very simple graph neural counter can outperform all the existing GNN modules on CommonsenseQA and OpenBookQA, two popular QA benchmark datasets which heavily rely on knowledge-aware reasoning. Our work reveals that existing knowledge-aware GNN modules may only carry out some simple reasoning such as counting. It remains a challenging open problem to build comprehensive reasoning modules for knowledge-powered QA.
\ No newline at end of file
diff --git a/data/2022/iclr/GNN-LM: Language Modeling based on Global Contexts via GNN b/data/2022/iclr/GNN-LM: Language Modeling based on Global Contexts via GNN
new file mode 100644
index 0000000000..28e5d30bf0
--- /dev/null
+++ b/data/2022/iclr/GNN-LM: Language Modeling based on Global Contexts via GNN	
@@ -0,0 +1 @@
+Inspired by the notion that ``{\it to copy is easier than to memorize}``, in this work, we introduce GNN-LM, which extends the vanilla neural language model (LM) by allowing to reference similar contexts in the entire training corpus. We build a directed heterogeneous graph between an input context and its semantically related neighbors selected from the training corpus, where nodes are tokens in the input context and retrieved neighbor contexts, and edges represent connections between nodes. Graph neural networks (GNNs) are constructed upon the graph to aggregate information from similar contexts to decode the token. This learning paradigm provides direct access to the reference contexts and helps improve a model's generalization ability. We conduct comprehensive experiments to validate the effectiveness of the GNN-LM: GNN-LM achieves a new state-of-the-art perplexity of 14.8 on WikiText-103 (a 3.9 point improvement over its counterpart of the vanilla LM model), and shows substantial improvement on One Billion Word and Enwiki8 datasets against strong baselines. In-depth ablation studies are performed to understand the mechanics of GNN-LM. \footnote{The code can be found at https://github.com/ShannonAI/GNN-LM
\ No newline at end of file
diff --git a/data/2022/iclr/GPT-Critic: Offline Reinforcement Learning for End-to-End Task-Oriented Dialogue Systems b/data/2022/iclr/GPT-Critic: Offline Reinforcement Learning for End-to-End Task-Oriented Dialogue Systems
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/GRAND++: Graph Neural Diffusion with A Source Term b/data/2022/iclr/GRAND++: Graph Neural Diffusion with A Source Term
new file mode 100644
index 0000000000..7b82b00963
--- /dev/null
+++ b/data/2022/iclr/GRAND++: Graph Neural Diffusion with A Source Term	
@@ -0,0 +1 @@
+We propose GRAph Neural Diffusion with a source term (GRAND++) for graph deep learning with a limited number of labeled nodes, i.e., low-labeling rate. GRAND++ is a class of continuous-depth graph deep learning architectures whose theoretical underpinning is the diffusion process on graphs with a source term. The source term guarantees two interesting theoretical properties of GRAND++: (i) the representation of graph nodes, under the dynamics of GRAND++, will not converge to a constant vector over all nodes even as the time goes to infinity, which mitigates the over-smoothing issue of graph neural networks and enables graph learning in very deep architectures. (ii) GRAND++ can provide accurate classification even when the model is trained with a very limited number of labeled training data. We experimentally verify the above two advantages on various graph deep learning benchmark tasks, showing a significant improvement over many existing graph neural networks.
\ No newline at end of file
diff --git a/data/2022/iclr/Gaussian Mixture Convolution Networks b/data/2022/iclr/Gaussian Mixture Convolution Networks
new file mode 100644
index 0000000000..e06a8664f0
--- /dev/null
+++ b/data/2022/iclr/Gaussian Mixture Convolution Networks	
@@ -0,0 +1 @@
+This paper proposes a novel method for deep learning based on the analytical convolution of multidimensional Gaussian mixtures. In contrast to tensors, these do not suffer from the curse of dimensionality and allow for a compact representation, as data is only stored where details exist. Convolution kernels and data are Gaussian mixtures with unconstrained weights, positions, and covariance matrices. Similar to discrete convolutional networks, each convolution step produces several feature channels, represented by independent Gaussian mixtures. Since traditional transfer functions like ReLUs do not produce Gaussian mixtures, we propose using a fitting of these functions instead. This fitting step also acts as a pooling layer if the number of Gaussian components is reduced appropriately. We demonstrate that networks based on this architecture reach competitive accuracy on Gaussian mixtures fitted to the MNIST and ModelNet data sets.
\ No newline at end of file
diff --git a/data/2022/iclr/GeneDisco: A Benchmark for Experimental Design in Drug Discovery b/data/2022/iclr/GeneDisco: A Benchmark for Experimental Design in Drug Discovery
new file mode 100644
index 0000000000..11cf024163
--- /dev/null
+++ b/data/2022/iclr/GeneDisco: A Benchmark for Experimental Design in Drug Discovery	
@@ -0,0 +1 @@
+In vitro cellular experimentation with genetic interventions, using for example CRISPR technologies, is an essential step in early-stage drug discovery and target validation that serves to assess initial hypotheses about causal associations between biological mechanisms and disease pathologies. With billions of potential hypotheses to test, the experimental design space for in vitro genetic experiments is extremely vast, and the available experimental capacity - even at the largest research institutions in the world - pales in relation to the size of this biological hypothesis space. Machine learning methods, such as active and reinforcement learning, could aid in optimally exploring the vast biological space by integrating prior knowledge from various information sources as well as extrapolating to yet unexplored areas of the experimental design space based on available data. However, there exist no standardised benchmarks and data sets for this challenging task and little research has been conducted in this area to date. Here, we introduce GeneDisco, a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery. GeneDisco contains a curated set of multiple publicly available experimental data sets as well as open-source implementations of state-of-the-art active learning policies for experimental design and exploration.
\ No newline at end of file
diff --git a/data/2022/iclr/Generalisation in Lifelong Reinforcement Learning through Logical Composition b/data/2022/iclr/Generalisation in Lifelong Reinforcement Learning through Logical Composition
new file mode 100644
index 0000000000..7f0dab6d8b
--- /dev/null
+++ b/data/2022/iclr/Generalisation in Lifelong Reinforcement Learning through Logical Composition	
@@ -0,0 +1 @@
+We leverage logical composition in reinforcement learning to create a framework that enables an agent to autonomously determine whether a new task can be immediately solved using its existing abilities, or whether a task-speciﬁc skill should be learned. In the latter case, the proposed algorithm also enables the agent to learn the new task faster by generating an estimate of the optimal policy. Importantly, we provide two main theoretical results: we bound the performance of the transferred policy on a new task, and we give bounds on the necessary and sufﬁcient number of tasks that need to be learned throughout an agent’s lifetime to generalise over a distribution. We verify our approach in a series of experiments, where we perform transfer learning both after learning a set of base tasks, and after learning an arbitrary set of tasks. We also demonstrate that, as a side effect of our transfer learning approach, an agent can produce an interpretable Boolean expression of its understanding of the current task. Finally, we demonstrate our approach in the full lifelong setting where an agent receives tasks from an unknown distribution. Starting from scratch, an agent is able to quickly generalise over the task distribution after learning only a few tasks, which are sub-logarithmic in the size of the task space.
\ No newline at end of file
diff --git a/data/2022/iclr/Generalization Through the Lens of Leave-One-Out Error b/data/2022/iclr/Generalization Through the Lens of Leave-One-Out Error
new file mode 100644
index 0000000000..a696f8e33a
--- /dev/null
+++ b/data/2022/iclr/Generalization Through the Lens of Leave-One-Out Error	
@@ -0,0 +1 @@
+Despite the tremendous empirical success of deep learning models to solve various learning tasks, our theoretical understanding of their generalization ability is very limited. Classical generalization bounds based on tools such as the VC dimension or Rademacher complexity, are so far unsuitable for deep models and it is doubtful that these techniques can yield tight bounds even in the most idealistic settings (Nagarajan&Kolter, 2019). In this work, we instead revisit the concept of leave-one-out (LOO) error to measure the generalization ability of deep models in the so-called kernel regime. While popular in statistics, the LOO error has been largely overlooked in the context of deep learning. By building upon the recently established connection between neural networks and kernel learning, we leverage the closed-form expression for the leave-one-out error, giving us access to an efficient proxy for the test error. We show both theoretically and empirically that the leave-one-out error is capable of capturing various phenomena in generalization theory, such as double descent, random labels or transfer learning. Our work therefore demonstrates that the leave-one-out error provides a tractable way to estimate the generalization ability of deep neural networks in the kernel regime, opening the door to potential, new research directions in the field of generalization.
\ No newline at end of file
diff --git a/data/2022/iclr/Generalization of Neural Combinatorial Solvers Through the Lens of Adversarial Robustness b/data/2022/iclr/Generalization of Neural Combinatorial Solvers Through the Lens of Adversarial Robustness
new file mode 100644
index 0000000000..9e4b987f97
--- /dev/null
+++ b/data/2022/iclr/Generalization of Neural Combinatorial Solvers Through the Lens of Adversarial Robustness	
@@ -0,0 +1 @@
+End-to-end (geometric) deep learning has seen first successes in approximating the solution of combinatorial optimization problems. However, generating data in the realm of NP-hard/-complete tasks brings practical and theoretical challenges, resulting in evaluation protocols that are too optimistic. Specifically, most datasets only capture a simpler subproblem and likely suffer from spurious features. We investigate these effects by studying adversarial robustness - a local generalization property - to reveal hard, model-specific instances and spurious features. For this purpose, we derive perturbation models for SAT and TSP. Unlike in other applications, where perturbation models are designed around subjective notions of imperceptibility, our perturbation models are efficient and sound, allowing us to determine the true label of perturbed samples without a solver. Surprisingly, with such perturbations, a sufficiently expressive neural solver does not suffer from the limitations of the accuracy-robustness trade-off common in supervised learning. Although such robust solvers exist, we show empirically that the assessed neural solvers do not generalize well w.r.t. small perturbations of the problem instance.
\ No newline at end of file
diff --git a/data/2022/iclr/Generalized Decision Transformer for Offline Hindsight Information Matching b/data/2022/iclr/Generalized Decision Transformer for Offline Hindsight Information Matching
new file mode 100644
index 0000000000..65a96fac23
--- /dev/null
+++ b/data/2022/iclr/Generalized Decision Transformer for Offline Hindsight Information Matching	
@@ -0,0 +1 @@
+How to extract as much learning signal from each trajectory data has been a key problem in reinforcement learning (RL), where sample inefficiency has posed serious challenges for practical applications. Recent works have shown that using expressive policy function approximators and conditioning on future trajectory information -- such as future states in hindsight experience replay or returns-to-go in Decision Transformer (DT) -- enables efficient learning of multi-task policies, where at times online RL is fully replaced by offline behavioral cloning, e.g. sequence modeling. We demonstrate that all these approaches are doing hindsight information matching (HIM) -- training policies that can output the rest of trajectory that matches some statistics of future state information. We present Generalized Decision Transformer (GDT) for solving any HIM problem, and show how different choices for the feature function and the anti-causal aggregator not only recover DT as a special case, but also lead to novel Categorical DT (CDT) and Bi-directional DT (BDT) for matching different statistics of the future. For evaluating CDT and BDT, we define offline multi-task state-marginal matching (SMM) and imitation learning (IL) as two generic HIM problems, propose a Wasserstein distance loss as a metric for both, and empirically study them on MuJoCo continuous control benchmarks. CDT, which simply replaces anti-causal summation with anti-causal binning in DT, enables the first effective offline multi-task SMM algorithm that generalizes well to unseen and even synthetic multi-modal state-feature distributions. BDT, which uses an anti-causal second transformer as the aggregator, can learn to model any statistics of the future and outperforms DT variants in offline multi-task IL. Our generalized formulations from HIM and GDT greatly expand the role of powerful sequence modeling architectures in modern RL.
\ No newline at end of file
diff --git a/data/2022/iclr/Generalized Demographic Parity for Group Fairness b/data/2022/iclr/Generalized Demographic Parity for Group Fairness
new file mode 100644
index 0000000000..a6e212e9dd
--- /dev/null
+++ b/data/2022/iclr/Generalized Demographic Parity for Group Fairness	
@@ -0,0 +1 @@
+This work aims to generalize demographic parity to continuous sensitive attributes while preserving tractable computation. Current fairness metrics for continuous sensitive attributes largely rely on intractable statistical independence between variables, such as Hirschfeld-Gebelein-Renyi (HGR) and mutual information. Statistical fairness metrics estimation relying on either tractable bounds or neural network approximation, however, are not sufficiently trustful to rank algorithms prediction bias due to lack of estimation accuracy guarantee. To make fairness metrics trustable, we propose Generalized Demographic Parity (GDP), a group fairness metric for continuous and discrete attributes. We show the understanding of GDP from the probability perspective and theoretically reveal the connection between GDP regularizer and adversarial debiasing. To estimate GDP, we adopt hard and soft group strategies via the one-hot or the soft group indicator, representing the membership of each sample in different groups of the sensitive attribute. We provably and numerically show that the soft group strategy achieves a faster estimation error convergence rate. Experiments show the better bias mitigation performance of GDP regularizer, compared with adversarial debiasing, for regression and classification tasks in tabular and graph benchmarks 1 .
\ No newline at end of file
diff --git a/data/2022/iclr/Generalized Kernel Thinning b/data/2022/iclr/Generalized Kernel Thinning
new file mode 100644
index 0000000000..160b15e1ee
--- /dev/null
+++ b/data/2022/iclr/Generalized Kernel Thinning	
@@ -0,0 +1 @@
+The kernel thinning (KT) algorithm of Dwivedi and Mackey (2021) compresses a probability distribution more effectively than independent sampling by targeting a reproducing kernel Hilbert space (RKHS) and leveraging a less smooth square-root kernel. Here we provide four improvements. First, we show that KT applied directly to the target RKHS yields tighter, dimension-free guarantees for any kernel, any distribution, and any fixed function in the RKHS. Second, we show that, for analytic kernels like Gaussian, inverse multiquadric, and sinc, target KT admits maximum mean discrepancy (MMD) guarantees comparable to or better than those of square-root KT without making explicit use of a square-root kernel. Third, we prove that KT with a fractional power kernel yields better-than-Monte-Carlo MMD guarantees for non-smooth kernels, like Laplace and Mat\'ern, that do not have square-roots. Fourth, we establish that KT applied to a sum of the target and power kernels (a procedure we call KT+) simultaneously inherits the improved MMD guarantees of power KT and the tighter individual function guarantees of target KT. In our experiments with target KT and KT+, we witness significant improvements in integration error even in $100$ dimensions and when compressing challenging differential equation posteriors.
\ No newline at end of file
diff --git a/data/2022/iclr/Generalized Natural Gradient Flows in Hidden Convex-Concave Games and GANs b/data/2022/iclr/Generalized Natural Gradient Flows in Hidden Convex-Concave Games and GANs
new file mode 100644
index 0000000000..75df2a2fae
--- /dev/null
+++ b/data/2022/iclr/Generalized Natural Gradient Flows in Hidden Convex-Concave Games and GANs	
@@ -0,0 +1 @@
+Game-theoretic formulations in machine learning have recently risen in prominence, whereby entire modeling paradigms are best captured as zero-sum games. Despite their popularity, however, their dynamics are still poorly understood. This lack of theory is often substantiated with painful empirical observations of volatile training dynamics and even divergence. Such results highlight the need to develop an appropriate theory with convergence guarantees that are powerful enough to inform practice. This paper studies the generalized Gradient Descent-Ascent (GDA) ﬂow in a large class of non-convex non-concave Zero-Sum games dubbed Hidden Convex-Concave games, a class of games that includes GANs. We focus on two speciﬁc geometries: a novel geometry induced by the hidden convex-concave structure that we call the hidden mapping geometry and the Fisher information geometry. For the hidden mapping geometry, we prove global convergence under mild assumptions. In the case of Fisher information geometry, we provide a complete picture of the dynamics in an interesting special setting of team competition via invariant function analysis.
\ No newline at end of file
diff --git a/data/2022/iclr/Generalized rectifier wavelet covariance models for texture synthesis b/data/2022/iclr/Generalized rectifier wavelet covariance models for texture synthesis
new file mode 100644
index 0000000000..a044ae547a
--- /dev/null
+++ b/data/2022/iclr/Generalized rectifier wavelet covariance models for texture synthesis	
@@ -0,0 +1 @@
+State-of-the-art maximum entropy models for texture synthesis are built from statistics relying on image representations defined by convolutional neural networks (CNN). Such representations capture rich structures in texture images, outperforming wavelet-based representations in this regard. However, conversely to neural networks, wavelets offer meaningful representations, as they are known to detect structures at multiple scales (e.g. edges) in images. In this work, we propose a family of statistics built upon non-linear wavelet based representations, that can be viewed as a particular instance of a one-layer CNN, using a generalized rectifier non-linearity. These statistics significantly improve the visual quality of previous classical wavelet-based models, and allow one to produce syntheses of similar quality to state-of-the-art models, on both gray-scale and color textures.
\ No newline at end of file
diff --git a/data/2022/iclr/Generalizing Few-Shot NAS with Gradient Matching b/data/2022/iclr/Generalizing Few-Shot NAS with Gradient Matching
new file mode 100644
index 0000000000..9d507fd1ae
--- /dev/null
+++ b/data/2022/iclr/Generalizing Few-Shot NAS with Gradient Matching	
@@ -0,0 +1 @@
+Efficient performance estimation of architectures drawn from large search spaces is essential to Neural Architecture Search. One-Shot methods tackle this challenge by training one supernet to approximate the performance of every architecture in the search space via weight-sharing, thereby drastically reducing the search cost. However, due to coupled optimization between child architectures caused by weight-sharing, One-Shot supernet's performance estimation could be inaccurate, leading to degraded search outcomes. To address this issue, Few-Shot NAS reduces the level of weight-sharing by splitting the One-Shot supernet into multiple separated sub-supernets via edge-wise (layer-wise) exhaustive partitioning. Since each partition of the supernet is not equally important, it necessitates the design of a more effective splitting criterion. In this work, we propose a gradient matching score (GM) that leverages gradient information at the shared weight for making informed splitting decisions. Intuitively, gradients from different child models can be used to identify whether they agree on how to update the shared modules, and subsequently to decide if they should share the same weight. Compared with exhaustive partitioning, the proposed criterion significantly reduces the branching factor per edge. This allows us to split more edges (layers) for a given budget, resulting in substantially improved performance as NAS search spaces usually include dozens of edges (layers). Extensive empirical evaluations of the proposed method on a wide range of search spaces (NASBench-201, DARTS, MobileNet Space), datasets (cifar10, cifar100, ImageNet) and search algorithms (DARTS, SNAS, RSPS, ProxylessNAS, OFA) demonstrate that it significantly outperforms its Few-Shot counterparts while surpassing previous comparable methods in terms of the accuracy of derived architectures.
\ No newline at end of file
diff --git a/data/2022/iclr/Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks b/data/2022/iclr/Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks
new file mode 100644
index 0000000000..120356c1f3
--- /dev/null
+++ b/data/2022/iclr/Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks	
@@ -0,0 +1 @@
+In the deep learning era, long video generation of high-quality still remains challenging due to the spatio-temporal complexity and continuity of videos. Existing prior works have attempted to model video distribution by representing videos as 3D grids of RGB values, which impedes the scale of generated videos and neglects continuous dynamics. In this paper, we found that the recent emerging paradigm of implicit neural representations (INRs) that encodes a continuous signal into a parameterized neural network effectively mitigates the issue. By utilizing INRs of video, we propose dynamics-aware implicit generative adversarial network (DIGAN), a novel generative adversarial network for video generation. Specifically, we introduce (a) an INR-based video generator that improves the motion dynamics by manipulating the space and time coordinates differently and (b) a motion discriminator that efficiently identifies the unnatural motions without observing the entire long frame sequences. We demonstrate the superiority of DIGAN under various datasets, along with multiple intriguing properties, e.g., long video synthesis, video extrapolation, and non-autoregressive video generation. For example, DIGAN improves the previous state-of-the-art FVD score on UCF-101 by 30.7% and can be trained on 128 frame videos of 128x128 resolution, 80 frames longer than the 48 frames of the previous state-of-the-art method.
\ No newline at end of file
diff --git a/data/2022/iclr/Generative Modeling with Optimal Transport Maps b/data/2022/iclr/Generative Modeling with Optimal Transport Maps
new file mode 100644
index 0000000000..5b737c799c
--- /dev/null
+++ b/data/2022/iclr/Generative Modeling with Optimal Transport Maps	
@@ -0,0 +1 @@
+With the discovery of Wasserstein GANs, Optimal Transport (OT) has become a powerful tool for large-scale generative modeling tasks. In these tasks, OT cost is typically used as the loss for training GANs. In contrast to this approach, we show that the OT map itself can be used as a generative model, providing comparable performance. Previous analogous approaches consider OT maps as generative models only in the latent spaces due to their poor performance in the original high-dimensional ambient space. In contrast, we apply OT maps directly in the ambient space, e.g., a space of high-dimensional images. First, we derive a min-max optimization algorithm to efficiently compute OT maps for the quadratic cost (Wasserstein-2 distance). Next, we extend the approach to the case when the input and output distributions are located in the spaces of different dimensions and derive error bounds for the computed OT map. We evaluate the algorithm on image generation and unpaired image restoration tasks. In particular, we consider denoising, colorization, and inpainting, where the optimality of the restoration map is a desired attribute, since the output (restored) image is expected to be close to the input (degraded) one.
\ No newline at end of file
diff --git a/data/2022/iclr/Generative Models as a Data Source for Multiview Representation Learning b/data/2022/iclr/Generative Models as a Data Source for Multiview Representation Learning
new file mode 100644
index 0000000000..d75696c7e8
--- /dev/null
+++ b/data/2022/iclr/Generative Models as a Data Source for Multiview Representation Learning	
@@ -0,0 +1 @@
+Generative models are now capable of producing highly realistic images that look nearly indistinguishable from the data on which they are trained. This raises the question: if we have good enough generative models, do we still need datasets? We investigate this question in the setting of learning general-purpose visual representations from a black-box generative model rather than directly from data. Given an off-the-shelf image generator without any access to its training data, we train representations from the samples output by this generator. We compare several representation learning methods that can be applied to this setting, using the latent space of the generator to generate multiple"views"of the same semantic content. We show that for contrastive methods, this multiview data can naturally be used to identify positive pairs (nearby in latent space) and negative pairs (far apart in latent space). We find that the resulting representations rival or even outperform those learned directly from real data, but that good performance requires care in the sampling strategy applied and the training method. Generative models can be viewed as a compressed and organized copy of a dataset, and we envision a future where more and more"model zoos"proliferate while datasets become increasingly unwieldy, missing, or private. This paper suggests several techniques for dealing with visual representation learning in such a future. Code is available on our project page https://ali-design.github.io/GenRep/.
\ No newline at end of file
diff --git a/data/2022/iclr/Generative Planning for Temporally Coordinated Exploration in Reinforcement Learning b/data/2022/iclr/Generative Planning for Temporally Coordinated Exploration in Reinforcement Learning
new file mode 100644
index 0000000000..2a3db34b87
--- /dev/null
+++ b/data/2022/iclr/Generative Planning for Temporally Coordinated Exploration in Reinforcement Learning	
@@ -0,0 +1 @@
+Standard model-free reinforcement learning algorithms optimize a policy that generates the action to be taken in the current time step in order to maximize expected future return. While flexible, it faces difficulties arising from the inefficient exploration due to its single step nature. In this work, we present Generative Planning method (GPM), which can generate actions not only for the current step, but also for a number of future steps (thus termed as generative planning). This brings several benefits to GPM. Firstly, since GPM is trained by maximizing value, the plans generated from it can be regarded as intentional action sequences for reaching high value regions. GPM can therefore leverage its generated multi-step plans for temporally coordinated exploration towards high value regions, which is potentially more effective than a sequence of actions generated by perturbing each action at single step level, whose consistent movement decays exponentially with the number of exploration steps. Secondly, starting from a crude initial plan generator, GPM can refine it to be adaptive to the task, which, in return, benefits future explorations. This is potentially more effective than commonly used action-repeat strategy, which is non-adaptive in its form of plans. Additionally, since the multi-step plan can be interpreted as the intent of the agent from now to a span of time period into the future, it offers a more informative and intuitive signal for interpretation. Experiments are conducted on several benchmark environments and the results demonstrated its effectiveness compared with several baseline methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Generative Principal Component Analysis b/data/2022/iclr/Generative Principal Component Analysis
new file mode 100644
index 0000000000..aad3398ab1
--- /dev/null
+++ b/data/2022/iclr/Generative Principal Component Analysis	
@@ -0,0 +1 @@
+In this paper, we study the problem of principal component analysis with generative modeling assumptions, adopting a general model for the observed matrix that encompasses notable special cases, including spiked matrix recovery and phase retrieval. The key assumption is that the underlying signal lies near the range of an $L$-Lipschitz continuous generative model with bounded $k$-dimensional inputs. We propose a quadratic estimator, and show that it enjoys a statistical rate of order $\sqrt{\frac{k\log L}{m}}$, where $m$ is the number of samples. We also provide a near-matching algorithm-independent lower bound. Moreover, we provide a variant of the classic power method, which projects the calculated data onto the range of the generative model during each iteration. We show that under suitable conditions, this method converges exponentially fast to a point achieving the above-mentioned statistical rate. We perform experiments on various image datasets for spiked matrix and phase retrieval models, and illustrate performance gains of our method to the classic power method and the truncated power method devised for sparse principal component analysis.
\ No newline at end of file
diff --git a/data/2022/iclr/Generative Pseudo-Inverse Memory b/data/2022/iclr/Generative Pseudo-Inverse Memory
new file mode 100644
index 0000000000..c427d97375
--- /dev/null
+++ b/data/2022/iclr/Generative Pseudo-Inverse Memory	
@@ -0,0 +1 @@
+We propose Generative Pseudo-Inverse Memory (GPM), a class of deep generative memory models that are fast to write in and read out. Memory operations are recast as seeking robust solutions of linear systems, which naturally lead to the use of matrix pseudo-inverses. The pseudo-inverses are iteratively approximated, with practical computation complexity of almost O(1). We prove theoretically and verify empirically that our model can retrieve exactly what have been written to the memory under mild conditions. A key capability of GPM is iterative reading, during which the attractor dynamics towards fixed points are enabled, allowing the model to iteratively improve sample quality in denoising and generating. More impressively, GPM can store a large amount of data while maintaining key abilities of accurate retrieving of stored patterns, denoising of corrupted data and generating novel samples. Empirically we demonstrate the efficiency and versatility of GPM on a comprehensive suite of experiments involving binarized MNIST, binarized Omniglot, FashionMNIST, CIFAR10 & CIFAR100 and CelebA.
\ No newline at end of file
diff --git a/data/2022/iclr/GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation b/data/2022/iclr/GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation
new file mode 100644
index 0000000000..e8b71beed1
--- /dev/null
+++ b/data/2022/iclr/GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation	
@@ -0,0 +1 @@
+Predicting molecular conformations from molecular graphs is a fundamental problem in cheminformatics and drug discovery. Recently, significant progress has been achieved with machine learning approaches, especially with deep generative models. Inspired by the diffusion process in classical non-equilibrium thermodynamics where heated particles will diffuse from original states to a noise distribution, in this paper, we propose a novel generative model named GeoDiff for molecular conformation prediction. GeoDiff treats each atom as a particle and learns to directly reverse the diffusion process (i.e., transforming from a noise distribution to stable conformations) as a Markov chain. Modeling such a generation process is however very challenging as the likelihood of conformations should be roto-translational invariant. We theoretically show that Markov chains evolving with equivariant Markov kernels can induce an invariant distribution by design, and further propose building blocks for the Markov kernels to preserve the desirable equivariance property. The whole framework can be efficiently trained in an end-to-end fashion by optimizing a weighted variational lower bound to the (conditional) likelihood. Experiments on multiple benchmarks show that GeoDiff is superior or comparable to existing state-of-the-art approaches, especially on large molecules.
\ No newline at end of file
diff --git a/data/2022/iclr/Geometric Transformers for Protein Interface Contact Prediction b/data/2022/iclr/Geometric Transformers for Protein Interface Contact Prediction
new file mode 100644
index 0000000000..be26c82161
--- /dev/null
+++ b/data/2022/iclr/Geometric Transformers for Protein Interface Contact Prediction	
@@ -0,0 +1 @@
+Computational methods for predicting the interface contacts between proteins come highly sought after for drug discovery as they can significantly advance the accuracy of alternative approaches, such as protein-protein docking, protein function analysis tools, and other computational methods for protein bioinformatics. In this work, we present the Geometric Transformer, a novel geometry-evolving graph transformer for rotation and translation-invariant protein interface contact prediction, packaged within DeepInteract, an end-to-end prediction pipeline. DeepInteract predicts partner-specific protein interface contacts (i.e., inter-protein residue-residue contacts) given the 3D tertiary structures of two proteins as input. In rigorous benchmarks, DeepInteract, on challenging protein complex targets from the 13th and 14th CASP-CAPRI experiments as well as Docking Benchmark 5, achieves 14% and 1.1% top L/5 precision (L: length of a protein unit in a complex), respectively. In doing so, DeepInteract, with the Geometric Transformer as its graph-based backbone, outperforms existing methods for interface contact prediction in addition to other graph-based neural network backbones compatible with DeepInteract, thereby validating the effectiveness of the Geometric Transformer for learning rich relational-geometric features for downstream tasks on 3D protein structures.
\ No newline at end of file
diff --git a/data/2022/iclr/Geometric and Physical Quantities improve E(3) Equivariant Message Passing b/data/2022/iclr/Geometric and Physical Quantities improve E(3) Equivariant Message Passing
new file mode 100644
index 0000000000..969d45779b
--- /dev/null
+++ b/data/2022/iclr/Geometric and Physical Quantities improve E(3) Equivariant Message Passing	
@@ -0,0 +1 @@
+Including covariant information, such as position, force, velocity or spin is important in many tasks in computational physics and chemistry. We introduce Steerable E(3) Equivariant Graph Neural Networks (SEGNNs) that generalise equivariant graph networks, such that node and edge attributes are not restricted to invariant scalars, but can contain covariant information, such as vectors or tensors. This model, composed of steerable MLPs, is able to incorporate geometric and physical information in both the message and update functions. Through the definition of steerable node attributes, the MLPs provide a new class of activation functions for general use with steerable feature fields. We discuss ours and related work through the lens of equivariant non-linear convolutions, which further allows us to pin-point the successful components of SEGNNs: non-linear message aggregation improves upon classic linear (steerable) point convolutions; steerable messages improve upon recent equivariant graph networks that send invariant messages. We demonstrate the effectiveness of our method on several tasks in computational physics and chemistry and provide extensive ablation studies.
\ No newline at end of file
diff --git a/data/2022/iclr/Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields b/data/2022/iclr/Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields
new file mode 100644
index 0000000000..30606ded01
--- /dev/null
+++ b/data/2022/iclr/Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields	
@@ -0,0 +1 @@
+We present implicit displacement fields, a novel representation for detailed 3D geometry. Inspired by a classic surface deformation technique, displacement mapping, our method represents a complex surface as a smooth base surface plus a displacement along the base's normal directions, resulting in a frequency-based shape decomposition, where the high frequency signal is constrained geometrically by the low frequency signal. Importantly, this disentanglement is unsupervised thanks to a tailored architectural design that has an innate frequency hierarchy by construction. We explore implicit displacement field surface reconstruction and detail transfer and demonstrate superior representational power, training stability and generalizability.
\ No newline at end of file
diff --git a/data/2022/iclr/GiraffeDet: A Heavy-Neck Paradigm for Object Detection b/data/2022/iclr/GiraffeDet: A Heavy-Neck Paradigm for Object Detection
new file mode 100644
index 0000000000..984be34f24
--- /dev/null
+++ b/data/2022/iclr/GiraffeDet: A Heavy-Neck Paradigm for Object Detection	
@@ -0,0 +1 @@
+In conventional object detection frameworks, a backbone body inherited from image recognition models extracts deep latent features and then a neck module fuses these latent features to capture information at different scales. As the resolution in object detection is much larger than in image recognition, the computational cost of the backbone often dominates the total inference cost. This heavy-backbone design paradigm is mostly due to the historical legacy when transferring image recognition models to object detection rather than an end-to-end optimized design for object detection. In this work, we show that such paradigm indeed leads to sub-optimal object detection models. To this end, we propose a novel heavy-neck paradigm, GiraffeDet, a giraffe-like network for efficient object detection. The GiraffeDet uses an extremely lightweight backbone and a very deep and large neck module which encourages dense information exchange among different spatial scales as well as different levels of latent semantics simultaneously. This design paradigm allows detectors to process the high-level semantic information and low-level spatial information at the same priority even in the early stage of the network, making it more effective in detection tasks. Numerical evaluations on multiple popular object detection benchmarks show that GiraffeDet consistently outperforms previous SOTA models across a wide spectrum of resource constraints. The source code is available at https://github.com/jyqi/GiraffeDet.
\ No newline at end of file
diff --git a/data/2022/iclr/Givens Coordinate Descent Methods for Rotation Matrix Learning in Trainable Embedding Indexes b/data/2022/iclr/Givens Coordinate Descent Methods for Rotation Matrix Learning in Trainable Embedding Indexes
new file mode 100644
index 0000000000..5ee29d668d
--- /dev/null
+++ b/data/2022/iclr/Givens Coordinate Descent Methods for Rotation Matrix Learning in Trainable Embedding Indexes	
@@ -0,0 +1 @@
+Product quantization (PQ) coupled with a space rotation, is widely used in modern approximate nearest neighbor (ANN) search systems to significantly compress the disk storage for embeddings and speed up the inner product computation. Existing rotation learning methods, however, minimize quantization distortion for fixed embeddings, which are not applicable to an end-to-end training scenario where embeddings are updated constantly. In this paper, based on geometric intuitions from Lie group theory, in particular the special orthogonal group $SO(n)$, we propose a family of block Givens coordinate descent algorithms to learn rotation matrix that are provably convergent on any convex objectives. Compared to the state-of-the-art SVD method, the Givens algorithms are much more parallelizable, reducing runtime by orders of magnitude on modern GPUs, and converge more stably according to experimental studies. They further improve upon vanilla product quantization significantly in an end-to-end training scenario.
\ No newline at end of file
diff --git a/data/2022/iclr/Global Convergence of Multi-Agent Policy Gradient in Markov Potential Games b/data/2022/iclr/Global Convergence of Multi-Agent Policy Gradient in Markov Potential Games
new file mode 100644
index 0000000000..765bedad30
--- /dev/null
+++ b/data/2022/iclr/Global Convergence of Multi-Agent Policy Gradient in Markov Potential Games	
@@ -0,0 +1 @@
+Potential games are arguably one of the most important and widely studied classes of normal form games. They define the archetypal setting of multi-agent coordination as all agent utilities are perfectly aligned with each other via a common potential function. Can this intuitive framework be transplanted in the setting of Markov Games? What are the similarities and differences between multi-agent coordination with and without state dependence? We present a novel definition of Markov Potential Games (MPG) that generalizes prior attempts at capturing complex stateful multi-agent coordination. Counter-intuitively, insights from normal-form potential games do not carry over as MPGs can consist of settings where state-games can be zero-sum games. In the opposite direction, Markov games where every state-game is a potential game are not necessarily MPGs. Nevertheless, MPGs showcase standard desirable properties such as the existence of deterministic Nash policies. In our main technical result, we prove fast convergence of independent policy gradient to Nash policies by adapting recent gradient dominance property arguments developed for single agent MDPs to multi-agent learning settings.
\ No newline at end of file
diff --git a/data/2022/iclr/Goal-Directed Planning via Hindsight Experience Replay b/data/2022/iclr/Goal-Directed Planning via Hindsight Experience Replay
new file mode 100644
index 0000000000..fc2dbefbda
--- /dev/null
+++ b/data/2022/iclr/Goal-Directed Planning via Hindsight Experience Replay	
@@ -0,0 +1 @@
+We consider the problem of goal-directed planning under a deterministic transition model. Monte Carlo Tree Search has shown remarkable performance in solving deterministic control problems. By using function approximators to bias the search of the tree, MCTS has been extended to complex continuous domains, resulting in the AlphaZero family of algorithms. Nonetheless, these algorithms still struggle with control problems with sparse rewards such as goal-directed domains, where a positive reward is awarded only when reaching a goal state. In this work, we extend AlphaZero with Hindsight Experience Replay to tackle complex goal-directed planning tasks. We demonstrate the effectiveness of the proposed approach through an extensive empirical evaluation in several simulated domains, including a novel application to a quantum compiling domain.
\ No newline at end of file
diff --git a/data/2022/iclr/GradMax: Growing Neural Networks using Gradient Information b/data/2022/iclr/GradMax: Growing Neural Networks using Gradient Information
new file mode 100644
index 0000000000..2be48ec460
--- /dev/null
+++ b/data/2022/iclr/GradMax: Growing Neural Networks using Gradient Information	
@@ -0,0 +1 @@
+The architecture and the parameters of neural networks are often optimized independently, which requires costly retraining of the parameters whenever the architecture is modified. In this work we instead focus on growing the architecture without requiring costly retraining. We present a method that adds new neurons during training without impacting what is already learned, while improving the training dynamics. We achieve the latter by maximizing the gradients of the new weights and find the optimal initialization efficiently by means of the singular value decomposition (SVD). We call this technique Gradient Maximizing Growth (GradMax) and demonstrate its effectiveness in variety of vision tasks and architectures.
\ No newline at end of file
diff --git a/data/2022/iclr/GradSign: Model Performance Inference with Theoretical Insights b/data/2022/iclr/GradSign: Model Performance Inference with Theoretical Insights
new file mode 100644
index 0000000000..01c717d181
--- /dev/null
+++ b/data/2022/iclr/GradSign: Model Performance Inference with Theoretical Insights	
@@ -0,0 +1 @@
+A key challenge in neural architecture search (NAS) is quickly inferring the predictive performance of a broad spectrum of neural networks to discover statistically accurate and computationally efﬁcient ones. We refer to this task as model performance inference (MPI). The current practice for efﬁcient MPI is gradient-based methods that leverage the gradients of a network at initialization to infer its performance. However, existing gradient-based methods rely only on heuristic metrics and lack the necessary theoretical foundations to consolidate their designs. We propose GradSign, an accurate, simple, and ﬂexible metric for model performance inference with theoretical insights. A key idea behind GradSign is a quantity Ψ to analyze the sample-wise optimization landscape of different networks. Theoretically, we show that Ψ is an upper bound for both the training and true population losses of a neural network under reasonable assumptions. However, it is computationally prohibitive to directly calculate Ψ for modern neural networks. To address this challenge, we design GradSign, an accurate and simple approximation of Ψ using the gradients of a network evaluated at a random initialization state. Evaluation on seven NAS benchmarks across three training datasets shows that GradSign generalizes well to real-world neural networks and consistently outperforms state-of-the-art gradient-based methods for MPI evaluated by Spearman’s ρ and Kendall’s Tau. Additionally, we have integrated GradSign into four existing NAS algorithms and show that the GradSign-assisted NAS algorithms outperform their vanilla counterparts by improving the accuracies of best-discovered networks by up to 0.3%, 1.1%, and 1.0% on three real-world tasks. Code is available at https://github.com/cmu-catalyst/GradSign
\ No newline at end of file
diff --git a/data/2022/iclr/Gradient Importance Learning for Incomplete Observations b/data/2022/iclr/Gradient Importance Learning for Incomplete Observations
new file mode 100644
index 0000000000..03d6dc8c85
--- /dev/null
+++ b/data/2022/iclr/Gradient Importance Learning for Incomplete Observations	
@@ -0,0 +1 @@
+Though recent works have developed methods that can generate estimates (or imputations) of the missing entries in a dataset to facilitate downstream analysis, most depend on assumptions that may not align with real-world applications and could suffer from poor performance in subsequent tasks such as classification. This is particularly true if the data have large missingness rates or a small sample size. More importantly, the imputation error could be propagated into the prediction step that follows, which may constrain the capabilities of the prediction model. In this work, we introduce the gradient importance learning (GIL) method to train multilayer perceptrons (MLPs) and long short-term memories (LSTMs) to directly perform inference from inputs containing missing values without imputation. Specifically, we employ reinforcement learning (RL) to adjust the gradients used to train these models via back-propagation. This allows the model to exploit the underlying information behind missingness patterns. We test the approach on real-world time-series (i.e., MIMIC-III), tabular data obtained from an eye clinic, and a standard dataset (i.e., MNIST), where our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Gradient Information Matters in Policy Optimization by Back-propagating through Model b/data/2022/iclr/Gradient Information Matters in Policy Optimization by Back-propagating through Model
new file mode 100644
index 0000000000..c5464c1d19
--- /dev/null
+++ b/data/2022/iclr/Gradient Information Matters in Policy Optimization by Back-propagating through Model	
@@ -0,0 +1 @@
+Model-based reinforcement learning provides an efficient mechanism to find the optimal policy by interacting with the learned environment. In addition to treating the learned environment like a black-box simulator, a more effective way to use the model is to exploit its differentiability. Such methods require the gradient information of the learned environment model when calculating the policy gradient. However, since the error of gradient is not considered in the model learning phase, there is no guarantee for the model’s accuracy. To address this problem, we first analyze the convergence rate for the policy optimization methods when the policy gradient is calculated using the learned environment model. The theoretical results show that the model gradient error matters in the policy optimization phrase. Then we propose a two-model-based learning method to control the prediction error and the gradient error. We separate the different roles of these two models at the model learning phase and coordinate them at the policy optimization phase. After proposing the method, we introduce the directional derivative projection policy optimization (DDPPO) algorithm as a practical implementation to find the optimal policy. Finally, we empirically demonstrate the proposed algorithm has better sample efficiency when achieving a comparable or better performance on benchmark continuous control tasks. Codes are available at https://github.com/CCreal/ddppo
\ No newline at end of file
diff --git a/data/2022/iclr/Gradient Matching for Domain Generalization b/data/2022/iclr/Gradient Matching for Domain Generalization
new file mode 100644
index 0000000000..10a8823b59
--- /dev/null
+++ b/data/2022/iclr/Gradient Matching for Domain Generalization	
@@ -0,0 +1 @@
+Machine learning systems typically assume that the distributions of training and test sets match closely. However, a critical requirement of such systems in the real world is their ability to generalize to unseen domains. Here, we propose an inter-domain gradient matching objective that targets domain generalization by maximizing the inner product between gradients from different domains. Since direct optimization of the gradient inner product can be computationally prohibitive -- requires computation of second-order derivatives -- we derive a simpler first-order algorithm named Fish that approximates its optimization. We demonstrate the efficacy of Fish on 6 datasets from the Wilds benchmark, which captures distribution shift across a diverse range of modalities. Our method produces competitive results on these datasets and surpasses all baselines on 4 of them. We perform experiments on both the Wilds benchmark, which captures distribution shift in the real world, as well as datasets in DomainBed benchmark that focuses more on synthetic-to-real transfer. Our method produces competitive results on both benchmarks, demonstrating its effectiveness across a wide range of domain generalization tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Gradient Step Denoiser for convergent Plug-and-Play b/data/2022/iclr/Gradient Step Denoiser for convergent Plug-and-Play
new file mode 100644
index 0000000000..9f60ac2d53
--- /dev/null
+++ b/data/2022/iclr/Gradient Step Denoiser for convergent Plug-and-Play	
@@ -0,0 +1 @@
+Plug-and-Play methods constitute a class of iterative algorithms for imaging problems where regularization is performed by an off-the-shelf denoiser. Although Plug-and-Play methods can lead to tremendous visual performance for various image problems, the few existing convergence guarantees are based on unrealistic (or suboptimal) hypotheses on the denoiser, or limited to strongly convex data terms. In this work, we propose a new type of Plug-and-Play methods, based on half-quadratic splitting, for which the denoiser is realized as a gradient descent step on a functional parameterized by a deep neural network. Exploiting convergence results for proximal gradient descent algorithms in the non-convex setting, we show that the proposed Plug-and-Play algorithm is a convergent iterative scheme that targets stationary points of an explicit global functional. Besides, experiments show that it is possible to learn such a deep denoiser while not compromising the performance in comparison to other state-of-the-art deep denoisers used in Plug-and-Play schemes. We apply our proximal gradient algorithm to various ill-posed inverse problems, e.g. deblurring, super-resolution and inpainting. For all these applications, numerical results empirically confirm the convergence results. Experiments also show that this new algorithm reaches state-of-the-art performance, both quantitatively and qualitatively.
\ No newline at end of file
diff --git a/data/2022/iclr/Granger causal inference on DAGs identifies genomic loci regulating transcription b/data/2022/iclr/Granger causal inference on DAGs identifies genomic loci regulating transcription
new file mode 100644
index 0000000000..0499fe169e
--- /dev/null
+++ b/data/2022/iclr/Granger causal inference on DAGs identifies genomic loci regulating transcription	
@@ -0,0 +1 @@
+When a dynamical system can be modeled as a sequence of observations, Granger causality is a powerful approach for detecting predictive interactions between its variables. However, traditional Granger causal inference has limited utility in domains where the dynamics need to be represented as directed acyclic graphs (DAGs) rather than as a linear sequence, such as with cell differentiation trajectories. Here, we present GrID-Net, a framework based on graph neural networks with lagged message passing for Granger causal inference on DAG-structured systems. Our motivating application is the analysis of single-cell multimodal data to identify genomic loci that mediate the regulation of specific genes. To our knowledge, GrID-Net is the first single-cell analysis tool that accounts for the temporal lag between a genomic locus becoming accessible and its downstream effect on a target gene's expression. We applied GrID-Net on multimodal single-cell assays that profile chromatin accessibility (ATAC-seq) and gene expression (RNA-seq) in the same cell and show that it dramatically outperforms existing methods for inferring regulatory locus-gene links, achieving up to 71% greater agreement with independent population genetics-based estimates. By extending Granger causality to DAG-structured dynamical systems, our work unlocks new domains for causal analyses and, more specifically, opens a path towards elucidating gene regulatory interactions relevant to cellular differentiation and complex human diseases at unprecedented scale and resolution.
\ No newline at end of file
diff --git a/data/2022/iclr/Graph Auto-Encoder via Neighborhood Wasserstein Reconstruction b/data/2022/iclr/Graph Auto-Encoder via Neighborhood Wasserstein Reconstruction
new file mode 100644
index 0000000000..9a02ca795b
--- /dev/null
+++ b/data/2022/iclr/Graph Auto-Encoder via Neighborhood Wasserstein Reconstruction	
@@ -0,0 +1 @@
+Graph neural networks (GNNs) have drawn significant research attention recently, mostly under the setting of semi-supervised learning. When task-agnostic representations are preferred or supervision is simply unavailable, the auto-encoder framework comes in handy with a natural graph reconstruction objective for unsupervised GNN training. However, existing graph auto-encoders are designed to reconstruct the direct links, so GNNs trained in this way are only optimized towards proximity-oriented graph mining tasks, and will fall short when the topological structures matter. In this work, we revisit the graph encoding process of GNNs which essentially learns to encode the neighborhood information of each node into an embedding vector, and propose a novel graph decoder to reconstruct the entire neighborhood information regarding both proximity and structure via Neighborhood Wasserstein Reconstruction (NWR). Specifically, from the GNN embedding of each node, NWR jointly predicts its node degree and neighbor feature distribution, where the distribution prediction adopts an optimal-transport loss based on the Wasserstein distance. Extensive experiments on both synthetic and real-world network datasets show that the unsupervised node representations learned with NWR have much more advantageous in structure-oriented graph mining tasks, while also achieving competitive performance in proximity-oriented ones.
\ No newline at end of file
diff --git a/data/2022/iclr/Graph Condensation for Graph Neural Networks b/data/2022/iclr/Graph Condensation for Graph Neural Networks
new file mode 100644
index 0000000000..70e25bfe82
--- /dev/null
+++ b/data/2022/iclr/Graph Condensation for Graph Neural Networks	
@@ -0,0 +1 @@
+Given the prevalence of large-scale graphs in real-world applications, the storage and time for training neural models have raised increasing concerns. To alleviate the concerns, we propose and study the problem of graph condensation for graph neural networks (GNNs). Specifically, we aim to condense the large, original graph into a small, synthetic and highly-informative graph, such that GNNs trained on the small graph and large graph have comparable performance. We approach the condensation problem by imitating the GNN training trajectory on the original graph through the optimization of a gradient matching loss and design a strategy to condense node futures and structural information simultaneously. Extensive experiments have demonstrated the effectiveness of the proposed framework in condensing different graph datasets into informative smaller graphs. In particular, we are able to approximate the original test accuracy by 95.3% on Reddit, 99.8% on Flickr and 99.0% on Citeseer, while reducing their graph size by more than 99.9%, and the condensed graphs can be used to train various GNN architectures.Code is released at https://github.com/ChandlerBang/GCond.
\ No newline at end of file
diff --git a/data/2022/iclr/Graph Neural Network Guided Local Search for the Traveling Salesperson Problem b/data/2022/iclr/Graph Neural Network Guided Local Search for the Traveling Salesperson Problem
new file mode 100644
index 0000000000..8fca7c6735
--- /dev/null
+++ b/data/2022/iclr/Graph Neural Network Guided Local Search for the Traveling Salesperson Problem	
@@ -0,0 +1 @@
+Solutions to the Traveling Salesperson Problem (TSP) have practical applications to processes in transportation, logistics, and automation, yet must be computed with minimal delay to satisfy the real-time nature of the underlying tasks. How-ever, solving large TSP instances quickly without sacriﬁcing solution quality remains challenging for current approximate algorithms. To close this gap, we present a hybrid data-driven approach for solving the TSP based on Graph Neural Networks (GNNs) and Guided Local Search (GLS). Our model predicts the regret of including each edge of the problem graph in the solution; GLS uses these predictions in conjunction with the original problem graph to ﬁnd solutions. Our experiments demonstrate that this approach converges to optimal solutions at a faster rate than three recent learning based approaches for the TSP. Notably, we reduce the mean optimality gap on the 100-node problem set from 1.534% to 0.705%, a 2 × improvement. When generalizing from 20-node instances to the 100-node problem set, we reduce the optimality gap from 18.845% to 2.622%, a 7 × improvement.
\ No newline at end of file
diff --git a/data/2022/iclr/Graph Neural Networks with Learnable Structural and Positional Representations b/data/2022/iclr/Graph Neural Networks with Learnable Structural and Positional Representations
new file mode 100644
index 0000000000..7b1c0f78ec
--- /dev/null
+++ b/data/2022/iclr/Graph Neural Networks with Learnable Structural and Positional Representations	
@@ -0,0 +1 @@
+Graph neural networks (GNNs) have become the standard learning architectures for graphs. GNNs have been applied to numerous domains ranging from quantum chemistry, recommender systems to knowledge graphs and natural language processing. A major issue with arbitrary graphs is the absence of canonical positional information of nodes, which decreases the representation power of GNNs to distinguish e.g. isomorphic nodes and other graph symmetries. An approach to tackle this issue is to introduce Positional Encoding (PE) of nodes, and inject it into the input layer, like in Transformers. Possible graph PE are Laplacian eigenvectors. In this work, we propose to decouple structural and positional representations to make easy for the network to learn these two essential properties. We introduce a novel generic architecture which we call LSPE (Learnable Structural and Positional Encodings). We investigate several sparse and fully-connected (Transformer-like) GNNs, and observe a performance increase for molecular datasets, from 1.79% up to 64.14% when considering learnable PE for both GNN classes.
\ No newline at end of file
diff --git a/data/2022/iclr/Graph-Augmented Normalizing Flows for Anomaly Detection of Multiple Time Series b/data/2022/iclr/Graph-Augmented Normalizing Flows for Anomaly Detection of Multiple Time Series
new file mode 100644
index 0000000000..c921dcc685
--- /dev/null
+++ b/data/2022/iclr/Graph-Augmented Normalizing Flows for Anomaly Detection of Multiple Time Series	
@@ -0,0 +1 @@
+Anomaly detection is a widely studied task for a broad variety of data types; among them, multiple time series appear frequently in applications, including for example, power grids and traffic networks. Detecting anomalies for multiple time series, however, is a challenging subject, owing to the intricate interdependencies among the constituent series. We hypothesize that anomalies occur in low density regions of a distribution and explore the use of normalizing flows for unsupervised anomaly detection, because of their superior quality in density estimation. Moreover, we propose a novel flow model by imposing a Bayesian network among constituent series. A Bayesian network is a directed acyclic graph (DAG) that models causal relationships; it factorizes the joint probability of the series into the product of easy-to-evaluate conditional probabilities. We call such a graph-augmented normalizing flow approach GANF and propose joint estimation of the DAG with flow parameters. We conduct extensive experiments on real-world datasets and demonstrate the effectiveness of GANF for density estimation, anomaly detection, and identification of time series distribution drift.
\ No newline at end of file
diff --git a/data/2022/iclr/Graph-Guided Network for Irregularly Sampled Multivariate Time Series b/data/2022/iclr/Graph-Guided Network for Irregularly Sampled Multivariate Time Series
new file mode 100644
index 0000000000..5aad4d801e
--- /dev/null
+++ b/data/2022/iclr/Graph-Guided Network for Irregularly Sampled Multivariate Time Series	
@@ -0,0 +1 @@
+In many domains, including healthcare, biology, and climate science, time series are irregularly sampled with varying time intervals between successive readouts and different subsets of variables (sensors) observed at different time points. Here, we introduce RAINDROP, a graph neural network that embeds irregularly sampled and multivariate time series while also learning the dynamics of sensors purely from observational data. RAINDROP represents every sample as a separate sensor graph and models time-varying dependencies between sensors with a novel message passing operator. It estimates the latent sensor graph structure and leverages the structure together with nearby observations to predict misaligned readouts. This model can be interpreted as a graph neural network that sends messages over graphs that are optimized for capturing time-varying dependencies among sensors. We use RAINDROP to classify time series and interpret temporal dynamics on three healthcare and human activity datasets. RAINDROP outperforms state-of-the-art methods by up to 11.4% (absolute F1-score points), including techniques that deal with irregular sampling using fixed discretization and set functions. RAINDROP shows superiority in diverse setups, including challenging leave-sensor-out settings.
\ No newline at end of file
diff --git a/data/2022/iclr/Graph-Relational Domain Adaptation b/data/2022/iclr/Graph-Relational Domain Adaptation
new file mode 100644
index 0000000000..eff9ded465
--- /dev/null
+++ b/data/2022/iclr/Graph-Relational Domain Adaptation	
@@ -0,0 +1 @@
+Existing domain adaptation methods tend to treat every domain equally and align them all perfectly. Such uniform alignment ignores topological structures among different domains; therefore it may be beneficial for nearby domains, but not necessarily for distant domains. In this work, we relax such uniform alignment by using a domain graph to encode domain adjacency, e.g., a graph of states in the US with each state as a domain and each edge indicating adjacency, thereby allowing domains to align flexibly based on the graph structure. We generalize the existing adversarial learning framework with a novel graph discriminator using encoding-conditioned graph embeddings. Theoretical analysis shows that at equilibrium, our method recovers classic domain adaptation when the graph is a clique, and achieves non-trivial alignment for other types of graphs. Empirical results show that our approach successfully generalizes uniform alignment, naturally incorporates domain information represented by graphs, and improves upon existing domain adaptation methods on both synthetic and real-world datasets. Code will soon be available at https://github.com/Wang-ML-Lab/GRDA.
\ No newline at end of file
diff --git a/data/2022/iclr/Graph-based Nearest Neighbor Search in Hyperbolic Spaces b/data/2022/iclr/Graph-based Nearest Neighbor Search in Hyperbolic Spaces
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Graph-less Neural Networks: Teaching Old MLPs New Tricks Via Distillation b/data/2022/iclr/Graph-less Neural Networks: Teaching Old MLPs New Tricks Via Distillation
new file mode 100644
index 0000000000..8288811412
--- /dev/null
+++ b/data/2022/iclr/Graph-less Neural Networks: Teaching Old MLPs New Tricks Via Distillation	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) are popular for graph machine learning and have shown great results on wide node classification tasks. Yet, they are less popular for practical deployments in the industry owing to their scalability challenges incurred by data dependency. Namely, GNN inference depends on neighbor nodes multiple hops away from the target, and fetching them burdens latency-constrained applications. Existing inference acceleration methods like pruning and quantization can speed up GNNs by reducing Multiplication-and-ACcumulation (MAC) operations, but the improvements are limited given the data dependency is not resolved. Conversely, multi-layer perceptrons (MLPs) have no graph dependency and infer much faster than GNNs, even though they are less accurate than GNNs for node classification in general. Motivated by these complementary strengths and weaknesses, we bring GNNs and MLPs together via knowledge distillation (KD). Our work shows that the performance of MLPs can be improved by large margins with GNN KD. We call the distilled MLPs Graph-less Neural Networks (GLNNs) as they have no inference graph dependency. We show that GLNNs with competitive accuracy infer faster than GNNs by 146X-273X and faster than other acceleration methods by 14X-27X. Under a production setting involving both transductive and inductive predictions across 7 datasets, GLNN accuracies improve over stand-alone MLPs by 12.36% on average and match GNNs on 6/7 datasets. Comprehensive analysis shows when and why GLNNs can achieve competitive accuracies to GNNs and suggests GLNN as a handy choice for latency-constrained applications.
\ No newline at end of file
diff --git a/data/2022/iclr/GraphENS: Neighbor-Aware Ego Network Synthesis for Class-Imbalanced Node Classification b/data/2022/iclr/GraphENS: Neighbor-Aware Ego Network Synthesis for Class-Imbalanced Node Classification
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Graphon based Clustering and Testing of Networks: Algorithms and Theory b/data/2022/iclr/Graphon based Clustering and Testing of Networks: Algorithms and Theory
new file mode 100644
index 0000000000..fb8b8ebc6e
--- /dev/null
+++ b/data/2022/iclr/Graphon based Clustering and Testing of Networks: Algorithms and Theory	
@@ -0,0 +1 @@
+Network-valued data are encountered in a wide range of applications and pose challenges in learning due to their complex structure and absence of vertex correspondence. Typical examples of such problems include classification or grouping of protein structures and social networks. Various methods, ranging from graph kernels to graph neural networks, have been proposed that achieve some success in graph classification problems. However, most methods have limited theoretical justification, and their applicability beyond classification remains unexplored. In this work, we propose methods for clustering multiple graphs, without vertex correspondence, that are inspired by the recent literature on estimating graphons -- symmetric functions corresponding to infinite vertex limit of graphs. We propose a novel graph distance based on sorting-and-smoothing graphon estimators. Using the proposed graph distance, we present two clustering algorithms and show that they achieve state-of-the-art results. We prove the statistical consistency of both algorithms under Lipschitz assumptions on the graph degrees. We further study the applicability of the proposed distance for graph two-sample testing problems.
\ No newline at end of file
diff --git a/data/2022/iclr/GreaseLM: Graph REASoning Enhanced Language Models b/data/2022/iclr/GreaseLM: Graph REASoning Enhanced Language Models
new file mode 100644
index 0000000000..edc2474d23
--- /dev/null
+++ b/data/2022/iclr/GreaseLM: Graph REASoning Enhanced Language Models	
@@ -0,0 +1 @@
+Answering complex questions about textual narratives requires reasoning over both stated context and the world knowledge that underlies it. However, pretrained language models (LM), the foundation of most modern QA systems, do not robustly represent latent relationships between concepts, which is necessary for reasoning. While knowledge graphs (KG) are often used to augment LMs with structured representations of world knowledge, it remains an open question how to effectively fuse and reason over the KG representations and the language context, which provides situational constraints and nuances. In this work, we propose GreaseLM, a new model that fuses encoded representations from pretrained LMs and graph neural networks over multiple layers of modality interaction operations. Information from both modalities propagates to the other, allowing language context representations to be grounded by structured world knowledge, and allowing linguistic nuances (e.g., negation, hedging) in the context to inform the graph representations of knowledge. Our results on three benchmarks in the commonsense reasoning (i.e., CommonsenseQA, OpenbookQA) and medical question answering (i.e., MedQA-USMLE) domains demonstrate that GreaseLM can more reliably answer questions that require reasoning over both situational constraints and structured knowledge, even outperforming models 8x larger.
\ No newline at end of file
diff --git a/data/2022/iclr/Group equivariant neural posterior estimation b/data/2022/iclr/Group equivariant neural posterior estimation
new file mode 100644
index 0000000000..eb2a4dc483
--- /dev/null
+++ b/data/2022/iclr/Group equivariant neural posterior estimation	
@@ -0,0 +1 @@
+Simulation-based inference with conditional neural density estimators is a powerful approach to solving inverse problems in science. However, these methods typically treat the underlying forward model as a black box, with no way to exploit geometric properties such as equivariances. Equivariances are common in scientific models, however integrating them directly into expressive inference networks (such as normalizing flows) is not straightforward. We here describe an alternative method to incorporate equivariances under joint transformations of parameters and data. Our method -- called group equivariant neural posterior estimation (GNPE) -- is based on self-consistently standardizing the"pose"of the data while estimating the posterior over parameters. It is architecture-independent, and applies both to exact and approximate equivariances. As a real-world application, we use GNPE for amortized inference of astrophysical binary black hole systems from gravitational-wave observations. We show that GNPE achieves state-of-the-art accuracy while reducing inference times by three orders of magnitude.
\ No newline at end of file
diff --git a/data/2022/iclr/Group-based Interleaved Pipeline Parallelism for Large-scale DNN Training b/data/2022/iclr/Group-based Interleaved Pipeline Parallelism for Large-scale DNN Training
new file mode 100644
index 0000000000..041821fa95
--- /dev/null
+++ b/data/2022/iclr/Group-based Interleaved Pipeline Parallelism for Large-scale DNN Training	
@@ -0,0 +1 @@
+The recent trend of using large-scale deep neural networks (DNN) to boost performance has propelled the development of the parallel pipelining technique for efficient DNN training, which has resulted in the development of several prominent pipelines such as GPipe, PipeDream, and PipeDream-2BW. However, the current leading pipeline PipeDream-2BW still suffers from two major drawbacks, i.e., the excessive memory redundancy and the delayed weight updates across all stages. In this work, we propose a novel pipeline named WPipe, which achieves better memory efficiency and fresher weight updates. WPipe uses a novel pipelining scheme that divides model partitions into two groups. It moves the forward pass of the next period of weight updates to the front of the backward pass of the current period of weight updates in the first group, retains the order in the second group, and updates each group alternatively. This scheme can eliminate half of the delayed gradients and memory redundancy compared to PipeDream-2BW. The experiments, which train large BERT language models, show that compared to PipeDream-2BW, WPipe achieves 1.4× acceleration and reduces the memory footprint by 36%, without nearly sacrificing any final model accuracy.
\ No newline at end of file
diff --git a/data/2022/iclr/HTLM: Hyper-Text Pre-Training and Prompting of Language Models b/data/2022/iclr/HTLM: Hyper-Text Pre-Training and Prompting of Language Models
new file mode 100644
index 0000000000..6b6f8fdef9
--- /dev/null
+++ b/data/2022/iclr/HTLM: Hyper-Text Pre-Training and Prompting of Language Models	
@@ -0,0 +1 @@
+We introduce HTLM, a hyper-text language model trained on a large-scale web crawl. Modeling hyper-text has a number of advantages: (1) it is easily gathered at scale, (2) it provides rich document-level and end-task-adjacent supervision (e.g. class and id attributes often encode document category information), and (3) it allows for new structured prompting that follows the established semantics of HTML (e.g. to do zero-shot summarization by infilling title tags for a webpage that contains the input text). We show that pretraining with a BART-style denoising loss directly on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels. HTLM matches or exceeds the performance of comparably sized text-only LMs for zero-shot prompting and fine-tuning for classification benchmarks, while also setting new state-of-the-art performance levels for zero-shot summarization. We also find that hyper-text prompts provide more value to HTLM, in terms of data efficiency, than plain text prompts do for existing LMs, and that HTLM is highly effective at auto-prompting itself, by simply generating the most likely hyper-text formatting for any available training data. We will release all code and models to support future HTLM research.
\ No newline at end of file
diff --git a/data/2022/iclr/Half-Inverse Gradients for Physical Deep Learning b/data/2022/iclr/Half-Inverse Gradients for Physical Deep Learning
new file mode 100644
index 0000000000..563b9268fd
--- /dev/null
+++ b/data/2022/iclr/Half-Inverse Gradients for Physical Deep Learning	
@@ -0,0 +1 @@
+Recent works in deep learning have shown that integrating differentiable physics simulators into the training process can greatly improve the quality of results. Although this combination represents a more complex optimization task than supervised neural network training, the same gradient-based optimizers are typically employed to minimize the loss function. However, the integrated physics solvers have a profound effect on the gradient flow as manipulating scales in magnitude and direction is an inherent property of many physical processes. Consequently, the gradient flow is often highly unbalanced and creates an environment in which existing gradient-based optimizers perform poorly. In this work, we analyze the characteristics of both physical and neural network optimizations to derive a new method that does not suffer from this phenomenon. Our method is based on a half-inversion of the Jacobian and combines principles of both classical network and physics optimizers to solve the combined optimization task. Compared to state-of-the-art neural network optimizers, our method converges more quickly and yields better solutions, which we demonstrate on three complex learning problems involving nonlinear oscillators, the Schroedinger equation and the Poisson problem.
\ No newline at end of file
diff --git a/data/2022/iclr/Handling Distribution Shifts on Graphs: An Invariance Perspective b/data/2022/iclr/Handling Distribution Shifts on Graphs: An Invariance Perspective
new file mode 100644
index 0000000000..eb8433d383
--- /dev/null
+++ b/data/2022/iclr/Handling Distribution Shifts on Graphs: An Invariance Perspective	
@@ -0,0 +1 @@
+There is increasing evidence suggesting neural networks' sensitivity to distribution shifts, so that research on out-of-distribution (OOD) generalization comes into the spotlight. Nonetheless, current endeavors mostly focus on Euclidean data, and its formulation for graph-structured data is not clear and remains under-explored, given two-fold fundamental challenges: 1) the inter-connection among nodes in one graph, which induces non-IID generation of data points even under the same environment, and 2) the structural information in the input graph, which is also informative for prediction. In this paper, we formulate the OOD problem on graphs and develop a new invariant learning approach, Explore-to-Extrapolate Risk Minimization (EERM), that facilitates graph neural networks to leverage invariance principles for prediction. EERM resorts to multiple context explorers (specified as graph structure editers in our case) that are adversarially trained to maximize the variance of risks from multiple virtual environments. Such a design enables the model to extrapolate from a single observed environment which is the common case for node-level prediction. We prove the validity of our method by theoretically showing its guarantee of a valid OOD solution and further demonstrate its power on various real-world datasets for handling distribution shifts from artificial spurious features, cross-domain transfers and dynamic graph evolution.
\ No newline at end of file
diff --git a/data/2022/iclr/Heteroscedastic Temporal Variational Autoencoder For Irregularly Sampled Time Series b/data/2022/iclr/Heteroscedastic Temporal Variational Autoencoder For Irregularly Sampled Time Series
new file mode 100644
index 0000000000..5111ab21ae
--- /dev/null
+++ b/data/2022/iclr/Heteroscedastic Temporal Variational Autoencoder For Irregularly Sampled Time Series	
@@ -0,0 +1 @@
+Irregularly sampled time series commonly occur in several domains where they present a significant challenge to standard deep learning models. In this paper, we propose a new deep learning framework for probabilistic interpolation of irregularly sampled time series that we call the Heteroscedastic Temporal Variational Autoencoder (HeTVAE). HeTVAE includes a novel input layer to encode information about input observation sparsity, a temporal VAE architecture to propagate uncertainty due to input sparsity, and a heteroscedastic output layer to enable variable uncertainty in output interpolations. Our results show that the proposed architecture is better able to reflect variable uncertainty through time due to sparse and irregular sampling than a range of baseline and traditional models, as well as recently proposed deep latent variable models that use homoscedastic output layers.
\ No newline at end of file
diff --git a/data/2022/iclr/Hidden Convexity of Wasserstein GANs: Interpretable Generative Models with Closed-Form Solutions b/data/2022/iclr/Hidden Convexity of Wasserstein GANs: Interpretable Generative Models with Closed-Form Solutions
new file mode 100644
index 0000000000..96ddde056c
--- /dev/null
+++ b/data/2022/iclr/Hidden Convexity of Wasserstein GANs: Interpretable Generative Models with Closed-Form Solutions	
@@ -0,0 +1 @@
+Generative Adversarial Networks (GANs) are commonly used for modeling complex distributions of data. Both the generators and discriminators of GANs are often modeled by neural networks, posing a non-transparent optimization problem which is non-convex and non-concave over the generator and discriminator, respectively. Such networks are often heuristically optimized with gradient descent-ascent (GDA), but it is unclear whether the optimization problem contains any saddle points, or whether heuristic methods can find them in practice. In this work, we analyze the training of Wasserstein GANs with two-layer neural network discriminators through the lens of convex duality, and for a variety of generators expose the conditions under which Wasserstein GANs can be solved exactly with convex optimization approaches, or can be represented as convex-concave games. Using this convex duality interpretation, we further demonstrate the impact of different activation functions of the discriminator. Our observations are verified with numerical results demonstrating the power of the convex interpretation, with applications in progressive training of convex architectures corresponding to linear generators and quadratic-activation discriminators for CelebA image generation. The code for our experiments is available at https://github.com/ardasahiner/ProCoGAN.
\ No newline at end of file
diff --git a/data/2022/iclr/Hidden Parameter Recurrent State Space Models For Changing Dynamics Scenarios b/data/2022/iclr/Hidden Parameter Recurrent State Space Models For Changing Dynamics Scenarios
new file mode 100644
index 0000000000..df0e5a8b44
--- /dev/null
+++ b/data/2022/iclr/Hidden Parameter Recurrent State Space Models For Changing Dynamics Scenarios	
@@ -0,0 +1 @@
+Recurrent State-space models (RSSMs) are highly expressive models for learning patterns in time series data and system identification. However, these models assume that the dynamics are fixed and unchanging, which is rarely the case in real-world scenarios. Many control applications often exhibit tasks with similar but not identical dynamics which can be modeled as a latent variable. We introduce the Hidden Parameter Recurrent State Space Models (HiP-RSSMs), a framework that parametrizes a family of related dynamical systems with a low-dimensional set of latent factors. We present a simple and effective way of learning and performing inference over this Gaussian graphical model that avoids approximations like variational inference. We show that HiP-RSSMs outperforms RSSMs and competing multi-task models on several challenging robotic benchmarks both on real-world systems and simulations.
\ No newline at end of file
diff --git a/data/2022/iclr/Hierarchical Few-Shot Imitation with Skill Transition Models b/data/2022/iclr/Hierarchical Few-Shot Imitation with Skill Transition Models
new file mode 100644
index 0000000000..d01a4663cf
--- /dev/null
+++ b/data/2022/iclr/Hierarchical Few-Shot Imitation with Skill Transition Models	
@@ -0,0 +1 @@
+A desirable property of autonomous agents is the ability to both solve long-horizon problems and generalize to unseen tasks. Recent advances in data-driven skill learning have shown that extracting behavioral priors from offline data can enable agents to solve challenging long-horizon tasks with reinforcement learning. However, generalization to tasks unseen during behavioral prior training remains an outstanding challenge. To this end, we present Few-shot Imitation with Skill Transition Models (FIST), an algorithm that extracts skills from offline data and utilizes them to generalize to unseen tasks given a few downstream demonstrations. FIST learns an inverse skill dynamics model, a distance function, and utilizes a semi-parametric approach for imitation. We show that FIST is capable of generalizing to new tasks and substantially outperforms prior baselines in navigation experiments requiring traversing unseen parts of a large maze and 7-DoF robotic arm experiments requiring manipulating previously unseen objects in a kitchen.
\ No newline at end of file
diff --git a/data/2022/iclr/Hierarchical Variational Memory for Few-shot Learning Across Domains b/data/2022/iclr/Hierarchical Variational Memory for Few-shot Learning Across Domains
new file mode 100644
index 0000000000..1f35a03553
--- /dev/null
+++ b/data/2022/iclr/Hierarchical Variational Memory for Few-shot Learning Across Domains	
@@ -0,0 +1 @@
+Neural memory enables fast adaptation to new tasks with just a few training samples. Existing memory models store features only from the single last layer, which does not generalize well in presence of a domain shift between training and test distributions. Rather than relying on a flat memory, we propose a hierarchical alternative that stores features at different semantic levels. We introduce a hierarchical prototype model, where each level of the prototype fetches corresponding information from the hierarchical memory. The model is endowed with the ability to flexibly rely on features at different semantic levels if the domain shift circumstances so demand. We meta-learn the model by a newly derived hierarchical variational inference framework, where hierarchical memory and prototypes are jointly optimized. To explore and exploit the importance of different semantic levels, we further propose to learn the weights associated with the prototype at each level in a data-driven way, which enables the model to adaptively choose the most generalizable features. We conduct thorough ablation studies to demonstrate the effectiveness of each component in our model. The new state-of-the-art performance on cross-domain and competitive performance on traditional few-shot classification further substantiates the benefit of hierarchical variational memory.
\ No newline at end of file
diff --git a/data/2022/iclr/High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize b/data/2022/iclr/High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize
new file mode 100644
index 0000000000..9b78479d88
--- /dev/null
+++ b/data/2022/iclr/High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize	
@@ -0,0 +1 @@
+In this paper, we propose a new, simplified high probability analysis of AdaGrad for smooth, non-convex problems. More specifically, we focus on a particular accelerated gradient (AGD) template (Lan, 2020), through which we recover the original AdaGrad and its variant with averaging, and prove a convergence rate of $\mathcal O (1/ \sqrt{T})$ with high probability without the knowledge of smoothness and variance. We use a particular version of Freedman's concentration bound for martingale difference sequences (Kakade&Tewari, 2008) which enables us to achieve the best-known dependence of $\log (1 / \delta )$ on the probability margin $\delta$. We present our analysis in a modular way and obtain a complementary $\mathcal O (1 / T)$ convergence rate in the deterministic setting. To the best of our knowledge, this is the first high probability result for AdaGrad with a truly adaptive scheme, i.e., completely oblivious to the knowledge of smoothness and uniform variance bound, which simultaneously has best-known dependence of $\log( 1/ \delta)$. We further prove noise adaptation property of AdaGrad under additional noise assumptions.
\ No newline at end of file
diff --git a/data/2022/iclr/High Probability Generalization Bounds with Fast Rates for Minimax Problems b/data/2022/iclr/High Probability Generalization Bounds with Fast Rates for Minimax Problems
new file mode 100644
index 0000000000..e87d6c60e2
--- /dev/null
+++ b/data/2022/iclr/High Probability Generalization Bounds with Fast Rates for Minimax Problems	
@@ -0,0 +1 @@
+Minimax problems are receiving an increasing amount of attention in a wide range of applications in machine learning (ML), for instance, reinforcement learning, robust optimization, adversarial learning, and distributed computing, to mention but a few. Current studies focus on the fundamental understanding of general minimax problems with an emphasis on convergence behavior. As a comparison, there is far less work to study the generalization performance. Additionally, existing generalization bounds are almost all derived in expectation, and the high probability bounds are all presented in the slow order O(1/ √ n), where n is the sample size. In this paper, we provide improved generalization analyses and obtain sharper high probability generalization bounds for most existing generalization measures of minimax problems. We then use the improved learning bounds to establish high probability generalization bounds with fast rates for classical empirical saddle point (ESP) solution and several popular gradient-based optimization algorithms, including gradient descent ascent (GDA), stochastic gradient descent ascent (SGDA), proximal point method (PPM), extra-gradient (EG), and optimistic gradient descent ascent (OGDA). In summary, we provide a systematical analysis of sharper generalization bounds of minimax problems.
\ No newline at end of file
diff --git a/data/2022/iclr/Hindsight Foresight Relabeling for Meta-Reinforcement Learning b/data/2022/iclr/Hindsight Foresight Relabeling for Meta-Reinforcement Learning
new file mode 100644
index 0000000000..9335c01428
--- /dev/null
+++ b/data/2022/iclr/Hindsight Foresight Relabeling for Meta-Reinforcement Learning	
@@ -0,0 +1 @@
+Meta-reinforcement learning (meta-RL) algorithms allow for agents to learn new behaviors from small amounts of experience, mitigating the sample inefficiency problem in RL. However, while meta-RL agents can adapt quickly to new tasks at test time after experiencing only a few trajectories, the meta-training process is still sample-inefficient. Prior works have found that in the multi-task RL setting, relabeling past transitions and thus sharing experience among tasks can improve sample efficiency and asymptotic performance. We apply this idea to the meta-RL setting and devise a new relabeling method called Hindsight Foresight Relabeling (HFR). We construct a relabeling distribution using the combination of"hindsight", which is used to relabel trajectories using reward functions from the training task distribution, and"foresight", which takes the relabeled trajectories and computes the utility of each trajectory for each task. HFR is easy to implement and readily compatible with existing meta-RL algorithms. We find that HFR improves performance when compared to other relabeling methods on a variety of meta-RL tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Hindsight is 20 20: Leveraging Past Traversals to Aid 3D Perception b/data/2022/iclr/Hindsight is 20 20: Leveraging Past Traversals to Aid 3D Perception
new file mode 100644
index 0000000000..ebcf92a63d
--- /dev/null
+++ b/data/2022/iclr/Hindsight is 20 20: Leveraging Past Traversals to Aid 3D Perception	
@@ -0,0 +1 @@
+Self-driving cars must detect vehicles, pedestrians, and other traffic participants accurately to operate safely. Small, far-away, or highly occluded objects are particularly challenging because there is limited information in the LiDAR point clouds for detecting them. To address this challenge, we leverage valuable information from the past: in particular, data collected in past traversals of the same scene. We posit that these past data, which are typically discarded, provide rich contextual information for disambiguating the above-mentioned challenging cases. To this end, we propose a novel, end-to-end trainable Hindsight framework to extract this contextual information from past traversals and store it in an easy-to-query data structure, which can then be leveraged to aid future 3D object detection of the same scene. We show that this framework is compatible with most modern 3D detection architectures and can substantially improve their average precision on multiple autonomous driving datasets, most notably by more than 300% on the challenging cases.
\ No newline at end of file
diff --git a/data/2022/iclr/Hindsight: Posterior-guided training of retrievers for improved open-ended generation b/data/2022/iclr/Hindsight: Posterior-guided training of retrievers for improved open-ended generation
new file mode 100644
index 0000000000..ea4ced3c12
--- /dev/null
+++ b/data/2022/iclr/Hindsight: Posterior-guided training of retrievers for improved open-ended generation	
@@ -0,0 +1 @@
+Many text generation systems benefit from using a retriever to retrieve passages from a textual knowledge corpus (e.g., Wikipedia) which are then provided as additional context to the generator. For open-ended generation tasks (like generating informative utterances in conversations) many varied passages may be equally relevant and we find that existing methods that jointly train the retriever and generator underperform: the retriever may not find relevant passages even amongst the top-10 and hence the generator may not learn a preference to ground its generated output in them. We propose using an additional guide retriever that is allowed to use the target output and"in hindsight"retrieve relevant passages during training. We model the guide retriever after the posterior distribution Q of passages given the input and the target output and train it jointly with the standard retriever and the generator by maximizing the evidence lower bound (ELBo) in expectation over Q. For informative conversations from the Wizard of Wikipedia dataset, with posterior-guided training, the retriever finds passages with higher relevance in the top-10 (23% relative improvement), the generator's responses are more grounded in the retrieved passage (19% relative improvement) and the end-to-end system produces better overall output (6.4% relative improvement).
\ No newline at end of file
diff --git a/data/2022/iclr/Hot-Refresh Model Upgrades with Regression-Free Compatible Training in Image Retrieval b/data/2022/iclr/Hot-Refresh Model Upgrades with Regression-Free Compatible Training in Image Retrieval
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/How Attentive are Graph Attention Networks? b/data/2022/iclr/How Attentive are Graph Attention Networks?
new file mode 100644
index 0000000000..8567f3f66f
--- /dev/null
+++ b/data/2022/iclr/How Attentive are Graph Attention Networks?	
@@ -0,0 +1 @@
+Graph Attention Networks (GATs) are one of the most popular GNN architectures and are considered as the state-of-the-art architecture for representation learning with graphs. In GAT, every node attends to its neighbors given its own representation as the query. However, in this paper we show that GAT computes a very limited kind of attention: the ranking of the attention scores is unconditioned on the query node. We formally define this restricted kind of attention as static attention and distinguish it from a strictly more expressive dynamic attention. Because GATs use a static attention mechanism, there are simple graph problems that GAT cannot express: in a controlled problem, we show that static attention hinders GAT from even fitting the training data. To remove this limitation, we introduce a simple fix by modifying the order of operations and propose GATv2: a dynamic graph attention variant that is strictly more expressive than GAT. We perform an extensive evaluation and show that GATv2 outperforms GAT across 11 OGB and other benchmarks while we match their parametric costs. Our code is available at https://github.com/tech-srl/how_attentive_are_gats . GATv2 is available as part of the PyTorch Geometric library, the Deep Graph Library, and the TensorFlow GNN library.
\ No newline at end of file
diff --git a/data/2022/iclr/How Did the Model Change? Efficiently Assessing Machine Learning API Shifts b/data/2022/iclr/How Did the Model Change? Efficiently Assessing Machine Learning API Shifts
new file mode 100644
index 0000000000..d484b7c7ac
--- /dev/null
+++ b/data/2022/iclr/How Did the Model Change? Efficiently Assessing Machine Learning API Shifts	
@@ -0,0 +1 @@
+ML prediction APIs from providers like Amazon and Google have made it simple to use ML in applications. A challenge for users is that such APIs continuously change over time as the providers update models, and changes can happen silently without users knowing. It is thus important to monitor when and how much the ML APIs’ performance shifts. To provide detailed change assessment, we model ML API shifts as confusion matrix differences, and propose a principled algorithmic framework, MASA, to provably assess these shifts efficiently given a sample budget constraint. MASA employs an upper-confidence bound based approach to adaptively determine on which data point to query the ML API to estimate shifts. Empirically, we observe significant ML API shifts from 2020 to 2021 among 12 out of 36 applications using commercial APIs from Google, Microsoft, Amazon, and other providers. These real-world shifts include both improvements and reductions in accuracy. Extensive experiments show that MASA can estimate such API shifts more accurately than standard approaches given the same budget.
\ No newline at end of file
diff --git a/data/2022/iclr/How Do Vision Transformers Work? b/data/2022/iclr/How Do Vision Transformers Work?
new file mode 100644
index 0000000000..48b445c243
--- /dev/null
+++ b/data/2022/iclr/How Do Vision Transformers Work?	
@@ -0,0 +1 @@
+The success of multi-head self-attentions (MSAs) for computer vision is now indisputable. However, little is known about how MSAs work. We present fundamental explanations to help better understand the nature of MSAs. In particular, we demonstrate the following properties of MSAs and Vision Transformers (ViTs): (1) MSAs improve not only accuracy but also generalization by flattening the loss landscapes. Such improvement is primarily attributable to their data specificity, not long-range dependency. On the other hand, ViTs suffer from non-convex losses. Large datasets and loss landscape smoothing methods alleviate this problem; (2) MSAs and Convs exhibit opposite behaviors. For example, MSAs are low-pass filters, but Convs are high-pass filters. Therefore, MSAs and Convs are complementary; (3) Multi-stage neural networks behave like a series connection of small individual models. In addition, MSAs at the end of a stage play a key role in prediction. Based on these insights, we propose AlterNet, a model in which Conv blocks at the end of a stage are replaced with MSA blocks. AlterNet outperforms CNNs not only in large data regimes but also in small data regimes. The code is available at https://github.com/xxxnell/how-do-vits-work.
\ No newline at end of file
diff --git a/data/2022/iclr/How Does SimSiam Avoid Collapse Without Negative Samples? A Unified Understanding with Self-supervised Contrastive Learning b/data/2022/iclr/How Does SimSiam Avoid Collapse Without Negative Samples? A Unified Understanding with Self-supervised Contrastive Learning
new file mode 100644
index 0000000000..ce1a307367
--- /dev/null
+++ b/data/2022/iclr/How Does SimSiam Avoid Collapse Without Negative Samples? A Unified Understanding with Self-supervised Contrastive Learning	
@@ -0,0 +1 @@
+To avoid collapse in self-supervised learning (SSL), a contrastive loss is widely used but often requires a large number of negative samples. Without negative samples yet achieving competitive performance, a recent work has attracted significant attention for providing a minimalist simple Siamese (SimSiam) method to avoid collapse. However, the reason for how it avoids collapse without negative samples remains not fully clear and our investigation starts by revisiting the explanatory claims in the original SimSiam. After refuting their claims, we introduce vector decomposition for analyzing the collapse based on the gradient analysis of the $l_2$-normalized representation vector. This yields a unified perspective on how negative samples and SimSiam alleviate collapse. Such a unified perspective comes timely for understanding the recent progress in SSL.
\ No newline at end of file
diff --git a/data/2022/iclr/How Low Can We Go: Trading Memory for Error in Low-Precision Training b/data/2022/iclr/How Low Can We Go: Trading Memory for Error in Low-Precision Training
new file mode 100644
index 0000000000..5f59e4c98c
--- /dev/null
+++ b/data/2022/iclr/How Low Can We Go: Trading Memory for Error in Low-Precision Training	
@@ -0,0 +1 @@
+Low-precision arithmetic trains deep learning models using less energy, less memory and less time. However, we pay a price for the savings: lower precision may yield larger round-off error and hence larger prediction error. As applications proliferate, users must choose which precision to use to train a new model, and chip manufacturers must decide which precisions to manufacture. We view these precision choices as a hyperparameter tuning problem, and borrow ideas from meta-learning to learn the tradeoff between memory and error. In this paper, we introduce Pareto Estimation to Pick the Perfect Precision (PEPPP). We use matrix factorization to find non-dominated configurations (the Pareto frontier) with a limited number of network evaluations. For any given memory budget, the precision that minimizes error is a point on this frontier. Practitioners can use the frontier to trade memory for error and choose the best precision for their goals.
\ No newline at end of file
diff --git a/data/2022/iclr/How Much Can CLIP Benefit Vision-and-Language Tasks? b/data/2022/iclr/How Much Can CLIP Benefit Vision-and-Language Tasks?
new file mode 100644
index 0000000000..521ba2dd94
--- /dev/null
+++ b/data/2022/iclr/How Much Can CLIP Benefit Vision-and-Language Tasks?	
@@ -0,0 +1 @@
+Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V&L models in two typical scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V&L pre-training and transferring to downstream tasks. We show that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks. We release our code at https://github.com/clip-vil/CLIP-ViL.
\ No newline at end of file
diff --git a/data/2022/iclr/How Well Does Self-Supervised Pre-Training Perform with Streaming Data? b/data/2022/iclr/How Well Does Self-Supervised Pre-Training Perform with Streaming Data?
new file mode 100644
index 0000000000..8259bd2ffc
--- /dev/null
+++ b/data/2022/iclr/How Well Does Self-Supervised Pre-Training Perform with Streaming Data?	
@@ -0,0 +1 @@
+Prior works on self-supervised pre-training focus on the joint training scenario, where massive unlabeled data are assumed to be given as input all at once, and only then is a learner trained. Unfortunately, such a problem setting is often impractical if not infeasible since many real-world tasks rely on sequential learning, e.g., data are decentralized or collected in a streaming fashion. In this paper, we conduct the first thorough and dedicated investigation on self-supervised pre-training with streaming data, aiming to shed light on the model behavior under this overlooked setup. Specifically, we pre-train over 500 models on four categories of pre-training streaming data from ImageNet and DomainNet and evaluate them on three types of downstream tasks and 12 different downstream datasets. Our studies show that, somehow beyond our expectation, with simple data replay or parameter regularization, sequential self-supervised pre-training turns out to be an efficient alternative for joint pre-training, as the performances of the former are mostly on par with those of the latter. Moreover, catastrophic forgetting, a common issue in sequential supervised learning, is much alleviated in sequential self-supervised learning (SSL), which is well justified through our comprehensive empirical analysis on representations and the sharpness of minima in the loss landscape. Our findings, therefore, suggest that, in practice, for SSL, the cumbersome joint training can be replaced mainly by sequential learning, which in turn enables a much broader spectrum of potential application scenarios.
\ No newline at end of file
diff --git a/data/2022/iclr/How many degrees of freedom do we need to train deep networks: a loss landscape perspective b/data/2022/iclr/How many degrees of freedom do we need to train deep networks: a loss landscape perspective
new file mode 100644
index 0000000000..643c80a2ab
--- /dev/null
+++ b/data/2022/iclr/How many degrees of freedom do we need to train deep networks: a loss landscape perspective	
@@ -0,0 +1 @@
+A variety of recent works, spanning pruning, lottery tickets, and training within random subspaces, have shown that deep neural networks can be trained using far fewer degrees of freedom than the total number of parameters. We analyze this phenomenon for random subspaces by first examining the success probability of hitting a training loss sub-level set when training within a random subspace of a given training dimensionality. We find a sharp phase transition in the success probability from $0$ to $1$ as the training dimension surpasses a threshold. This threshold training dimension increases as the desired final loss decreases, but decreases as the initial loss decreases. We then theoretically explain the origin of this phase transition, and its dependence on initialization and final desired loss, in terms of properties of the high-dimensional geometry of the loss landscape. In particular, we show via Gordon's escape theorem, that the training dimension plus the Gaussian width of the desired loss sub-level set, projected onto a unit sphere surrounding the initialization, must exceed the total number of parameters for the success probability to be large. In several architectures and datasets, we measure the threshold training dimension as a function of initialization and demonstrate that it is a small fraction of the total parameters, implying by our theory that successful training with so few dimensions is possible precisely because the Gaussian width of low loss sub-level sets is very large. Moreover, we compare this threshold training dimension to more sophisticated ways of reducing training degrees of freedom, including lottery tickets as well as a new, analogous method: lottery subspaces. Code is available at https://github.com/ganguli-lab/degrees-of-freedom.
\ No newline at end of file
diff --git a/data/2022/iclr/How to Inject Backdoors with Better Consistency: Logit Anchoring on Clean Data b/data/2022/iclr/How to Inject Backdoors with Better Consistency: Logit Anchoring on Clean Data
new file mode 100644
index 0000000000..2284e3798e
--- /dev/null
+++ b/data/2022/iclr/How to Inject Backdoors with Better Consistency: Logit Anchoring on Clean Data	
@@ -0,0 +1 @@
+Since training a large-scale backdoored model from scratch requires a large training dataset, several recent attacks have considered to inject backdoors into a trained clean model without altering model behaviors on the clean data. Previous work finds that backdoors can be injected into a trained clean model with Adversarial Weight Perturbation (AWP). Here AWPs refers to the variations of parameters that are small in backdoor learning. In this work, we observe an interesting phenomenon that the variations of parameters are always AWPs when tuning the trained clean model to inject backdoors. We further provide theoretical analysis to explain this phenomenon. We formulate the behavior of maintaining accuracy on clean data as the consistency of backdoored models, which includes both global consistency and instance-wise consistency. We extensively analyze the effects of AWPs on the consistency of backdoored models. In order to achieve better consistency, we propose a novel anchoring loss to anchor or freeze the model behaviors on the clean data, with a theoretical guarantee. Both the analytical and the empirical results validate the effectiveness of the anchoring loss in improving the consistency, especially the instance-wise consistency.
\ No newline at end of file
diff --git a/data/2022/iclr/How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective b/data/2022/iclr/How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective
new file mode 100644
index 0000000000..b7ea88008b
--- /dev/null
+++ b/data/2022/iclr/How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective	
@@ -0,0 +1 @@
+The lack of adversarial robustness has been recognized as an important issue for state-of-the-art machine learning (ML) models, e.g., deep neural networks (DNNs). Thereby, robustifying ML models against adversarial attacks is now a major focus of research. However, nearly all existing defense methods, particularly for robust training, made the white-box assumption that the defender has the access to the details of an ML model (or its surrogate alternatives if available), e.g., its architectures and parameters. Beyond existing works, in this paper we aim to address the problem of black-box defense: How to robustify a black-box model using just input queries and output feedback? Such a problem arises in practical scenarios, where the owner of the predictive model is reluctant to share model information in order to preserve privacy. To this end, we propose a general notion of defensive operation that can be applied to black-box models, and design it through the lens of denoised smoothing (DS), a first-order (FO) certified defense technique. To allow the design of merely using model queries, we further integrate DS with the zeroth-order (gradient-free) optimization. However, a direct implementation of zeroth-order (ZO) optimization suffers a high variance of gradient estimates, and thus leads to ineffective defense. To tackle this problem, we next propose to prepend an autoencoder (AE) to a given (black-box) model so that DS can be trained using variance-reduced ZO optimization. We term the eventual defense as ZO-AE-DS. In practice, we empirically show that ZO-AE- DS can achieve improved accuracy, certified robustness, and query complexity over existing baselines. And the effectiveness of our approach is justified under both image classification and image reconstruction tasks. Codes are available at https://github.com/damon-demon/Black-Box-Defense.
\ No newline at end of file
diff --git a/data/2022/iclr/How to Train Your MAML to Excel in Few-Shot Classification b/data/2022/iclr/How to Train Your MAML to Excel in Few-Shot Classification
new file mode 100644
index 0000000000..be5073a4c4
--- /dev/null
+++ b/data/2022/iclr/How to Train Your MAML to Excel in Few-Shot Classification	
@@ -0,0 +1 @@
+Model-agnostic meta-learning (MAML) is arguably one of the most popular meta-learning algorithms nowadays. Nevertheless, its performance on few-shot classification is far behind many recent algorithms dedicated to the problem. In this paper, we point out several key facets of how to train MAML to excel in few-shot classification. First, we find that MAML needs a large number of gradient steps in its inner loop update, which contradicts its common usage in few-shot classification. Second, we find that MAML is sensitive to the class label assignments during meta-testing. Concretely, MAML meta-trains the initialization of an $N$-way classifier. These $N$ ways, during meta-testing, then have"$N!$"different permutations to be paired with a few-shot task of $N$ novel classes. We find that these permutations lead to a huge variance of accuracy, making MAML unstable in few-shot classification. Third, we investigate several approaches to make MAML permutation-invariant, among which meta-training a single vector to initialize all the $N$ weight vectors in the classification head performs the best. On benchmark datasets like MiniImageNet and TieredImageNet, our approach, which we name UNICORN-MAML, performs on a par with or even outperforms many recent few-shot classification algorithms, without sacrificing MAML's simplicity.
\ No newline at end of file
diff --git a/data/2022/iclr/How to deal with missing data in supervised deep learning? b/data/2022/iclr/How to deal with missing data in supervised deep learning?
new file mode 100644
index 0000000000..731f9b16da
--- /dev/null
+++ b/data/2022/iclr/How to deal with missing data in supervised deep learning?	
@@ -0,0 +1 @@
+The issue of missing data in supervised learning has been largely overlooked, especially in the deep learning community. We investigate strategies to adapt neural architectures to handle missing values. Here, we focus on regression and classiﬁcation problems where the features are assumed to be missing at random. Of particular interest are schemes that allow to reuse as-is a neural discriminative architecture. One scheme involves imputing the missing values with learnable constants. We propose a second novel approach that leverages recent advances in deep generative modelling. More precisely, a deep latent variable model can be learned jointly with the discriminative model, using importance-weighted variational inference in an end-to-end way. This hybrid approach, which mimics multiple imputation, also allows to impute the data, by relying on both the discriminative and the generative model. We also discuss ways of using a pre-trained generative model to train the discriminative one. In domains where powerful deep generative models are available, the hybrid approach leads to large performance gains.
\ No newline at end of file
diff --git a/data/2022/iclr/How unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis b/data/2022/iclr/How unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis
new file mode 100644
index 0000000000..9667a81a5f
--- /dev/null
+++ b/data/2022/iclr/How unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis	
@@ -0,0 +1 @@
+Self-training, a semi-supervised learning algorithm, leverages a large amount of unlabeled data to improve learning when the labeled data are limited. Despite empirical successes, its theoretical characterization remains elusive. To the best of our knowledge, this work establishes the first theoretical analysis for the known iterative self-training paradigm and proves the benefits of unlabeled data in both training convergence and generalization ability. To make our theoretical analysis feasible, we focus on the case of one-hidden-layer neural networks. However, theoretical understanding of iterative self-training is non-trivial even for a shallow neural network. One of the key challenges is that existing neural network landscape analysis built upon supervised learning no longer holds in the (semi-supervised) self-training paradigm. We address this challenge and prove that iterative self-training converges linearly with both convergence rate and generalization accuracy improved in the order of $1/\sqrt{M}$, where $M$ is the number of unlabeled samples. Experiments from shallow neural networks to deep neural networks are also provided to justify the correctness of our established theoretical insights on self-training.
\ No newline at end of file
diff --git a/data/2022/iclr/Huber Additive Models for Non-stationary Time Series Analysis b/data/2022/iclr/Huber Additive Models for Non-stationary Time Series Analysis
new file mode 100644
index 0000000000..f53b706073
--- /dev/null
+++ b/data/2022/iclr/Huber Additive Models for Non-stationary Time Series Analysis	
@@ -0,0 +1 @@
+Sparse additive models have shown promising flexibility and interpretability in processing time series data. However, existing methods usually assume the time series data to be stationary and the innovation is sampled from a Gaussian distribution. Both assumptions are too stringent for heavy-tailed and non-stationary time series data that frequently arise in practice, such as finance and medical fields. To address these problems, we propose an adaptive sparse Huber additive model for robust forecasting in both non-Gaussian data and (non)stationary data. In theory, the generalization bounds of our estimator are established for both stationary and nonstationary time series data, which are independent of the widely used mixing conditions in learning theory of dependent observations. Moreover, the error bound for non-stationary time series contains a discrepancy measure for the shifts of the data distributions over time. Such a discrepancy measure can be estimated empirically and used as a penalty in our method. Experimental results on both synthetic and real-world benchmark datasets validate the effectiveness of the proposed method. The code is available at https://github.com/xianruizhong/SpHAM.
\ No newline at end of file
diff --git a/data/2022/iclr/HyAR: Addressing Discrete-Continuous Action Reinforcement Learning via Hybrid Action Representation b/data/2022/iclr/HyAR: Addressing Discrete-Continuous Action Reinforcement Learning via Hybrid Action Representation
new file mode 100644
index 0000000000..c23b480dd4
--- /dev/null
+++ b/data/2022/iclr/HyAR: Addressing Discrete-Continuous Action Reinforcement Learning via Hybrid Action Representation	
@@ -0,0 +1 @@
+Discrete-continuous hybrid action space is a natural setting in many practical problems, such as robot control and game AI. However, most previous Reinforcement Learning (RL) works only demonstrate the success in controlling with either discrete or continuous action space, while seldom take into account the hybrid action space. One naive way to address hybrid action RL is to convert the hybrid action space into a unified homogeneous action space by discretization or continualization, so that conventional RL algorithms can be applied. However, this ignores the underlying structure of hybrid action space and also induces the scalability issue and additional approximation difficulties, thus leading to degenerated results. In this paper, we propose Hybrid Action Representation (HyAR) to learn a compact and decodable latent representation space for the original hybrid action space. HyAR constructs the latent space and embeds the dependence between discrete action and continuous parameter via an embedding table and conditional Variantional Auto-Encoder (VAE). To further improve the effectiveness, the action representation is trained to be semantically smooth through unsupervised environmental dynamics prediction. Finally, the agent then learns its policy with conventional DRL algorithms in the learned representation space and interacts with the environment by decoding the hybrid action embeddings to the original action space. We evaluate HyAR in a variety of environments with discrete-continuous action space. The results demonstrate the superiority of HyAR when compared with previous baselines, especially for high-dimensional action spaces.
\ No newline at end of file
diff --git a/data/2022/iclr/Hybrid Local SGD for Federated Learning with Heterogeneous Communications b/data/2022/iclr/Hybrid Local SGD for Federated Learning with Heterogeneous Communications
new file mode 100644
index 0000000000..834b98049e
--- /dev/null
+++ b/data/2022/iclr/Hybrid Local SGD for Federated Learning with Heterogeneous Communications	
@@ -0,0 +1 @@
+device-to-server
\ No newline at end of file
diff --git a/data/2022/iclr/Hybrid Memoised Wake-Sleep: Approximate Inference at the Discrete-Continuous Interface b/data/2022/iclr/Hybrid Memoised Wake-Sleep: Approximate Inference at the Discrete-Continuous Interface
new file mode 100644
index 0000000000..2893fdd598
--- /dev/null
+++ b/data/2022/iclr/Hybrid Memoised Wake-Sleep: Approximate Inference at the Discrete-Continuous Interface	
@@ -0,0 +1 @@
+Modeling complex phenomena typically involves the use of both discrete and continuous variables. Such a setting applies across a wide range of problems, from identifying trends in time-series data to performing effective compositional scene understanding in images. Here, we propose Hybrid Memoised Wake-Sleep (HMWS), an algorithm for effective inference in such hybrid discrete-continuous models. Prior approaches to learning suffer as they need to perform repeated expensive inner-loop discrete inference. We build on a recent approach, Memoised Wake-Sleep (MWS), which alleviates part of the problem by memoising discrete variables, and extend it to allow for a principled and effective way to handle continuous variables by learning a separate recognition model used for importance-sampling based approximate inference and marginalization. We evaluate HMWS in the GP-kernel learning and 3D scene understanding domains, and show that it outperforms current state-of-the-art inference methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Hybrid Random Features b/data/2022/iclr/Hybrid Random Features
new file mode 100644
index 0000000000..63bd72023c
--- /dev/null
+++ b/data/2022/iclr/Hybrid Random Features	
@@ -0,0 +1 @@
+We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric (Rahimi and Recht, 2007) or (recently introduced in the context of linear-attention Transformers) positive random features (Choromanski et al., 2021). By generalizing Bochner's Theorem for softmax/Gaussian kernels and leveraging random features for compositional kernels, the HRF-mechanism provides strong theoretical guarantees - unbiased approximation and strictly smaller worst-case relative errors than its counterparts. We conduct exhaustive empirical evaluation of HRF ranging from pointwise kernel estimation experiments, through tests on data admitting clustering structure to benchmarking implicit-attention Transformers (also for downstream Robotics applications), demonstrating its quality in a wide spectrum of machine learning problems.
\ No newline at end of file
diff --git a/data/2022/iclr/HyperDQN: A Randomized Exploration Method for Deep Reinforcement Learning b/data/2022/iclr/HyperDQN: A Randomized Exploration Method for Deep Reinforcement Learning
new file mode 100644
index 0000000000..877acb5ec8
--- /dev/null
+++ b/data/2022/iclr/HyperDQN: A Randomized Exploration Method for Deep Reinforcement Learning	
@@ -0,0 +1 @@
+Randomized least-square value iteration (RLSVI) is a provably efficient exploration method. However, it is limited to the case where (1) a good feature is known in advance and (2) this feature is fixed during the training. If otherwise, RLSVI suffers an unbearable computational burden to obtain the posterior samples. In this work, we present a practical algorithm named HyperDQN to address the above issues under deep RL. In addition to a non-linear neural network (i.e., base model) that predicts Q-values, our method employs a probabilistic hypermodel (i.e., meta model), which outputs the parameter of the base model. When both models are jointly optimized under a specifically designed objective, three purposes can be achieved. First, the hypermodel can generate approximate posterior samples regarding the parameter of the Q-value function. As a result, diverse Q-value functions are sampled to select exploratory action sequences. This retains the punchline of RLSVI for efficient exploration. Second, a good feature is learned to approximate Q-value functions. This addresses limitation (1). Third, the posterior samples of the Q-value function can be obtained in a more efficient way than the existing methods, and the changing feature does not affect the efficiency. This deals with limitation (2). On the Atari suite, HyperDQN with 20M frames outperforms DQN with 200M frames in terms of the maximum human-normalized score. For SuperMarioBros, HyperDQN outperforms several exploration bonus and randomized exploration methods on 5 out of 9 games.
\ No newline at end of file
diff --git a/data/2022/iclr/Hyperparameter Tuning with Renyi Differential Privacy b/data/2022/iclr/Hyperparameter Tuning with Renyi Differential Privacy
new file mode 100644
index 0000000000..fdb16dc696
--- /dev/null
+++ b/data/2022/iclr/Hyperparameter Tuning with Renyi Differential Privacy	
@@ -0,0 +1 @@
+For many differentially private algorithms, such as the prominent noisy stochastic gradient descent (DP-SGD), the analysis needed to bound the privacy leakage of a single training run is well understood. However, few studies have reasoned about the privacy leakage resulting from the multiple training runs needed to fine tune the value of the training algorithm's hyperparameters. In this work, we first illustrate how simply setting hyperparameters based on non-private training runs can leak private information. Motivated by this observation, we then provide privacy guarantees for hyperparameter search procedures within the framework of Renyi Differential Privacy. Our results improve and extend the work of Liu and Talwar (STOC 2019). Our analysis supports our previous observation that tuning hyperparameters does indeed leak private information, but we prove that, under certain assumptions, this leakage is modest, as long as each candidate training run needed to select hyperparameters is itself differentially private.
\ No newline at end of file
diff --git a/data/2022/iclr/IFR-Explore: Learning Inter-object Functional Relationships in 3D Indoor Scenes b/data/2022/iclr/IFR-Explore: Learning Inter-object Functional Relationships in 3D Indoor Scenes
new file mode 100644
index 0000000000..0ea279067f
--- /dev/null
+++ b/data/2022/iclr/IFR-Explore: Learning Inter-object Functional Relationships in 3D Indoor Scenes	
@@ -0,0 +1 @@
+Building embodied intelligent agents that can interact with 3D indoor environments has received increasing research attention in recent years. While most works focus on single-object or agent-object visual functionality and affordances, our work proposes to study a new kind of visual relationship that is also important to perceive and model -- inter-object functional relationships (e.g., a switch on the wall turns on or off the light, a remote control operates the TV). Humans often spend little or no effort to infer these relationships, even when entering a new room, by using our strong prior knowledge (e.g., we know that buttons control electrical devices) or using only a few exploratory interactions in cases of uncertainty (e.g., multiple switches and lights in the same room). In this paper, we take the first step in building AI system learning inter-object functional relationships in 3D indoor environments with key technical contributions of modeling prior knowledge by training over large-scale scenes and designing interactive policies for effectively exploring the training scenes and quickly adapting to novel test scenes. We create a new benchmark based on the AI2Thor and PartNet datasets and perform extensive experiments that prove the effectiveness of our proposed method. Results show that our model successfully learns priors and fast-interactive-adaptation strategies for exploring inter-object functional relationships in complex 3D scenes. Several ablation studies further validate the usefulness of each proposed module.
\ No newline at end of file
diff --git a/data/2022/iclr/IGLU: Efficient GCN Training via Lazy Updates b/data/2022/iclr/IGLU: Efficient GCN Training via Lazy Updates
new file mode 100644
index 0000000000..e2b62ac974
--- /dev/null
+++ b/data/2022/iclr/IGLU: Efficient GCN Training via Lazy Updates	
@@ -0,0 +1 @@
+Training multi-layer Graph Convolution Networks (GCN) using standard SGD techniques scales poorly as each descent step ends up updating node embeddings for a large portion of the graph. Recent attempts to remedy this sub-sample the graph that reduces compute but introduce additional variance and may offer suboptimal performance. This paper develops the IGLU method that caches intermediate computations at various GCN layers thus enabling lazy updates that significantly reduce the compute cost of descent. IGLU introduces bounded bias into the gradients but nevertheless converges to a first-order saddle point under standard assumptions such as objective smoothness. Benchmark experiments show that IGLU offers up to 1.2% better accuracy despite requiring up to 88% less compute.
\ No newline at end of file
diff --git a/data/2022/iclr/Igeood: An Information Geometry Approach to Out-of-Distribution Detection b/data/2022/iclr/Igeood: An Information Geometry Approach to Out-of-Distribution Detection
new file mode 100644
index 0000000000..bd8719f4ca
--- /dev/null
+++ b/data/2022/iclr/Igeood: An Information Geometry Approach to Out-of-Distribution Detection	
@@ -0,0 +1 @@
+Reliable out-of-distribution (OOD) detection is fundamental to implementing safer modern machine learning (ML) systems. In this paper, we introduce Igeood, an effective method for detecting OOD samples. Igeood applies to any pre-trained neural network, works under various degrees of access to the ML model, does not require OOD samples or assumptions on the OOD data but can also benefit (if available) from OOD samples. By building on the geodesic (Fisher-Rao) distance between the underlying data distributions, our discriminator can combine confidence scores from the logits outputs and the learned features of a deep neural network. Empirically, we show that Igeood outperforms competing state-of-the-art methods on a variety of network architectures and datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/Illiterate DALL-E Learns to Compose b/data/2022/iclr/Illiterate DALL-E Learns to Compose
new file mode 100644
index 0000000000..cab29cbd68
--- /dev/null
+++ b/data/2022/iclr/Illiterate DALL-E Learns to Compose	
@@ -0,0 +1 @@
+Although DALL-E has shown an impressive ability of composition-based systematic generalization in image generation, it requires the dataset of text-image pairs and the compositionality is provided by the text. In contrast, object-centric representation models like the Slot Attention model learn composable representations without the text prompt. However, unlike DALL-E its ability to systematically generalize for zero-shot generation is significantly limited. In this paper, we propose a simple but novel slot-based autoencoding architecture, called SLATE, for combining the best of both worlds: learning object-centric representations that allows systematic generalization in zero-shot image generation without text. As such, this model can also be seen as an illiterate DALL-E model. Unlike the pixel-mixture decoders of existing object-centric representation models, we propose to use the Image GPT decoder conditioned on the slots for capturing complex interactions among the slots and pixels. In experiments, we show that this simple and easy-to-implement architecture not requiring a text prompt achieves significant improvement in in-distribution and out-of-distribution (zero-shot) image generation and qualitatively comparable or better slot-attention structure than the models based on mixture decoders.
\ No newline at end of file
diff --git a/data/2022/iclr/Image BERT Pre-training with Online Tokenizer b/data/2022/iclr/Image BERT Pre-training with Online Tokenizer
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Imbedding Deep Neural Networks b/data/2022/iclr/Imbedding Deep Neural Networks
new file mode 100644
index 0000000000..dbcf4184ff
--- /dev/null
+++ b/data/2022/iclr/Imbedding Deep Neural Networks	
@@ -0,0 +1 @@
+Continuous-depth neural networks, such as Neural ODEs, have refashioned the understanding of residual neural networks in terms of non-linear vector-valued optimal control problems. The common solution is to use the adjoint sensitivity method to replicate a forward-backward pass optimisation problem. We propose a new approach which explicates the network's `depth' as a fundamental variable, thus reducing the problem to a system of forward-facing initial value problems. This new method is based on the principle of `Invariant Imbedding' for which we prove a general solution, applicable to all non-linear, vector-valued optimal control problems with both running and terminal loss. Our new architectures provide a tangible tool for inspecting the theoretical--and to a great extent unexplained--properties of network depth. They also constitute a resource of discrete implementations of Neural ODEs comparable to classes of imbedded residual neural networks. Through a series of experiments, we show the competitive performance of the proposed architectures for supervised learning and time series prediction.
\ No newline at end of file
diff --git a/data/2022/iclr/Imitation Learning by Reinforcement Learning b/data/2022/iclr/Imitation Learning by Reinforcement Learning
new file mode 100644
index 0000000000..5d170a506d
--- /dev/null
+++ b/data/2022/iclr/Imitation Learning by Reinforcement Learning	
@@ -0,0 +1 @@
+Imitation learning algorithms learn a policy from demonstrations of expert behavior. We show that, for deterministic experts, imitation learning can be done by reduction to reinforcement learning with a stationary reward. Our theoretical analysis both certifies the recovery of expert reward and bounds the total variation distance between the expert and the imitation learner, showing a link to adversarial imitation learning. We conduct experiments which confirm that our reduction works well in practice for continuous control tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Imitation Learning from Observations under Transition Model Disparity b/data/2022/iclr/Imitation Learning from Observations under Transition Model Disparity
new file mode 100644
index 0000000000..88100d83e4
--- /dev/null
+++ b/data/2022/iclr/Imitation Learning from Observations under Transition Model Disparity	
@@ -0,0 +1 @@
+Learning to perform tasks by leveraging a dataset of expert observations, also known as imitation learning from observations (ILO), is an important paradigm for learning skills without access to the expert reward function or the expert actions. We consider ILO in the setting where the expert and the learner agents operate in different environments, with the source of the discrepancy being the transition dynamics model. Recent methods for scalable ILO utilize adversarial learning to match the state-transition distributions of the expert and the learner, an approach that becomes challenging when the dynamics are dissimilar. In this work, we propose an algorithm that trains an intermediary policy in the learner environment and uses it as a surrogate expert for the learner. The intermediary policy is learned such that the state transitions generated by it are close to the state transitions in the expert dataset. To derive a practical and scalable algorithm, we employ concepts from prior work on estimating the support of a probability distribution. Experiments using MuJoCo locomotion tasks highlight that our method compares favorably to the baselines for ILO with transition dynamics mismatch.
\ No newline at end of file
diff --git a/data/2022/iclr/Implicit Bias of Adversarial Training for Deep Neural Networks b/data/2022/iclr/Implicit Bias of Adversarial Training for Deep Neural Networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks b/data/2022/iclr/Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks
new file mode 100644
index 0000000000..9f55bcc30e
--- /dev/null
+++ b/data/2022/iclr/Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks	
@@ -0,0 +1 @@
+We study the dynamics of a neural network in function space when optimizing the mean squared error via gradient flow. We show that in the underparameterized regime the network learns eigenfunctions of an integral operator TK∞ determined by the Neural Tangent Kernel (NTK) at rates corresponding to their eigenvalues. For example, for uniformly distributed data on the sphere Sd−1 and rotation invariant weight distributions, the eigenfunctions of TK∞ are the spherical harmonics. Our results can be understood as describing a spectral bias in the underparameterized regime. The proofs use the concept of “Damped Deviations”, where deviations of the NTK matter less for eigendirections with large eigenvalues due to the occurence of a damping factor. Aside from the underparameterized regime, the damped deviations point-of-view can be used to track the dynamics of the empirical risk in the overparameterized setting, allowing us to extend certain results in the literature. We conclude that damped deviations offers a simple and unifying perspective of the dynamics when optimizing the squared error.
\ No newline at end of file
diff --git a/data/2022/iclr/Implicit Bias of Projected Subgradient Method Gives Provable Robust Recovery of Subspaces of Unknown Codimension b/data/2022/iclr/Implicit Bias of Projected Subgradient Method Gives Provable Robust Recovery of Subspaces of Unknown Codimension
new file mode 100644
index 0000000000..78439f3c66
--- /dev/null
+++ b/data/2022/iclr/Implicit Bias of Projected Subgradient Method Gives Provable Robust Recovery of Subspaces of Unknown Codimension	
@@ -0,0 +1 @@
+Robust subspace recovery (RSR) is a fundamental problem in robust representation learning. Here we focus on a recently proposed RSR method termed Dual Principal Component Pursuit (DPCP) approach, which aims to recover a basis of the orthogonal complement of the subspace and is amenable to handling subspaces of high relative dimension. Prior work has shown that DPCP can provably recover the correct subspace in the presence of outliers, as long as the true dimension of the subspace is known. We show that DPCP can provably solve RSR problems in the {\it unknown} subspace dimension regime, as long as orthogonality constraints -- adopted in previous DPCP formulations -- are relaxed and random initialization is used instead of spectral one. Namely, we propose a very simple algorithm based on running multiple instances of a projected sub-gradient descent method (PSGM), with each problem instance seeking to find one vector in the null space of the subspace. We theoretically prove that under mild conditions this approach will succeed with high probability. In particular, we show that 1) all of the problem instances will converge to a vector in the nullspace of the subspace and 2) the ensemble of problem instance solutions will be sufficiently diverse to fully span the nullspace of the subspace thus also revealing its true unknown codimension. We provide empirical results that corroborate our theoretical results and showcase the remarkable implicit rank regularization behavior of PSGM algorithm that allows us to perform RSR without being aware of the subspace dimension.
\ No newline at end of file
diff --git a/data/2022/iclr/Improved deterministic l2 robustness on CIFAR-10 and CIFAR-100 b/data/2022/iclr/Improved deterministic l2 robustness on CIFAR-10 and CIFAR-100
new file mode 100644
index 0000000000..11439ba24e
--- /dev/null
+++ b/data/2022/iclr/Improved deterministic l2 robustness on CIFAR-10 and CIFAR-100	
@@ -0,0 +1 @@
+Training convolutional neural networks (CNNs) with a strict Lipschitz constraint under the l2 norm is useful for provable adversarial robustness, interpretable gradients and stable training. While 1-Lipschitz CNNs can be designed by enforcing a 1-Lipschitz constraint on each layer, training such networks requires each layer to have an orthogonal Jacobian matrix (for all inputs) to prevent the gradients from vanishing during backpropagation. A layer with this property is said to be Gradient Norm Preserving (GNP). In this work, we introduce a procedure to certify the robustness of 1-Lipschitz CNNs by relaxing the orthogonalization of the last linear layer of the network that significantly advances the state of the art for both standard and provable robust accuracies on CIFAR-100 (gains of 4.80% and 4.71%, respectively). We further boost their robustness by introducing (i) a novel Gradient Norm preserving activation function called the Householder activation function (that includes every GroupSort activation) and (ii) a certificate regularization. On CIFAR-10, we achieve significant improvements over prior works in provable robust accuracy (5.81%) with only a minor drop in standard accuracy (−0.29%). Code for reproducing all experiments in the paper is available at https://github.com/singlasahil14/SOC.
\ No newline at end of file
diff --git a/data/2022/iclr/Improving Federated Learning Face Recognition via Privacy-Agnostic Clusters b/data/2022/iclr/Improving Federated Learning Face Recognition via Privacy-Agnostic Clusters
new file mode 100644
index 0000000000..2e3cf24eaf
--- /dev/null
+++ b/data/2022/iclr/Improving Federated Learning Face Recognition via Privacy-Agnostic Clusters	
@@ -0,0 +1 @@
+The growing public concerns on data privacy in face recognition can be greatly addressed by the federated learning (FL) paradigm. However, conventional FL methods perform poorly due to the uniqueness of the task: broadcasting class centers among clients is crucial for recognition performances but leads to privacy leakage. To resolve the privacy-utility paradox, this work proposes PrivacyFace, a framework largely improves the federated learning face recognition via communicating auxiliary and privacy-agnostic information among clients. PrivacyFace mainly consists of two components: First, a practical Differentially Private Local Clustering (DPLC) mechanism is proposed to distill sanitized clusters from local class centers. Second, a consensus-aware recognition loss subsequently encourages global consensuses among clients, which ergo results in more discriminative features. The proposed framework is mathematically proved to be differentially private, introducing a lightweight overhead as well as yielding prominent performance boosts (\textit{e.g.}, +9.63\% and +10.26\% for TAR@FAR=1e-4 on IJB-B and IJB-C respectively). Extensive experiments and ablation studies on a large-scale dataset have demonstrated the efficacy and practicability of our method.
\ No newline at end of file
diff --git a/data/2022/iclr/Improving Mutual Information Estimation with Annealed and Energy-Based Bounds b/data/2022/iclr/Improving Mutual Information Estimation with Annealed and Energy-Based Bounds
new file mode 100644
index 0000000000..2a60e672b2
--- /dev/null
+++ b/data/2022/iclr/Improving Mutual Information Estimation with Annealed and Energy-Based Bounds	
@@ -0,0 +1 @@
+Mutual information (MI) is a fundamental quantity in information theory and machine learning. However, direct estimation of MI is intractable, even if the true joint probability density for the variables of interest is known, as it involves estimating a potentially high-dimensional log partition function. In this work, we present a unifying view of existing MI bounds from the perspective of importance sampling, and propose three novel bounds based on this approach. Since accurate estimation of MI without density information requires a sample size exponential in the true MI, we assume either a single marginal or the full joint density information is known. In settings where the full joint density is available, we propose Multi-Sample Annealed Importance Sampling (AIS) bounds on MI, which we demonstrate can tightly estimate large values of MI in our experiments. In settings where only a single marginal distribution is known, we propose Generalized IWAE (GIWAE) and MINE-AIS bounds. Our GIWAE bound unifies variational and contrastive bounds in a single framework that generalizes InfoNCE, IWAE, and Barber-Agakov bounds. Our MINE-AIS method improves upon existing energy-based methods such as MINE-DV and MINE-F by directly optimizing a tighter lower bound on MI. MINE-AIS uses MCMC sampling to estimate gradients for training and Multi-Sample AIS for evaluating the bound. Our methods are particularly suitable for evaluating MI in deep generative models, since explicit forms of the marginal or joint densities are often available. We evaluate our bounds on estimating the MI of VAEs and GANs trained on the MNIST and CIFAR datasets, and showcase significant gains over existing bounds in these challenging settings with high ground truth MI.
\ No newline at end of file
diff --git a/data/2022/iclr/Improving Non-Autoregressive Translation Models Without Distillation b/data/2022/iclr/Improving Non-Autoregressive Translation Models Without Distillation
new file mode 100644
index 0000000000..575873847d
--- /dev/null
+++ b/data/2022/iclr/Improving Non-Autoregressive Translation Models Without Distillation	
@@ -0,0 +1 @@
+Transformer-based autoregressive (AR) machine translation models have achieved signiﬁcant performance improvements, nearing human-level accuracy on some languages. The AR framework translates one token at a time which can be time consuming, especially for long sequences. To accelerate inference, recent work has been exploring non-autoregressive (NAR) approaches that translate blocks of tokens in parallel. Despite signiﬁcant progress, leading NAR models still lag behind their AR counterparts, and only become competitive when trained with distillation. In this paper we investigate possible reasons behind this performance gap, namely, the indistinguishability of tokens, and mismatch between training and inference. We then propose the Conditional Masked Language Model with Correction (CMLMC) that addresses these problems. Empirically, we show that CMLMC achieves state-of-the-art NAR performance when trained on raw data without distillation, and approaches AR performance on multiple datasets. Code for this work is available here: https://github.com/layer6ai-labs/CMLMC .
\ No newline at end of file
diff --git a/data/2022/iclr/Improving the Accuracy of Learning Example Weights for Imbalance Classification b/data/2022/iclr/Improving the Accuracy of Learning Example Weights for Imbalance Classification
new file mode 100644
index 0000000000..6434c72014
--- /dev/null
+++ b/data/2022/iclr/Improving the Accuracy of Learning Example Weights for Imbalance Classification	
@@ -0,0 +1 @@
+To solve the imbalanced classiﬁcation, methods of weighting examples have been proposed. Recent work has studied to assign adaptive weights to training examples through learning mechanisms. Speciﬁcally, similar to classiﬁcation models, the weights are regarded as parameters that need to be learned. However, the algorithms in recent work use local information to approximately optimize the weights, which may lead to inaccurate learning of the weights. In this work, we ﬁrst propose a novel mechanism of learning with a constraint, which can accurately train the weights and model. Then, we propose a combined method of our learning mechanism and the existing work, which can promote each other to perform better. Our method can be applied to any type of deep network model. Experiments show that compared with state-of-the-art algorithms, our method has signiﬁcant improvement in varieties of settings, including text and image classiﬁ-cation over different imbalance ratios, binary and multi-class classiﬁcation.
\ No newline at end of file
diff --git a/data/2022/iclr/In a Nutshell, the Human Asked for This: Latent Goals for Following Temporal Specifications b/data/2022/iclr/In a Nutshell, the Human Asked for This: Latent Goals for Following Temporal Specifications
new file mode 100644
index 0000000000..95ceb38db8
--- /dev/null
+++ b/data/2022/iclr/In a Nutshell, the Human Asked for This: Latent Goals for Following Temporal Specifications	
@@ -0,0 +1 @@
+We address the problem of building agents whose goal is to learn to execute out-of distribution (OOD) multi-task instructions expressed in temporal logic (TL) by using deep reinforcement learning (DRL). Recent works provided evidence that the agent's neural architecture is a key feature when DRL agents are learning to solve OOD tasks in TL. Yet, the studies on this topic are still in their infancy. In this work, we propose a new deep learning configuration with inductive biases that lead agents to generate latent representations of their current goal, yielding a stronger generalization performance. We use these latent-goal networks within a neuro-symbolic framework that executes multi-task formally-defined instructions and contrast the performance of the proposed neural networks against employing different state-of-the-art (SOTA) architectures when generalizing to unseen instructions in OOD environments.
\ No newline at end of file
diff --git a/data/2022/iclr/Increasing the Cost of Model Extraction with Calibrated Proof of Work b/data/2022/iclr/Increasing the Cost of Model Extraction with Calibrated Proof of Work
new file mode 100644
index 0000000000..daae046f21
--- /dev/null
+++ b/data/2022/iclr/Increasing the Cost of Model Extraction with Calibrated Proof of Work	
@@ -0,0 +1 @@
+In model extraction attacks, adversaries can steal a machine learning model exposed via a public API by repeatedly querying it and adjusting their own model based on obtained predictions. To prevent model stealing, existing defenses focus on detecting malicious queries, truncating, or distorting outputs, thus necessarily introducing a tradeoff between robustness and model utility for legitimate users. Instead, we propose to impede model extraction by requiring users to complete a proof-of-work before they can read the model's predictions. This deters attackers by greatly increasing (even up to 100x) the computational effort needed to leverage query access for model extraction. Since we calibrate the effort required to complete the proof-of-work to each query, this only introduces a slight overhead for regular users (up to 2x). To achieve this, our calibration applies tools from differential privacy to measure the information revealed by a query. Our method requires no modification of the victim model and can be applied by machine learning practitioners to guard their publicly exposed models against being easily stolen.
\ No newline at end of file
diff --git a/data/2022/iclr/Incremental False Negative Detection for Contrastive Learning b/data/2022/iclr/Incremental False Negative Detection for Contrastive Learning
new file mode 100644
index 0000000000..c4c05f84fb
--- /dev/null
+++ b/data/2022/iclr/Incremental False Negative Detection for Contrastive Learning	
@@ -0,0 +1 @@
+Self-supervised learning has recently shown great potential in vision tasks through contrastive learning, which aims to discriminate each image, or instance, in the dataset. However, such instance-level learning ignores the semantic relationship among instances and sometimes undesirably repels the anchor from the semantically similar samples, termed as"false negatives". In this work, we show that the unfavorable effect from false negatives is more significant for the large-scale datasets with more semantic concepts. To address the issue, we propose a novel self-supervised contrastive learning framework that incrementally detects and explicitly removes the false negative samples. Specifically, following the training process, our method dynamically detects increasing high-quality false negatives considering that the encoder gradually improves and the embedding space becomes more semantically structural. Next, we discuss two strategies to explicitly remove the detected false negatives during contrastive learning. Extensive experiments show that our framework outperforms other self-supervised contrastive learning methods on multiple benchmarks in a limited resource setup.
\ No newline at end of file
diff --git a/data/2022/iclr/Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking b/data/2022/iclr/Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking
new file mode 100644
index 0000000000..e806ea7692
--- /dev/null
+++ b/data/2022/iclr/Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking	
@@ -0,0 +1 @@
+Protein complex formation is a central problem in biology, being involved in most of the cell's processes, and essential for applications, e.g. drug design or protein engineering. We tackle rigid body protein-protein docking, i.e., computationally predicting the 3D structure of a protein-protein complex from the individual unbound structures, assuming no conformational change within the proteins happens during binding. We design a novel pairwise-independent SE(3)-equivariant graph matching network to predict the rotation and translation to place one of the proteins at the right docked position relative to the second protein. We mathematically guarantee a basic principle: the predicted complex is always identical regardless of the initial locations and orientations of the two structures. Our model, named EquiDock, approximates the binding pockets and predicts the docking poses using keypoint matching and alignment, achieved through optimal transport and a differentiable Kabsch algorithm. Empirically, we achieve significant running time improvements and often outperform existing docking software despite not relying on heavy candidate sampling, structure refinement, or templates.
\ No newline at end of file
diff --git a/data/2022/iclr/Inductive Relation Prediction Using Analogy Subgraph Embeddings b/data/2022/iclr/Inductive Relation Prediction Using Analogy Subgraph Embeddings
new file mode 100644
index 0000000000..46c93577dd
--- /dev/null
+++ b/data/2022/iclr/Inductive Relation Prediction Using Analogy Subgraph Embeddings	
@@ -0,0 +1 @@
+Prevailing methods for relation prediction in heterogeneous graphs including knowledge graphs aim at learning the latent representations (i.e., embeddings) of observed nodes and relations, and are thus limited to the transductive setting where the relation types must be known during training. In this paper, we propose ANalogy SubGraph Embedding Learning (GraphANGEL), a novel relation prediction framework that predicts relations between each node pair by checking whether the subgraphs containing the pair are similar to other subgraphs containing the considered relation. Each graph pattern explicitly represents a specific logical rule, which contributes to an inductive bias that facilitates generalization to unseen relation types and leads to more explainable predictive models. Our model consistently outperforms existing models in terms of heterogeneous graph based recommendation as well as knowledge graph completion. We also empirically demonstrate the capability of our model in generalizing to new relation types while producing explainable heat maps of attention scores across the discovered logics.
\ No newline at end of file
diff --git a/data/2022/iclr/InfinityGAN: Towards Infinite-Pixel Image Synthesis b/data/2022/iclr/InfinityGAN: Towards Infinite-Pixel Image Synthesis
new file mode 100644
index 0000000000..fb89d92d03
--- /dev/null
+++ b/data/2022/iclr/InfinityGAN: Towards Infinite-Pixel Image Synthesis	
@@ -0,0 +1 @@
+We present a novel framework, InfinityGAN, for arbitrary-sized image generation. The task is associated with several key challenges. First, scaling existing models to an arbitrarily large image size is resource-constrained, in terms of both computation and availability of large-field-of-view training data. InfinityGAN trains and infers in a seamless patch-by-patch manner with low computational resources. Second, large images should be locally and globally consistent, avoid repetitive patterns, and look realistic. To address these, InfinityGAN disentangles global appearances, local structures, and textures. With this formulation, we can generate images with spatial size and level of details not attainable before. Experimental evaluation validates that InfinityGAN generates images with superior realism compared to baselines and features parallelizable inference. Finally, we show several applications unlocked by our approach, such as spatial style fusion, multi-modal outpainting, and image inbetweening. All applications can be operated with arbitrary input and output sizes. Please find the full version of the paper at https://openreview.net/forum?id=ufGMqIM0a4b .
\ No newline at end of file
diff --git a/data/2022/iclr/Information Bottleneck: Exact Analysis of (Quantized) Neural Networks b/data/2022/iclr/Information Bottleneck: Exact Analysis of (Quantized) Neural Networks
new file mode 100644
index 0000000000..01022428ee
--- /dev/null
+++ b/data/2022/iclr/Information Bottleneck: Exact Analysis of (Quantized) Neural Networks	
@@ -0,0 +1 @@
+The information bottleneck (IB) principle has been suggested as a way to analyze deep neural networks. The learning dynamics are studied by inspecting the mutual information (MI) between the hidden layers and the input and output. Notably, separate fitting and compression phases during training have been reported. This led to some controversy including claims that the observations are not reproducible and strongly dependent on the type of activation function used as well as on the way the MI is estimated. Our study confirms that different ways of binning when computing the MI lead to qualitatively different results, either supporting or refusing IB conjectures. To resolve the controversy, we study the IB principle in settings where MI is non-trivial and can be computed exactly. We monitor the dynamics of quantized neural networks, that is, we discretize the whole deep learning system so that no approximation is required when computing the MI. This allows us to quantify the information flow without measurement errors. In this setting, we observed a fitting phase for all layers and a compression phase for the output layer in all experiments; the compression in the hidden layers was dependent on the type of activation function. Our study shows that the initial IB results were not artifacts of binning when computing the MI. However, the critical claim that the compression phase may not be observed for some networks also holds true.
\ No newline at end of file
diff --git a/data/2022/iclr/Information Gain Propagation: a New Way to Graph Active Learning with Soft Labels b/data/2022/iclr/Information Gain Propagation: a New Way to Graph Active Learning with Soft Labels
new file mode 100644
index 0000000000..673d557a9e
--- /dev/null
+++ b/data/2022/iclr/Information Gain Propagation: a New Way to Graph Active Learning with Soft Labels	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) have achieved great success in various tasks, but their performance highly relies on a large number of labeled nodes, which typically requires considerable human effort. GNN-based Active Learning (AL) methods are proposed to improve the labeling efficiency by selecting the most valuable nodes to label. Existing methods assume an oracle can correctly categorize all the selected nodes and thus just focus on the node selection. However, such an exact labeling task is costly, especially when the categorization is out of the domain of individual expert (oracle). The paper goes further, presenting a soft-label approach to AL on GNNs. Our key innovations are: i) relaxed queries where a domain expert (oracle) only judges the correctness of the predicted labels (a binary question) rather than identifying the exact class (a multi-class question), and ii) new criteria of maximizing information gain propagation for active learner with relaxed queries and soft labels. Empirical studies on public datasets demonstrate that our method significantly outperforms the state-of-the-art GNN-based AL methods in terms of both accuracy and labeling cost.
\ No newline at end of file
diff --git a/data/2022/iclr/Information Prioritization through Empowerment in Visual Model-based RL b/data/2022/iclr/Information Prioritization through Empowerment in Visual Model-based RL
new file mode 100644
index 0000000000..7b87a8628f
--- /dev/null
+++ b/data/2022/iclr/Information Prioritization through Empowerment in Visual Model-based RL	
@@ -0,0 +1 @@
+Model-based reinforcement learning (RL) algorithms designed for handling complex visual observations typically learn some sort of latent state representation, either explicitly or implicitly. Standard methods of this sort do not distinguish between functionally relevant aspects of the state and irrelevant distractors, instead aiming to represent all available information equally. We propose a modified objective for model-based RL that, in combination with mutual information maximization, allows us to learn representations and dynamics for visual model-based RL without reconstruction in a way that explicitly prioritizes functionally relevant factors. The key principle behind our design is to integrate a term inspired by variational empowerment into a state-space model based on mutual information. This term prioritizes information that is correlated with action, thus ensuring that functionally relevant factors are captured first. Furthermore, the same empowerment term also promotes faster exploration during the RL process, especially for sparse-reward tasks where the reward signal is insufficient to drive exploration in the early stages of learning. We evaluate the approach on a suite of vision-based robot control tasks with natural video backgrounds, and show that the proposed prioritized information objective outperforms state-of-the-art model based RL approaches with higher sample efficiency and episodic returns. https://sites.google.com/view/information-empowerment
\ No newline at end of file
diff --git a/data/2022/iclr/Information-theoretic Online Memory Selection for Continual Learning b/data/2022/iclr/Information-theoretic Online Memory Selection for Continual Learning
new file mode 100644
index 0000000000..cb6719d787
--- /dev/null
+++ b/data/2022/iclr/Information-theoretic Online Memory Selection for Continual Learning	
@@ -0,0 +1 @@
+A challenging problem in task-free continual learning is the online selection of a representative replay memory from data streams. In this work, we investigate the online memory selection problem from an information-theoretic perspective. To gather the most information, we propose the \textit{surprise} and the \textit{learnability} criteria to pick informative points and to avoid outliers. We present a Bayesian model to compute the criteria efficiently by exploiting rank-one matrix structures. We demonstrate that these criteria encourage selecting informative points in a greedy algorithm for online memory selection. Furthermore, by identifying the importance of \textit{the timing to update the memory}, we introduce a stochastic information-theoretic reservoir sampler (InfoRS), which conducts sampling among selective points with high information. Compared to reservoir sampling, InfoRS demonstrates improved robustness against data imbalance. Finally, empirical performances over continual learning benchmarks manifest its efficiency and efficacy.
\ No newline at end of file
diff --git a/data/2022/iclr/IntSGD: Adaptive Floatless Compression of Stochastic Gradients b/data/2022/iclr/IntSGD: Adaptive Floatless Compression of Stochastic Gradients
new file mode 100644
index 0000000000..7275a12805
--- /dev/null
+++ b/data/2022/iclr/IntSGD: Adaptive Floatless Compression of Stochastic Gradients	
@@ -0,0 +1 @@
+We propose a family of adaptive integer compression operators for distributed Stochastic Gradient Descent (SGD) that do not communicate a single float. This is achieved by multiplying floating-point vectors with a number known to every device and then rounding to integers. In contrast to the prior work on integer compression for SwitchML by Sapio et al. (2021), our IntSGD method is provably convergent and computationally cheaper as it estimates the scaling of vectors adaptively. Our theory shows that the iteration complexity of IntSGD matches that of SGD up to constant factors for both convex and non-convex, smooth and non-smooth functions, with and without overparameterization. Moreover, our algorithm can also be tailored for the popular all-reduce primitive and shows promising empirical performance.
\ No newline at end of file
diff --git a/data/2022/iclr/Interacting Contour Stochastic Gradient Langevin Dynamics b/data/2022/iclr/Interacting Contour Stochastic Gradient Langevin Dynamics
new file mode 100644
index 0000000000..a3daed958a
--- /dev/null
+++ b/data/2022/iclr/Interacting Contour Stochastic Gradient Langevin Dynamics	
@@ -0,0 +1 @@
+We propose an interacting contour stochastic gradient Langevin dynamics (ICSGLD) sampler, an embarrassingly parallel multiple-chain contour stochastic gradient Langevin dynamics (CSGLD) sampler with efficient interactions. We show that ICSGLD can be theoretically more efficient than a single-chain CSGLD with an equivalent computational budget. We also present a novel random-field function, which facilitates the estimation of self-adapting parameters in big data and obtains free mode explorations. Empirically, we compare the proposed algorithm with popular benchmark methods for posterior sampling. The numerical results show a great potential of ICSGLD for large-scale uncertainty estimation tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Interpretable Unsupervised Diversity Denoising and Artefact Removal b/data/2022/iclr/Interpretable Unsupervised Diversity Denoising and Artefact Removal
new file mode 100644
index 0000000000..3e9897bd4c
--- /dev/null
+++ b/data/2022/iclr/Interpretable Unsupervised Diversity Denoising and Artefact Removal	
@@ -0,0 +1 @@
+Image denoising and artefact removal are complex inverse problems admitting multiple valid solutions. Unsupervised diversity restoration, that is, obtaining a diverse set of possible restorations given a corrupted image, is important for ambiguity removal in many applications such as microscopy where paired data for supervised training are often unobtainable. In real world applications, imaging noise and artefacts are typically hard to model, leading to unsatisfactory performance of existing unsupervised approaches. This work presents an interpretable approach for unsupervised and diverse image restoration. To this end, we introduce a capable architecture called Hierarchical DivNoising (HDN) based on hierarchical Variational Autoencoder. We show that HDN learns an interpretable multi-scale representation of artefacts and we leverage this interpretability to remove imaging artefacts commonly occurring in microscopy data. Our method achieves state-of-the-art results on twelve benchmark image denoising datasets while providing access to a whole distribution of sensibly restored solutions. Additionally, we demonstrate on three real microscopy datasets that HDN removes artefacts without supervision, being the first method capable of doing so while generating multiple plausible restorations all consistent with the given corrupted image.
\ No newline at end of file
diff --git a/data/2022/iclr/Invariant Causal Representation Learning for Out-of-Distribution Generalization b/data/2022/iclr/Invariant Causal Representation Learning for Out-of-Distribution Generalization
new file mode 100644
index 0000000000..5171a509f4
--- /dev/null
+++ b/data/2022/iclr/Invariant Causal Representation Learning for Out-of-Distribution Generalization	
@@ -0,0 +1 @@
+Due to spurious correlations, machine learning systems often fail to generalize to environments whose distributions differ from the ones used at training time. Prior work addressing this, either explicitly or implicitly, attempted to find a data representation that has an invariant relationship with the target. This is done by leveraging a diverse set of training environments to reduce the effect of spurious features and build an invariant predictor. However, these methods have generalization guarantees only when both data representation and classifiers come from a linear model class. We propose invariant Causal Representation Learning (iCaRL), an approach that enables out-of-distribution (OOD) generalization in the nonlinear setting (i.e., nonlinear representations and nonlinear classifiers). It builds upon a practical and general assumption: the prior over the data representation (i.e., a set of latent variables encoding the data) given the target and the environment belongs to general exponential family distributions, i.e., a more flexible conditionally non-factorized prior that can actually capture complicated dependences between the latent variables. Based on this, we show that it is possible to identify the data representation up to simple transformations. We also show that all direct causes of the target can be fully discovered, which further enables us to obtain generalization guarantees in the nonlinear setting. Experiments on both synthetic and real-world datasets demonstrate that our approach outperforms a variety of baseline methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Inverse Online Learning: Understanding Non-Stationary and Reactionary Policies b/data/2022/iclr/Inverse Online Learning: Understanding Non-Stationary and Reactionary Policies
new file mode 100644
index 0000000000..c731c07331
--- /dev/null
+++ b/data/2022/iclr/Inverse Online Learning: Understanding Non-Stationary and Reactionary Policies	
@@ -0,0 +1 @@
+Human decision making is well known to be imperfect and the ability to analyse such processes individually is crucial when attempting to aid or improve a decision-maker's ability to perform a task, e.g. to alert them to potential biases or oversights on their part. To do so, it is necessary to develop interpretable representations of how agents make decisions and how this process changes over time as the agent learns online in reaction to the accrued experience. To then understand the decision-making processes underlying a set of observed trajectories, we cast the policy inference problem as the inverse to this online learning problem. By interpreting actions within a potential outcomes framework, we introduce a meaningful mapping based on agents choosing an action they believe to have the greatest treatment effect. We introduce a practical algorithm for retrospectively estimating such perceived effects, alongside the process through which agents update them, using a novel architecture built upon an expressive family of deep state-space models. Through application to the analysis of UNOS organ donation acceptance decisions, we demonstrate that our approach can bring valuable insights into the factors that govern decision processes and how they change over time.
\ No newline at end of file
diff --git a/data/2022/iclr/Is Fairness Only Metric Deep? Evaluating and Addressing Subgroup Gaps in Deep Metric Learning b/data/2022/iclr/Is Fairness Only Metric Deep? Evaluating and Addressing Subgroup Gaps in Deep Metric Learning
new file mode 100644
index 0000000000..de867463e8
--- /dev/null
+++ b/data/2022/iclr/Is Fairness Only Metric Deep? Evaluating and Addressing Subgroup Gaps in Deep Metric Learning	
@@ -0,0 +1 @@
+Deep metric learning (DML) enables learning with less supervision through its emphasis on the similarity structure of representations. There has been much work on improving generalization of DML in settings like zero-shot retrieval, but little is known about its implications for fairness. In this paper, we are the first to evaluate state-of-the-art DML methods trained on imbalanced data, and to show the negative impact these representations have on minority subgroup performance when used for downstream tasks. In this work, we first define fairness in DML through an analysis of three properties of the representation space -- inter-class alignment, intra-class alignment, and uniformity -- and propose finDML, the fairness in non-balanced DML benchmark to characterize representation fairness. Utilizing finDML, we find bias in DML representations to propagate to common downstream classification tasks. Surprisingly, this bias is propagated even when training data in the downstream task is re-balanced. To address this problem, we present Partial Attribute De-correlation (PARADE) to de-correlate feature representations from sensitive attributes and reduce performance gaps between subgroups in both embedding space and downstream metrics.
\ No newline at end of file
diff --git a/data/2022/iclr/Is High Variance Unavoidable in RL? A Case Study in Continuous Control b/data/2022/iclr/Is High Variance Unavoidable in RL? A Case Study in Continuous Control
new file mode 100644
index 0000000000..c671ca5320
--- /dev/null
+++ b/data/2022/iclr/Is High Variance Unavoidable in RL? A Case Study in Continuous Control	
@@ -0,0 +1 @@
+Reinforcement learning (RL) experiments have notoriously high variance, and minor details can have disproportionately large effects on measured outcomes. This is problematic for creating reproducible research and also serves as an obstacle for real-world applications, where safety and predictability are paramount. In this paper, we investigate causes for this perceived instability. To allow for an in-depth analysis, we focus on a specifically popular setup with high variance -- continuous control from pixels with an actor-critic agent. In this setting, we demonstrate that variance mostly arises early in training as a result of poor"outlier"runs, but that weight initialization and initial exploration are not to blame. We show that one cause for early variance is numerical instability which leads to saturating nonlinearities. We investigate several fixes to this issue and find that one particular method is surprisingly effective and simple -- normalizing penultimate features. Addressing the learning instability allows for larger learning rates, and significantly decreases the variance of outcomes. This demonstrates that the perceived variance in RL is not necessarily inherent to the problem definition and may be addressed through simple architectural modifications.
\ No newline at end of file
diff --git a/data/2022/iclr/Is Homophily a Necessity for Graph Neural Networks? b/data/2022/iclr/Is Homophily a Necessity for Graph Neural Networks?
new file mode 100644
index 0000000000..47bdd91900
--- /dev/null
+++ b/data/2022/iclr/Is Homophily a Necessity for Graph Neural Networks?	
@@ -0,0 +1 @@
+Graph neural networks (GNNs) have shown great prowess in learning representations suitable for numerous graph-based machine learning tasks. When applied to semi-supervised node classification, GNNs are widely believed to work well due to the homophily assumption ("like attracts like"), and fail to generalize to heterophilous graphs where dissimilar nodes connect. Recent works design new architectures to overcome such heterophily-related limitations, citing poor baseline performance and new architecture improvements on a few heterophilous graph benchmark datasets as evidence for this notion. In our experiments, we empirically find that standard graph convolutional networks (GCNs) can actually achieve better performance than such carefully designed methods on some commonly used heterophilous graphs. This motivates us to reconsider whether homophily is truly necessary for good GNN performance. We find that this claim is not quite true, and in fact, GCNs can achieve strong performance on heterophilous graphs under certain conditions. Our work carefully characterizes these conditions, and provides supporting theoretical understanding and empirical observations. Finally, we examine existing heterophilous graphs benchmarks and reconcile how the GCN (under)performs on them based on this understanding.
\ No newline at end of file
diff --git a/data/2022/iclr/Is Importance Weighting Incompatible with Interpolating Classifiers? b/data/2022/iclr/Is Importance Weighting Incompatible with Interpolating Classifiers?
new file mode 100644
index 0000000000..48984d0067
--- /dev/null
+++ b/data/2022/iclr/Is Importance Weighting Incompatible with Interpolating Classifiers?	
@@ -0,0 +1 @@
+Importance weighting is a classic technique to handle distribution shifts. However, prior work has presented strong empirical and theoretical evidence demonstrating that importance weights can have little to no effect on overparameterized neural networks. Is importance weighting truly incompatible with the training of overparameterized neural networks? Our paper answers this in the negative. We show that importance weighting fails not because of the overparameterization, but instead, as a result of using exponentially-tailed losses like the logistic or cross-entropy loss. As a remedy, we show that polynomially-tailed losses restore the effects of importance reweighting in correcting distribution shift in overparameterized models. We characterize the behavior of gradient descent on importance weighted polynomially-tailed losses with overparameterized linear models, and theoretically demonstrate the advantage of using polynomially-tailed losses in a label shift setting. Surprisingly, our theory shows that using weights that are obtained by exponentiating the classical unbiased importance weights can improve performance. Finally, we demonstrate the practical value of our analysis with neural network experiments on a subpopulation shift and a label shift dataset. When reweighted, our loss function can outperform reweighted cross-entropy by as much as 9% in test accuracy. Our loss function also gives test accuracies comparable to, or even exceeding, well-tuned state-of-the-art methods for correcting distribution shifts.
\ No newline at end of file
diff --git a/data/2022/iclr/It Takes Four to Tango: Multiagent Self Play for Automatic Curriculum Generation b/data/2022/iclr/It Takes Four to Tango: Multiagent Self Play for Automatic Curriculum Generation
new file mode 100644
index 0000000000..8f037adfd0
--- /dev/null
+++ b/data/2022/iclr/It Takes Four to Tango: Multiagent Self Play for Automatic Curriculum Generation	
@@ -0,0 +1 @@
+We are interested in training general-purpose reinforcement learning agents that can solve a wide variety of goals. Training such agents efficiently requires automatic generation of a goal curriculum. This is challenging as it requires (a) exploring goals of increasing difficulty, while ensuring that the agent (b) is exposed to a diverse set of goals in a sample efficient manner and (c) does not catastrophically forget previously solved goals. We propose Curriculum Self Play (CuSP), an automated goal generation framework that seeks to satisfy these desiderata by virtue of a multi-player game with four agents. We extend the asymmetric curricula learning in PAIRED (Dennis et al., 2020) to a symmetrized game that carefully balances cooperation and competition between two off-policy student learners and two regret-maximizing teachers. CuSP additionally introduces entropic goal coverage and accounts for the non-stationary nature of the students, allowing us to automatically induce a curriculum that balances progressive exploration with anti-catastrophic exploitation. We demonstrate that our method succeeds at generating an effective curricula of goals for a range of control tasks, outperforming other methods at zero-shot test-time generalization to novel out-of-distribution goals.
\ No newline at end of file
diff --git a/data/2022/iclr/It Takes Two to Tango: Mixup for Deep Metric Learning b/data/2022/iclr/It Takes Two to Tango: Mixup for Deep Metric Learning
new file mode 100644
index 0000000000..de6299b6f6
--- /dev/null
+++ b/data/2022/iclr/It Takes Two to Tango: Mixup for Deep Metric Learning	
@@ -0,0 +1 @@
+Metric learning involves learning a discriminative representation such that embeddings of similar classes are encouraged to be close, while embeddings of dissimilar classes are pushed far apart. State-of-the-art methods focus mostly on sophisticated loss functions or mining strategies. On the one hand, metric learning losses consider two or more examples at a time. On the other hand, modern data augmentation methods for classification consider two or more examples at a time. The combination of the two ideas is under-studied. In this work, we aim to bridge this gap and improve representations using mixup, which is a powerful data augmentation approach interpolating two or more examples and corresponding target labels at a time. This task is challenging because unlike classification, the loss functions used in metric learning are not additive over examples, so the idea of interpolating target labels is not straightforward. To the best of our knowledge, we are the first to investigate mixing both examples and target labels for deep metric learning. We develop a generalized formulation that encompasses existing metric learning loss functions and modify it to accommodate for mixup, introducing Metric Mix, or Metrix. We also introduce a new metric - utilization, to demonstrate that by mixing examples during training, we are exploring areas of the embedding space beyond the training classes, thereby improving representations. To validate the effect of improved representations, we show that mixing inputs, intermediate representations or embeddings along with target labels significantly outperforms state-of-the-art metric learning methods on four benchmark deep metric learning datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/Iterated Reasoning with Mutual Information in Cooperative and Byzantine Decentralized Teaming b/data/2022/iclr/Iterated Reasoning with Mutual Information in Cooperative and Byzantine Decentralized Teaming
new file mode 100644
index 0000000000..6c9763a124
--- /dev/null
+++ b/data/2022/iclr/Iterated Reasoning with Mutual Information in Cooperative and Byzantine Decentralized Teaming	
@@ -0,0 +1 @@
+Information sharing is key in building team cognition and enables coordination and cooperation. High-performing human teams also benefit from acting strategically with hierarchical levels of iterated communication and rationalizability, meaning a human agent can reason about the actions of their teammates in their decision-making. Yet, the majority of prior work in Multi-Agent Reinforcement Learning (MARL) does not support iterated rationalizability and only encourage inter-agent communication, resulting in a suboptimal equilibrium cooperation strategy. In this work, we show that reformulating an agent's policy to be conditional on the policies of its neighboring teammates inherently maximizes Mutual Information (MI) lower-bound when optimizing under Policy Gradient (PG). Building on the idea of decision-making under bounded rationality and cognitive hierarchy theory, we show that our modified PG approach not only maximizes local agent rewards but also implicitly reasons about MI between agents without the need for any explicit ad-hoc regularization terms. Our approach, InfoPG, outperforms baselines in learning emergent collaborative behaviors and sets the state-of-the-art in decentralized cooperative MARL tasks. Our experiments validate the utility of InfoPG by achieving higher sample efficiency and significantly larger cumulative reward in several complex cooperative multi-agent domains.
\ No newline at end of file
diff --git a/data/2022/iclr/Iterative Refinement Graph Neural Network for Antibody Sequence-Structure Co-design b/data/2022/iclr/Iterative Refinement Graph Neural Network for Antibody Sequence-Structure Co-design
new file mode 100644
index 0000000000..203f179715
--- /dev/null
+++ b/data/2022/iclr/Iterative Refinement Graph Neural Network for Antibody Sequence-Structure Co-design	
@@ -0,0 +1 @@
+Antibodies are versatile proteins that bind to pathogens like viruses and stimulate the adaptive immune system. The specificity of antibody binding is determined by complementarity-determining regions (CDRs) at the tips of these Y-shaped proteins. In this paper, we propose a generative model to automatically design the CDRs of antibodies with enhanced binding specificity or neutralization capabilities. Previous generative approaches formulate protein design as a structure-conditioned sequence generation task, assuming the desired 3D structure is given a priori. In contrast, we propose to co-design the sequence and 3D structure of CDRs as graphs. Our model unravels a sequence autoregressively while iteratively refining its predicted global structure. The inferred structure in turn guides subsequent residue choices. For efficiency, we model the conditional dependence between residues inside and outside of a CDR in a coarse-grained manner. Our method achieves superior log-likelihood on the test set and outperforms previous baselines in designing antibodies capable of neutralizing the SARS-CoV-2 virus.
\ No newline at end of file
diff --git a/data/2022/iclr/Joint Shapley values: a measure of joint feature importance b/data/2022/iclr/Joint Shapley values: a measure of joint feature importance
new file mode 100644
index 0000000000..268fb27e0f
--- /dev/null
+++ b/data/2022/iclr/Joint Shapley values: a measure of joint feature importance	
@@ -0,0 +1 @@
+The Shapley value is one of the most widely used measures of feature importance partly as it measures a feature's average effect on a model's prediction. We introduce joint Shapley values, which directly extend Shapley's axioms and intuitions: joint Shapley values measure a set of features' average contribution to a model's prediction. We prove the uniqueness of joint Shapley values, for any order of explanation. Results for games show that joint Shapley values present different insights from existing interaction indices, which assess the effect of a feature within a set of features. The joint Shapley values provide intuitive results in ML attribution problems. With binary features, we present a presence-adjusted global value that is more consistent with local intuitions than the usual approach.
\ No newline at end of file
diff --git a/data/2022/iclr/KL Guided Domain Adaptation b/data/2022/iclr/KL Guided Domain Adaptation
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Know Thyself: Transferable Visual Control Policies Through Robot-Awareness b/data/2022/iclr/Know Thyself: Transferable Visual Control Policies Through Robot-Awareness
new file mode 100644
index 0000000000..8c137f4496
--- /dev/null
+++ b/data/2022/iclr/Know Thyself: Transferable Visual Control Policies Through Robot-Awareness	
@@ -0,0 +1 @@
+Training visuomotor robot controllers from scratch on a new robot typically requires generating large amounts of robot-speciﬁc data. Could we leverage data previously collected on another robot to reduce or even completely remove this need for robot-speciﬁc data? We propose a “robot-aware” solution paradigm that exploits readily available robot “self-knowledge” such as proprioception, kinematics, and camera calibration to achieve this. First, we learn modular dynamics models that pair a transferable, robot-agnostic world dynamics module with a robot-speciﬁc, analytical robot dynamics module. Next, we set up visual planning costs that draw a distinction between the robot self and the world. Our experiments on tabletop manipulation tasks in simulation and on real robots demonstrate that these plug-in improvements dramatically boost the transferability of visuomotor controllers, even permitting zero-shot transfer onto new robots for the very ﬁrst time. Project website: https://hueds.github.io/rac/
\ No newline at end of file
diff --git a/data/2022/iclr/Know Your Action Set: Learning Action Relations for Reinforcement Learning b/data/2022/iclr/Know Your Action Set: Learning Action Relations for Reinforcement Learning
new file mode 100644
index 0000000000..a9fec9c2fd
--- /dev/null
+++ b/data/2022/iclr/Know Your Action Set: Learning Action Relations for Reinforcement Learning	
@@ -0,0 +1 @@
+Intelligent agents can solve tasks in various ways depending on their available set of actions. However, conventional reinforcement learning (RL) assumes a fixed action set. This work asserts that tasks with varying action sets require reasoning of the relations between the available actions. For instance, taking a nail-action in a repair task is meaningful only if a hammer-action is also available. To learn and utilize such action relations, we propose a novel policy architecture consisting of a graph attention network over the available actions. We show that our model makes informed action decisions by correctly attending to other related actions in both value-based and policy-based RL. Consequently, it outperforms non-relational architectures on applications where the action space often varies, such as recommender systems and physical reasoning with tools and skills. 1
\ No newline at end of file
diff --git a/data/2022/iclr/Knowledge Infused Decoding b/data/2022/iclr/Knowledge Infused Decoding
new file mode 100644
index 0000000000..fc206d0bee
--- /dev/null
+++ b/data/2022/iclr/Knowledge Infused Decoding	
@@ -0,0 +1 @@
+Pre-trained language models (LMs) have been shown to memorize a substantial amount of knowledge from the pre-training corpora; however, they are still limited in recalling factually correct knowledge given a certain context. Hence, they tend to suffer from counterfactual or hallucinatory generation when used in knowledge-intensive natural language generation (NLG) tasks. Recent remedies to this problem focus on modifying either the pre-training or task fine-tuning objectives to incorporate knowledge, which normally require additional costly training or architecture modification of LMs for practical applications. We present Knowledge Infused Decoding (KID) -- a novel decoding algorithm for generative LMs, which dynamically infuses external knowledge into each step of the LM decoding. Specifically, we maintain a local knowledge memory based on the current context, interacting with a dynamically created external knowledge trie, and continuously update the local memory as a knowledge-aware constraint to guide decoding via reinforcement learning. On six diverse knowledge-intensive NLG tasks, task-agnostic LMs (e.g., GPT-2 and BART) armed with KID outperform many task-optimized state-of-the-art models, and show particularly strong performance in few-shot scenarios over seven related knowledge-infusion techniques. Human evaluation confirms KID's ability to generate more relevant and factual language for the input context when compared with multiple baselines. Finally, KID also alleviates exposure bias and provides stable generation quality when generating longer sequences. Code for KID is available at https://github.com/microsoft/KID.
\ No newline at end of file
diff --git a/data/2022/iclr/Knowledge Removal in Sampling-based Bayesian Inference b/data/2022/iclr/Knowledge Removal in Sampling-based Bayesian Inference
new file mode 100644
index 0000000000..8615ac413e
--- /dev/null
+++ b/data/2022/iclr/Knowledge Removal in Sampling-based Bayesian Inference	
@@ -0,0 +1 @@
+The right to be forgotten has been legislated in many countries, but its enforcement in the AI industry would cause unbearable costs. When single data deletion requests come, companies may need to delete the whole models learned with massive resources. Existing works propose methods to remove knowledge learned from data for explicitly parameterized models, which however are not appliable to the sampling-based Bayesian inference, i.e., Markov chain Monte Carlo (MCMC), as MCMC can only infer implicit distributions. In this paper, we propose the first machine unlearning algorithm for MCMC. We first convert the MCMC unlearning problem into an explicit optimization problem. Based on this problem conversion, an {\it MCMC influence function} is designed to provably characterize the learned knowledge from data, which then delivers the MCMC unlearning algorithm. Theoretical analysis shows that MCMC unlearning would not compromise the generalizability of the MCMC models. Experiments on Gaussian mixture models and Bayesian neural networks confirm the effectiveness of the proposed algorithm. The code is available at \url{https://github.com/fshp971/mcmc-unlearning}.
\ No newline at end of file
diff --git a/data/2022/iclr/L0-Sparse Canonical Correlation Analysis b/data/2022/iclr/L0-Sparse Canonical Correlation Analysis
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/LFPT5: A Unified Framework for Lifelong Few-shot Language Learning Based on Prompt Tuning of T5 b/data/2022/iclr/LFPT5: A Unified Framework for Lifelong Few-shot Language Learning Based on Prompt Tuning of T5
new file mode 100644
index 0000000000..c421338c5f
--- /dev/null
+++ b/data/2022/iclr/LFPT5: A Unified Framework for Lifelong Few-shot Language Learning Based on Prompt Tuning of T5	
@@ -0,0 +1 @@
+Existing approaches to lifelong language learning rely on plenty of labeled data for learning a new task, which is hard to obtain in most real scenarios. Considering that humans can continually learn new tasks from a handful of examples, we expect the models also to be able to generalize well on new few-shot tasks without forgetting the previous ones. In this work, we define this more challenging yet practical problem as Lifelong Few-shot Language Learning (LFLL) and propose a unified framework for it based on prompt tuning of T5. Our framework called LFPT5 takes full advantage of PT's strong few-shot learning ability, and simultaneously trains the model as a task solver and a data generator. Before learning a new domain of the same task type, LFPT5 generates pseudo (labeled) samples of previously learned domains, and later gets trained on those samples to alleviate forgetting of previous knowledge as it learns the new domain. In addition, a KL divergence loss is minimized to achieve label consistency between the previous and the current model. While adapting to a new task type, LFPT5 includes and tunes additional prompt embeddings for the new task. With extensive experiments, we demonstrate that LFPT5 can be applied to various different types of tasks and significantly outperform previous methods in different LFLL settings.
\ No newline at end of file
diff --git a/data/2022/iclr/LIGS: Learnable Intrinsic-Reward Generation Selection for Multi-Agent Learning b/data/2022/iclr/LIGS: Learnable Intrinsic-Reward Generation Selection for Multi-Agent Learning
new file mode 100644
index 0000000000..f36015afac
--- /dev/null
+++ b/data/2022/iclr/LIGS: Learnable Intrinsic-Reward Generation Selection for Multi-Agent Learning	
@@ -0,0 +1 @@
+Efficient exploration is important for reinforcement learners to achieve high rewards. In multi-agent systems, coordinated exploration and behaviour is critical for agents to jointly achieve optimal outcomes. In this paper, we introduce a new general framework for improving coordination and performance of multi-agent reinforcement learners (MARL). Our framework, named Learnable Intrinsic-Reward Generation Selection algorithm (LIGS) introduces an adaptive learner, Generator that observes the agents and learns to construct intrinsic rewards online that coordinate the agents’ joint exploration and joint behaviour. Using a novel combination of MARL and switching controls, LIGS determines the best states to learn to add intrinsic rewards which leads to a highly efficient learning process. LIGS can subdivide complex tasks making them easier to solve and enables systems of MARL agents to quickly solve environments with sparse rewards. LIGS can seamlessly adopt existing MARL algorithms and, our theory shows that it ensures convergence to policies that deliver higher system performance. We demonstrate its superior performance in challenging tasks in Foraging and StarCraft II.
\ No newline at end of file
diff --git a/data/2022/iclr/LORD: Lower-Dimensional Embedding of Log-Signature in Neural Rough Differential Equations b/data/2022/iclr/LORD: Lower-Dimensional Embedding of Log-Signature in Neural Rough Differential Equations
new file mode 100644
index 0000000000..73e5acb373
--- /dev/null
+++ b/data/2022/iclr/LORD: Lower-Dimensional Embedding of Log-Signature in Neural Rough Differential Equations	
@@ -0,0 +1 @@
+The problem of processing very long time-series data (e.g., a length of more than 10,000) is a long-standing research problem in machine learning. Recently, one breakthrough, called neural rough differential equations (NRDEs), has been proposed and has shown that it is able to process such data. Their main concept is to use the log-signature transform, which is known to be more efficient than the Fourier transform for irregular long time-series, to convert a very long time-series sample into a relatively shorter series of feature vectors. However, the log-signature transform causes non-trivial spatial overheads. To this end, we present the method of LOweR-Dimensional embedding of log-signature (LORD), where we define an NRDE-based autoencoder to implant the higher-depth log-signature knowledge into the lower-depth log-signature. We show that the encoder successfully combines the higher-depth and the lower-depth log-signature knowledge, which greatly stabilizes the training process and increases the model accuracy. In our experiments with benchmark datasets, the improvement ratio by our method is up to 75\% in terms of various classification and forecasting evaluation metrics.
\ No newline at end of file
diff --git a/data/2022/iclr/Label Encoding for Regression Networks b/data/2022/iclr/Label Encoding for Regression Networks
new file mode 100644
index 0000000000..185bfa1ee5
--- /dev/null
+++ b/data/2022/iclr/Label Encoding for Regression Networks	
@@ -0,0 +1 @@
+Deep neural networks are used for a wide range of regression problems. However, there exists a significant gap in accuracy between specialized approaches and generic direct regression in which a network is trained by minimizing the squared or absolute error of output labels. Prior work has shown that solving a regression problem with a set of binary classifiers can improve accuracy by utilizing well-studied binary classification algorithms. We introduce binary-encoded labels (BEL), which generalizes the application of binary classification to regression by providing a framework for considering arbitrary multi-bit values when encoding target values. We identify desirable properties of suitable encoding and decoding functions used for the conversion between real-valued and binary-encoded labels based on theoretical and empirical study. These properties highlight a tradeoff between classification error probability and error-correction capabilities of label encodings. BEL can be combined with off-the-shelf task-specific feature extractors and trained end-to-end. We propose a series of sample encoding, decoding, and training loss functions for BEL and demonstrate they result in lower error than direct regression and specialized approaches while being suitable for a diverse set of regression problems, network architectures, and evaluation metrics. BEL achieves state-of-the-art accuracies for several regression benchmarks. Code is available at https://github.com/ubc-aamodt-group/BEL_regression.
\ No newline at end of file
diff --git a/data/2022/iclr/Label Leakage and Protection in Two-party Split Learning b/data/2022/iclr/Label Leakage and Protection in Two-party Split Learning
new file mode 100644
index 0000000000..5e768527f9
--- /dev/null
+++ b/data/2022/iclr/Label Leakage and Protection in Two-party Split Learning	
@@ -0,0 +1 @@
+Two-party split learning is a popular technique for learning a model across feature-partitioned data. In this work, we explore whether it is possible for one party to steal the private label information from the other party during split training, and whether there are methods that can protect against such attacks. Specifically, we first formulate a realistic threat model and propose a privacy loss metric to quantify label leakage in split learning. We then show that there exist two simple yet effective methods within the threat model that can allow one party to accurately recover private ground-truth labels owned by the other party. To combat these attacks, we propose several random perturbation techniques, including $\texttt{Marvell}$, an approach that strategically finds the structure of the noise perturbation by minimizing the amount of label leakage (measured through our quantification metric) of a worst-case adversary. We empirically demonstrate the effectiveness of our protection techniques against the identified attacks, and show that $\texttt{Marvell}$ in particular has improved privacy-utility tradeoffs relative to baseline approaches.
\ No newline at end of file
diff --git a/data/2022/iclr/Label-Efficient Semantic Segmentation with Diffusion Models b/data/2022/iclr/Label-Efficient Semantic Segmentation with Diffusion Models
new file mode 100644
index 0000000000..a7154ba2f9
--- /dev/null
+++ b/data/2022/iclr/Label-Efficient Semantic Segmentation with Diffusion Models	
@@ -0,0 +1 @@
+Denoising diffusion probabilistic models have recently received much research attention since they outperform alternative approaches, such as GANs, and currently provide state-of-the-art generative performance. The superior performance of diffusion models has made them an appealing tool in several applications, including inpainting, super-resolution, and semantic editing. In this paper, we demonstrate that diffusion models can also serve as an instrument for semantic segmentation, especially in the setup when labeled data is scarce. In particular, for several pretrained diffusion models, we investigate the intermediate activations from the networks that perform the Markov step of the reverse diffusion process. We show that these activations effectively capture the semantic information from an input image and appear to be excellent pixel-level representations for the segmentation problem. Based on these observations, we describe a simple segmentation method, which can work even if only a few training images are provided. Our approach significantly outperforms the existing alternatives on several datasets for the same amount of human supervision.
\ No newline at end of file
diff --git a/data/2022/iclr/Language model compression with weighted low-rank factorization b/data/2022/iclr/Language model compression with weighted low-rank factorization
new file mode 100644
index 0000000000..25c9aa4405
--- /dev/null
+++ b/data/2022/iclr/Language model compression with weighted low-rank factorization	
@@ -0,0 +1 @@
+Factorizing a large matrix into small matrices is a popular strategy for model compression. Singular value decomposition (SVD) plays a vital role in this compression strategy, approximating a learned matrix with fewer parameters. However, SVD minimizes the squared error toward reconstructing the original matrix without gauging the importance of the parameters, potentially giving a larger reconstruction error for those who affect the task accuracy more. In other words, the optimization objective of SVD is not aligned with the trained model's task accuracy. We analyze this previously unexplored problem, make observations, and address it by introducing Fisher information to weigh the importance of parameters affecting the model prediction. This idea leads to our method: Fisher-Weighted SVD (FWSVD). Although the factorized matrices from our approach do not result in smaller reconstruction errors, we find that our resulting task accuracy is much closer to the original model's performance. We perform analysis with the transformer-based language models, showing our weighted SVD largely alleviates the mismatched optimization objectives and can maintain model performance with a higher compression rate. Our method can directly compress a task-specific model while achieving better performance than other compact model strategies requiring expensive model pre-training. Moreover, the evaluation of compressing an already compact model shows our method can further reduce 9% to 30% parameters with an insignificant impact on task accuracy.
\ No newline at end of file
diff --git a/data/2022/iclr/Language modeling via stochastic processes b/data/2022/iclr/Language modeling via stochastic processes
new file mode 100644
index 0000000000..f15f0fed70
--- /dev/null
+++ b/data/2022/iclr/Language modeling via stochastic processes	
@@ -0,0 +1 @@
+Modern language models can generate high-quality short texts. However, they often meander or are incoherent when generating longer texts. These issues arise from the next-token-only language modeling objective. Recent work in self-supervised learning suggests that models can learn good latent representations via contrastive learning, which can be effective for discriminative tasks. Our work analyzes the application of contrastive representations for generative tasks, like long text generation. We propose one approach for leveraging constrastive representations, which we call Time Control (TC). TC first learns a contrastive representation of the target text domain, then generates text by decoding from these representations. Compared to domain-specific methods and fine-tuning GPT2 across a variety of text domains, TC performs competitively to methods specific for learning sentence representations on discourse coherence. On long text generation settings, TC preserves the text structure both in terms of ordering (up to $+15\%$ better) and text length consistency (up to $+90\%$ better).
\ No newline at end of file
diff --git a/data/2022/iclr/Language-biased image classification: evaluation based on semantic representations b/data/2022/iclr/Language-biased image classification: evaluation based on semantic representations
new file mode 100644
index 0000000000..f8d43e3c5f
--- /dev/null
+++ b/data/2022/iclr/Language-biased image classification: evaluation based on semantic representations	
@@ -0,0 +1 @@
+Humans show language-biased image recognition for a word-embedded image, known as picture-word interference. Such interference depends on hierarchical semantic categories and reflects that human language processing highly interacts with visual processing. Similar to humans, recent artificial models jointly trained on texts and images, e.g., OpenAI CLIP, show language-biased image classification. Exploring whether the bias leads to interference similar to those observed in humans can contribute to understanding how much the model acquires hierarchical semantic representations from joint learning of language and vision. The present study introduces methodological tools from the cognitive science literature to assess the biases of artificial models. Specifically, we introduce a benchmark task to test whether words superimposed on images can distort the image classification across different category levels and, if it can, whether the perturbation is due to the shared semantic representation between language and vision. Our dataset is a set of word-embedded images and consists of a mixture of natural image datasets and hierarchical word labels with superordinate/basic category levels. Using this benchmark test, we evaluate the CLIP model. We show that presenting words distorts the image classification by the model across different category levels, but the effect does not depend on the semantic relationship between images and embedded words. This suggests that the semantic word representation in the CLIP visual processing is not shared with the image representation, although the word representation strongly dominates for word-embedded images.
\ No newline at end of file
diff --git a/data/2022/iclr/Language-driven Semantic Segmentation b/data/2022/iclr/Language-driven Semantic Segmentation
new file mode 100644
index 0000000000..544d397bf5
--- /dev/null
+++ b/data/2022/iclr/Language-driven Semantic Segmentation	
@@ -0,0 +1 @@
+We present LSeg, a novel model for language-driven semantic image segmentation. LSeg uses a text encoder to compute embeddings of descriptive input labels (e.g.,"grass"or"building") together with a transformer-based image encoder that computes dense per-pixel embeddings of the input image. The image encoder is trained with a contrastive objective to align pixel embeddings to the text embedding of the corresponding semantic class. The text embeddings provide a flexible label representation in which semantically similar labels map to similar regions in the embedding space (e.g.,"cat"and"furry"). This allows LSeg to generalize to previously unseen categories at test time, without retraining or even requiring a single additional training sample. We demonstrate that our approach achieves highly competitive zero-shot performance compared to existing zero- and few-shot semantic segmentation methods, and even matches the accuracy of traditional segmentation algorithms when a fixed label set is provided. Code and demo are available at https://github.com/isl-org/lang-seg.
\ No newline at end of file
diff --git a/data/2022/iclr/Large Language Models Can Be Strong Differentially Private Learners b/data/2022/iclr/Large Language Models Can Be Strong Differentially Private Learners
new file mode 100644
index 0000000000..d0bfeef78a
--- /dev/null
+++ b/data/2022/iclr/Large Language Models Can Be Strong Differentially Private Learners	
@@ -0,0 +1 @@
+Differentially Private (DP) learning has seen limited success for building large deep learning models of text, and straightforward attempts at applying Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have resulted in large performance drops and high computational overhead. We show that this performance drop can be mitigated with (1) the use of large pretrained language models; (2) non-standard hyperparameters that suit DP optimization; and (3) fine-tuning objectives which are aligned with the pretraining procedure. With the above, we obtain NLP models that outperform state-of-the-art DP-trained models under the same privacy budget and strong non-private baselines -- by directly fine-tuning pretrained models with DP optimization on moderately-sized corpora. To address the computational challenge of running DP-SGD with large Transformers, we propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients for any linear layer in the model. The technique enables privately training Transformers with almost the same memory cost as non-private training at a modest run-time overhead. Contrary to conventional wisdom that DP optimization fails at learning high-dimensional models (due to noise that scales with dimension) empirical results reveal that private learning with pretrained language models doesn't tend to suffer from dimension-dependent performance degradation. Code to reproduce results can be found at https://github.com/lxuechen/private-transformers.
\ No newline at end of file
diff --git a/data/2022/iclr/Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect b/data/2022/iclr/Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect
new file mode 100644
index 0000000000..ee3bfb3d5f
--- /dev/null
+++ b/data/2022/iclr/Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect	
@@ -0,0 +1 @@
+Recent empirical advances show that training deep models with large learning rate often improves generalization performance. However, theoretical justifications on the benefits of large learning rate are highly limited, due to challenges in analysis. In this paper, we consider using Gradient Descent (GD) with a large learning rate on a homogeneous matrix factorization problem, i.e., $\min_{X, Y} \|A - XY^\top\|_{\sf F}^2$. We prove a convergence theory for constant large learning rates well beyond $2/L$, where $L$ is the largest eigenvalue of Hessian at the initialization. Moreover, we rigorously establish an implicit bias of GD induced by such a large learning rate, termed 'balancing', meaning that magnitudes of $X$ and $Y$ at the limit of GD iterations will be close even if their initialization is significantly unbalanced. Numerical experiments are provided to support our theory.
\ No newline at end of file
diff --git a/data/2022/iclr/Large-Scale Representation Learning on Graphs via Bootstrapping b/data/2022/iclr/Large-Scale Representation Learning on Graphs via Bootstrapping
new file mode 100644
index 0000000000..ba24a2f852
--- /dev/null
+++ b/data/2022/iclr/Large-Scale Representation Learning on Graphs via Bootstrapping	
@@ -0,0 +1 @@
+Self-supervised learning provides a promising path towards eliminating the need for costly label information in representation learning on graphs. However, to achieve state-of-the-art performance, methods often need large numbers of negative examples and rely on complex augmentations. This can be prohibitively expensive, especially for large graphs. To address these challenges, we introduce Bootstrapped Graph Latents (BGRL) - a graph representation learning method that learns by predicting alternative augmentations of the input. BGRL uses only simple augmentations and alleviates the need for contrasting with negative examples, and is thus scalable by design. BGRL outperforms or matches prior methods on several established benchmarks, while achieving a 2-10x reduction in memory costs. Furthermore, we show that BGRL can be scaled up to extremely large graphs with hundreds of millions of nodes in the semi-supervised regime - achieving state-of-the-art performance and improving over supervised baselines where representations are shaped only through label information. In particular, our solution centered on BGRL constituted one of the winning entries to the Open Graph Benchmark - Large Scale Challenge at KDD Cup 2021, on a graph orders of magnitudes larger than all previously available benchmarks, thus demonstrating the scalability and effectiveness of our approach.
\ No newline at end of file
diff --git a/data/2022/iclr/Latent Image Animator: Learning to Animate Images via Latent Space Navigation b/data/2022/iclr/Latent Image Animator: Learning to Animate Images via Latent Space Navigation
new file mode 100644
index 0000000000..0e00084861
--- /dev/null
+++ b/data/2022/iclr/Latent Image Animator: Learning to Animate Images via Latent Space Navigation	
@@ -0,0 +1 @@
+Due to the remarkable progress of deep generative models, animating images has become increasingly efficient, whereas associated results have become increasingly realistic. Current animation-approaches commonly exploit structure representation extracted from driving videos. Such structure representation is instrumental in transferring motion from driving videos to still images. However, such approaches fail in case the source image and driving video encompass large appearance variation. Moreover, the extraction of structure information requires additional modules that endow the animation-model with increased complexity. Deviating from such models, we here introduce the Latent Image Animator (LIA), a self-supervised autoencoder that evades need for structure representation. LIA is streamlined to animate images by linear navigation in the latent space. Specifically, motion in generated video is constructed by linear displacement of codes in the latent space. Towards this, we learn a set of orthogonal motion directions simultaneously, and use their linear combination, in order to represent any displacement in the latent space. Extensive quantitative and qualitative analysis suggests that our model systematically and significantly outperforms state-of-art methods on VoxCeleb, Taichi and TED-talk datasets w.r.t. generated quality.
\ No newline at end of file
diff --git a/data/2022/iclr/Latent Variable Sequential Set Transformers for Joint Multi-Agent Motion Prediction b/data/2022/iclr/Latent Variable Sequential Set Transformers for Joint Multi-Agent Motion Prediction
new file mode 100644
index 0000000000..02f89e1ca6
--- /dev/null
+++ b/data/2022/iclr/Latent Variable Sequential Set Transformers for Joint Multi-Agent Motion Prediction	
@@ -0,0 +1 @@
+Robust multi-agent trajectory prediction is essential for the safe control of robotic systems. A major challenge is to efficiently learn a representation that approximates the true joint distribution of contextual, social, and temporal information to enable planning. We propose Latent Variable Sequential Set Transformers which are encoder-decoder architectures that generate scene-consistent multi-agent trajectories. We refer to these architectures as"AutoBots". The encoder is a stack of interleaved temporal and social multi-head self-attention (MHSA) modules which alternately perform equivariant processing across the temporal and social dimensions. The decoder employs learnable seed parameters in combination with temporal and social MHSA modules allowing it to perform inference over the entire future scene in a single forward pass efficiently. AutoBots can produce either the trajectory of one ego-agent or a distribution over the future trajectories for all agents in the scene. For the single-agent prediction case, our model achieves top results on the global nuScenes vehicle motion prediction leaderboard, and produces strong results on the Argoverse vehicle prediction challenge. In the multi-agent setting, we evaluate on the synthetic partition of TrajNet++ dataset to showcase the model's socially-consistent predictions. We also demonstrate our model on general sequences of sets and provide illustrative experiments modelling the sequential structure of the multiple strokes that make up symbols in the Omniglot data. A distinguishing feature of AutoBots is that all models are trainable on a single desktop GPU (1080 Ti) in under 48h.
\ No newline at end of file
diff --git a/data/2022/iclr/Learn Locally, Correct Globally: A Distributed Algorithm for Training Graph Neural Networks b/data/2022/iclr/Learn Locally, Correct Globally: A Distributed Algorithm for Training Graph Neural Networks
new file mode 100644
index 0000000000..721e69e4be
--- /dev/null
+++ b/data/2022/iclr/Learn Locally, Correct Globally: A Distributed Algorithm for Training Graph Neural Networks	
@@ -0,0 +1 @@
+Despite the recent success of Graph Neural Networks (GNNs), training GNNs on large graphs remains challenging. The limited resource capacities of the existing servers, the dependency between nodes in a graph, and the privacy concern due to the centralized storage and model learning have spurred the need to design an effective distributed algorithm for GNN training. However, existing distributed GNN training methods impose either excessive communication costs or large memory overheads that hinders their scalability. To overcome these issues, we propose a communication-efficient distributed GNN training technique named $\text{{Learn Locally, Correct Globally}}$ (LLCG). To reduce the communication and memory overhead, each local machine in LLCG first trains a GNN on its local data by ignoring the dependency between nodes among different machines, then sends the locally trained model to the server for periodic model averaging. However, ignoring node dependency could result in significant performance degradation. To solve the performance degradation, we propose to apply $\text{{Global Server Corrections}}$ on the server to refine the locally learned models. We rigorously analyze the convergence of distributed methods with periodic model averaging for training GNNs and show that naively applying periodic model averaging but ignoring the dependency between nodes will suffer from an irreducible residual error. However, this residual error can be eliminated by utilizing the proposed global corrections to entail fast convergence rate. Extensive experiments on real-world datasets show that LLCG can significantly improve the efficiency without hurting the performance.
\ No newline at end of file
diff --git a/data/2022/iclr/Learnability Lock: Authorized Learnability Control Through Adversarial Invertible Transformations b/data/2022/iclr/Learnability Lock: Authorized Learnability Control Through Adversarial Invertible Transformations
new file mode 100644
index 0000000000..7525249017
--- /dev/null
+++ b/data/2022/iclr/Learnability Lock: Authorized Learnability Control Through Adversarial Invertible Transformations	
@@ -0,0 +1 @@
+Owing much to the revolution of information technology, the recent progress of deep learning benefits incredibly from the vastly enhanced access to data available in various digital formats. However, in certain scenarios, people may not want their data being used for training commercial models and thus studied how to attack the learnability of deep learning models. Previous works on learnability attack only consider the goal of preventing unauthorized exploitation on the specific dataset but not the process of restoring the learnability for authorized cases. To tackle this issue, this paper introduces and investigates a new concept called"learnability lock"for controlling the model's learnability on a specific dataset with a special key. In particular, we propose adversarial invertible transformation, that can be viewed as a mapping from image to image, to slightly modify data samples so that they become"unlearnable"by machine learning models with negligible loss of visual features. Meanwhile, one can unlock the learnability of the dataset and train models normally using the corresponding key. The proposed learnability lock leverages class-wise perturbation that applies a universal transformation function on data samples of the same label. This ensures that the learnability can be easily restored with a simple inverse transformation while remaining difficult to be detected or reverse-engineered. We empirically demonstrate the success and practicability of our method on visual classification tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Learnability of convolutional neural networks for infinite dimensional input via mixed and anisotropic smoothness b/data/2022/iclr/Learnability of convolutional neural networks for infinite dimensional input via mixed and anisotropic smoothness
new file mode 100644
index 0000000000..96884abc11
--- /dev/null
+++ b/data/2022/iclr/Learnability of convolutional neural networks for infinite dimensional input via mixed and anisotropic smoothness	
@@ -0,0 +1 @@
+Among a wide range of success of deep learning, convolutional neural networks have been extensively utilized in several tasks such as speech recognition, image processing, and natural language processing, which require inputs with large dimensions. Several studies have investigated function estimation capability of deep learning, but most of them have assumed that the dimensionality of the input is much smaller than the sample size. However, for typical data in applications such as those handled by the convolutional neural networks described above, the dimensionality of inputs is relatively high or even infinite. In this paper, we investigate the approximation and estimation errors of the (dilated) convolutional neural networks when the input is infinite dimensional. Although the approximation and estimation errors of neural networks are affected by the curse of dimensionality in the existing analyses for typical function spaces such as the Hölder and Besov spaces, we show that, by considering anisotropic smoothness, they can alleviate exponential dependency on the dimensionality but they only depend on the smoothness of the target functions. Our theoretical analysis supports the great practical success of convolutional networks. Furthermore, we show that the dilated convolution is advantageous when the smoothness of the target function has a sparse structure.
\ No newline at end of file
diff --git a/data/2022/iclr/Learned Simulators for Turbulence b/data/2022/iclr/Learned Simulators for Turbulence
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations b/data/2022/iclr/Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations
new file mode 100644
index 0000000000..074bb87ce8
--- /dev/null
+++ b/data/2022/iclr/Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations	
@@ -0,0 +1 @@
+Molecular chirality, a form of stereochemistry most often describing relative spatial arrangements of bonded neighbors around tetrahedral carbon centers, influences the set of 3D conformers accessible to the molecule without changing its 2D graph connectivity. Chirality can strongly alter (bio)chemical interactions, particularly protein-drug binding. Most 2D graph neural networks (GNNs) designed for molecular property prediction at best use atomic labels to na\"ively treat chirality, while E(3)-invariant 3D GNNs are invariant to chirality altogether. To enable representation learning on molecules with defined stereochemistry, we design an SE(3)-invariant model that processes torsion angles of a 3D molecular conformer. We explicitly model conformational flexibility by integrating a novel type of invariance to rotations about internal molecular bonds into the architecture, mitigating the need for multi-conformer data augmentation. We test our model on four benchmarks: contrastive learning to distinguish conformers of different stereoisomers in a learned latent space, classification of chiral centers as R/S, prediction of how enantiomers rotate circularly polarized light, and ranking enantiomers by their docking scores in an enantiosensitive protein pocket. We compare our model, Chiral InterRoto-Invariant Neural Network (ChIRo), with 2D and 3D GNNs to demonstrate that our model achieves state of the art performance when learning chiral-sensitive functions from molecular structures.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Altruistic Behaviours in Reinforcement Learning without External Rewards b/data/2022/iclr/Learning Altruistic Behaviours in Reinforcement Learning without External Rewards
new file mode 100644
index 0000000000..eb9fc06a40
--- /dev/null
+++ b/data/2022/iclr/Learning Altruistic Behaviours in Reinforcement Learning without External Rewards	
@@ -0,0 +1 @@
+Can artificial agents learn to assist others in achieving their goals without knowing what those goals are? Generic reinforcement learning agents could be trained to behave altruistically towards others by rewarding them for altruistic behaviour, i.e., rewarding them for benefiting other agents in a given situation. Such an approach assumes that other agents' goals are known so that the altruistic agent can cooperate in achieving those goals. However, explicit knowledge of other agents' goals is often difficult to acquire. In the case of human agents, their goals and preferences may be difficult to express fully; they might be ambiguous or even contradictory. Thus, it is beneficial to develop agents that do not depend on external supervision and learn altruistic behaviour in a task-agnostic manner. We propose to act altruistically towards other agents by giving them more choice and allowing them to achieve their goals better. Some concrete examples include opening a door for others or safeguarding them to pursue their objectives without interference. We formalize this concept and propose an altruistic agent that learns to increase the choices another agent has by preferring to maximize the number of states that the other agent can reach in its future. We evaluate our approach in three different multi-agent environments where another agent's success depends on altruistic behaviour. Finally, we show that our unsupervised agents can perform comparably to agents explicitly trained to work cooperatively, in some cases even outperforming them.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction b/data/2022/iclr/Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
new file mode 100644
index 0000000000..524d4e36c2
--- /dev/null
+++ b/data/2022/iclr/Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction	
@@ -0,0 +1 @@
+Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition. On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled data from LRS3 and combined with self-training. Using our audio-visual representation on the same benchmark for audio-only speech recognition leads to a 40% relative WER reduction over the state-of-the-art performance (1.3% vs 2.3%). Our code and models are available at https://github.com/facebookresearch/av_hubert
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Causal Models from Conditional Moment Restrictions by Importance Weighting b/data/2022/iclr/Learning Causal Models from Conditional Moment Restrictions by Importance Weighting
new file mode 100644
index 0000000000..22be47bbde
--- /dev/null
+++ b/data/2022/iclr/Learning Causal Models from Conditional Moment Restrictions by Importance Weighting	
@@ -0,0 +1 @@
+We consider learning causal relationships under conditional moment restrictions. Unlike causal inference under unconditional moment restrictions, conditional moment restrictions pose serious challenges for causal inference, especially in high-dimensional settings. To address this issue, we propose a method that transforms conditional moment restrictions to unconditional moment restrictions through importance weighting, using a conditional density ratio estimator. Using this transformation, we successfully estimate nonparametric functions defined under conditional moment restrictions. Our proposed framework is general and can be applied to a wide range of methods, including neural networks. We analyze the estimation error, providing theoretical support for our proposed method. In experiments, we confirm the soundness of our proposed method.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Continuous Environment Fields via Implicit Functions b/data/2022/iclr/Learning Continuous Environment Fields via Implicit Functions
new file mode 100644
index 0000000000..d2676ebc23
--- /dev/null
+++ b/data/2022/iclr/Learning Continuous Environment Fields via Implicit Functions	
@@ -0,0 +1 @@
+We propose a novel scene representation that encodes reaching distance -- the distance between any position in the scene to a goal along a feasible trajectory. We demonstrate that this environment field representation can directly guide the dynamic behaviors of agents in 2D mazes or 3D indoor scenes. Our environment field is a continuous representation and learned via a neural implicit function using discretely sampled training data. We showcase its application for agent navigation in 2D mazes, and human trajectory prediction in 3D indoor environments. To produce physically plausible and natural trajectories for humans, we additionally learn a generative model that predicts regions where humans commonly appear, and enforce the environment field to be defined within such regions. Extensive experiments demonstrate that the proposed method can generate both feasible and plausible trajectories efficiently and accurately.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Curves for Gaussian Process Regression with Power-Law Priors and Targets b/data/2022/iclr/Learning Curves for Gaussian Process Regression with Power-Law Priors and Targets
new file mode 100644
index 0000000000..c64b49d86d
--- /dev/null
+++ b/data/2022/iclr/Learning Curves for Gaussian Process Regression with Power-Law Priors and Targets	
@@ -0,0 +1 @@
+We characterize the power-law asymptotics of learning curves for Gaussian process regression (GPR) under the assumption that the eigenspectrum of the prior and the eigenexpansion coefficients of the target function follow a power law. Under similar assumptions, we leverage the equivalence between GPR and kernel ridge regression (KRR) to show the generalization error of KRR. Infinitely wide neural networks can be related to GPR with respect to the neural network GP kernel and the neural tangent kernel, which in several cases is known to have a power-law spectrum. Hence our methods can be applied to study the generalization error of infinitely wide neural networks. We present toy experiments demonstrating the theory.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Curves for SGD on Structured Features b/data/2022/iclr/Learning Curves for SGD on Structured Features
new file mode 100644
index 0000000000..fbb43c102d
--- /dev/null
+++ b/data/2022/iclr/Learning Curves for SGD on Structured Features	
@@ -0,0 +1 @@
+The generalization performance of a machine learning algorithm such as a neural network depends in a non-trivial way on the structure of the data distribution. To analyze the influence of data structure on test loss dynamics, we study an exactly solveable model of stochastic gradient descent (SGD) which predicts test loss when training on features with arbitrary covariance structure. We solve the theory exactly for both Gaussian features and arbitrary features and we show that the simpler Gaussian model accurately predicts test loss of nonlinear random-feature models and deep neural networks trained with SGD on real datasets such as MNIST and CIFAR-10. We show that the optimal batch size at a fixed compute budget is typically small and depends on the feature correlation structure, demonstrating the computational benefits of SGD with small batch sizes. Lastly, we extend our theory to the more usual setting of stochastic gradient descent on a fixed subsampled training set, showing that both training and test error can be accurately predicted in our framework on real data.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Discrete Structured Variational Auto-Encoder using Natural Evolution Strategies b/data/2022/iclr/Learning Discrete Structured Variational Auto-Encoder using Natural Evolution Strategies
new file mode 100644
index 0000000000..8c6646ee52
--- /dev/null
+++ b/data/2022/iclr/Learning Discrete Structured Variational Auto-Encoder using Natural Evolution Strategies	
@@ -0,0 +1 @@
+Discrete variational auto-encoders (VAEs) are able to represent semantic latent spaces in generative learning. In many real-life settings, the discrete latent space consists of high-dimensional structures, and propagating gradients through the relevant structures often requires enumerating over an exponentially large latent space. Recently, various approaches were devised to propagate approximated gradients without enumerating over the space of possible structures. In this work, we use Natural Evolution Strategies (NES), a class of gradient-free black-box optimization algorithms, to learn discrete structured VAEs. The NES algorithms are computationally appealing as they estimate gradients with forward pass evaluations only, thus they do not require to propagate gradients through their discrete structures. We demonstrate empirically that optimizing discrete structured VAEs using NES is as effective as gradient-based approximations. Lastly, we prove NES converges for non-Lipschitz functions as appear in discrete structured VAEs.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Disentangled Representation by Exploiting Pretrained Generative Models: A Contrastive Learning View b/data/2022/iclr/Learning Disentangled Representation by Exploiting Pretrained Generative Models: A Contrastive Learning View
new file mode 100644
index 0000000000..0e7feafa07
--- /dev/null
+++ b/data/2022/iclr/Learning Disentangled Representation by Exploiting Pretrained Generative Models: A Contrastive Learning View	
@@ -0,0 +1 @@
+From the intuitive notion of disentanglement, the image variations corresponding to different factors should be distinct from each other, and the disentangled representation should reflect those variations with separate dimensions. To discover the factors and learn disentangled representation, previous methods typically leverage an extra regularization term when learning to generate realistic images. However, the term usually results in a trade-off between disentanglement and generation quality. For the generative models pretrained without any disentanglement term, the generated images show semantically meaningful variations when traversing along different directions in the latent space. Based on this observation, we argue that it is possible to mitigate the trade-off by $(i)$ leveraging the pretrained generative models with high generation quality, $(ii)$ focusing on discovering the traversal directions as factors for disentangled representation learning. To achieve this, we propose Disentaglement via Contrast (DisCo) as a framework to model the variations based on the target disentangled representations, and contrast the variations to jointly discover disentangled directions and learn disentangled representations. DisCo achieves the state-of-the-art disentangled representation learning and distinct direction discovering, given pretrained non-disentangled generative models including GAN, VAE, and Flow. Source code is at https://github.com/xrenaa/DisCo.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Distributionally Robust Models at Scale via Composite Optimization b/data/2022/iclr/Learning Distributionally Robust Models at Scale via Composite Optimization
new file mode 100644
index 0000000000..361060415e
--- /dev/null
+++ b/data/2022/iclr/Learning Distributionally Robust Models at Scale via Composite Optimization	
@@ -0,0 +1 @@
+To train machine learning models that are robust to distribution shifts in the data, distributionally robust optimization (DRO) has been proven very effective. However, the existing approaches to learning a distributionally robust model either require solving complex optimization problems such as semidefinite programming or a first-order method whose convergence scales linearly with the number of data samples -- which hinders their scalability to large datasets. In this paper, we show how different variants of DRO are simply instances of a finite-sum composite optimization for which we provide scalable methods. We also provide empirical results that demonstrate the effectiveness of our proposed algorithm with respect to the prior art in order to learn robust models from very large datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Efficient Image Super-Resolution Networks via Structure-Regularized Pruning b/data/2022/iclr/Learning Efficient Image Super-Resolution Networks via Structure-Regularized Pruning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Learning Efficient Online 3D Bin Packing on Packing Configuration Trees b/data/2022/iclr/Learning Efficient Online 3D Bin Packing on Packing Configuration Trees
new file mode 100644
index 0000000000..f5da17ff62
--- /dev/null
+++ b/data/2022/iclr/Learning Efficient Online 3D Bin Packing on Packing Configuration Trees	
@@ -0,0 +1 @@
+Online 3D Bin Packing Problem (3D-BPP) has widespread applications in industrial automation and has aroused enthusiastic research interest recently. Existing methods usually solve the problem with limited resolution of spatial discretization, and/or cannot deal with complex practical constraints well. We propose to enhance the practical applicability of online 3D-BPP via learning on a novel hierarchical representation –– packing conﬁguration tree (PCT). PCT is a full-ﬂedged description of the state and action space of bin packing which can support packing policy learning based on deep reinforcement learning (DRL). The size of the packing action space is proportional to the number of leaf nodes, i.e. candidate placements, making the DRL model easy to train and well-performing even with continuous solution space. During training, PCT expands based on heuristic rules, however, the DRL model learns a much more effective and robust packing policy than heuristic methods. Through extensive evaluation, we demonstrate that our method outperforms all existing online BPP methods and is versatile in terms of incorporating various practical constraints.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Fast Samplers for Diffusion Models by Differentiating Through Sample Quality b/data/2022/iclr/Learning Fast Samplers for Diffusion Models by Differentiating Through Sample Quality
new file mode 100644
index 0000000000..3d9376100c
--- /dev/null
+++ b/data/2022/iclr/Learning Fast Samplers for Diffusion Models by Differentiating Through Sample Quality	
@@ -0,0 +1 @@
+Diffusion models have emerged as an expressive family of generative models rivaling GANs in sample quality and autoregressive models in likelihood scores. Standard diffusion models typically require hundreds of forward passes through the model to generate a single high-fidelity sample. We introduce Differentiable Diffusion Sampler Search (DDSS): a method that optimizes fast samplers for any pre-trained diffusion model by differentiating through sample quality scores. We also present Generalized Gaussian Diffusion Models (GGDM), a family of flexible non-Markovian samplers for diffusion models. We show that optimizing the degrees of freedom of GGDM samplers by maximizing sample quality scores via gradient descent leads to improved sample quality. Our optimization procedure backpropagates through the sampling process using the reparametrization trick and gradient rematerialization. DDSS achieves strong results on unconditional image generation across various datasets (e.g., FID scores on LSUN church 128x128 of 11.6 with only 10 inference steps, and 4.82 with 20 steps, compared to 51.1 and 14.9 with strongest DDPM/DDIM baselines). Our method is compatible with any pre-trained diffusion model without fine-tuning or re-training required.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Fast, Learning Slow: A General Continual Learning Method based on Complementary Learning System b/data/2022/iclr/Learning Fast, Learning Slow: A General Continual Learning Method based on Complementary Learning System
new file mode 100644
index 0000000000..b8c20be5be
--- /dev/null
+++ b/data/2022/iclr/Learning Fast, Learning Slow: A General Continual Learning Method based on Complementary Learning System	
@@ -0,0 +1 @@
+Humans excel at continually learning from an ever-changing environment whereas it remains a challenge for deep neural networks which exhibit catastrophic forgetting. The complementary learning system (CLS) theory suggests that the interplay between rapid instance-based learning and slow structured learning in the brain is crucial for accumulating and retaining knowledge. Here, we propose CLS-ER, a novel dual memory experience replay (ER) method which maintains short-term and long-term semantic memories that interact with the episodic memory. Our method employs an effective replay mechanism whereby new knowledge is acquired while aligning the decision boundaries with the semantic memories. CLS-ER does not utilize the task boundaries or make any assumption about the distribution of the data which makes it versatile and suited for"general continual learning". Our approach achieves state-of-the-art performance on standard benchmarks as well as more realistic general continual learning settings.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Features with Parameter-Free Layers b/data/2022/iclr/Learning Features with Parameter-Free Layers
new file mode 100644
index 0000000000..0bb8277b9c
--- /dev/null
+++ b/data/2022/iclr/Learning Features with Parameter-Free Layers	
@@ -0,0 +1 @@
+Trainable layers such as convolutional building blocks are the standard network design choices by learning parameters to capture the global context through successive spatial operations. When designing an efficient network, trainable layers such as the depthwise convolution is the source of efficiency in the number of parameters and FLOPs, but there was little improvement to the model speed in practice. This paper argues that simple built-in parameter-free operations can be a favorable alternative to the efficient trainable layers replacing spatial operations in a network architecture. We aim to break the stereotype of organizing the spatial operations of building blocks into trainable layers. Extensive experimental analyses based on layer-level studies with fully-trained models and neural architecture searches are provided to investigate whether parameter-free operations such as the max-pool are functional. The studies eventually give us a simple yet effective idea for redesigning network architectures, where the parameter-free operations are heavily used as the main building block without sacrificing the model accuracy as much. Experimental results on the ImageNet dataset demonstrate that the network architectures with parameter-free operations could enjoy the advantages of further efficiency in terms of model speed, the number of the parameters, and FLOPs. Code and ImageNet pretrained models are available at https://github.com/naver-ai/PfLayer.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Generalizable Representations for Reinforcement Learning via Adaptive Meta-learner of Behavioral Similarities b/data/2022/iclr/Learning Generalizable Representations for Reinforcement Learning via Adaptive Meta-learner of Behavioral Similarities
new file mode 100644
index 0000000000..93cad0c71c
--- /dev/null
+++ b/data/2022/iclr/Learning Generalizable Representations for Reinforcement Learning via Adaptive Meta-learner of Behavioral Similarities	
@@ -0,0 +1 @@
+How to learn an effective reinforcement learning-based model for control tasks from high-level visual observations is a practical and challenging problem. A key to solving this problem is to learn low-dimensional state representations from observations, from which an effective policy can be learned. In order to boost the learning of state encoding, recent works are focused on capturing behavioral similarities between state representations or applying data augmentation on visual observations. In this paper, we propose a novel meta-learner-based framework for representation learning regarding behavioral similarities for reinforcement learning. Specifically, our framework encodes the high-dimensional observations into two decomposed embeddings regarding reward and dynamics in a Markov Decision Process (MDP). A pair of meta-learners are developed, one of which quantifies the reward similarity and the other quantifies dynamics similarity over the correspondingly decomposed embeddings. The meta-learners are self-learned to update the state embeddings by approximating two disjoint terms in on-policy bisimulation metric. To incorporate the reward and dynamics terms, we further develop a strategy to adaptively balance their impacts based on different tasks or environments. We empirically demonstrate that our proposed framework outperforms state-of-the-art baselines on several benchmarks, including conventional DM Control Suite, Distracting DM Control Suite and a self-driving task CARLA.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Graphon Mean Field Games and Approximate Nash Equilibria b/data/2022/iclr/Learning Graphon Mean Field Games and Approximate Nash Equilibria
new file mode 100644
index 0000000000..bc64645f81
--- /dev/null
+++ b/data/2022/iclr/Learning Graphon Mean Field Games and Approximate Nash Equilibria	
@@ -0,0 +1 @@
+Recent advances at the intersection of dense large graph limits and mean field games have begun to enable the scalable analysis of a broad class of dynamical sequential games with large numbers of agents. So far, results have been largely limited to graphon mean field systems with continuous-time diffusive or jump dynamics, typically without control and with little focus on computational methods. We propose a novel discrete-time formulation for graphon mean field games as the limit of non-linear dense graph Markov games with weak interaction. On the theoretical side, we give extensive and rigorous existence and approximation properties of the graphon mean field solution in sufficiently large systems. On the practical side, we provide general learning schemes for graphon mean field equilibria by either introducing agent equivalence classes or reformulating the graphon mean field system as a classical mean field system. By repeatedly finding a regularized optimal control solution and its generated mean field, we successfully obtain plausible approximate Nash equilibria in otherwise infeasible large dense graph games with many agents. Empirically, we are able to demonstrate on a number of examples that the finite-agent behavior comes increasingly close to the mean field behavior for our computed equilibria as the graph or system size grows, verifying our theory. More generally, we successfully apply policy gradient reinforcement learning in conjunction with sequential Monte Carlo methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Guarantees for Graph Convolutional Networks on the Stochastic Block Model b/data/2022/iclr/Learning Guarantees for Graph Convolutional Networks on the Stochastic Block Model
new file mode 100644
index 0000000000..b808eecb7f
--- /dev/null
+++ b/data/2022/iclr/Learning Guarantees for Graph Convolutional Networks on the Stochastic Block Model	
@@ -0,0 +1 @@
+An abundance of neural network models and algorithms for diverse tasks on graphs have been developed in the past five years. However, very few provable guarantees have been available for the performance of graph neural network models. This state of affairs is in contrast with the steady progress on the theoretical underpinnings of traditional dense and convolutional neural networks. In this paper we present the first provable guarantees for one of the best-studied families of graph neural network models, Graph Convolutional Networks (GCNs), for semi-supervised community detection tasks. We show that with high probability over the initialization and training data, a GCN will efficiently learn to detect communities on graphs drawn from a stochastic block model. Our proof relies on a fine-grained analysis of the training dynamics in order to overcome the complexity of a non-convex optimization landscape with many poorly-performing local minima.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Hierarchical Structures with Differentiable Nondeterministic Stacks b/data/2022/iclr/Learning Hierarchical Structures with Differentiable Nondeterministic Stacks
new file mode 100644
index 0000000000..7dd1d03433
--- /dev/null
+++ b/data/2022/iclr/Learning Hierarchical Structures with Differentiable Nondeterministic Stacks	
@@ -0,0 +1 @@
+Learning hierarchical structures in sequential data -- from simple algorithmic patterns to natural language -- in a reliable, generalizable way remains a challenging problem for neural language models. Past work has shown that recurrent neural networks (RNNs) struggle to generalize on held-out algorithmic or syntactic patterns without supervision or some inductive bias. To remedy this, many papers have explored augmenting RNNs with various differentiable stacks, by analogy with finite automata and pushdown automata (PDAs). In this paper, we improve the performance of our recently proposed Nondeterministic Stack RNN (NS-RNN), which uses a differentiable data structure that simulates a nondeterministic PDA, with two important changes. First, the model now assigns unnormalized positive weights instead of probabilities to stack actions, and we provide an analysis of why this improves training. Second, the model can directly observe the state of the underlying PDA. Our model achieves lower cross-entropy than all previous stack RNNs on five context-free language modeling tasks (within 0.05 nats of the information-theoretic lower bound), including a task on which the NS-RNN previously failed to outperform a deterministic stack RNN baseline. Finally, we propose a restricted version of the NS-RNN that incrementally processes infinitely long sequences, and we present language modeling results on the Penn Treebank.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Long-Term Reward Redistribution via Randomized Return Decomposition b/data/2022/iclr/Learning Long-Term Reward Redistribution via Randomized Return Decomposition
new file mode 100644
index 0000000000..60ef22e6ba
--- /dev/null
+++ b/data/2022/iclr/Learning Long-Term Reward Redistribution via Randomized Return Decomposition	
@@ -0,0 +1 @@
+randomized return decomposition (RRD), which sets up a surrogate optimization problem of the least-squares-based return decomposition. The proposed surrogate objective allows us to conduct return decomposition on short subsequences of agent trajectories, which is scalable in long-horizon tasks. We provide analyses to characterize the algorithmic property of our surrogate objective function and discuss connections to existing methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Multimodal VAEs through Mutual Supervision b/data/2022/iclr/Learning Multimodal VAEs through Mutual Supervision
new file mode 100644
index 0000000000..9f06a3e363
--- /dev/null
+++ b/data/2022/iclr/Learning Multimodal VAEs through Mutual Supervision	
@@ -0,0 +1 @@
+Multimodal VAEs seek to model the joint distribution over heterogeneous data (e.g.\ vision, language), whilst also capturing a shared representation across such modalities. Prior work has typically combined information from the modalities by reconciling idiosyncratic representations directly in the recognition model through explicit products, mixtures, or other such factorisations. Here we introduce a novel alternative, the MEME, that avoids such explicit combinations by repurposing semi-supervised VAEs to combine information between modalities implicitly through mutual supervision. This formulation naturally allows learning from partially-observed data where some modalities can be entirely missing -- something that most existing approaches either cannot handle, or do so to a limited extent. We demonstrate that MEME outperforms baselines on standard metrics across both partial and complete observation schemes on the MNIST-SVHN (image-image) and CUB (image-text) datasets. We also contrast the quality of the representations learnt by mutual supervision against standard approaches and observe interesting trends in its ability to capture relatedness between data.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Neural Contextual Bandits through Perturbed Rewards b/data/2022/iclr/Learning Neural Contextual Bandits through Perturbed Rewards
new file mode 100644
index 0000000000..6507e5ff5c
--- /dev/null
+++ b/data/2022/iclr/Learning Neural Contextual Bandits through Perturbed Rewards	
@@ -0,0 +1 @@
+Thanks to the power of representation learning, neural contextual bandit algorithms demonstrate remarkable performance improvement against their classical counterparts. But because their exploration has to be performed in the entire neural network parameter space to obtain nearly optimal regret, the resulting computational cost is prohibitively high. We perturb the rewards when updating the neural network to eliminate the need of explicit exploration and the corresponding computational overhead. We prove that a $\tilde{O}(\tilde{d}\sqrt{T})$ regret upper bound is still achievable under standard regularity conditions, where $T$ is the number of rounds of interactions and $\tilde{d}$ is the effective dimension of a neural tangent kernel matrix. Extensive comparisons with several benchmark contextual bandit algorithms, including two recent neural contextual bandit models, demonstrate the effectiveness and computational efficiency of our proposed neural bandit algorithm.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Object-Oriented Dynamics for Planning from Text b/data/2022/iclr/Learning Object-Oriented Dynamics for Planning from Text
new file mode 100644
index 0000000000..5aa21a403c
--- /dev/null
+++ b/data/2022/iclr/Learning Object-Oriented Dynamics for Planning from Text	
@@ -0,0 +1 @@
+The advancement of dynamics models enables model-based planning in complex environments. Dynamics models mostly study image-based games with fully observable states. Generalizing these models to Text-Based Games (TBGs), which often include partially observable states with noisy text observations, is challenging. In this work, we propose an Object-Oriented Text Dynamics (OOTD) model that enables planning algorithms to solve decision-making problems in text domains. OOTD predicts a memory graph that dynamically remembers the history of object observations and ﬁlters object-irrelevant information. To improve the robustness of dynamics, our OOTD model identiﬁes the objects inﬂuenced by input actions and predicts beliefs of object states with independently parameterized transition layers. We develop variational objectives under the object-supervised and self-supervised settings to model the stochasticity of predicted dynamics. Empirical results show that our OOTD-based planner signiﬁcantly outperforms model-free baselines in terms of sample efﬁciency and running scores.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Optimal Conformal Classifiers b/data/2022/iclr/Learning Optimal Conformal Classifiers
new file mode 100644
index 0000000000..83ee977c02
--- /dev/null
+++ b/data/2022/iclr/Learning Optimal Conformal Classifiers	
@@ -0,0 +1 @@
+Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in high-stake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. Conformal prediction (CP) addresses these issues by using the classifier's predictions, e.g., its probability estimates, to predict confidence sets containing the true class with a user-specified probability. However, using CP as a separate processing step after training prevents the underlying model from adapting to the prediction of confidence sets. Thus, this paper explores strategies to differentiate through CP during training with the goal of training model with the conformal wrapper end-to-end. In our approach, conformal training (ConfTr), we specifically"simulate"conformalization on mini-batches during training. Compared to standard training, ConfTr reduces the average confidence set size (inefficiency) of state-of-the-art CP methods applied after training. Moreover, it allows to"shape"the confidence sets predicted at test time, which is difficult for standard CP. On experiments with several datasets, we show ConfTr can influence how inefficiency is distributed across classes, or guide the composition of confidence sets in terms of the included classes, while retaining the guarantees offered by CP.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Prototype-oriented Set Representations for Meta-Learning b/data/2022/iclr/Learning Prototype-oriented Set Representations for Meta-Learning
new file mode 100644
index 0000000000..65a5d15f3d
--- /dev/null
+++ b/data/2022/iclr/Learning Prototype-oriented Set Representations for Meta-Learning	
@@ -0,0 +1 @@
+Learning from set-structured data is a fundamental problem that has recently attracted increasing attention, where a series of summary networks are introduced to deal with the set input. In fact, many meta-learning problems can be treated as set-input tasks. Most existing summary networks aim to design different architectures for the input set in order to enforce permutation invariance. However, scant attention has been paid to the common cases where different sets in a meta-distribution are closely related and share certain statistical properties. Viewing each set as a distribution over a set of global prototypes, this paper provides a novel prototype-oriented optimal transport (POT) framework to improve existing summary networks. To learn the distribution over the global prototypes, we minimize its regularized optimal transport distance to the set empirical distribution over data points, providing a natural unsupervised way to improve the summary network. Since our plug-and-play framework can be applied to many meta-learning problems, we further instantiate it to the cases of few-shot classification and implicit meta generative modeling. Extensive experiments demonstrate that our framework significantly improves the existing summary networks on learning more powerful summary statistics from sets and can be successfully integrated into metric-based few-shot classification and generative modeling applications, providing a promising tool for addressing set-input and meta-learning problems.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Pruning-Friendly Networks via Frank-Wolfe: One-Shot, Any-Sparsity, And No Retraining b/data/2022/iclr/Learning Pruning-Friendly Networks via Frank-Wolfe: One-Shot, Any-Sparsity, And No Retraining
new file mode 100644
index 0000000000..f8a3a828bb
--- /dev/null
+++ b/data/2022/iclr/Learning Pruning-Friendly Networks via Frank-Wolfe: One-Shot, Any-Sparsity, And No Retraining	
@@ -0,0 +1 @@
+We present a novel framework to train a large deep neural network (DNN) for only once , which can then be pruned to any sparsity ratio to preserve competitive accuracy without any re-training . Conventional methods often require (iterative) pruning followed by re-training, which not only incurs large overhead beyond the original DNN training but also can be sensitive to retraining hyperparameters. Our core idea is to re-cast the DNN training as an explicit pruning-aware process: that is formulated with an auxiliary K -sparse polytope constraint, to encourage network weights to lie in a convex hull spanned by K -sparse vectors, potentially resulting in more sparse weight matrices. We then leverage a stochastic Frank-Wolfe (SFW) algorithm to solve this new constrained optimization, which naturally leads to sparse weight updates each time. We further note an overlooked fact that existing DNN initializations were derived to enhance SGD training (e.g., avoid gradient explosion or collapse), but was unaligned with the challenges of training with SFW. We hence also present the ﬁrst learning-based initialization scheme speciﬁcally for boosting SFW-based DNN training. Experiments on CIFAR-10 and Tiny-ImageNet datasets demonstrate that our new framework named SFW-pruning consistently achieves the state-of-the-art performance on various benchmark DNNs over a wide range of pruning
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Representation from Neural Fisher Kernel with Low-rank Approximation b/data/2022/iclr/Learning Representation from Neural Fisher Kernel with Low-rank Approximation
new file mode 100644
index 0000000000..fe827df5a9
--- /dev/null
+++ b/data/2022/iclr/Learning Representation from Neural Fisher Kernel with Low-rank Approximation	
@@ -0,0 +1 @@
+In this paper, we study the representation of neural networks from the view of kernels. We first define the Neural Fisher Kernel (NFK), which is the Fisher Kernel applied to neural networks. We show that NFK can be computed for both supervised and unsupervised learning models, which can serve as a unified tool for representation extraction. Furthermore, we show that practical NFKs exhibit low-rank structures. We then propose an efficient algorithm that computes a low rank approximation of NFK, which scales to large datasets and networks. We show that the low-rank approximation of NFKs derived from unsupervised generative models and supervised learning models gives rise to high-quality compact representations of data, achieving competitive results on a variety of machine learning tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Scenario Representation for Solving Two-stage Stochastic Integer Programs b/data/2022/iclr/Learning Scenario Representation for Solving Two-stage Stochastic Integer Programs
new file mode 100644
index 0000000000..9a5a756f34
--- /dev/null
+++ b/data/2022/iclr/Learning Scenario Representation for Solving Two-stage Stochastic Integer Programs	
@@ -0,0 +1 @@
+Many practical combinatorial optimization problems under uncertainty can be modeled as stochastic integer programs (SIPs), which are extremely challenging to solve due to the high complexity. To solve two-stage SIPs efficiently, we propose a conditional variational autoencoder (CVAE) based method to learn scenario representation for a class of SIP instances. Specifically, we design a graph convolutional network based encoder to embed each scenario with the deterministic part of its instance (i.e. context) into a low-dimensional latent space, from which a decoder reconstructs the scenario from its latent representation conditioned on the context. Such a design effectively captures the dependencies of the scenarios on their corresponding instances. We apply the trained encoder to two tasks in typical SIP solving, i.e. scenario reduction and objective prediction. Experiments on two graph-based SIPs show that the learned representation significantly boosts the solving performance to attain high-quality solutions in short computational time, and generalizes fairly well to problems of larger sizes or with more scenarios.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning State Representations via Retracing in Reinforcement Learning b/data/2022/iclr/Learning State Representations via Retracing in Reinforcement Learning
new file mode 100644
index 0000000000..ec5e3b5cbd
--- /dev/null
+++ b/data/2022/iclr/Learning State Representations via Retracing in Reinforcement Learning	
@@ -0,0 +1 @@
+We propose learning via retracing, a novel self-supervised approach for learning the state representation (and the associated dynamics model) for reinforcement learning tasks. In addition to the predictive (reconstruction) supervision in the forward direction, we propose to include"retraced"transitions for representation / model learning, by enforcing the cycle-consistency constraint between the original and retraced states, hence improve upon the sample efficiency of learning. Moreover, learning via retracing explicitly propagates information about future transitions backward for inferring previous states, thus facilitates stronger representation learning for the downstream reinforcement learning tasks. We introduce Cycle-Consistency World Model (CCWM), a concrete model-based instantiation of learning via retracing. Additionally we propose a novel adaptive"truncation"mechanism for counteracting the negative impacts brought by"irreversible"transitions such that learning via retracing can be maximally effective. Through extensive empirical studies on visual-based continuous control benchmarks, we demonstrate that CCWM achieves state-of-the-art performance in terms of sample efficiency and asymptotic performance, whilst exhibiting behaviours that are indicative of stronger representation learning.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Strides in Convolutional Neural Networks b/data/2022/iclr/Learning Strides in Convolutional Neural Networks
new file mode 100644
index 0000000000..86b7a1b060
--- /dev/null
+++ b/data/2022/iclr/Learning Strides in Convolutional Neural Networks	
@@ -0,0 +1 @@
+Convolutional neural networks typically contain several downsampling operators, such as strided convolutions or pooling layers, that progressively reduce the resolution of intermediate representations. This provides some shift-invariance while reducing the computational complexity of the whole architecture. A critical hyperparameter of such layers is their stride: the integer factor of downsampling. As strides are not differentiable, finding the best configuration either requires cross-validation or discrete optimization (e.g. architecture search), which rapidly become prohibitive as the search space grows exponentially with the number of downsampling layers. Hence, exploring this search space by gradient descent would allow finding better configurations at a lower computational cost. This work introduces DiffStride, the first downsampling layer with learnable strides. Our layer learns the size of a cropping mask in the Fourier domain, that effectively performs resizing in a differentiable way. Experiments on audio and image classification show the generality and effectiveness of our solution: we use DiffStride as a drop-in replacement to standard downsampling layers and outperform them. In particular, we show that introducing our layer into a ResNet-18 architecture allows keeping consistent high performance on CIFAR10, CIFAR100 and ImageNet even when training starts from poor random stride configurations. Moreover, formulating strides as learnable variables allows us to introduce a regularization term that controls the computational complexity of the architecture. We show how this regularization allows trading off accuracy for efficiency on ImageNet.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Super-Features for Image Retrieval b/data/2022/iclr/Learning Super-Features for Image Retrieval
new file mode 100644
index 0000000000..8b7381f448
--- /dev/null
+++ b/data/2022/iclr/Learning Super-Features for Image Retrieval	
@@ -0,0 +1 @@
+Methods that combine local and global features have recently shown excellent performance on multiple challenging deep image retrieval benchmarks, but their use of local features raises at least two issues. First, these local features simply boil down to the localized map activations of a neural network, and hence can be extremely redundant. Second, they are typically trained with a global loss that only acts on top of an aggregation of local features; by contrast, testing is based on local feature matching, which creates a discrepancy between training and testing. In this paper, we propose a novel architecture for deep image retrieval, based solely on mid-level features that we call Super-features. These Super-features are constructed by an iterative attention module and constitute an ordered set in which each element focuses on a localized and discriminant image pattern. For training, they require only image labels. A contrastive loss operates directly at the level of Super-features and focuses on those that match across images. A second complementary loss encourages diversity. Experiments on common landmark retrieval benchmarks validate that Super-features substantially outperform state-of-the-art methods when using the same number of features, and only require a significantly smaller memory footprint to match their performance. Code and models are available at: https://github.com/naver/FIRe.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Synthetic Environments and Reward Networks for Reinforcement Learning b/data/2022/iclr/Learning Synthetic Environments and Reward Networks for Reinforcement Learning
new file mode 100644
index 0000000000..3f8e880da5
--- /dev/null
+++ b/data/2022/iclr/Learning Synthetic Environments and Reward Networks for Reinforcement Learning	
@@ -0,0 +1 @@
+We introduce Synthetic Environments (SEs) and Reward Networks (RNs), represented by neural networks, as proxy environment models for training Reinforcement Learning (RL) agents. We show that an agent, after being trained exclusively on the SE, is able to solve the corresponding real environment. While an SE acts as a full proxy to a real environment by learning about its state dynamics and rewards, an RN is a partial proxy that learns to augment or replace rewards. We use bi-level optimization to evolve SEs and RNs: the inner loop trains the RL agent, and the outer loop trains the parameters of the SE / RN via an evolution strategy. We evaluate our proposed new concept on a broad range of RL algorithms and classic control environments. In a one-to-one comparison, learning an SE proxy requires more interactions with the real environment than training agents only on the real environment. However, once such an SE has been learned, we do not need any interactions with the real environment to train new agents. Moreover, the learned SE proxies allow us to train agents with fewer interactions while maintaining the original task performance. Our empirical results suggest that SEs achieve this result by learning informed representations that bias the agents towards relevant states. Moreover, we find that these proxies are robust against hyperparameter variation and can also transfer to unseen agents.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Temporally Causal Latent Processes from General Temporal Data b/data/2022/iclr/Learning Temporally Causal Latent Processes from General Temporal Data
new file mode 100644
index 0000000000..26ad24af9b
--- /dev/null
+++ b/data/2022/iclr/Learning Temporally Causal Latent Processes from General Temporal Data	
@@ -0,0 +1 @@
+Our goal is to recover time-delayed latent causal variables and identify their relations from measured temporal data. Estimating causally-related latent variables from observations is particularly challenging as the latent variables are not uniquely recoverable in the most general case. In this work, we consider both a nonparametric, nonstationary setting and a parametric setting for the latent processes and propose two provable conditions under which temporally causal latent processes can be identified from their nonlinear mixtures. We propose LEAP, a theoretically-grounded framework that extends Variational AutoEncoders (VAEs) by enforcing our conditions through proper constraints in causal process prior. Experimental results on various datasets demonstrate that temporally causal latent processes are reliably identified from observed variables under different dependency structures and that our approach considerably outperforms baselines that do not properly leverage history or nonstationarity information. This demonstrates that using temporal information to learn latent processes from their invertible nonlinear mixtures in an unsupervised manner, for which we believe our work is one of the first, seems promising even without sparsity or minimality assumptions.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Towards The Largest Margins b/data/2022/iclr/Learning Towards The Largest Margins
new file mode 100644
index 0000000000..98be31daa8
--- /dev/null
+++ b/data/2022/iclr/Learning Towards The Largest Margins	
@@ -0,0 +1 @@
+One of the main challenges for feature representation in deep learning-based classification is the design of appropriate loss functions that exhibit strong discriminative power. The classical softmax loss does not explicitly encourage discriminative learning of features. A popular direction of research is to incorporate margins in well-established losses in order to enforce extra intra-class compactness and inter-class separability, which, however, were developed through heuristic means, as opposed to rigorous mathematical principles. In this work, we attempt to address this limitation by formulating the principled optimization objective as learning towards the largest margins. Specifically, we firstly define the class margin as the measure of inter-class separability, and the sample margin as the measure of intra-class compactness. Accordingly, to encourage discriminative representation of features, the loss function should promote the largest possible margins for both classes and samples. Furthermore, we derive a generalized margin softmax loss to draw general conclusions for the existing margin-based losses. Not only does this principled framework offer new perspectives to understand and interpret existing margin-based losses, but it also provides new insights that can guide the design of new tools, including sample margin regularization and largest margin softmax loss for the class-balanced case, and zero-centroid regularization for the class-imbalanced case. Experimental results demonstrate the effectiveness of our strategy on a variety of tasks, including visual classification, imbalanced classification, person re-identification, and face verification.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Transferable Reward for Query Object Localization with Policy Adaptation b/data/2022/iclr/Learning Transferable Reward for Query Object Localization with Policy Adaptation
new file mode 100644
index 0000000000..ec3207a5a1
--- /dev/null
+++ b/data/2022/iclr/Learning Transferable Reward for Query Object Localization with Policy Adaptation	
@@ -0,0 +1 @@
+We propose a reinforcement learning based approach to query object localization, for which an agent is trained to localize objects of interest specified by a small exemplary set. We learn a transferable reward signal formulated using the exemplary set by ordinal metric learning. Our proposed method enables test-time policy adaptation to new environments where the reward signals are not readily available, and outperforms fine-tuning approaches that are limited to annotated images. In addition, the transferable reward allows repurposing the trained agent from one specific class to another class. Experiments on corrupted MNIST, CU-Birds, and COCO datasets demonstrate the effectiveness of our approach.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Value Functions from Undirected State-only Experience b/data/2022/iclr/Learning Value Functions from Undirected State-only Experience
new file mode 100644
index 0000000000..173ea1b920
--- /dev/null
+++ b/data/2022/iclr/Learning Value Functions from Undirected State-only Experience	
@@ -0,0 +1 @@
+This paper tackles the problem of learning value functions from undirected state-only experience (state transitions without action labels i.e. (s,s',r) tuples). We first theoretically characterize the applicability of Q-learning in this setting. We show that tabular Q-learning in discrete Markov decision processes (MDPs) learns the same value function under any arbitrary refinement of the action space. This theoretical result motivates the design of Latent Action Q-learning or LAQ, an offline RL method that can learn effective value functions from state-only experience. Latent Action Q-learning (LAQ) learns value functions using Q-learning on discrete latent actions obtained through a latent-variable future prediction model. We show that LAQ can recover value functions that have high correlation with value functions learned using ground truth actions. Value functions learned using LAQ lead to sample efficient acquisition of goal-directed behavior, can be used with domain-specific low-level controllers, and facilitate transfer across embodiments. Our experiments in 5 environments ranging from 2D grid world to 3D visual navigation in realistic environments demonstrate the benefits of LAQ over simpler alternatives, imitation learning oracles, and competing methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Versatile Neural Architectures by Propagating Network Codes b/data/2022/iclr/Learning Versatile Neural Architectures by Propagating Network Codes
new file mode 100644
index 0000000000..5fe814d245
--- /dev/null
+++ b/data/2022/iclr/Learning Versatile Neural Architectures by Propagating Network Codes	
@@ -0,0 +1 @@
+This work explores how to design a single neural network capable of adapting to multiple heterogeneous vision tasks, such as image segmentation, 3D detection, and video recognition. This goal is challenging because both network architecture search (NAS) spaces and methods in different tasks are inconsistent. We solve this challenge from both sides. We first introduce a unified design space for multiple tasks and build a multitask NAS benchmark (NAS-Bench-MR) on many widely used datasets, including ImageNet, Cityscapes, KITTI, and HMDB51. We further propose Network Coding Propagation (NCP), which back-propagates gradients of neural predictors to directly update architecture codes along the desired gradient directions to solve various tasks. In this way, optimal architecture configurations can be found by NCP in our large search space in seconds. Unlike prior arts of NAS that typically focus on a single task, NCP has several unique benefits. (1) NCP transforms architecture optimization from data-driven to architecture-driven, enabling joint search an architecture among multitasks with different data distributions. (2) NCP learns from network codes but not original data, enabling it to update the architecture efficiently across datasets. (3) In addition to our NAS-Bench-MR, NCP performs well on other NAS benchmarks, such as NAS-Bench-201. (4) Thorough studies of NCP on inter-, cross-, and intra-tasks highlight the importance of cross-task neural architecture design, i.e., multitask neural architectures and architecture transferring between different tasks. Code is available at https://github.com/dingmyu/NCP.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers b/data/2022/iclr/Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers
new file mode 100644
index 0000000000..ebfaf9cef0
--- /dev/null
+++ b/data/2022/iclr/Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers	
@@ -0,0 +1 @@
+We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based locomotion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method that leverages both proprioceptive states and visual observations for locomotion control. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We transfer our learned policy from simulation to a real robot by running it indoors and in the wild with unseen obstacles and terrain. Our method not only significantly improves over baselines, but also achieves far better generalization performance, especially when transferred to the real robot. Our project page with videos is at https://rchalyang.github.io/LocoTransformer/ .
\ No newline at end of file
diff --git a/data/2022/iclr/Learning Weakly-supervised Contrastive Representations b/data/2022/iclr/Learning Weakly-supervised Contrastive Representations
new file mode 100644
index 0000000000..e9b288b7fa
--- /dev/null
+++ b/data/2022/iclr/Learning Weakly-supervised Contrastive Representations	
@@ -0,0 +1 @@
+We argue that a form of the valuable information provided by the auxiliary information is its implied data clustering information. For instance, considering hashtags as auxiliary information, we can hypothesize that an Instagram image will be semantically more similar with the same hashtags. With this intuition, we present a two-stage weakly-supervised contrastive learning approach. The first stage is to cluster data according to its auxiliary information. The second stage is to learn similar representations within the same cluster and dissimilar representations for data from different clusters. Our empirical experiments suggest the following three contributions. First, compared to conventional self-supervised representations, the auxiliary-information-infused representations bring the performance closer to the supervised representations, which use direct downstream labels as supervision signals. Second, our approach performs the best in most cases, when comparing our approach with other baseline representation learning methods that also leverage auxiliary data information. Third, we show that our approach also works well with unsupervised constructed clusters (e.g., no auxiliary information), resulting in a strong unsupervised representation learning approach.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning a subspace of policies for online adaptation in Reinforcement Learning b/data/2022/iclr/Learning a subspace of policies for online adaptation in Reinforcement Learning
new file mode 100644
index 0000000000..a6868c8456
--- /dev/null
+++ b/data/2022/iclr/Learning a subspace of policies for online adaptation in Reinforcement Learning	
@@ -0,0 +1 @@
+Deep Reinforcement Learning (RL) is mainly studied in a setting where the training and the testing environments are similar. But in many practical applications, these environments may differ. For instance, in control systems, the robot(s) on which a policy is learned might differ from the robot(s) on which a policy will run. It can be caused by different internal factors (e.g., calibration issues, system attrition, defective modules) or also by external changes (e.g., weather conditions). There is a need to develop RL methods that generalize well to variations of the training conditions. In this article, we consider the simplest yet hard to tackle generalization setting where the test environment is unknown at train time, forcing the agent to adapt to the system's new dynamics. This online adaptation process can be computationally expensive (e.g., fine-tuning) and cannot rely on meta-RL techniques since there is just a single train environment. To do so, we propose an approach where we learn a subspace of policies within the parameter space. This subspace contains an infinite number of policies that are trained to solve the training environment while having different parameter values. As a consequence, two policies in that subspace process information differently and exhibit different behaviors when facing variations of the train environment. Our experiments carried out over a large variety of benchmarks compare our approach with baselines, including diversity-based methods. In comparison, our approach is simple to tune, does not need any extra component (e.g., discriminator) and learns policies able to gather a high reward on unseen environments.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning by Directional Gradient Descent b/data/2022/iclr/Learning by Directional Gradient Descent
new file mode 100644
index 0000000000..492da79232
--- /dev/null
+++ b/data/2022/iclr/Learning by Directional Gradient Descent	
@@ -0,0 +1 @@
+How should state be constructed from a sequence of observations, so as to best achieve some objective? Most deep learning methods update the parameters of the state representation by gradient descent. However, no prior method for computing the gradient is fully satisfactory, for example consuming too much memory, introducing too much variance, or adding too much bias. In this work, we propose a new learning algorithm that addresses these limitations. The basic idea is to update the parameters of the representation by using the directional derivative along a candidate direction, a quantity that may be computed online with the same computational cost as the representation itself. We consider several different choices of candidate direction, including random selection and approximations to the true gradient, and investigate their performance on several synthetic tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning curves for continual learning in neural networks: Self-knowledge transfer and forgetting b/data/2022/iclr/Learning curves for continual learning in neural networks: Self-knowledge transfer and forgetting
new file mode 100644
index 0000000000..c870684db5
--- /dev/null
+++ b/data/2022/iclr/Learning curves for continual learning in neural networks: Self-knowledge transfer and forgetting	
@@ -0,0 +1 @@
+Sequential training from task to task is becoming one of the major objects in deep learning applications such as continual learning and transfer learning. Nevertheless, it remains unclear under what conditions the trained model's performance improves or deteriorates. To deepen our understanding of sequential training, this study provides a theoretical analysis of generalization performance in a solvable case of continual learning. We consider neural networks in the neural tangent kernel (NTK) regime that continually learn target functions from task to task, and investigate the generalization by using an established statistical mechanical analysis of kernel ridge-less regression. We first show characteristic transitions from positive to negative transfer. More similar targets above a specific critical value can achieve positive knowledge transfer for the subsequent task while catastrophic forgetting occurs even with very similar targets. Next, we investigate a variant of continual learning which supposes the same target function in multiple tasks. Even for the same target, the trained model shows some transfer and forgetting depending on the sample size of each task. We can guarantee that the generalization error monotonically decreases from task to task for equal sample sizes while unbalanced sample sizes deteriorate the generalization. We respectively refer to these improvement and deterioration as self-knowledge transfer and forgetting, and empirically confirm them in realistic training of deep neural networks as well.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning meta-features for AutoML b/data/2022/iclr/Learning meta-features for AutoML
new file mode 100644
index 0000000000..9ff05a78cb
--- /dev/null
+++ b/data/2022/iclr/Learning meta-features for AutoML	
@@ -0,0 +1 @@
+This paper tackles the AutoML problem, aimed to automatically select an ML algorithm and its hyper-parameter conﬁguration most appropriate to the dataset at hand. The proposed approach, MetaBu, learns new meta-features via an Optimal Transport procedure, aligning the manually designed meta-features with the space of distributions on the hyper-parameter conﬁgurations. MetaBu meta-features, learned once and for all, induce a topology on the set of datasets that is exploited to deﬁne a distribution of promising hyper-parameter conﬁgurations amenable to AutoML. Experiments on the OpenML CC-18 benchmark demonstrate that using MetaBu meta-features boosts the performance of state of the art AutoML systems, AutoSkLearn (Feurer et al. 2015) and Probabilistic Matrix Factorization (Fusi et al. 2018). Furthermore, the inspection of MetaBu meta-features gives some hints into when an ML algorithm does well. Finally, the topology based on MetaBu meta-features enables to estimate the intrinsic dimensionality of the OpenML benchmark w.r.t. a given ML algorithm or pipeline.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning more skills through optimistic exploration b/data/2022/iclr/Learning more skills through optimistic exploration
new file mode 100644
index 0000000000..21d1ad4d94
--- /dev/null
+++ b/data/2022/iclr/Learning more skills through optimistic exploration	
@@ -0,0 +1 @@
+Unsupervised skill learning objectives (Gregor et al., 2016, Eysenbach et al., 2018) allow agents to learn rich repertoires of behavior in the absence of extrinsic rewards. They work by simultaneously training a policy to produce distinguishable latent-conditioned trajectories, and a discriminator to evaluate distinguishability by trying to infer latents from trajectories. The hope is for the agent to explore and master the environment by encouraging each skill (latent) to reliably reach different states. However, an inherent exploration problem lingers: when a novel state is actually encountered, the discriminator will necessarily not have seen enough training data to produce accurate and confident skill classifications, leading to low intrinsic reward for the agent and effective penalization of the sort of exploration needed to actually maximize the objective. To combat this inherent pessimism towards exploration, we derive an information gain auxiliary objective that involves training an ensemble of discriminators and rewarding the policy for their disagreement. Our objective directly estimates the epistemic uncertainty that comes from the discriminator not having seen enough training examples, thus providing an intrinsic reward more tailored to the true objective compared to pseudocount-based methods (Burda et al., 2019). We call this exploration bonus discriminator disagreement intrinsic reward, or DISDAIN. We demonstrate empirically that DISDAIN improves skill learning both in a tabular grid world (Four Rooms) and the 57 games of the Atari Suite (from pixels). Thus, we encourage researchers to treat pessimism with DISDAIN.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning the Dynamics of Physical Systems from Sparse Observations with Finite Element Networks b/data/2022/iclr/Learning the Dynamics of Physical Systems from Sparse Observations with Finite Element Networks
new file mode 100644
index 0000000000..1e19042125
--- /dev/null
+++ b/data/2022/iclr/Learning the Dynamics of Physical Systems from Sparse Observations with Finite Element Networks	
@@ -0,0 +1 @@
+We propose a new method for spatio-temporal forecasting on arbitrarily distributed points. Assuming that the observed system follows an unknown partial differential equation, we derive a continuous-time model for the dynamics of the data via the finite element method. The resulting graph neural network estimates the instantaneous effects of the unknown dynamics on each cell in a meshing of the spatial domain. Our model can incorporate prior knowledge via assumptions on the form of the unknown PDE, which induce a structural bias towards learning specific processes. Through this mechanism, we derive a transport variant of our model from the convection equation and show that it improves the transfer performance to higher-resolution meshes on sea surface temperature and gas flow forecasting against baseline models representing a selection of spatio-temporal forecasting methods. A qualitative analysis shows that our model disentangles the data dynamics into their constituent parts, which makes it uniquely interpretable.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning to Annotate Part Segmentation with Gradient Matching b/data/2022/iclr/Learning to Annotate Part Segmentation with Gradient Matching
new file mode 100644
index 0000000000..bda0bd8e52
--- /dev/null
+++ b/data/2022/iclr/Learning to Annotate Part Segmentation with Gradient Matching	
@@ -0,0 +1 @@
+The success of state-of-the-art deep neural networks heavily relies on the presence of large-scale labelled datasets, which are extremely expensive and time-consuming to annotate. This paper focuses on tackling semi-supervised part segmentation tasks by generating high-quality images with a pre-trained GAN and labelling the generated images with an automatic annotator. In particular, we formulate the annotator learning as a learning-to-learn problem. Given a pre-trained GAN, the annotator learns to label object parts in a set of randomly generated images such that a part segmentation model trained on these synthetic images with their predicted labels obtains low segmentation error on a small validation set of manually labelled images. We further reduce this nested-loop optimization problem to a simple gradient matching problem and efficiently solve it with an iterative algorithm. We show that our method can learn annotators from a broad range of labelled images including real images, generated images, and even analytically rendered images. Our method is evaluated with semi-supervised part segmentation tasks and significantly outperforms other semi-supervised competitors when the amount of labelled examples is extremely limited.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning to Complete Code with Sketches b/data/2022/iclr/Learning to Complete Code with Sketches
new file mode 100644
index 0000000000..a4fab8b836
--- /dev/null
+++ b/data/2022/iclr/Learning to Complete Code with Sketches	
@@ -0,0 +1 @@
+Code completion is usually cast as a language modelling problem, i.e., continuing an input in a left-to-right fashion. However, in practice, some parts of the completion (e.g., string literals) may be very hard to predict, whereas subsequent parts directly follow from the context. To handle this, we instead consider the scenario of generating code completions with"holes"inserted in places where a model is uncertain. We develop Grammformer, a Transformer-based model that guides code generation by the programming language grammar, and compare it to a variety of more standard sequence models. We train the models on code completion for C# and Python given partial code context. To evaluate models, we consider both ROUGE as well as a new metric RegexAcc that measures success of generating completions matching long outputs with as few holes as possible. In our experiments, Grammformer generates 10-50% more accurate completions compared to traditional generative models and 37-50% longer sketches compared to sketch-generating baselines trained with similar techniques.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning to Dequantise with Truncated Flows b/data/2022/iclr/Learning to Dequantise with Truncated Flows
new file mode 100644
index 0000000000..c4095d8349
--- /dev/null
+++ b/data/2022/iclr/Learning to Dequantise with Truncated Flows	
@@ -0,0 +1 @@
+Dequantisation is a general technique used for transforming data described by a discrete random variable x into a continuous (latent) random variable z, for the purpose of it being modeled by likelihood-based density models. Dequantisation was first introduced in the context of ordinal data, such as image pixel values. However, when the data is categorical, the dequantisation scheme is not obvious. We learn such a dequantisation scheme q(z|x), using variational inference with TRUncated FLows (TRUFL) — a novel flow-based model that allows the dequantiser to have a learnable truncated support. Unlike previous work, the TRUFL dequantiser is (i) capable of embedding the data losslessly in certain cases, since the truncation allows the conditional distributions q(z|x) to have non-overlapping bounded supports, while being (ii) trainable with back-propagation. Addtionally, since the support of the marginal q(z) is bounded and the support of prior p(z) is not, we propose to renormalise the prior distribution over the support of q(z). We derive a lower bound for training, and propose a rejection sampling scheme to account for the invalid samples. Experimentally, we benchmark TRUFL on constrained generation tasks, and find that it outperforms prior approaches. In addition, we find that rejection sampling results in higher validity for the constrained problems.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning to Downsample for Segmentation of Ultra-High Resolution Images b/data/2022/iclr/Learning to Downsample for Segmentation of Ultra-High Resolution Images
new file mode 100644
index 0000000000..b236099c02
--- /dev/null
+++ b/data/2022/iclr/Learning to Downsample for Segmentation of Ultra-High Resolution Images	
@@ -0,0 +1 @@
+Many computer vision systems require low-cost segmentation algorithms based on deep learning, either because of the enormous size of input images or limited computational budget. Common solutions uniformly downsample the input images to meet memory constraints, assuming all pixels are equally informative. In this work, we demonstrate that this assumption can harm the segmentation performance because the segmentation difficulty varies spatially. We combat this problem by introducing a learnable downsampling module, which can be optimised together with the given segmentation model in an end-to-end fashion. We formulate the problem of training such downsampling module as optimisation of sampling density distributions over the input images given their low-resolution views. To defend against degenerate solutions (e.g. over-sampling trivial regions like the backgrounds), we propose a regularisation term that encourages the sampling locations to concentrate around the object boundaries. We find the downsampling module learns to sample more densely at difficult locations, thereby improving the segmentation performance. Our experiments on benchmarks of high-resolution street view, aerial and medical images demonstrate substantial improvements in terms of efficiency-and-accuracy trade-off compared to both uniform downsampling and two recent advanced downsampling techniques.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning to Extend Molecular Scaffolds with Structural Motifs b/data/2022/iclr/Learning to Extend Molecular Scaffolds with Structural Motifs
new file mode 100644
index 0000000000..ef806d3b44
--- /dev/null
+++ b/data/2022/iclr/Learning to Extend Molecular Scaffolds with Structural Motifs	
@@ -0,0 +1 @@
+Recent advancements in deep learning-based modeling of molecules promise to accelerate in silico drug discovery. A plethora of generative models is available, building molecules either atom-by-atom and bond-by-bond or fragment-by-fragment. However, many drug discovery projects require a fixed scaffold to be present in the generated molecule, and incorporating that constraint has only recently been explored. Here, we propose MoLeR, a graph-based model that naturally supports scaffolds as initial seed of the generative procedure, which is possible because it is not conditioned on the generation history. Our experiments show that MoLeR performs comparably to state-of-the-art methods on unconstrained molecular optimization tasks, and outperforms them on scaffold-based tasks, while being an order of magnitude faster to train and sample from than existing approaches. Furthermore, we show the influence of a number of seemingly minor design choices on the overall performance.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning to Generalize across Domains on Single Test Samples b/data/2022/iclr/Learning to Generalize across Domains on Single Test Samples
new file mode 100644
index 0000000000..0d5860afe9
--- /dev/null
+++ b/data/2022/iclr/Learning to Generalize across Domains on Single Test Samples	
@@ -0,0 +1 @@
+We strive to learn a model from a set of source domains that generalizes well to unseen target domains. The main challenge in such a domain generalization scenario is the unavailability of any target domain data during training, resulting in the learned model not being explicitly adapted to the unseen target domains. We propose learning to generalize across domains on single test samples. We leverage a meta-learning paradigm to learn our model to acquire the ability of adaptation with single samples at training time so as to further adapt itself to each single test sample at test time. We formulate the adaptation to the single test sample as a variational Bayesian inference problem, which incorporates the test sample as a conditional into the generation of model parameters. The adaptation to each test sample requires only one feed-forward computation at test time without any fine-tuning or self-supervised training on additional data from the unseen domains. Extensive ablation studies demonstrate that our model learns the ability to adapt models to each single sample by mimicking domain shifts during training. Further, our model achieves at least comparable -- and often better -- performance than state-of-the-art methods on multiple benchmarks for domain generalization.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning to Guide and to be Guided in the Architect-Builder Problem b/data/2022/iclr/Learning to Guide and to be Guided in the Architect-Builder Problem
new file mode 100644
index 0000000000..d0f8a7cfde
--- /dev/null
+++ b/data/2022/iclr/Learning to Guide and to be Guided in the Architect-Builder Problem	
@@ -0,0 +1 @@
+We are interested in interactive agents that learn to coordinate, namely, a $builder$ -- which performs actions but ignores the goal of the task, i.e. has no access to rewards -- and an $architect$ which guides the builder towards the goal of the task. We define and explore a formal setting where artificial agents are equipped with mechanisms that allow them to simultaneously learn a task while at the same time evolving a shared communication protocol. Ideally, such learning should only rely on high-level communication priors and be able to handle a large variety of tasks and meanings while deriving communication protocols that can be reused across tasks. We present the Architect-Builder Problem (ABP): an asymmetrical setting in which an architect must learn to guide a builder towards constructing a specific structure. The architect knows the target structure but cannot act in the environment and can only send arbitrary messages to the builder. The builder on the other hand can act in the environment, but receives no rewards nor has any knowledge about the task, and must learn to solve it relying only on the messages sent by the architect. Crucially, the meaning of messages is initially not defined nor shared between the agents but must be negotiated throughout learning. Under these constraints, we propose Architect-Builder Iterated Guiding (ABIG), a solution to ABP where the architect leverages a learned model of the builder to guide it while the builder uses self-imitation learning to reinforce its guided behavior. We analyze the key learning mechanisms of ABIG and test it in 2D tasks involving grasping cubes, placing them at a given location, or building various shapes. ABIG results in a low-level, high-frequency, guiding communication protocol that not only enables an architect-builder pair to solve the task at hand, but that can also generalize to unseen tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning to Map for Active Semantic Goal Navigation b/data/2022/iclr/Learning to Map for Active Semantic Goal Navigation
new file mode 100644
index 0000000000..bbfbdd472b
--- /dev/null
+++ b/data/2022/iclr/Learning to Map for Active Semantic Goal Navigation	
@@ -0,0 +1 @@
+We consider the problem of object goal navigation in unseen environments. Solving this problem requires learning of contextual semantic priors, a challenging endeavour given the spatial and semantic variability of indoor environments. Current methods learn to implicitly encode these priors through goal-oriented navigation policy functions operating on spatial representations that are limited to the agent's observable areas. In this work, we propose a novel framework that actively learns to generate semantic maps outside the field of view of the agent and leverages the uncertainty over the semantic classes in the unobserved areas to decide on long term goals. We demonstrate that through this spatial prediction strategy, we are able to learn semantic priors in scenes that can be leveraged in unknown environments. Additionally, we show how different objectives can be defined by balancing exploration with exploitation during searching for semantic targets. Our method is validated in the visually realistic environments of the Matterport3D dataset and show improved results on object goal navigation over competitive baselines.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic Forecasting b/data/2022/iclr/Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic Forecasting
new file mode 100644
index 0000000000..b33d57aa57
--- /dev/null
+++ b/data/2022/iclr/Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic Forecasting	
@@ -0,0 +1 @@
+Traffic forecasting is a challenging problem due to complex road networks and sudden speed changes caused by various events on roads. A number of models have been proposed to solve this challenging problem with a focus on learning spatio-temporal dependencies of roads. In this work, we propose a new perspective of converting the forecasting problem into a pattern matching task, assuming that large data can be represented by a set of patterns. To evaluate the validness of the new perspective, we design a novel traffic forecasting model, called Pattern-Matching Memory Networks (PM-MemNet), which learns to match input data to the representative patterns with a key-value memory structure. We first extract and cluster representative traffic patterns, which serve as keys in the memory. Then via matching the extracted keys and inputs, PM-MemNet acquires necessary information of existing traffic patterns from the memory and uses it for forecasting. To model spatio-temporal correlation of traffic, we proposed novel memory architecture GCMem, which integrates attention and graph convolution for memory enhancement. The experiment results indicate that PM-MemNet is more accurate than state-of-the-art models, such as Graph WaveNet with higher responsiveness. We also present a qualitative analysis result, describing how PM-MemNet works and achieves its higher accuracy when road speed rapidly changes.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning to Schedule Learning rate with Graph Neural Networks b/data/2022/iclr/Learning to Schedule Learning rate with Graph Neural Networks
new file mode 100644
index 0000000000..c38740089e
--- /dev/null
+++ b/data/2022/iclr/Learning to Schedule Learning rate with Graph Neural Networks	
@@ -0,0 +1 @@
+Recent decades have witnessed great development of stochastic optimization in training deep neural networks. Learning rate scheduling is one of the most important factors that influence the performance of stochastic optimizers like Adam. Traditional methods seek to find a relatively proper scheduling among a limited number of pre-defined rules and might not accommodate a particular target problem. Instead, we propose a novel Graph-Network-based Scheduler (GNS), aiming at learning a specific scheduling mechanism without restrictions to existing principles. By constructing a directed graph for the underlying neural network of the target problem, GNS encodes current dynamics with a graph message passing network and trains an agent to control the learning rate accordingly via reinforcement learning. The proposed scheduler can capture the intermediate layer information while being able to generalize to problems of varying scales. Besides, an efficient reward collection procedure is leveraged to speed up training. We evaluate our framework on benchmarking datasets, Fashion-MNIST and CIFAR10 for image classification, and GLUE for language understanding. GNS shows consistent improvement over popular baselines when training CNN and Transformer models. Moreover, GNS demonstrates great generalization to different datasets and network structures. Our code is available at https://github.com/xyh97/GNS.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning transferable motor skills with hierarchical latent mixture policies b/data/2022/iclr/Learning transferable motor skills with hierarchical latent mixture policies
new file mode 100644
index 0000000000..608c1d531b
--- /dev/null
+++ b/data/2022/iclr/Learning transferable motor skills with hierarchical latent mixture policies	
@@ -0,0 +1 @@
+For robots operating in the real world, it is desirable to learn reusable behaviours that can effectively be transferred and adapted to numerous tasks and scenarios. We propose an approach to learn abstract motor skills from data using a hierarchical mixture latent variable model. In contrast to existing work, our method exploits a three-level hierarchy of both discrete and continuous latent variables, to capture a set of high-level behaviours while allowing for variance in how they are executed. We demonstrate in manipulation domains that the method can effectively cluster offline data into distinct, executable behaviours, while retaining the flexibility of a continuous latent variable model. The resulting skills can be transferred and fine-tuned on new tasks, unseen objects, and from state to vision-based policies, yielding better sample efficiency and asymptotic performance compared to existing skill- and imitation-based methods. We further analyse how and when the skills are most beneficial: they encourage directed exploration to cover large regions of the state space relevant to the task, making them most effective in challenging sparse-reward settings.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations b/data/2022/iclr/Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations
new file mode 100644
index 0000000000..7092fec941
--- /dev/null
+++ b/data/2022/iclr/Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations	
@@ -0,0 +1 @@
+Existing research on learning with noisy labels mainly focuses on synthetic label noise. Synthetic noise, though has clean structures which greatly enabled statistical analyses, often fails to model real-world noise patterns. The recent literature has observed several efforts to offer real-world noisy datasets, yet the existing efforts suffer from two caveats: (1) The lack of ground-truth verification makes it hard to theoretically study the property and treatment of real-world label noise; (2) These efforts are often of large scales, which may result in unfair comparisons of robust methods within reasonable and accessible computation power. To better understand real-world label noise, it is crucial to build controllable and moderate-sized real-world noisy datasets with both ground-truth and noisy labels. This work presents two new benchmark datasets CIFAR-10N, CIFAR-100N, equipping the training datasets of CIFAR-10, CIFAR-100 with human-annotated real-world noisy labels we collected from Amazon Mechanical Turk. We quantitatively and qualitatively show that real-world noisy labels follow an instance-dependent pattern rather than the classically assumed and adopted ones (e.g., class-dependent label noise). We then initiate an effort to benchmarking a subset of the existing solutions using CIFAR-10N and CIFAR-100N. We further proceed to study the memorization of correct and wrong predictions, which further illustrates the difference between human noise and class-dependent synthetic noise. We show indeed the real-world noise patterns impose new and outstanding challenges as compared to synthetic label noise. These observations require us to rethink the treatment of noisy labels, and we hope the availability of these two datasets would facilitate the development and evaluation of future learning with noisy label solutions. Datasets and leaderboards are available at http://noisylabels.com.
\ No newline at end of file
diff --git a/data/2022/iclr/Learning-Augmented $k$-means Clustering b/data/2022/iclr/Learning-Augmented $k$-means Clustering
new file mode 100644
index 0000000000..b7b2d2b57a
--- /dev/null
+++ b/data/2022/iclr/Learning-Augmented $k$-means Clustering	
@@ -0,0 +1 @@
+$k$-means clustering is a well-studied problem due to its wide applicability. Unfortunately, there exist strong theoretical limits on the performance of any algorithm for the $k$-means problem on worst-case inputs. To overcome this barrier, we consider a scenario where"advice"is provided to help perform clustering. Specifically, we consider the $k$-means problem augmented with a predictor that, given any point, returns its cluster label in an approximately optimal clustering up to some, possibly adversarial, error. We present an algorithm whose performance improves along with the accuracy of the predictor, even though na\"{i}vely following the accurate predictor can still lead to a high clustering cost. Thus if the predictor is sufficiently accurate, we can retrieve a close to optimal clustering with nearly optimal runtime, breaking known computational barriers for algorithms that do not have access to such advice. We evaluate our algorithms on real datasets and show significant improvements in the quality of clustering.
\ No newline at end of file
diff --git a/data/2022/iclr/Leveraging Automated Unit Tests for Unsupervised Code Translation b/data/2022/iclr/Leveraging Automated Unit Tests for Unsupervised Code Translation
new file mode 100644
index 0000000000..93e58ca14c
--- /dev/null
+++ b/data/2022/iclr/Leveraging Automated Unit Tests for Unsupervised Code Translation	
@@ -0,0 +1 @@
+With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java $\to$ Python and Python $\to$ C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%.
\ No newline at end of file
diff --git a/data/2022/iclr/Leveraging unlabeled data to predict out-of-distribution performance b/data/2022/iclr/Leveraging unlabeled data to predict out-of-distribution performance
new file mode 100644
index 0000000000..baba0cffc7
--- /dev/null
+++ b/data/2022/iclr/Leveraging unlabeled data to predict out-of-distribution performance	
@@ -0,0 +1 @@
+Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions that may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples for which model confidence exceeds that threshold. ATC outperforms previous methods across several model architectures, types of distribution shifts (e.g., due to synthetic corruptions, dataset reproduction, or novel subpopulations), and datasets (Wilds, ImageNet, Breeds, CIFAR, and MNIST). In our experiments, ATC estimates target performance $2$-$4\times$ more accurately than prior methods. We also explore the theoretical foundations of the problem, proving that, in general, identifying the accuracy is just as hard as identifying the optimal predictor and thus, the efficacy of any method rests upon (perhaps unstated) assumptions on the nature of the shift. Finally, analyzing our method on some toy distributions, we provide insights concerning when it works. Code is available at https://github.com/saurabhgarg1996/ATC_code/.
\ No newline at end of file
diff --git "a/data/2022/iclr/Likelihood Training of Schr\303\266dinger Bridge using Forward-Backward SDEs Theory" "b/data/2022/iclr/Likelihood Training of Schr\303\266dinger Bridge using Forward-Backward SDEs Theory"
new file mode 100644
index 0000000000..eb59c310f0
--- /dev/null
+++ "b/data/2022/iclr/Likelihood Training of Schr\303\266dinger Bridge using Forward-Backward SDEs Theory"	
@@ -0,0 +1 @@
+Schr\"odinger Bridge (SB) is an entropy-regularized optimal transport problem that has received increasing attention in deep generative modeling for its mathematical flexibility compared to the Scored-based Generative Model (SGM). However, it remains unclear whether the optimization principle of SB relates to the modern training of deep generative models, which often rely on constructing log-likelihood objectives.This raises questions on the suitability of SB models as a principled alternative for generative applications. In this work, we present a novel computational framework for likelihood training of SB models grounded on Forward-Backward Stochastic Differential Equations Theory - a mathematical methodology appeared in stochastic optimal control that transforms the optimality condition of SB into a set of SDEs. Crucially, these SDEs can be used to construct the likelihood objectives for SB that, surprisingly, generalizes the ones for SGM as special cases. This leads to a new optimization principle that inherits the same SB optimality yet without losing applications of modern generative training techniques, and we show that the resulting training algorithm achieves comparable results on generating realistic images on MNIST, CelebA, and CIFAR10. Our code is available at https://github.com/ghliu/SB-FBSDE.
\ No newline at end of file
diff --git a/data/2022/iclr/Linking Emergent and Natural Languages via Corpus Transfer b/data/2022/iclr/Linking Emergent and Natural Languages via Corpus Transfer
new file mode 100644
index 0000000000..2151dcf163
--- /dev/null
+++ b/data/2022/iclr/Linking Emergent and Natural Languages via Corpus Transfer	
@@ -0,0 +1 @@
+The study of language emergence aims to understand how human languages are shaped by perceptual grounding and communicative intent. Computational approaches to emergent communication (EC) predominantly consider referential games in limited domains and analyze the learned protocol within the game framework. As a result, it remains unclear how the emergent languages from these settings connect to natural languages or provide benefits in real-world language processing tasks, where statistical models trained on large text corpora dominate. In this work, we propose a novel way to establish such a link by corpus transfer, i.e. pretraining on a corpus of emergent language for downstream natural language tasks, which is in contrast to prior work that directly transfers speaker and listener parameters. Our approach showcases non-trivial transfer benefits for two different tasks -- language modeling and image captioning. For example, in a low-resource setup (modeling 2 million natural language tokens), pre-training on an emergent language corpus with just 2 million tokens reduces model perplexity by $24.6\%$ on average across ten natural languages. We also introduce a novel metric to predict the transferability of an emergent language by translating emergent messages to natural language captions grounded on the same images. We find that our translation-based metric highly correlates with the downstream performance on modeling natural languages (for instance $\rho=0.83$ on Hebrew), while topographic similarity, a popular metric in previous work, shows surprisingly low correlation ($\rho=0.003$), hinting that simple properties like attribute disentanglement from synthetic domains might not capture the full complexities of natural language. Our findings also indicate potential benefits of moving language emergence forward with natural language resources and models.
\ No newline at end of file
diff --git a/data/2022/iclr/Lipschitz-constrained Unsupervised Skill Discovery b/data/2022/iclr/Lipschitz-constrained Unsupervised Skill Discovery
new file mode 100644
index 0000000000..75fe26ec3b
--- /dev/null
+++ b/data/2022/iclr/Lipschitz-constrained Unsupervised Skill Discovery	
@@ -0,0 +1 @@
+We study the problem of unsupervised skill discovery, whose goal is to learn a set of diverse and useful skills with no external reward. There have been a number of skill discovery methods based on maximizing the mutual information (MI) between skills and states. However, we point out that their MI objectives usually prefer static skills to dynamic ones, which may hinder the application for downstream tasks. To address this issue, we propose Lipschitz-constrained Skill Discovery ( LSD ), which encourages the agent to discover more diverse, dynamic, and far-reaching skills. Another beneﬁt of LSD is that its learned representation function can be utilized for solving goal-following downstream tasks even in a zero-shot manner — i.e ., without further training or complex planning. Through experiments on various MuJoCo robotic locomotion and manipulation environments, we demonstrate that LSD outperforms previous approaches in terms of skill diversity, state space coverage, and performance on seven downstream tasks including the challenging task of following multiple goals on Humanoid. Our code and videos are available at https://shpark.me/projects/lsd/ .
\ No newline at end of file
diff --git a/data/2022/iclr/LoRA: Low-Rank Adaptation of Large Language Models b/data/2022/iclr/LoRA: Low-Rank Adaptation of Large Language Models
new file mode 100644
index 0000000000..2878db2b92
--- /dev/null
+++ b/data/2022/iclr/LoRA: Low-Rank Adaptation of Large Language Models	
@@ -0,0 +1 @@
+An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.
\ No newline at end of file
diff --git a/data/2022/iclr/Local Feature Swapping for Generalization in Reinforcement Learning b/data/2022/iclr/Local Feature Swapping for Generalization in Reinforcement Learning
new file mode 100644
index 0000000000..8d98703a74
--- /dev/null
+++ b/data/2022/iclr/Local Feature Swapping for Generalization in Reinforcement Learning	
@@ -0,0 +1 @@
+Over the past few years, the acceleration of computing resources and research in deep learning has led to significant practical successes in a range of tasks, including in particular in computer vision. Building on these advances, reinforcement learning has also seen a leap forward with the emergence of agents capable of making decisions directly from visual observations. Despite these successes, the over-parametrization of neural architectures leads to memorization of the data used during training and thus to a lack of generalization. Reinforcement learning agents based on visual inputs also suffer from this phenomenon by erroneously correlating rewards with unrelated visual features such as background elements. To alleviate this problem, we introduce a new regularization technique consisting of channel-consistent local permutations (CLOP) of the feature maps. The proposed permutations induce robustness to spatial correlations and help prevent overfitting behaviors in RL. We demonstrate, on the OpenAI Procgen Benchmark, that RL agents trained with the CLOP method exhibit robustness to visual changes and better generalization properties than agents trained using other state-of-the-art regularization techniques. We also demonstrate the effectiveness of CLOP as a general regularization technique in supervised learning.
\ No newline at end of file
diff --git a/data/2022/iclr/Long Expressive Memory for Sequence Modeling b/data/2022/iclr/Long Expressive Memory for Sequence Modeling
new file mode 100644
index 0000000000..a505588b40
--- /dev/null
+++ b/data/2022/iclr/Long Expressive Memory for Sequence Modeling	
@@ -0,0 +1 @@
+We propose a novel method called Long Expressive Memory (LEM) for learning long-term sequential dependencies. LEM is gradient-based, it can efficiently process sequential tasks with very long-term dependencies, and it is sufficiently expressive to be able to learn complicated input-output maps. To derive LEM, we consider a system of multiscale ordinary differential equations, as well as a suitable time-discretization of this system. For LEM, we derive rigorous bounds to show the mitigation of the exploding and vanishing gradients problem, a well-known challenge for gradient-based recurrent sequential learning methods. We also prove that LEM can approximate a large class of dynamical systems to high accuracy. Our empirical results, ranging from image and time-series classification through dynamical systems prediction to speech recognition and language modeling, demonstrate that LEM outperforms state-of-the-art recurrent neural networks, gated recurrent units, and long short-term memory models.
\ No newline at end of file
diff --git a/data/2022/iclr/Looking Back on Learned Experiences For Class task Incremental Learning b/data/2022/iclr/Looking Back on Learned Experiences For Class task Incremental Learning
new file mode 100644
index 0000000000..789c99035a
--- /dev/null
+++ b/data/2022/iclr/Looking Back on Learned Experiences For Class task Incremental Learning	
@@ -0,0 +1 @@
+Classical deep neural networks are limited in their ability to learn from emerging streams of training data. When trained sequentially on new or evolving tasks, their performance degrades sharply, making them inappropriate in real-world use cases. Existing methods tackle it by either storing old data samples or only updating a parameter set of deep neural networks, which, however, demands a large memory budget or spoils the flexibility of models to learn the incremented task distribution. In this paper, we shed light on an on-call transfer set to provide past experiences whenever a new task arises in the data stream. In particular, we propose a CostFree Incremental Learning (CF-IL) not only to replay past experiences the model has learned but also to perform this in a cost free manner. Towards this end, we introduced a memory recovery paradigm in which we query the network to synthesize past exemplars whenever a new task emerges. Thus, our method needs no extra memory for data buffering or network growing, besides calls the proposed memory recovery paradigm to provide past exemplars, named a transfer set in order to mitigate catastrophically forgetting the former tasks in the Incremental Learning (IL) setup. Moreover, in contrast with recently proposed methods, the suggested paradigm does not desire a parallel architecture since it only relies on the learner network. Compared to the state-of-the-art data techniques without buffering past data samples, CF-IL demonstrates significantly better performance on the well-known datasets whether a task oracle is available in test time (Task-IL) or not (Class-IL)1.
\ No newline at end of file
diff --git a/data/2022/iclr/Lossless Compression with Probabilistic Circuits b/data/2022/iclr/Lossless Compression with Probabilistic Circuits
new file mode 100644
index 0000000000..bba8c83683
--- /dev/null
+++ b/data/2022/iclr/Lossless Compression with Probabilistic Circuits	
@@ -0,0 +1 @@
+Despite extensive progress on image generation, common deep generative model architectures are not easily applied to lossless compression. For example, VAEs suffer from a compression cost overhead due to their latent variables. This overhead can only be partially eliminated with elaborate schemes such as bits-back coding, often resulting in poor single-sample compression rates. To overcome such problems, we establish a new class of tractable lossless compression models that permit efficient encoding and decoding: Probabilistic Circuits (PCs). These are a class of neural networks involving $|p|$ computational units that support efficient marginalization over arbitrary subsets of the $D$ feature dimensions, enabling efficient arithmetic coding. We derive efficient encoding and decoding schemes that both have time complexity $\mathcal{O} (\log(D) \cdot |p|)$, where a naive scheme would have linear costs in $D$ and $|p|$, making the approach highly scalable. Empirically, our PC-based (de)compression algorithm runs 5-40 times faster than neural compression algorithms that achieve similar bitrates. By scaling up the traditional PC structure learning pipeline, we achieve state-of-the-art results on image datasets such as MNIST. Furthermore, PCs can be naturally integrated with existing neural compression algorithms to improve the performance of these base models on natural image datasets. Our results highlight the potential impact that non-standard learning architectures may have on neural data compression.
\ No newline at end of file
diff --git a/data/2022/iclr/Lossy Compression with Distribution Shift as Entropy Constrained Optimal Transport b/data/2022/iclr/Lossy Compression with Distribution Shift as Entropy Constrained Optimal Transport
new file mode 100644
index 0000000000..eaca0f3ba3
--- /dev/null
+++ b/data/2022/iclr/Lossy Compression with Distribution Shift as Entropy Constrained Optimal Transport	
@@ -0,0 +1 @@
+We study an extension of lossy compression where the reconstruction distribution is different from the source distribution in order to account for distributional shift due to processing. We formulate this as a generalization of optimal transport with an entropy bottleneck to account for the rate constraint due to compression. We provide expressions for the tradeoff between compression rate and the achievable distortion with and without shared common randomness between the encoder and decoder. We study the examples of binary, uniform and Gaussian sources (in an asymptotic setting) in detail and demonstrate that shared randomness can strictly improve the tradeoff. For the case without common randomness and squared-Euclidean distortion, we show that the optimal solution partially decouples into the problem of optimal compression and transport and also characterize the penalty associated with fully decoupling them. We provide experimental results by training deep learning end-to-end compression systems for performing denoising on SVHN and super-resolution on MNIST suggesting consistency with our theoretical results.
\ No newline at end of file
diff --git a/data/2022/iclr/Low-Budget Active Learning via Wasserstein Distance: An Integer Programming Approach b/data/2022/iclr/Low-Budget Active Learning via Wasserstein Distance: An Integer Programming Approach
new file mode 100644
index 0000000000..4ec800b94e
--- /dev/null
+++ b/data/2022/iclr/Low-Budget Active Learning via Wasserstein Distance: An Integer Programming Approach	
@@ -0,0 +1 @@
+Active learning is the process of training a model with limited labeled data by selecting a core subset of an unlabeled data pool to label. The large scale of data sets used in deep learning forces most sample selection strategies to employ efficient heuristics. This paper introduces an integer optimization problem for selecting a core set that minimizes the discrete Wasserstein distance from the unlabeled pool. We demonstrate that this problem can be tractably solved with a Generalized Benders Decomposition algorithm. Our strategy uses high-quality latent features that can be obtained by unsupervised learning on the unlabeled pool. Numerical results on several data sets show that our optimization approach is competitive with baselines and particularly outperforms them in the low budget regime where less than one percent of the data set is labeled.
\ No newline at end of file
diff --git a/data/2022/iclr/MAML is a Noisy Contrastive Learner in Classification b/data/2022/iclr/MAML is a Noisy Contrastive Learner in Classification
new file mode 100644
index 0000000000..05f439db15
--- /dev/null
+++ b/data/2022/iclr/MAML is a Noisy Contrastive Learner in Classification	
@@ -0,0 +1 @@
+Model-agnostic meta-learning (MAML) is one of the most popular and widely adopted meta-learning algorithms, achieving remarkable success in various learning problems. Yet, with the unique design of nested inner-loop and outer-loop updates, which govern the task-specific and meta-model-centric learning, respectively, the underlying learning objective of MAML remains implicit and thus impedes a more straightforward understanding of it. In this paper, we provide a new perspective of the working mechanism of MAML. We discover that MAML is analogous to a meta-learner using a supervised contrastive objective. The query features are pulled towards the support features of the same class and against those of different classes. Such contrastiveness is experimentally verified via an analysis based on the cosine similarity. Moreover, we reveal that vanilla MAML has an undesirable interference term originating from the random initialization and the cross-task interaction. We thus propose a simple but effective technique, zeroing trick, to alleviate the interference. Extensive experiments are conducted on both mini-ImageNet and Omniglot datasets to demonstrate the consistent improvement brought by our proposed method, validating its effectiveness.
\ No newline at end of file
diff --git a/data/2022/iclr/MCMC Should Mix: Learning Energy-Based Model with Neural Transport Latent Space MCMC b/data/2022/iclr/MCMC Should Mix: Learning Energy-Based Model with Neural Transport Latent Space MCMC
new file mode 100644
index 0000000000..9ac79878a3
--- /dev/null
+++ b/data/2022/iclr/MCMC Should Mix: Learning Energy-Based Model with Neural Transport Latent Space MCMC	
@@ -0,0 +1 @@
+Learning energy-based model (EBM) requires MCMC sampling of the learned model as an inner loop of the learning algorithm. However, MCMC sampling of EBMs in high-dimensional data space is generally not mixing, because the energy function, which is usually parametrized by a deep network, is highly multi-modal in the data space. This is a serious handicap for both theory and practice of EBMs. In this paper, we propose to learn an EBM with a flow-based model (or in general a latent variable model) serving as a backbone, so that the EBM is a correction or an exponential tilting of the flow-based model. We show that the model has a particularly simple form in the space of the latent variables of the backbone model, and MCMC sampling of the EBM in the latent space mixes well and traverses modes in the data space. This enables proper sampling and learning of EBMs.
\ No newline at end of file
diff --git a/data/2022/iclr/MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling b/data/2022/iclr/MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling
new file mode 100644
index 0000000000..2d31371624
--- /dev/null
+++ b/data/2022/iclr/MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling	
@@ -0,0 +1 @@
+Musical expression requires control of both what notes are played, and how they are performed. Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism. Black-box neural audio synthesis and concatenative samplers can produce realistic audio, but have few mechanisms for control. In this work, we introduce MIDI-DDSP a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control. Starting from interpretable Differentiable Digital Signal Processing (DDSP) synthesis parameters, we infer musical notes and high-level properties of their expressive performance (such as timbre, vibrato, dynamics, and articulation). This creates a 3-level hierarchy (notes, performance, synthesis) that affords individuals the option to intervene at each level, or utilize trained priors (performance given notes, synthesis given performance) for creative assistance. Through quantitative experiments and listening tests, we demonstrate that this hierarchy can reconstruct high-fidelity audio, accurately predict performance attributes for a note sequence, independently manipulate the attributes of a given performance, and as a complete system, generate realistic audio from a novel note sequence. By utilizing an interpretable hierarchy, with multiple levels of granularity, MIDI-DDSP opens the door to assistive tools to empower individuals across a diverse range of musical experience.
\ No newline at end of file
diff --git a/data/2022/iclr/MT3: Multi-Task Multitrack Music Transcription b/data/2022/iclr/MT3: Multi-Task Multitrack Music Transcription
new file mode 100644
index 0000000000..9eea762d10
--- /dev/null
+++ b/data/2022/iclr/MT3: Multi-Task Multitrack Music Transcription	
@@ -0,0 +1 @@
+Automatic Music Transcription (AMT), inferring musical notes from raw audio, is a challenging task at the core of music understanding. Unlike Automatic Speech Recognition (ASR), which typically focuses on the words of a single speaker, AMT often requires transcribing multiple instruments simultaneously, all while preserving fine-scale pitch and timing information. Further, many AMT datasets are"low-resource", as even expert musicians find music transcription difficult and time-consuming. Thus, prior work has focused on task-specific architectures, tailored to the individual instruments of each task. In this work, motivated by the promising results of sequence-to-sequence transfer learning for low-resource Natural Language Processing (NLP), we demonstrate that a general-purpose Transformer model can perform multi-task AMT, jointly transcribing arbitrary combinations of musical instruments across several transcription datasets. We show this unified training framework achieves high-quality transcription results across a range of datasets, dramatically improving performance for low-resource instruments (such as guitar), while preserving strong performance for abundant instruments (such as piano). Finally, by expanding the scope of AMT, we expose the need for more consistent evaluation metrics and better dataset alignment, and provide a strong baseline for this new direction of multi-task AMT.
\ No newline at end of file
diff --git a/data/2022/iclr/MaGNET: Uniform Sampling from Deep Generative Network Manifolds Without Retraining b/data/2022/iclr/MaGNET: Uniform Sampling from Deep Generative Network Manifolds Without Retraining
new file mode 100644
index 0000000000..0e960881a1
--- /dev/null
+++ b/data/2022/iclr/MaGNET: Uniform Sampling from Deep Generative Network Manifolds Without Retraining	
@@ -0,0 +1 @@
+Deep Generative Networks (DGNs) are extensively employed in Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and their variants to approximate the data manifold and distribution. However, training samples are often distributed in a non-uniform fashion on the manifold, due to costs or convenience of collection. For example, the CelebA dataset contains a large fraction of smiling faces. These inconsistencies will be reproduced when sampling from the trained DGN, which is not always preferred, e.g., for fairness or data augmentation. In response, we develop MaGNET, a novel and theoretically motivated latent space sampler for any pre-trained DGN, that produces samples uniformly distributed on the learned manifold. We perform a range of experiments on various datasets and DGNs, e.g., for the state-of-the-art StyleGAN2 trained on FFHQ dataset, uniform sampling via MaGNET increases distribution precision and recall by 4.1\% \&3.0\% and decreases gender bias by 41.2\%, without requiring labels or retraining. As uniform distribution does not imply uniform semantic distribution, we also explore separately how semantic attributes of generated samples vary under MaGNET sampling.
\ No newline at end of file
diff --git a/data/2022/iclr/Machine Learning For Elliptic PDEs: Fast Rate Generalization Bound, Neural Scaling Law and Minimax Optimality b/data/2022/iclr/Machine Learning For Elliptic PDEs: Fast Rate Generalization Bound, Neural Scaling Law and Minimax Optimality
new file mode 100644
index 0000000000..0c0be58bee
--- /dev/null
+++ b/data/2022/iclr/Machine Learning For Elliptic PDEs: Fast Rate Generalization Bound, Neural Scaling Law and Minimax Optimality	
@@ -0,0 +1 @@
+In this paper, we study the statistical limits of deep learning techniques for solving elliptic partial differential equations (PDEs) from random samples using the Deep Ritz Method (DRM) and Physics-Informed Neural Networks (PINNs). To simplify the problem, we focus on a prototype elliptic PDE: the Schr\"odinger equation on a hypercube with zero Dirichlet boundary condition, which has wide application in the quantum-mechanical systems. We establish upper and lower bounds for both methods, which improves upon concurrently developed upper bounds for this problem via a fast rate generalization bound. We discover that the current Deep Ritz Methods is sub-optimal and propose a modified version of it. We also prove that PINN and the modified version of DRM can achieve minimax optimal bounds over Sobolev spaces. Empirically, following recent work which has shown that the deep model accuracy will improve with growing training sets according to a power law, we supply computational experiments to show a similar behavior of dimension dependent power law for deep PDE solvers.
\ No newline at end of file
diff --git a/data/2022/iclr/Map Induction: Compositional spatial submap learning for efficient exploration in novel environments b/data/2022/iclr/Map Induction: Compositional spatial submap learning for efficient exploration in novel environments
new file mode 100644
index 0000000000..7e0617100e
--- /dev/null
+++ b/data/2022/iclr/Map Induction: Compositional spatial submap learning for efficient exploration in novel environments	
@@ -0,0 +1 @@
+Humans are expert explorers. Understanding the computational cognitive mechanisms that support this efficiency can advance the study of the human mind and enable more efficient exploration algorithms. We hypothesize that humans explore new environments efficiently by inferring the structure of unobserved spaces using spatial information collected from previously explored spaces. This cognitive process can be modeled computationally using program induction in a Hierarchical Bayesian framework that explicitly reasons about uncertainty with strong spatial priors. Using a new behavioral Map Induction Task, we demonstrate that this computational framework explains human exploration behavior better than non-inductive models and outperforms state-of-the-art planning algorithms when applied to a realistic spatial navigation domain.
\ No newline at end of file
diff --git a/data/2022/iclr/Mapping Language Models to Grounded Conceptual Spaces b/data/2022/iclr/Mapping Language Models to Grounded Conceptual Spaces
new file mode 100644
index 0000000000..467d4314bf
--- /dev/null
+++ b/data/2022/iclr/Mapping Language Models to Grounded Conceptual Spaces	
@@ -0,0 +1 @@
+A fundamental criticism of text-only language models (LMs) is their lack of grounding —that is, the ability to tie a word for which they have learned a representation to its referent in the non-linguistic world. However, despite this limitation, large pre-trained LMs have been shown to have a remarkable grasp of the conceptual structure of language, as demonstrated by their ability to answer questions, generate ﬂuent text, or make inferences about entities, objects, and properties that they have never physically observed. In this work we investigate the extent to which the rich conceptual structure that LMs learn indeed reﬂects the conceptual structure of the non-linguistic world—which is something that LMs have never observed. We do this by testing whether the LMs can learn to map an entire conceptual domain (e.g., direction or colour) onto a grounded world representation given only a small number of examples. For example, we show a model what the word “left” means using a textual depiction of a grid world, and assess how well it can generalise to related concepts, for example, the word “right” , in a similar grid world. We investigate a range of generative language models of varying sizes (including GPT-2 and GPT-3), and see that although the smaller models struggle to perform this mapping, the largest model can not only learn to ground the concepts that it is explicitly taught, but appears to generalise to several instances of unseen concepts as well. Our results suggest an alternative means of building grounded language models: rather than learning grounded representations “from scratch”, it is possible that large text-only models learn a sufﬁciently rich conceptual structure that could allow them to be grounded in a data-efﬁcient way.
\ No newline at end of file
diff --git a/data/2022/iclr/Mapping conditional distributions for domain adaptation under generalized target shift b/data/2022/iclr/Mapping conditional distributions for domain adaptation under generalized target shift
new file mode 100644
index 0000000000..814703c8a1
--- /dev/null
+++ b/data/2022/iclr/Mapping conditional distributions for domain adaptation under generalized target shift	
@@ -0,0 +1 @@
+We consider the problem of unsupervised domain adaptation (UDA) between a source and a target domain under conditional and label shift a.k.a Generalized Target Shift (GeTarS). Unlike simpler UDA settings, few works have addressed this challenging problem. Recent approaches learn domain-invariant representations, yet they have practical limitations and rely on strong assumptions that may not hold in practice. In this paper, we explore a novel and general approach to align pretrained representations, which circumvents existing drawbacks. Instead of constraining representation invariance, it learns an optimal transport map, implemented as a NN, which maps source representations onto target ones. Our approach is flexible and scalable, it preserves the problem's structure and it has strong theoretical guarantees under mild assumptions. In particular, our solution is unique, matches conditional distributions across domains, recovers target proportions and explicitly controls the target generalization risk. Through an exhaustive comparison on several datasets, we challenge the state-of-the-art in GeTarS.
\ No newline at end of file
diff --git a/data/2022/iclr/Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning b/data/2022/iclr/Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning
new file mode 100644
index 0000000000..55eec051f7
--- /dev/null
+++ b/data/2022/iclr/Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning	
@@ -0,0 +1 @@
+We present DrQ-v2, a model-free reinforcement learning (RL) algorithm for visual continuous control. DrQ-v2 builds on DrQ, an off-policy actor-critic approach that uses data augmentation to learn directly from pixels. We introduce several improvements that yield state-of-the-art results on the DeepMind Control Suite. Notably, DrQ-v2 is able to solve complex humanoid locomotion tasks directly from pixel observations, previously unattained by model-free RL. DrQ-v2 is conceptually simple, easy to implement, and provides significantly better computational footprint compared to prior work, with the majority of tasks taking just 8 hours to train on a single GPU. Finally, we publicly release DrQ-v2's implementation to provide RL practitioners with a strong and computationally efficient baseline.
\ No newline at end of file
diff --git a/data/2022/iclr/Maximizing Ensemble Diversity in Deep Reinforcement Learning b/data/2022/iclr/Maximizing Ensemble Diversity in Deep Reinforcement Learning
new file mode 100644
index 0000000000..210f5fa609
--- /dev/null
+++ b/data/2022/iclr/Maximizing Ensemble Diversity in Deep Reinforcement Learning	
@@ -0,0 +1 @@
+Modern deep reinforcement learning (DRL) has been successful in solving a range of challenging sequential decision-making problems. Most of these algorithms use an ensemble of neural networks as their backbone structure and beneﬁt from the diversity among the neural networks to achieve optimal results. Unfortunately, the members of the ensemble can converge to the same point either the parametric space or representation space during the training phase, therefore, losing all the leverage of an ensemble. In this paper, we describe Maximize Ensemble Diversity in Reinforcement Learning (MED-RL), a set of regularization methods inspired from the economics and consensus optimization to improve diversity in the ensemble-based deep reinforcement learning methods by encouraging inequality between the networks during training. We integrated MED-RL in ﬁve of the most common ensemble-based deep RL algorithms for both continuous and discrete control tasks and evaluated on six Mujoco environments and six Atari games. Our results show that MED-RL augmented algorithms outperform their un-regularized counterparts signiﬁcantly and in some cases achieved more than 300 % in performance gains
\ No newline at end of file
diff --git a/data/2022/iclr/Maximum Entropy RL (Provably) Solves Some Robust RL Problems b/data/2022/iclr/Maximum Entropy RL (Provably) Solves Some Robust RL Problems
new file mode 100644
index 0000000000..65f4aa514c
--- /dev/null
+++ b/data/2022/iclr/Maximum Entropy RL (Provably) Solves Some Robust RL Problems	
@@ -0,0 +1 @@
+Many potential applications of reinforcement learning (RL) require guarantees that the agent will perform well in the face of disturbances to the dynamics or reward function. In this paper, we prove theoretically that standard maximum entropy RL is robust to some disturbances in the dynamics and the reward function. While this capability of MaxEnt RL has been observed empirically in prior work, to the best of our knowledge our work provides the first rigorous proof and theoretical characterization of the MaxEnt RL robust set. While a number of prior robust RL algorithms have been designed to handle similar disturbances to the reward function or dynamics, these methods typically require adding additional moving parts and hyperparameters on top of a base RL algorithm. In contrast, our theoretical results suggest that MaxEnt RL by itself is robust to certain disturbances, without requiring any additional modifications. While this does not imply that MaxEnt RL is the best available robust RL method, MaxEnt RL does possess a striking simplicity and appealing formal guarantees.
\ No newline at end of file
diff --git a/data/2022/iclr/Maximum n-times Coverage for Vaccine Design b/data/2022/iclr/Maximum n-times Coverage for Vaccine Design
new file mode 100644
index 0000000000..472ece9d27
--- /dev/null
+++ b/data/2022/iclr/Maximum n-times Coverage for Vaccine Design	
@@ -0,0 +1 @@
+We introduce the maximum $n$-times coverage problem that selects $k$ overlays to maximize the summed coverage of weighted elements, where each element must be covered at least $n$ times. We also define the min-cost $n$-times coverage problem where the objective is to select the minimum set of overlays such that the sum of the weights of elements that are covered at least $n$ times is at least $τ$. Maximum $n$-times coverage is a generalization of the multi-set multi-cover problem, is NP-complete, and is not submodular. We introduce two new practical solutions for $n$-times coverage based on integer linear programming and sequential greedy optimization. We show that maximum $n$-times coverage is a natural way to frame peptide vaccine design, and find that it produces a pan-strain COVID-19 vaccine design that is superior to 29 other published designs in predicted population coverage and the expected number of peptides displayed by each individual's HLA molecules.
\ No newline at end of file
diff --git a/data/2022/iclr/Measuring CLEVRness: Black-box Testing of Visual Reasoning Models b/data/2022/iclr/Measuring CLEVRness: Black-box Testing of Visual Reasoning Models
new file mode 100644
index 0000000000..59eb600010
--- /dev/null
+++ b/data/2022/iclr/Measuring CLEVRness: Black-box Testing of Visual Reasoning Models	
@@ -0,0 +1 @@
+How can we measure the reasoning capabilities of intelligence systems? Visual question answering provides a convenient framework for testing the model's abilities by interrogating the model through questions about the scene. However, despite scores of various visual QA datasets and architectures, which sometimes yield even a super-human performance, the question of whether those architectures can actually reason remains open to debate. To answer this, we extend the visual question answering framework and propose the following behavioral test in the form of a two-player game. We consider black-box neural models of CLEVR. These models are trained on a diagnostic dataset benchmarking reasoning. Next, we train an adversarial player that re-configures the scene to fool the CLEVR model. We show that CLEVR models, which otherwise could perform at a human level, can easily be fooled by our agent. Our results put in doubt whether data-driven approaches can do reasoning without exploiting the numerous biases that are often present in those datasets. Finally, we also propose a controlled experiment measuring the efficiency of such models to learn and perform reasoning.
\ No newline at end of file
diff --git a/data/2022/iclr/Measuring the Interpretability of Unsupervised Representations via Quantized Reversed Probing b/data/2022/iclr/Measuring the Interpretability of Unsupervised Representations via Quantized Reversed Probing
new file mode 100644
index 0000000000..f66bd23da8
--- /dev/null
+++ b/data/2022/iclr/Measuring the Interpretability of Unsupervised Representations via Quantized Reversed Probing	
@@ -0,0 +1 @@
+Self-supervised visual representation learning has recently attracted significant research interest. While a common way to evaluate self-supervised representations is through transfer to various downstream tasks, we instead investigate the problem of measuring their interpretability, i.e. understanding the semantics encoded in raw representations. We formulate the latter as estimating the mutual information between the representation and a space of manually labelled concepts. To quantify this we introduce a decoding bottleneck: information must be captured by simple predictors, mapping concepts to clusters in representation space. This approach, which we call reverse linear probing, provides a single number sensitive to the semanticity of the representation. This measure is also able to detect when the representation contains combinations of concepts (e.g.,"red apple") instead of just individual attributes ("red"and"apple"independently). Finally, we propose to use supervised classifiers to automatically label large datasets in order to enrich the space of concepts used for probing. We use our method to evaluate a large number of self-supervised representations, ranking them by interpretability, highlight the differences that emerge compared to the standard evaluation with linear probes and discuss several qualitative insights. Code at: {\scriptsize{\url{https://github.com/iro-cp/ssl-qrp}}}.
\ No newline at end of file
diff --git a/data/2022/iclr/Memorizing Transformers b/data/2022/iclr/Memorizing Transformers
new file mode 100644
index 0000000000..5a9004382c
--- /dev/null
+++ b/data/2022/iclr/Memorizing Transformers	
@@ -0,0 +1 @@
+Language models typically need to be trained or finetuned in order to acquire new knowledge, which involves updating their weights. We instead envision language models that can simply read and memorize new data at inference time, thus acquiring new knowledge immediately. In this work, we extend language models with the ability to memorize the internal representations of past inputs. We demonstrate that an approximate kNN lookup into a non-differentiable memory of recent (key, value) pairs improves language modeling across various benchmarks and tasks, including generic webtext (C4), math papers (arXiv), books (PG-19), code (Github), as well as formal theorems (Isabelle). We show that the performance steadily improves when we increase the size of memory up to 262K tokens. On benchmarks including code and mathematics, we find that the model is capable of making use of newly defined functions and theorems during test time.
\ No newline at end of file
diff --git a/data/2022/iclr/Memory Augmented Optimizers for Deep Learning b/data/2022/iclr/Memory Augmented Optimizers for Deep Learning
new file mode 100644
index 0000000000..0dac46ceee
--- /dev/null
+++ b/data/2022/iclr/Memory Augmented Optimizers for Deep Learning	
@@ -0,0 +1 @@
+Popular approaches for minimizing loss in data-driven learning often involve an abstraction or an explicit retention of the history of gradients for efficient parameter updates. The aggregated history of gradients nudges the parameter updates in the right direction even when the gradients at any given step are not informative. Although the history of gradients summarized in meta-parameters or explicitly stored in memory has been shown effective in theory and practice, the question of whether $all$ or only a subset of the gradients in the history are sufficient in deciding the parameter updates remains unanswered. In this paper, we propose a framework of memory-augmented gradient descent optimizers that retain a limited view of their gradient history in their internal memory. Such optimizers scale well to large real-life datasets, and our experiments show that the memory augmented extensions of standard optimizers enjoy accelerated convergence and improved performance on a majority of computer vision and language tasks that we considered. Additionally, we prove that the proposed class of optimizers with fixed-size memory converge under assumptions of strong convexity, regardless of which gradients are selected or how they are linearly combined to form the update step.
\ No newline at end of file
diff --git a/data/2022/iclr/Memory Replay with Data Compression for Continual Learning b/data/2022/iclr/Memory Replay with Data Compression for Continual Learning
new file mode 100644
index 0000000000..6c48b3a699
--- /dev/null
+++ b/data/2022/iclr/Memory Replay with Data Compression for Continual Learning	
@@ -0,0 +1 @@
+Continual learning needs to overcome catastrophic forgetting of the past. Memory replay of representative old training samples has been shown as an effective solution, and achieves the state-of-the-art (SOTA) performance. However, existing work is mainly built on a small memory buffer containing a few original data, which cannot fully characterize the old data distribution. In this work, we propose memory replay with data compression (MRDC) to reduce the storage cost of old training samples and thus increase their amount that can be stored in the memory buffer. Observing that the trade-off between the quality and quantity of compressed data is highly nontrivial for the efficacy of memory replay, we propose a novel method based on determinantal point processes (DPPs) to efficiently determine an appropriate compression quality for currently-arrived training samples. In this way, using a naive data compression algorithm with a properly selected quality can largely boost recent strong baselines by saving more compressed data in a limited storage space. We extensively validate this across several benchmarks of class-incremental learning and in a realistic scenario of object detection for autonomous driving.
\ No newline at end of file
diff --git a/data/2022/iclr/Mention Memory: incorporating textual knowledge into Transformers through entity mention attention b/data/2022/iclr/Mention Memory: incorporating textual knowledge into Transformers through entity mention attention
new file mode 100644
index 0000000000..871f4f1f7e
--- /dev/null
+++ b/data/2022/iclr/Mention Memory: incorporating textual knowledge into Transformers through entity mention attention	
@@ -0,0 +1 @@
+Natural language understanding tasks such as open-domain question answering often require retrieving and assimilating factual information from multiple sources. We propose to address this problem by integrating a semi-parametric representation of a large text corpus into a Transformer model as a source of factual knowledge. Specifically, our method represents knowledge with `mention memory', a table of dense vector representations of every entity mention in a corpus. The proposed model - TOME - is a Transformer that accesses the information through internal memory layers in which each entity mention in the input passage attends to the mention memory. This approach enables synthesis of and reasoning over many disparate sources of information within a single Transformer model. In experiments using a memory of 150 million Wikipedia mentions, TOME achieves strong performance on several open-domain knowledge-intensive tasks, including the claim verification benchmarks HoVer and FEVER and several entity-based QA benchmarks. We also show that the model learns to attend to informative mentions without any direct supervision. Finally we demonstrate that the model can generalize to new unseen entities by updating the memory without retraining.
\ No newline at end of file
diff --git a/data/2022/iclr/Message Passing Neural PDE Solvers b/data/2022/iclr/Message Passing Neural PDE Solvers
new file mode 100644
index 0000000000..e252df8fee
--- /dev/null
+++ b/data/2022/iclr/Message Passing Neural PDE Solvers	
@@ -0,0 +1 @@
+The numerical solution of partial differential equations (PDEs) is difficult, having led to a century of research so far. Recently, there have been pushes to build neural--numerical hybrid solvers, which piggy-backs the modern trend towards fully end-to-end learned systems. Most works so far can only generalize over a subset of properties to which a generic solver would be faced, including: resolution, topology, geometry, boundary conditions, domain discretization regularity, dimensionality, etc. In this work, we build a solver, satisfying these properties, where all the components are based on neural message passing, replacing all heuristically designed components in the computation graph with backprop-optimized neural function approximators. We show that neural message passing solvers representationally contain some classical methods, such as finite differences, finite volumes, and WENO schemes. In order to encourage stability in training autoregressive models, we put forward a method that is based on the principle of zero-stability, posing stability as a domain adaptation problem. We validate our method on various fluid-like flow problems, demonstrating fast, stable, and accurate performance across different domain topologies, equation parameters, discretizations, etc., in 1D and 2D.
\ No newline at end of file
diff --git a/data/2022/iclr/Meta Discovery: Learning to Discover Novel Classes given Very Limited Data b/data/2022/iclr/Meta Discovery: Learning to Discover Novel Classes given Very Limited Data
new file mode 100644
index 0000000000..73ca05b623
--- /dev/null
+++ b/data/2022/iclr/Meta Discovery: Learning to Discover Novel Classes given Very Limited Data	
@@ -0,0 +1 @@
+In novel class discovery (NCD), we are given labeled data from seen classes and unlabeled data from unseen classes, and we train clustering models for the unseen classes. However, the implicit assumptions behind NCD are still unclear. In this paper, we demystify assumptions behind NCD and find that high-level semantic features should be shared among the seen and unseen classes. Based on this finding, NCD is theoretically solvable under certain assumptions and can be naturally linked to meta-learning that has exactly the same assumption as NCD. Thus, we can empirically solve the NCD problem by meta-learning algorithms after slight modifications. This meta-learning-based methodology significantly reduces the amount of unlabeled data needed for training and makes it more practical, as demonstrated in experiments. The use of very limited data is also justified by the application scenario of NCD: since it is unnatural to label only seen-class data, NCD is sampling instead of labeling in causality. Therefore, unseen-class data should be collected on the way of collecting seen-class data, which is why they are novel and first need to be clustered.
\ No newline at end of file
diff --git a/data/2022/iclr/Meta Learning Low Rank Covariance Factors for Energy Based Deterministic Uncertainty b/data/2022/iclr/Meta Learning Low Rank Covariance Factors for Energy Based Deterministic Uncertainty
new file mode 100644
index 0000000000..b2a034afe7
--- /dev/null
+++ b/data/2022/iclr/Meta Learning Low Rank Covariance Factors for Energy Based Deterministic Uncertainty	
@@ -0,0 +1 @@
+Numerous recent works utilize bi-Lipschitz regularization of neural network layers to preserve relative distances between data instances in the feature spaces of each layer. This distance sensitivity with respect to the data aids in tasks such as uncertainty calibration and out-of-distribution (OOD) detection. In previous works, features extracted with a distance sensitive model are used to construct feature covariance matrices which are used in deterministic uncertainty estimation or OOD detection. However, in cases where there is a distribution over tasks, these methods result in covariances which are sub-optimal, as they may not leverage all of the meta information which can be shared among tasks. With the use of an attentive set encoder, we propose to meta learn either diagonal or diagonal plus low-rank factors to efficiently construct task specific covariance matrices. Additionally, we propose an inference procedure which utilizes scaled energy to achieve a final predictive distribution which is well calibrated under a distributional dataset shift.
\ No newline at end of file
diff --git a/data/2022/iclr/Meta-Imitation Learning by Watching Video Demonstrations b/data/2022/iclr/Meta-Imitation Learning by Watching Video Demonstrations
new file mode 100644
index 0000000000..ce2afcae21
--- /dev/null
+++ b/data/2022/iclr/Meta-Imitation Learning by Watching Video Demonstrations	
@@ -0,0 +1 @@
+Meta-Imitation Learning is a promising technique for the robot to learn a new task from observing one or a few human demonstrations. However, it usually requires a signiﬁcant number of demonstrations both from humans and robots during the meta-training phase, which is a laborious and hard work for data collection, especially in recording the actions and specifying the correspondence between human and robot. In this work, we present an approach of meta-imitation learning by watching video demonstrations from humans. In comparison to prior works, our approach is able to translate human videos into practical robot demonstrations and train the meta-policy with adaptive loss based on the quality of the translated data. Our approach relies only on human videos and does not require robot demonstration, which facilitates data collection and is more in line with human imitation behavior. Experiments reveal that our method achieves the comparable performance to the baseline on fast learning a set of vision-based tasks through watching a single video demonstration.
\ No newline at end of file
diff --git a/data/2022/iclr/Meta-Learning with Fewer Tasks through Task Interpolation b/data/2022/iclr/Meta-Learning with Fewer Tasks through Task Interpolation
new file mode 100644
index 0000000000..a13f637025
--- /dev/null
+++ b/data/2022/iclr/Meta-Learning with Fewer Tasks through Task Interpolation	
@@ -0,0 +1 @@
+Meta-learning enables algorithms to quickly learn a newly encountered task with just a few labeled examples by transferring previously learned knowledge. However, the bottleneck of current meta-learning algorithms is the requirement of a large number of meta-training tasks, which may not be accessible in real-world scenarios. To address the challenge that available tasks may not densely sample the space of tasks, we propose to augment the task set through interpolation. By meta-learning with task interpolation (MLTI), our approach effectively generates additional tasks by randomly sampling a pair of tasks and interpolating the corresponding features and labels. Under both gradient-based and metric-based meta-learning settings, our theoretical analysis shows MLTI corresponds to a data-adaptive meta-regularization and further improves the generalization. Empirically, in our experiments on eight datasets from diverse domains including image recognition, pose prediction, molecule property prediction, and medical image classification, we find that the proposed general MLTI framework is compatible with representative meta-learning algorithms and consistently outperforms other state-of-the-art strategies.
\ No newline at end of file
diff --git a/data/2022/iclr/MetaMorph: Learning Universal Controllers with Transformers b/data/2022/iclr/MetaMorph: Learning Universal Controllers with Transformers
new file mode 100644
index 0000000000..738bc2f169
--- /dev/null
+++ b/data/2022/iclr/MetaMorph: Learning Universal Controllers with Transformers	
@@ -0,0 +1 @@
+Multiple domains like vision, natural language, and audio are witnessing tremendous progress by leveraging Transformers for large scale pre-training followed by task specific fine tuning. In contrast, in robotics we primarily train a single robot for a single task. However, modular robot systems now allow for the flexible combination of general-purpose building blocks into task optimized morphologies. However, given the exponentially large number of possible robot morphologies, training a controller for each new design is impractical. In this work, we propose MetaMorph, a Transformer based approach to learn a universal controller over a modular robot design space. MetaMorph is based on the insight that robot morphology is just another modality on which we can condition the output of a Transformer. Through extensive experiments we demonstrate that large scale pre-training on a variety of robot morphologies results in policies with combinatorial generalization capabilities, including zero shot generalization to unseen robot morphologies. We further demonstrate that our pre-trained policy can be used for sample-efficient transfer to completely new robot morphologies and tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts b/data/2022/iclr/MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts
new file mode 100644
index 0000000000..2efb8e8ca9
--- /dev/null
+++ b/data/2022/iclr/MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts	
@@ -0,0 +1 @@
+Understanding the performance of machine learning models across diverse data distributions is critically important for reliable applications. Motivated by this, there is a growing focus on curating benchmark datasets that capture distribution shifts. While valuable, the existing benchmarks are limited in that many of them only contain a small number of shifts and they lack systematic annotation about what is different across different shifts. We present MetaShift--a collection of 12,868 sets of natural images across 410 classes--to address this challenge. We leverage the natural heterogeneity of Visual Genome and its annotations to construct MetaShift. The key construction idea is to cluster images using its metadata, which provides context for each image (e.g."cats with cars"or"cats in bathroom") that represent distinct data distributions. MetaShift has two important benefits: first, it contains orders of magnitude more natural data shifts than previously available. Second, it provides explicit explanations of what is unique about each of its data sets and a distance score that measures the amount of distribution shift between any two of its data sets. We demonstrate the utility of MetaShift in benchmarking several recent proposals for training models to be robust to data shifts. We find that the simple empirical risk minimization performs the best when shifts are moderate and no method had a systematic advantage for large shifts. We also show how MetaShift can help to visualize conflicts between data subsets during model training.
\ No newline at end of file
diff --git a/data/2022/iclr/Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks b/data/2022/iclr/Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks
new file mode 100644
index 0000000000..20094f0f11
--- /dev/null
+++ b/data/2022/iclr/Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks	
@@ -0,0 +1 @@
+We present a new method for one shot domain adaptation. The input to our method is trained GAN that can produce images in domain A and a single reference image I_B from domain B. The proposed algorithm can translate any output of the trained GAN from domain A to domain B. There are two main advantages of our method compared to the current state of the art: First, our solution achieves higher visual quality, e.g. by noticeably reducing overfitting. Second, our solution allows for more degrees of freedom to control the domain gap, i.e. what aspects of image I_B are used to define the domain B. Technically, we realize the new method by building on a pre-trained StyleGAN generator as GAN and a pre-trained CLIP model for representing the domain gap. We propose several new regularizers for controlling the domain gap to optimize the weights of the pre-trained StyleGAN generator to output images in domain B instead of domain A. The regularizers prevent the optimization from taking on too many attributes of the single reference image. Our results show significant visual improvements over the state of the art as well as multiple applications that highlight improved control.
\ No newline at end of file
diff --git a/data/2022/iclr/Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond b/data/2022/iclr/Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond
new file mode 100644
index 0000000000..120e7cae5c
--- /dev/null
+++ b/data/2022/iclr/Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond	
@@ -0,0 +1 @@
+In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients without replacement and are thus closer to practice. For smooth functions satisfying the Polyak-{\L}ojasiewicz condition, we obtain convergence bounds (in the large epoch regime) which show that these shuffling-based variants converge faster than their with-replacement counterparts. Moreover, we prove matching lower bounds showing that our convergence analysis is tight. Finally, we propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
\ No newline at end of file
diff --git a/data/2022/iclr/Minimax Optimality (Probably) Doesn't Imply Distribution Learning for GANs b/data/2022/iclr/Minimax Optimality (Probably) Doesn't Imply Distribution Learning for GANs
new file mode 100644
index 0000000000..69b3c5c247
--- /dev/null
+++ b/data/2022/iclr/Minimax Optimality (Probably) Doesn't Imply Distribution Learning for GANs	
@@ -0,0 +1 @@
+Arguably the most fundamental question in the theory of generative adversarial networks (GANs) is to understand to what extent GANs can actually learn the underlying distribution. Theoretical and empirical evidence suggests local optimality of the empirical training objective is insufficient. Yet, it does not rule out the possibility that achieving a true population minimax optimal solution might imply distribution learning. In this paper, we show that standard cryptographic assumptions imply that this stronger condition is still insufficient. Namely, we show that if local pseudorandom generators (PRGs) exist, then for a large family of natural continuous target distributions, there are ReLU network generators of constant depth and polynomial size which take Gaussian random seeds so that (i) the output is far in Wasserstein distance from the target distribution, but (ii) no polynomially large Lipschitz discriminator ReLU network can detect this. This implies that even achieving a population minimax optimal solution to the Wasserstein GAN objective is likely insufficient for distribution learning in the usual statistical sense. Our techniques reveal a deep connection between GANs and PRGs, which we believe will lead to further insights into the computational landscape of GANs.
\ No newline at end of file
diff --git a/data/2022/iclr/Minimax Optimization with Smooth Algorithmic Adversaries b/data/2022/iclr/Minimax Optimization with Smooth Algorithmic Adversaries
new file mode 100644
index 0000000000..23c3a15a29
--- /dev/null
+++ b/data/2022/iclr/Minimax Optimization with Smooth Algorithmic Adversaries	
@@ -0,0 +1 @@
+This paper considers minimax optimization $\min_x \max_y f(x, y)$ in the challenging setting where $f$ can be both nonconvex in $x$ and nonconcave in $y$. Though such optimization problems arise in many machine learning paradigms including training generative adversarial networks (GANs) and adversarially robust models, many fundamental issues remain in theory, such as the absence of efficiently computable optimality notions, and cyclic or diverging behavior of existing algorithms. Our framework sprouts from the practical consideration that under a computational budget, the max-player can not fully maximize $f(x,\cdot)$ since nonconcave maximization is NP-hard in general. So, we propose a new algorithm for the min-player to play against smooth algorithms deployed by the adversary (i.e., the max-player) instead of against full maximization. Our algorithm is guaranteed to make monotonic progress (thus having no limit cycles), and to find an appropriate"stationary point"in a polynomial number of iterations. Our framework covers practical settings where the smooth algorithms deployed by the adversary are multi-step stochastic gradient ascent, and its accelerated version. We further provide complementing experiments that confirm our theoretical findings and demonstrate the effectiveness of the proposed approach in practice.
\ No newline at end of file
diff --git a/data/2022/iclr/Mirror Descent Policy Optimization b/data/2022/iclr/Mirror Descent Policy Optimization
new file mode 100644
index 0000000000..4061e1b28f
--- /dev/null
+++ b/data/2022/iclr/Mirror Descent Policy Optimization	
@@ -0,0 +1 @@
+We propose deep Reinforcement Learning (RL) algorithms inspired by mirror descent, a well-known first-order trust region optimization method for solving constrained convex problems. Our approach, which we call as Mirror Descent Policy Optimization (MDPO), is based on the idea of iteratively solving a `trust-region' problem that minimizes a sum of two terms: a linearization of the objective function and a proximity term that restricts two consecutive updates to be close to each other. Following this approach we derive on-policy and off-policy variants of the MDPO algorithm and analyze their performance while emphasizing important implementation details, motivated by the existing theoretical framework. We highlight the connections between on-policy MDPO and two popular trust region RL algorithms: TRPO and PPO, and conduct a comprehensive empirical comparison of these algorithms. We then derive off-policy MDPO and compare its performance to existing approaches. Importantly, we show that the theoretical framework of MDPO can be scaled to deep RL while achieving good performance on popular benchmarks.
\ No newline at end of file
diff --git a/data/2022/iclr/Missingness Bias in Model Debugging b/data/2022/iclr/Missingness Bias in Model Debugging
new file mode 100644
index 0000000000..b01e5b0ab3
--- /dev/null
+++ b/data/2022/iclr/Missingness Bias in Model Debugging	
@@ -0,0 +1 @@
+Missingness, or the absence of features from an input, is a concept fundamental to many model debugging tools. However, in computer vision, pixels cannot simply be removed from an image. One thus tends to resort to heuristics such as blacking out pixels, which may in turn introduce bias into the debugging process. We study such biases and, in particular, show how transformer-based architectures can enable a more natural implementation of missingness, which side-steps these issues and improves the reliability of model debugging in practice. Our code is available at https://github.com/madrylab/missingness
\ No newline at end of file
diff --git a/data/2022/iclr/MoReL: Multi-omics Relational Learning b/data/2022/iclr/MoReL: Multi-omics Relational Learning
new file mode 100644
index 0000000000..af7f576c70
--- /dev/null
+++ b/data/2022/iclr/MoReL: Multi-omics Relational Learning	
@@ -0,0 +1 @@
+Multi-omics data analysis has the potential to discover hidden molecular interactions, revealing potential regulatory and/or signal transduction pathways for cellular processes of interest when studying life and disease systems. One of critical challenges when dealing with real-world multi-omics data is that they may manifest heterogeneous structures and data quality as often existing data may be collected from different subjects under different conditions for each type of omics data. We propose a novel deep Bayesian generative model to efficiently infer a multi-partite graph encoding molecular interactions across such heterogeneous views, using a fused Gromov-Wasserstein (FGW) regularization between latent representations of corresponding views for integrative analysis. With such an optimal transport regularization in the deep Bayesian generative model, it not only allows incorporating view-specific side information, either with graph-structured or unstructured data in different views, but also increases the model flexibility with the distribution-based regularization. This allows efficient alignment of heterogeneous latent variable distributions to derive reliable interaction predictions compared to the existing point-based graph embedding methods. Our experiments on several real-world datasets demonstrate enhanced performance of MoReL in inferring meaningful interactions compared to existing baselines.
\ No newline at end of file
diff --git a/data/2022/iclr/MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer b/data/2022/iclr/MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
new file mode 100644
index 0000000000..acd44b9d4d
--- /dev/null
+++ b/data/2022/iclr/MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer	
@@ -0,0 +1 @@
+Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters. Our source code is open-source and available at: https://github.com/apple/ml-cvnets
\ No newline at end of file
diff --git a/data/2022/iclr/Model Agnostic Interpretability for Multiple Instance Learning b/data/2022/iclr/Model Agnostic Interpretability for Multiple Instance Learning
new file mode 100644
index 0000000000..39637498e2
--- /dev/null
+++ b/data/2022/iclr/Model Agnostic Interpretability for Multiple Instance Learning	
@@ -0,0 +1 @@
+In Multiple Instance Learning (MIL), models are trained using bags of instances, where only a single label is provided for each bag. A bag label is often only determined by a handful of key instances within a bag, making it difficult to interpret what information a classifier is using to make decisions. In this work, we establish the key requirements for interpreting MIL models. We then go on to develop several model-agnostic approaches that meet these requirements. Our methods are compared against existing inherently interpretable MIL models on several datasets, and achieve an increase in interpretability accuracy of up to 30%. We also examine the ability of the methods to identify interactions between instances and scale to larger datasets, improving their applicability to real-world problems.
\ No newline at end of file
diff --git a/data/2022/iclr/Model Zoo: A Growing Brain That Learns Continually b/data/2022/iclr/Model Zoo: A Growing Brain That Learns Continually
new file mode 100644
index 0000000000..bcb2feaeb5
--- /dev/null
+++ b/data/2022/iclr/Model Zoo: A Growing Brain That Learns Continually	
@@ -0,0 +1 @@
+This paper argues that continual learning methods can benefit by splitting the capacity of the learner across multiple models. We use statistical learning theory and experimental analysis to show how multiple tasks can interact with each other in a non-trivial fashion when a single model is trained on them. The generalization error on a particular task can improve when it is trained with synergistic tasks, but can also deteriorate when trained with competing tasks. This theory motivates our method named Model Zoo which, inspired from the boosting literature, grows an ensemble of small models, each of which is trained during one episode of continual learning. We demonstrate that Model Zoo obtains large gains in accuracy on a variety of continual learning benchmark problems. Code is available at https://github.com/grasp-lyrl/modelzoo_continual.
\ No newline at end of file
diff --git a/data/2022/iclr/Model-Based Offline Meta-Reinforcement Learning with Regularization b/data/2022/iclr/Model-Based Offline Meta-Reinforcement Learning with Regularization
new file mode 100644
index 0000000000..101c4c0e8d
--- /dev/null
+++ b/data/2022/iclr/Model-Based Offline Meta-Reinforcement Learning with Regularization	
@@ -0,0 +1 @@
+Existing offline reinforcement learning (RL) methods face a few major challenges, particularly the distributional shift between the learned policy and the behavior policy. Offline Meta-RL is emerging as a promising approach to address these challenges, aiming to learn an informative meta-policy from a collection of tasks. Nevertheless, as shown in our empirical studies, offline Meta-RL could be outperformed by offline single-task RL methods on tasks with good quality of datasets, indicating that a right balance has to be delicately calibrated between"exploring"the out-of-distribution state-actions by following the meta-policy and"exploiting"the offline dataset by staying close to the behavior policy. Motivated by such empirical analysis, we explore model-based offline Meta-RL with regularized Policy Optimization (MerPO), which learns a meta-model for efficient task structure inference and an informative meta-policy for safe exploration of out-of-distribution state-actions. In particular, we devise a new meta-Regularized model-based Actor-Critic (RAC) method for within-task policy optimization, as a key building block of MerPO, using conservative policy evaluation and regularized policy improvement; and the intrinsic tradeoff therein is achieved via striking the right balance between two regularizers, one based on the behavior policy and the other on the meta-policy. We theoretically show that the learnt policy offers guaranteed improvement over both the behavior policy and the meta-policy, thus ensuring the performance improvement on new tasks via offline Meta-RL. Experiments corroborate the superior performance of MerPO over existing offline Meta-RL methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Model-augmented Prioritized Experience Replay b/data/2022/iclr/Model-augmented Prioritized Experience Replay
new file mode 100644
index 0000000000..c66d5657ab
--- /dev/null
+++ b/data/2022/iclr/Model-augmented Prioritized Experience Replay	
@@ -0,0 +1 @@
+Experience replay is an essential component in off-policy model-free reinforcement learning (MfRL). Due to its effectiveness, various methods for calculating priority scores on experiences have been proposed for sampling. Since critic networks are crucial to policy learning, TD-error, directly correlated to Q-values, is one of the most frequently used features to compute the scores. However, critic networks often underor overestimate Q-values, so it is often ineffective to learn for predicting Q-values by sampled experiences based heavily on TD-error. Accordingly, it is valuable to find auxiliary features, which positively support TD-error in calculating the scores for efficient sampling. Motivated by this, we propose a novel experience replay method, which we call model-augmented prioritized experience replay (MaPER), that employs new learnable features driven from components in modelbased RL (MbRL) to calculate the scores on experiences. The proposed MaPER brings the effect of curriculum learning for predicting Q-values better by the critic network with negligible memory and computational overhead compared to the vanilla PER. Indeed, our experimental results on various tasks demonstrate that MaPER can significantly improve the performance of the state-of-the-art offpolicy MfRL and MbRL which includes off-policy MfRL algorithms in its policy optimization procedure.
\ No newline at end of file
diff --git a/data/2022/iclr/Modeling Label Space Interactions in Multi-label Classification using Box Embeddings b/data/2022/iclr/Modeling Label Space Interactions in Multi-label Classification using Box Embeddings
new file mode 100644
index 0000000000..a7071ee869
--- /dev/null
+++ b/data/2022/iclr/Modeling Label Space Interactions in Multi-label Classification using Box Embeddings	
@@ -0,0 +1 @@
+Multi-label classification is a challenging structured prediction task in which a set of output class labels are predicted for each input. Real-world datasets often have taxonomic relationships between labels which can be explicit, implicit, or partially observed. Most existing multi-label classification methods either ignore the label taxonomy or require the complete specification of the taxonomy at training and inference time to enforce coherence in their predictions. In this work we introduce the multi-label box model (MBM), a multi-label classification method that combines the encoding power of neural networks with the inductive bias of probabilistic box embeddings (Vilnis et al., 2018), which can be understood as trainable Venn-diagrams based on hyper-rectangles. By representing labels as boxes, MBM is able to capture taxonomic relations among labels without them being provided explicitly. Furthermore, since MBM learns the label-label relationships from data and represents them as calibrated conditional probabilities, it provides a high degree of interpretability. This interpretability also facilitates the injection of partial information about label-label relationships into model training, to further improve its consistency. We provide theoretical grounding for our method and show experimentally the model’s ability to learn the true latent taxonomic structure from data. Through extensive empirical evaluations on twelve multi-label classification datasets, we show that MBM can significantly improve taxonomic consistency while maintaining the state-of-the-art predictive performance. 1
\ No newline at end of file
diff --git a/data/2022/iclr/Modular Lifelong Reinforcement Learning via Neural Composition b/data/2022/iclr/Modular Lifelong Reinforcement Learning via Neural Composition
new file mode 100644
index 0000000000..7ea4ad2aa3
--- /dev/null
+++ b/data/2022/iclr/Modular Lifelong Reinforcement Learning via Neural Composition	
@@ -0,0 +1 @@
+Humans commonly solve complex problems by decomposing them into easier subproblems and then combining the subproblem solutions. This type of compositional reasoning permits reuse of the subproblem solutions when tackling future tasks that share part of the underlying compositional structure. In a continual or lifelong reinforcement learning (RL) setting, this ability to decompose knowledge into reusable components would enable agents to quickly learn new RL tasks by leveraging accumulated compositional structures. We explore a particular form of composition based on neural modules and present a set of RL problems that intuitively admit compositional solutions. Empirically, we demonstrate that neural composition indeed captures the underlying structure of this space of problems. We further propose a compositional lifelong RL method that leverages accumulated neural components to accelerate the learning of future tasks while retaining performance on previous tasks via off-line RL over replayed experiences.
\ No newline at end of file
diff --git a/data/2022/iclr/MonoDistill: Learning Spatial Features for Monocular 3D Object Detection b/data/2022/iclr/MonoDistill: Learning Spatial Features for Monocular 3D Object Detection
new file mode 100644
index 0000000000..0308fe0e19
--- /dev/null
+++ b/data/2022/iclr/MonoDistill: Learning Spatial Features for Monocular 3D Object Detection	
@@ -0,0 +1 @@
+3D object detection is a fundamental and challenging task for 3D scene understanding, and the monocular-based methods can serve as an economical alternative to the stereo-based or LiDAR-based methods. However, accurately detecting objects in the 3D space from a single image is extremely difficult due to the lack of spatial cues. To mitigate this issue, we propose a simple and effective scheme to introduce the spatial information from LiDAR signals to the monocular 3D detectors, without introducing any extra cost in the inference phase. In particular, we first project the LiDAR signals into the image plane and align them with the RGB images. After that, we use the resulting data to train a 3D detector (LiDAR Net) with the same architecture as the baseline model. Finally, this LiDAR Net can serve as the teacher to transfer the learned knowledge to the baseline model. Experimental results show that the proposed method can significantly boost the performance of the baseline model and ranks the $1^{st}$ place among all monocular-based methods on the KITTI benchmark. Besides, extensive ablation studies are conducted, which further prove the effectiveness of each part of our designs and illustrate what the baseline model has learned from the LiDAR Net. Our code will be released at \url{https://github.com/monster-ghost/MonoDistill}.
\ No newline at end of file
diff --git a/data/2022/iclr/Monotonic Differentiable Sorting Networks b/data/2022/iclr/Monotonic Differentiable Sorting Networks
new file mode 100644
index 0000000000..0e08f72e93
--- /dev/null
+++ b/data/2022/iclr/Monotonic Differentiable Sorting Networks	
@@ -0,0 +1 @@
+Differentiable sorting algorithms allow training with sorting and ranking supervision, where only the ordering or ranking of samples is known. Various methods have been proposed to address this challenge, ranging from optimal transport-based differentiable Sinkhorn sorting algorithms to making classic sorting networks differentiable. One problem of current differentiable sorting methods is that they are non-monotonic. To address this issue, we propose a novel relaxation of conditional swap operations that guarantees monotonicity in differentiable sorting networks. We introduce a family of sigmoid functions and prove that they produce differentiable sorting networks that are monotonic. Monotonicity ensures that the gradients always have the correct sign, which is an advantage in gradient-based optimization. We demonstrate that monotonic differentiable sorting networks improve upon previous differentiable sorting methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Multi-Agent MDP Homomorphic Networks b/data/2022/iclr/Multi-Agent MDP Homomorphic Networks
new file mode 100644
index 0000000000..a3ad9f9285
--- /dev/null
+++ b/data/2022/iclr/Multi-Agent MDP Homomorphic Networks	
@@ -0,0 +1 @@
+This paper introduces Multi-Agent MDP Homomorphic Networks, a class of networks that allows distributed execution using only local information, yet is able to share experience between global symmetries in the joint state-action space of cooperative multi-agent systems. In cooperative multi-agent systems, complex symmetries arise between different configurations of the agents and their local observations. For example, consider a group of agents navigating: rotating the state globally results in a permutation of the optimal joint policy. Existing work on symmetries in single agent reinforcement learning can only be generalized to the fully centralized setting, because such approaches rely on the global symmetry in the full state-action spaces, and these can result in correspondences across agents. To encode such symmetries while still allowing distributed execution we propose a factorization that decomposes global symmetries into local transformations. Our proposed factorization allows for distributing the computation that enforces global symmetries over local agents and local interactions. We introduce a multi-agent equivariant policy network based on this factorization. We show empirically on symmetric multi-agent problems that globally symmetric distributable policies improve data efficiency compared to non-equivariant baselines.
\ No newline at end of file
diff --git a/data/2022/iclr/Multi-Critic Actor Learning: Teaching RL Policies to Act with Style b/data/2022/iclr/Multi-Critic Actor Learning: Teaching RL Policies to Act with Style
new file mode 100644
index 0000000000..a8d62519da
--- /dev/null
+++ b/data/2022/iclr/Multi-Critic Actor Learning: Teaching RL Policies to Act with Style	
@@ -0,0 +1 @@
+Using a single value function (critic) shared over multiple tasks in Actor-Critic multi-task reinforcement learning (MTRL) can result in negative interference between tasks, which can compromise learning performance. Multi-Critic Actor Learning (MultiCriticAL) proposes instead maintaining separate critics for each task being trained while training a single multi-task actor. Explicitly distinguishing between tasks also eliminates the need for critics to learn to do so and mitigates interference between task-value estimates. MultiCriticAL is tested in the context of multi-style learning, a special case of MTRL where agents are trained to behave with different distinct behavior styles, and yields up to 56% performance gains over the single-critic baselines and even successfully learns behavior styles in cases where single-critic approaches may simply fail to learn. In a simulated real-world use case, MultiCriticAL enables learning policies that smoothly transition between multiple fighting styles on an experimental build of EA’s UFC game.
\ No newline at end of file
diff --git a/data/2022/iclr/Multi-Mode Deep Matrix and Tensor Factorization b/data/2022/iclr/Multi-Mode Deep Matrix and Tensor Factorization
new file mode 100644
index 0000000000..64f4aa3065
--- /dev/null
+++ b/data/2022/iclr/Multi-Mode Deep Matrix and Tensor Factorization	
@@ -0,0 +1 @@
+Recently, deep linear and nonlinear matrix factorizations gain increasing attention in the area of machine learning. Existing deep nonlinear matrix factorization methods can only exploit partial nonlinearity of the data and are not effective in handling matrices of which the number of rows is comparable to the number of columns. On the other hand, there is still a gap between deep learning and tensor decomposition. This paper presents a framework of multi-mode deep matrix and tensor factorizations to explore and exploit the full nonlinearity of the data in matrices and tensors. We use the factorization methods to solve matrix and tensor completion problems and prove that our methods have tighter generalization error bounds than conventional matrix and tensor factorization methods. The experiments on synthetic data and real datasets showed that the proposed methods have much higher recovery accuracy than many baselines
\ No newline at end of file
diff --git a/data/2022/iclr/Multi-Stage Episodic Control for Strategic Exploration in Text Games b/data/2022/iclr/Multi-Stage Episodic Control for Strategic Exploration in Text Games
new file mode 100644
index 0000000000..5b947ebab7
--- /dev/null
+++ b/data/2022/iclr/Multi-Stage Episodic Control for Strategic Exploration in Text Games	
@@ -0,0 +1 @@
+Text adventure games present unique challenges to reinforcement learning methods due to their combinatorially large action spaces and sparse rewards. The interplay of these two factors is particularly demanding because large action spaces require extensive exploration, while sparse rewards provide limited feedback. This work proposes to tackle the explore-vs-exploit dilemma using a multi-stage approach that explicitly disentangles these two strategies within each episode. Our algorithm, called eXploit-Then-eXplore (XTX), begins each episode using an exploitation policy that imitates a set of promising trajectories from the past, and then switches over to an exploration policy aimed at discovering novel actions that lead to unseen state spaces. This policy decomposition allows us to combine global decisions about which parts of the game space to return to with curiosity-based local exploration in that space, motivated by how a human may approach these games. Our method significantly outperforms prior approaches by 27% and 11% average normalized score over 12 games from the Jericho benchmark (Hausknecht et al., 2020) in both deterministic and stochastic settings, respectively. On the game of Zork1, in particular, XTX obtains a score of 103, more than a 2x improvement over prior methods, and pushes past several known bottlenecks in the game that have plagued previous state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Multi-Task Processes b/data/2022/iclr/Multi-Task Processes
new file mode 100644
index 0000000000..bf35f22057
--- /dev/null
+++ b/data/2022/iclr/Multi-Task Processes	
@@ -0,0 +1 @@
+Neural Processes (NPs) consider a task as a function realized from a stochastic process and flexibly adapt to unseen tasks through inference on functions. However, naive NPs can model data from only a single stochastic process and are designed to infer each task independently. Since many real-world data represent a set of correlated tasks from multiple sources (e.g., multiple attributes and multi-sensor data), it is beneficial to infer them jointly and exploit the underlying correlation to improve the predictive performance. To this end, we propose Multi-Task Processes (MTPs), an extension of NPs designed to jointly infer tasks realized from multiple stochastic processes. We build our MTPs in a hierarchical manner such that intertask correlation is considered by conditioning all per-task latent variables on a single global latent variable. In addition, we further design our MTPs so that they can address multi-task settings with incomplete data (i.e., not all tasks share the same set of input points), which has high practical demands in various applications. Experiments demonstrate that MTPs can successfully model multiple tasks jointly by discovering and exploiting their correlations in various real-world data such as time series of weather attributes and pixel-aligned visual modalities.
\ No newline at end of file
diff --git a/data/2022/iclr/Multi-objective Optimization by Learning Space Partition b/data/2022/iclr/Multi-objective Optimization by Learning Space Partition
new file mode 100644
index 0000000000..7c839c010f
--- /dev/null
+++ b/data/2022/iclr/Multi-objective Optimization by Learning Space Partition	
@@ -0,0 +1 @@
+In contrast to single-objective optimization (SOO), multi-objective optimization (MOO) requires an optimizer to find the Pareto frontier, a subset of feasible solutions that are not dominated by other feasible solutions. In this paper, we propose LaMOO, a novel multi-objective optimizer that learns a model from observed samples to partition the search space and then focus on promising regions that are likely to contain a subset of the Pareto frontier. The partitioning is based on the dominance number, which measures"how close"a data point is to the Pareto frontier among existing samples. To account for possible partition errors due to limited samples and model mismatch, we leverage Monte Carlo Tree Search (MCTS) to exploit promising regions while exploring suboptimal regions that may turn out to contain good solutions later. Theoretically, we prove the efficacy of learning space partitioning via LaMOO under certain assumptions. Empirically, on the HyperVolume (HV) benchmark, a popular MOO metric, LaMOO substantially outperforms strong baselines on multiple real-world MOO tasks, by up to 225% in sample efficiency for neural architecture search on Nasbench201, and up to 10% for molecular design.
\ No newline at end of file
diff --git a/data/2022/iclr/Multimeasurement Generative Models b/data/2022/iclr/Multimeasurement Generative Models
new file mode 100644
index 0000000000..37ece9243c
--- /dev/null
+++ b/data/2022/iclr/Multimeasurement Generative Models	
@@ -0,0 +1 @@
+We formally map the problem of sampling from an unknown distribution with a density in $\mathbb{R}^d$ to the problem of learning and sampling a smoother density in $\mathbb{R}^{Md}$ obtained by convolution with a fixed factorial kernel: the new density is referred to as M-density and the kernel as multimeasurement noise model (MNM). The M-density in $\mathbb{R}^{Md}$ is smoother than the original density in $\mathbb{R}^d$, easier to learn and sample from, yet for large $M$ the two problems are mathematically equivalent since clean data can be estimated exactly given a multimeasurement noisy observation using the Bayes estimator. To formulate the problem, we derive the Bayes estimator for Poisson and Gaussian MNMs in closed form in terms of the unnormalized M-density. This leads to a simple least-squares objective for learning parametric energy and score functions. We present various parametrization schemes of interest including one in which studying Gaussian M-densities directly leads to multidenoising autoencoders--this is the first theoretical connection made between denoising autoencoders and empirical Bayes in the literature. Samples in $\mathbb{R}^d$ are obtained by walk-jump sampling (Saremi&Hyvarinen, 2019) via underdamped Langevin MCMC (walk) to sample from M-density and the multimeasurement Bayes estimation (jump). We study permutation invariant Gaussian M-densities on MNIST, CIFAR-10, and FFHQ-256 datasets, and demonstrate the effectiveness of this framework for realizing fast-mixing stable Markov chains in high dimensions.
\ No newline at end of file
diff --git a/data/2022/iclr/Multiset-Equivariant Set Prediction with Approximate Implicit Differentiation b/data/2022/iclr/Multiset-Equivariant Set Prediction with Approximate Implicit Differentiation
new file mode 100644
index 0000000000..2a8562526d
--- /dev/null
+++ b/data/2022/iclr/Multiset-Equivariant Set Prediction with Approximate Implicit Differentiation	
@@ -0,0 +1 @@
+Most set prediction models in deep learning use set-equivariant operations, but they actually operate on multisets. We show that set-equivariant functions cannot represent certain functions on multisets, so we introduce the more appropriate notion of multiset-equivariance. We identify that the existing Deep Set Prediction Network (DSPN) can be multiset-equivariant without being hindered by set-equivariance and improve it with approximate implicit differentiation, allowing for better optimization while being faster and saving memory. In a range of toy experiments, we show that the perspective of multiset-equivariance is beneficial and that our changes to DSPN achieve better results in most cases. On CLEVR object property prediction, we substantially improve over the state-of-the-art Slot Attention from 8% to 77% in one of the strictest evaluation metrics because of the benefits made possible by implicit differentiation.
\ No newline at end of file
diff --git a/data/2022/iclr/Multitask Prompted Training Enables Zero-Shot Task Generalization b/data/2022/iclr/Multitask Prompted Training Enables Zero-Shot Task Generalization
new file mode 100644
index 0000000000..a5c9782424
--- /dev/null
+++ b/data/2022/iclr/Multitask Prompted Training Enables Zero-Shot Task Generalization	
@@ -0,0 +1 @@
+Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language models' pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping any natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks. We fine-tune a pretrained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-bench benchmark, outperforming models up to 6x its size. All trained models are available at https://github.com/bigscience-workshop/t-zero and all prompts are available at https://github.com/bigscience-workshop/promptsource.
\ No newline at end of file
diff --git a/data/2022/iclr/NAS-Bench-Suite: NAS Evaluation is (Now) Surprisingly Easy b/data/2022/iclr/NAS-Bench-Suite: NAS Evaluation is (Now) Surprisingly Easy
new file mode 100644
index 0000000000..63dabdb614
--- /dev/null
+++ b/data/2022/iclr/NAS-Bench-Suite: NAS Evaluation is (Now) Surprisingly Easy	
@@ -0,0 +1 @@
+The release of tabular benchmarks, such as NAS-Bench-101 and NAS-Bench-201, has significantly lowered the computational overhead for conducting scientific research in neural architecture search (NAS). Although they have been widely adopted and used to tune real-world NAS algorithms, these benchmarks are limited to small search spaces and focus solely on image classification. Recently, several new NAS benchmarks have been introduced that cover significantly larger search spaces over a wide range of tasks, including object detection, speech recognition, and natural language processing. However, substantial differences among these NAS benchmarks have so far prevented their widespread adoption, limiting researchers to using just a few benchmarks. In this work, we present an in-depth analysis of popular NAS algorithms and performance prediction methods across 25 different combinations of search spaces and datasets, finding that many conclusions drawn from a few NAS benchmarks do not generalize to other benchmarks. To help remedy this problem, we introduce NAS-Bench-Suite, a comprehensive and extensible collection of NAS benchmarks, accessible through a unified interface, created with the aim to facilitate reproducible, generalizable, and rapid NAS research. Our code is available at https://github.com/automl/naslib.
\ No newline at end of file
diff --git a/data/2022/iclr/NASI: Label- and Data-agnostic Neural Architecture Search at Initialization b/data/2022/iclr/NASI: Label- and Data-agnostic Neural Architecture Search at Initialization
new file mode 100644
index 0000000000..ca59778a39
--- /dev/null
+++ b/data/2022/iclr/NASI: Label- and Data-agnostic Neural Architecture Search at Initialization	
@@ -0,0 +1 @@
+Recent years have witnessed a surging interest in Neural Architecture Search (NAS). Various algorithms have been proposed to improve the search efficiency and effectiveness of NAS, i.e., to reduce the search cost and improve the generalization performance of the selected architectures, respectively. However, the search efficiency of these algorithms is severely limited by the need for model training during the search process. To overcome this limitation, we propose a novel NAS algorithm called NAS at Initialization (NASI) that exploits the capability of a Neural Tangent Kernel in being able to characterize the converged performance of candidate architectures at initialization, hence allowing model training to be completely avoided to boost the search efficiency. Besides the improved search efficiency, NASI also achieves competitive search effectiveness on various datasets like CIFAR-10/100 and ImageNet. Further, NASI is shown to be label- and data-agnostic under mild conditions, which guarantees the transferability of architectures selected by our NASI over different datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/NASPY: Automated Extraction of Automated Machine Learning Models b/data/2022/iclr/NASPY: Automated Extraction of Automated Machine Learning Models
new file mode 100644
index 0000000000..f76b541e81
--- /dev/null
+++ b/data/2022/iclr/NASPY: Automated Extraction of Automated Machine Learning Models	
@@ -0,0 +1 @@
+We present NASPY , an end-to-end adversarial framework to extract the network architecture of deep learning models generated by Neural Architecture Search (NAS). Existing model extraction attacks mainly focus on conventional DNN models with very simple operations, or require heavy manual analysis with lots of prior knowledge. In contrast, NASPY introduces seq2seq models to automatically identify novel and complicated operations (e.g., separable convolution, dilated convolution) from hardware side-channel sequences of model inference. We design two models (RNN-CTC and transformer), which can achieve only 3.2% and 11.3% error rates for operation prediction. We further present methods to recover the model hyper-parameters and topology from the operation sequence. With these techniques, NASPY is able to extract the complete NAS model architecture with high fidelity and automation, which are rarely analyzed before.
\ No newline at end of file
diff --git a/data/2022/iclr/NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training b/data/2022/iclr/NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training
new file mode 100644
index 0000000000..69cb8a7f1d
--- /dev/null
+++ b/data/2022/iclr/NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training	
@@ -0,0 +1 @@
+Designing accurate and efﬁcient vision transformers (ViTs) is an important but challenging task. Supernet-based one-shot neural architecture search (NAS) enables fast architecture optimization and has achieved state-of-the-art results on convolutional neural networks (CNNs). However, directly applying the supernet-based NAS to optimize ViTs leads to poor performance - even worse compared to training single ViTs. In this work, we observe that the poor performance is due to a gradient conﬂict issue: the gradients of different sub-networks conﬂict with that of the supernet more severely in ViTs than CNNs, which leads to early saturation in training and inferior convergence. To alleviate this issue, we propose a series of techniques, including a gradient projection algorithm, a switchable layer scaling design, and a simpliﬁed data augmentation and regularization training recipe. The proposed techniques signiﬁcantly improve the convergence and the performance of all sub-networks. Our discovered hybrid ViT model family, dubbed NASViT, achieves top-1 accuracy from 78.2% to 81.8% on ImageNet from 200M to 800M FLOPs, and outperforms all the prior art CNNs and ViTs, including AlphaNet and LeViT. When transferred to semantic segmentation tasks, NASViTs also out-perform previous backbones on both Cityscape and ADE20K datasets, achieving 73.2% and
\ No newline at end of file
diff --git a/data/2022/iclr/NODE-GAM: Neural Generalized Additive Model for Interpretable Deep Learning b/data/2022/iclr/NODE-GAM: Neural Generalized Additive Model for Interpretable Deep Learning
new file mode 100644
index 0000000000..76614e42e9
--- /dev/null
+++ b/data/2022/iclr/NODE-GAM: Neural Generalized Additive Model for Interpretable Deep Learning	
@@ -0,0 +1 @@
+Deployment of machine learning models in real high-risk settings (e.g. healthcare) often depends not only on the model's accuracy but also on its fairness, robustness, and interpretability. Generalized Additive Models (GAMs) are a class of interpretable models with a long history of use in these high-risk domains, but they lack desirable features of deep learning such as differentiability and scalability. In this work, we propose a neural GAM (NODE-GAM) and neural GA$^2$M (NODE-GA$^2$M) that scale well and perform better than other GAMs on large datasets, while remaining interpretable compared to other ensemble and deep learning models. We demonstrate that our models find interesting patterns in the data. Lastly, we show that we improve model accuracy via self-supervised pre-training, an improvement that is not possible for non-differentiable GAMs.
\ No newline at end of file
diff --git a/data/2022/iclr/Natural Language Descriptions of Deep Visual Features b/data/2022/iclr/Natural Language Descriptions of Deep Visual Features
new file mode 100644
index 0000000000..a1a0cd8703
--- /dev/null
+++ b/data/2022/iclr/Natural Language Descriptions of Deep Visual Features	
@@ -0,0 +1 @@
+Some neurons in deep networks specialize in recognizing highly specific perceptual, structural, or semantic features of inputs. In computer vision, techniques exist for identifying neurons that respond to individual concept categories like colors, textures, and object classes. But these techniques are limited in scope, labeling only a small subset of neurons and behaviors in any network. Is a richer characterization of neuron-level computation possible? We introduce a procedure (called MILAN, for mutual-information-guided linguistic annotation of neurons) that automatically labels neurons with open-ended, compositional, natural language descriptions. Given a neuron, MILAN generates a description by searching for a natural language string that maximizes pointwise mutual information with the image regions in which the neuron is active. MILAN produces fine-grained descriptions that capture categorical, relational, and logical structure in learned features. These descriptions obtain high agreement with human-generated feature descriptions across a diverse set of model architectures and tasks, and can aid in understanding and controlling learned models. We highlight three applications of natural language neuron descriptions. First, we use MILAN for analysis, characterizing the distribution and importance of neurons selective for attribute, category, and relational information in vision models. Second, we use MILAN for auditing, surfacing neurons sensitive to human faces in datasets designed to obscure them. Finally, we use MILAN for editing, improving robustness in an image classifier by deleting neurons sensitive to text features spuriously correlated with class labels.
\ No newline at end of file
diff --git a/data/2022/iclr/Natural Posterior Network: Deep Bayesian Predictive Uncertainty for Exponential Family Distributions b/data/2022/iclr/Natural Posterior Network: Deep Bayesian Predictive Uncertainty for Exponential Family Distributions
new file mode 100644
index 0000000000..80c47362ab
--- /dev/null
+++ b/data/2022/iclr/Natural Posterior Network: Deep Bayesian Predictive Uncertainty for Exponential Family Distributions	
@@ -0,0 +1 @@
+Uncertainty awareness is crucial to develop reliable machine learning models. In this work, we propose the Natural Posterior Network (NatPN) for fast and highquality uncertainty estimation for any task where the target distribution belongs to the exponential family. Thus, NatPN finds application for both classification and general regression settings. Unlike many previous approaches, NatPN does not require out-of-distribution (OOD) data at training time. Instead, it leverages Normalizing Flows to fit a single density on a learned low-dimensional and taskdependent latent space. For any input sample, NatPN uses the predicted likelihood to perform a Bayesian update over the target distribution. Theoretically, NatPN assigns high uncertainty far away from training data. Empirically, our extensive experiments on calibration and OOD detection show that NatPN delivers highly competitive performance for classification, regression and count prediction tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Near-Optimal Reward-Free Exploration for Linear Mixture MDPs with Plug-in Solver b/data/2022/iclr/Near-Optimal Reward-Free Exploration for Linear Mixture MDPs with Plug-in Solver
new file mode 100644
index 0000000000..c27bf30acb
--- /dev/null
+++ b/data/2022/iclr/Near-Optimal Reward-Free Exploration for Linear Mixture MDPs with Plug-in Solver	
@@ -0,0 +1 @@
+Although model-based reinforcement learning (RL) approaches are considered more sample efficient, existing algorithms are usually relying on sophisticated planning algorithm to couple tightly with the model-learning procedure. Hence the learned models may lack the ability of being re-used with more specialized planners. In this paper we address this issue and provide approaches to learn an RL model efficiently without the guidance of a reward signal. In particular, we take a plug-in solver approach, where we focus on learning a model in the exploration phase and demand that \emph{any planning algorithm} on the learned model can give a near-optimal policy. Specicially, we focus on the linear mixture MDP setting, where the probability transition matrix is a (unknown) convex combination of a set of existing models. We show that, by establishing a novel exploration algorithm, the plug-in approach learns a model by taking $\tilde{O}(d^2H^3/\epsilon^2)$ interactions with the environment and \emph{any} $\epsilon$-optimal planner on the model gives an $O(\epsilon)$-optimal policy on the original model. This sample complexity matches lower bounds for non-plug-in approaches and is \emph{statistically optimal}. We achieve this result by leveraging a careful maximum total-variance bound using Bernstein inequality and properties specified to linear mixture MDP.
\ No newline at end of file
diff --git a/data/2022/iclr/Near-optimal Offline Reinforcement Learning with Linear Representation: Leveraging Variance Information with Pessimism b/data/2022/iclr/Near-optimal Offline Reinforcement Learning with Linear Representation: Leveraging Variance Information with Pessimism
new file mode 100644
index 0000000000..730f21ea5d
--- /dev/null
+++ b/data/2022/iclr/Near-optimal Offline Reinforcement Learning with Linear Representation: Leveraging Variance Information with Pessimism	
@@ -0,0 +1 @@
+Offline reinforcement learning, which seeks to utilize offline/historical data to optimize sequential decision-making strategies, has gained surging prominence in recent studies. Due to the advantage that appropriate function approximators can help mitigate the sample complexity burden in modern reinforcement learning problems, existing endeavors usually enforce powerful function representation models (e.g. neural networks) to learn the optimal policies. However, a precise understanding of the statistical limits with function representations, remains elusive, even when such a representation is linear. Towards this goal, we study the statistical limits of offline reinforcement learning with linear model representations. To derive the tight offline learning bound, we design the variance-aware pessimistic value iteration (VAPVI), which adopts the conditional variance information of the value function for time-inhomogeneous episodic linear Markov decision processes (MDPs). VAPVI leverages estimated variances of the value functions to reweight the Bellman residuals in the least-square pessimistic value iteration and provides improved offline learning bounds over the best-known existing results (whereas the Bellman residuals are equally weighted by design). More importantly, our learning bounds are expressed in terms of system quantities, which provide natural instance-dependent characterizations that previous results are short of. We hope our results draw a clearer picture of what offline learning should look like when linear representations are provided.
\ No newline at end of file
diff --git a/data/2022/iclr/Network Augmentation for Tiny Deep Learning b/data/2022/iclr/Network Augmentation for Tiny Deep Learning
new file mode 100644
index 0000000000..fccdf40c76
--- /dev/null
+++ b/data/2022/iclr/Network Augmentation for Tiny Deep Learning	
@@ -0,0 +1 @@
+We introduce Network Augmentation (NetAug), a new training method for improving the performance of tiny neural networks. Existing regularization techniques (e.g., data augmentation, dropout) have shown much success on large neural networks by adding noise to overcome over-fitting. However, we found these techniques hurt the performance of tiny neural networks. We argue that training tiny models are different from large models: rather than augmenting the data, we should augment the model, since tiny models tend to suffer from under-fitting rather than over-fitting due to limited capacity. To alleviate this issue, NetAug augments the network (reverse dropout) instead of inserting noise into the dataset or the network. It puts the tiny model into larger models and encourages it to work as a sub-model of larger models to get extra supervision, in addition to functioning as an independent model. At test time, only the tiny model is used for inference, incurring zero inference overhead. We demonstrate the effectiveness of NetAug on image classification and object detection. NetAug consistently improves the performance of tiny models, achieving up to 2.2% accuracy improvement on ImageNet. On object detection, achieving the same level of performance, NetAug requires 41% fewer MACs on Pascal VOC and 38% fewer MACs on COCO than the baseline.
\ No newline at end of file
diff --git a/data/2022/iclr/Network Insensitivity to Parameter Noise via Parameter Attack During Training b/data/2022/iclr/Network Insensitivity to Parameter Noise via Parameter Attack During Training
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/NeuPL: Neural Population Learning b/data/2022/iclr/NeuPL: Neural Population Learning
new file mode 100644
index 0000000000..0a4348219a
--- /dev/null
+++ b/data/2022/iclr/NeuPL: Neural Population Learning	
@@ -0,0 +1 @@
+Learning in strategy games (e.g. StarCraft, poker) requires the discovery of diverse policies. This is often achieved by iteratively training new policies against existing ones, growing a policy population that is robust to exploit. This iterative approach suffers from two issues in real-world games: a) under finite budget, approximate best-response operators at each iteration needs truncating, resulting in under-trained good-responses populating the population; b) repeated learning of basic skills at each iteration is wasteful and becomes intractable in the presence of increasingly strong opponents. In this work, we propose Neural Population Learning (NeuPL) as a solution to both issues. NeuPL offers convergence guarantees to a population of best-responses under mild assumptions. By representing a population of policies within a single conditional model, NeuPL enables transfer learning across policies. Empirically, we show the generality, improved performance and efficiency of NeuPL across several test domains. Most interestingly, we show that novel strategies become more accessible, not less, as the neural population expands.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path b/data/2022/iclr/Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path
new file mode 100644
index 0000000000..e636368b5b
--- /dev/null
+++ b/data/2022/iclr/Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path	
@@ -0,0 +1 @@
+The recently discovered Neural Collapse (NC) phenomenon occurs pervasively in today's deep net training paradigm of driving cross-entropy (CE) loss towards zero. During NC, last-layer features collapse to their class-means, both classifiers and class-means collapse to the same Simplex Equiangular Tight Frame, and classifier behavior collapses to the nearest-class-mean decision rule. Recent works demonstrated that deep nets trained with mean squared error (MSE) loss perform comparably to those trained with CE. As a preliminary, we empirically establish that NC emerges in such MSE-trained deep nets as well through experiments on three canonical networks and five benchmark datasets. We provide, in a Google Colab notebook, PyTorch code for reproducing MSE-NC and CE-NC: at https://colab.research.google.com/github/neuralcollapse/neuralcollapse/blob/main/neuralcollapse.ipynb. The analytically-tractable MSE loss offers more mathematical opportunities than the hard-to-analyze CE loss, inspiring us to leverage MSE loss towards the theoretical investigation of NC. We develop three main contributions: (I) We show a new decomposition of the MSE loss into (A) terms directly interpretable through the lens of NC and which assume the last-layer classifier is exactly the least-squares classifier; and (B) a term capturing the deviation from this least-squares classifier. (II) We exhibit experiments on canonical datasets and networks demonstrating that term-(B) is negligible during training. This motivates us to introduce a new theoretical construct: the central path, where the linear classifier stays MSE-optimal for feature activations throughout the dynamics. (III) By studying renormalized gradient flow along the central path, we derive exact dynamics that predict NC.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Contextual Bandits with Deep Representation and Shallow Exploration b/data/2022/iclr/Neural Contextual Bandits with Deep Representation and Shallow Exploration
new file mode 100644
index 0000000000..ec78db82cb
--- /dev/null
+++ b/data/2022/iclr/Neural Contextual Bandits with Deep Representation and Shallow Exploration	
@@ -0,0 +1 @@
+We study a general class of contextual bandits, where each context-action pair is associated with a raw feature vector, but the reward generating function is unknown. We propose a novel learning algorithm that transforms the raw feature vector using the last hidden layer of a deep ReLU neural network (deep representation learning), and uses an upper confidence bound (UCB) approach to explore in the last linear layer (shallow exploration). We prove that under standard assumptions, our proposed algorithm achieves $\tilde{O}(\sqrt{T})$ finite-time regret, where $T$ is the learning time horizon. Compared with existing neural contextual bandit algorithms, our approach is computationally much more efficient since it only needs to explore in the last layer of the deep neural network.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Deep Equilibrium Solvers b/data/2022/iclr/Neural Deep Equilibrium Solvers
new file mode 100644
index 0000000000..731c000d5b
--- /dev/null
+++ b/data/2022/iclr/Neural Deep Equilibrium Solvers	
@@ -0,0 +1 @@
+A deep equilibrium (DEQ) model abandons traditional depth by solving for the fixed point of a single nonlinear layer fθ. This structure enables decoupling the internal structure of the layer (which controls representational capacity) from how the fixed point is actually computed (which impacts inference-time efficiency), which is usually via classic techniques such as Broyden’s method or Anderson acceleration. In this paper, we show that one can exploit such decoupling and substantially enhance this fixed point computation using a custom neural solver. Specifically, our solver uses a parameterized network to both guess an initial value of the optimization and perform iterative updates, in a method that generalizes a learnable form of Anderson acceleration and can be trained end-to-end in an unsupervised manner. Such a solution is particularly well suited to the implicit model setting, because inference in these models requires repeatedly solving for a fixed point of the same nonlinear layer for different inputs, a task at which our network excels. Our experiments show that these neural equilibrium solvers are fast to train (only taking an extra 0.9-1.1% over the original DEQ’s training time), require few additional parameters (1-3% of the original model size), yet lead to a 2× speedup in DEQ network inference without any degradation in accuracy across numerous domains and tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Link Prediction with Walk Pooling b/data/2022/iclr/Neural Link Prediction with Walk Pooling
new file mode 100644
index 0000000000..2a14fb47cd
--- /dev/null
+++ b/data/2022/iclr/Neural Link Prediction with Walk Pooling	
@@ -0,0 +1 @@
+Graph neural networks achieve high accuracy in link prediction by jointly leveraging graph topology and node attributes. Topology, however, is represented indirectly; state-of-the-art methods based on subgraph classification label nodes with distance to the target link, so that, although topological information is present, it is tempered by pooling. This makes it challenging to leverage features like loops and motifs associated with network formation mechanisms. We propose a link prediction algorithm based on a new pooling scheme called WalkPool. WalkPool combines the expressivity of topological heuristics with the feature-learning ability of neural networks. It summarizes a putative link by random walk probabilities of adjacent paths. Instead of extracting transition probabilities from the original graph, it computes the transition matrix of a"predictive"latent graph by applying attention to learned features; this may be interpreted as feature-sensitive topology fingerprinting. WalkPool can leverage unsupervised node features or be combined with GNNs and trained end-to-end. It outperforms state-of-the-art methods on all common link prediction benchmarks, both homophilic and heterophilic, with and without node attributes. Applying WalkPool to a set of unsupervised GNNs significantly improves prediction accuracy, suggesting that it may be used as a general-purpose graph pooling scheme.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Markov Controlled SDE: Stochastic Optimization for Continuous-Time Data b/data/2022/iclr/Neural Markov Controlled SDE: Stochastic Optimization for Continuous-Time Data
new file mode 100644
index 0000000000..4c709dc543
--- /dev/null
+++ b/data/2022/iclr/Neural Markov Controlled SDE: Stochastic Optimization for Continuous-Time Data	
@@ -0,0 +1 @@
+We propose a novel probabilistic framework for modeling stochastic dynamics with the rigorous use of stochastic optimal control theory. The proposed model called the neural Markov controlled stochastic differential equation (CSDE) overcomes the fundamental and structural limitations of conventional dynamical models by introducing the following two components: (1) Markov dynamic programming to efﬁciently train the proposed CSDE and (2) multi-conditional forward-backward losses to provide information for accurate inference and to assure theoretical optimality. We demonstrate that our dynamical model efﬁciently generates a complex time series in the data space without extra networks while showing comparable performance against existing model-based methods on several datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Methods for Logical Reasoning over Knowledge Graphs b/data/2022/iclr/Neural Methods for Logical Reasoning over Knowledge Graphs
new file mode 100644
index 0000000000..bc5f7d871b
--- /dev/null
+++ b/data/2022/iclr/Neural Methods for Logical Reasoning over Knowledge Graphs	
@@ -0,0 +1 @@
+Reasoning is a fundamental problem for computers and deeply studied in Artificial Intelligence. In this paper, we specifically focus on answering multi-hop logical queries on Knowledge Graphs (KGs). This is a complicated task because, in real-world scenarios, the graphs tend to be large and incomplete. Most previous works have been unable to create models that accept full First-Order Logical (FOL) queries, which include negative queries, and have only been able to process a limited set of query structures. Additionally, most methods present logic operators that can only perform the logical operation they are made for. We introduce a set of models that use Neural Networks to create one-point vector embeddings to answer the queries. The versatility of neural networks allows the framework to handle FOL queries with Conjunction ($\wedge$), Disjunction ($\vee$) and Negation ($\neg$) operators. We demonstrate experimentally the performance of our model through extensive experimentation on well-known benchmarking datasets. Besides having more versatile operators, the models achieve a 10\% relative increase over the best performing state of the art and more than 30\% over the original method based on single-point vector embeddings.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Models for Output-Space Invariance in Combinatorial Problems b/data/2022/iclr/Neural Models for Output-Space Invariance in Combinatorial Problems
new file mode 100644
index 0000000000..e24ad23b24
--- /dev/null
+++ b/data/2022/iclr/Neural Models for Output-Space Invariance in Combinatorial Problems	
@@ -0,0 +1 @@
+Recently many neural models have been proposed to solve combinatorial puzzles by implicitly learning underlying constraints using their solved instances, such as sudoku or graph coloring (GCP). One drawback of the proposed architectures, which are often based on Graph Neural Networks (GNN), is that they cannot generalize across the size of the output space from which variables are assigned a value, for example, set of colors in a GCP, or board-size in sudoku. We call the output space for the variables as 'value-set'. While many works have demonstrated generalization of GNNs across graph size, there has been no study on how to design a GNN for achieving value-set invariance for problems that come from the same domain. For example, learning to solve 16 x 16 sudoku after being trained on only 9 x 9 sudokus. In this work, we propose novel methods to extend GNN based architectures to achieve value-set invariance. Specifically, our model builds on recently proposed Recurrent Relational Networks. Our first approach exploits the graph-size invariance of GNNs by converting a multi-class node classification problem into a binary node classification problem. Our second approach works directly with multiple classes by adding multiple nodes corresponding to the values in the value-set, and then connecting variable nodes to value nodes depending on the problem initialization. Our experimental evaluation on three different combinatorial problems demonstrates that both our models perform well on our novel problem, compared to a generic neural reasoner. Between two of our models, we observe an inherent trade-off: while the binarized model gives better performance when trained on smaller value-sets, multi-valued model is much more memory efficient, resulting in improved performance when trained on larger value-sets, where binarized model fails to train.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Network Approximation based on Hausdorff distance of Tropical Zonotopes b/data/2022/iclr/Neural Network Approximation based on Hausdorff distance of Tropical Zonotopes
new file mode 100644
index 0000000000..69a9423fe0
--- /dev/null
+++ b/data/2022/iclr/Neural Network Approximation based on Hausdorff distance of Tropical Zonotopes	
@@ -0,0 +1 @@
+In this work we theoretically contribute to neural network approximation by pro-viding a novel tropical geometrical viewpoint to structured neural network compression. In particular, we show that the approximation error between two neural networks with ReLU activations and one hidden layer depends on the Hausdorff distance of the tropical zonotopes of the networks. This theorem comes as a ﬁrst step towards a purely geometrical interpretation of neural network approximation. Based on this theoretical contribution, we propose geometrical methods that employ the K-means algorithm to compress the fully connected parts of ReLU activated deep neural networks. We analyze the error bounds of our algorithms theoretically based on our approximation theorem and evaluate them empirically on neural network compression. Our experiments follow a proof-of-concept strategy and indicate that our geometrical tools achieve improved performance over relevant tropical geometry techniques and can be competitive against non-tropical methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Networks as Kernel Learners: The Silent Alignment Effect b/data/2022/iclr/Neural Networks as Kernel Learners: The Silent Alignment Effect
new file mode 100644
index 0000000000..42718434bd
--- /dev/null
+++ b/data/2022/iclr/Neural Networks as Kernel Learners: The Silent Alignment Effect	
@@ -0,0 +1 @@
+Neural networks in the lazy training regime converge to kernel machines. Can neural networks in the rich feature learning regime learn a kernel machine with a data-dependent kernel? We demonstrate that this can indeed happen due to a phenomenon we term silent alignment, which requires that the tangent kernel of a network evolves in eigenstructure while small and before the loss appreciably decreases, and grows only in overall scale afterwards. We show that such an effect takes place in homogenous neural networks with small initialization and whitened data. We provide an analytical treatment of this effect in the linear network case. In general, we find that the kernel develops a low-rank contribution in the early phase of training, and then evolves in overall scale, yielding a function equivalent to a kernel regression solution with the final network's tangent kernel. The early spectral learning of the kernel depends on the depth. We also demonstrate that non-whitened data can weaken the silent alignment effect.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Parameter Allocation Search b/data/2022/iclr/Neural Parameter Allocation Search
new file mode 100644
index 0000000000..dadf9b60ce
--- /dev/null
+++ b/data/2022/iclr/Neural Parameter Allocation Search	
@@ -0,0 +1 @@
+Neural parameter allocation search (NPAS) automates parameter sharing by obtaining weights for a network given an arbitrary, fixed parameter budget. Prior work has two major drawbacks we aim to address. First, there is a disconnect in the sharing pattern between the search and training steps, where weights are warped for layers of different sizes during the search to measure similarity, but not during training, resulting in reduced performance. To address this, we generate layer weights by learning to compose sets of SuperWeights, which represent a group of trainable parameters. These SuperWeights are created to be large enough so they can be used to represent any layer in the network, but small enough that they are computationally efficient. The second drawback we address is the method of measuring similarity between shared parameters. Whereas prior work compared the weights themselves, we argue this does not take into account the amount of conflict between the shared weights. Instead, we use gradient information to identify layers with shared weights that wish to diverge from each other. We demonstrate that our SuperWeight Networks consistently boost performance over the state-of-the-art on the ImageNet and CIFAR datasets in the NPAS setting. We further show that our approach can generate parameters for many network architectures using the same set of weights. This enables us to support tasks like efficient ensembling and anytime prediction, outperforming fully-parameterized ensembles with 17% fewer parameters1.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Processes with Stochastic Attention: Paying more attention to the context dataset b/data/2022/iclr/Neural Processes with Stochastic Attention: Paying more attention to the context dataset
new file mode 100644
index 0000000000..2d214a9c3c
--- /dev/null
+++ b/data/2022/iclr/Neural Processes with Stochastic Attention: Paying more attention to the context dataset	
@@ -0,0 +1 @@
+Neural processes (NPs) aim to stochastically complete unseen data points based on a given context dataset. NPs essentially leverage a given dataset as a context representation to derive a suitable identifier for a novel task. To improve the prediction accuracy, many variants of NPs have investigated context embedding approaches that generally design novel network architectures and aggregation functions satisfying permutation invariant. In this work, we propose a stochastic attention mechanism for NPs to capture appropriate context information. From the perspective of information theory, we demonstrate that the proposed method encourages context embedding to be differentiated from a target dataset, allowing NPs to consider features in a target dataset and context embedding independently. We observe that the proposed method can appropriately capture context embedding even under noisy data sets and restricted task distributions, where typical NPs suffer from a lack of context embeddings. We empirically show that our approach substantially outperforms conventional NPs in various domains through 1D regression, predator-prey model, and image completion. Moreover, the proposed method is also validated by MovieLens-10k dataset, a real-world problem.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Program Synthesis with Query b/data/2022/iclr/Neural Program Synthesis with Query
new file mode 100644
index 0000000000..26b88a0a89
--- /dev/null
+++ b/data/2022/iclr/Neural Program Synthesis with Query	
@@ -0,0 +1 @@
+Aiming to find a program satisfying the user intent given input-output examples, program synthesis has attracted increasing interest in the area of machine learning. Despite the promising performance of existing methods, most of their success comes from the privileged information of well-designed input-output examples. However, providing such input-output examples is unrealistic because it requires the users to have the ability to describe the underlying program with a few inputoutput examples under the training distribution. In this work, we propose a querybased framework that trains a query neural network to generate informative inputoutput examples automatically and interactively from a large query space. The quality of the query depends on the amount of the mutual information between the query and the corresponding program, which can guide the optimization of the query framework. To estimate the mutual information more accurately, we introduce the functional space (F-space) which models the relevance between the input-output examples and the programs in a differentiable way. We evaluate the effectiveness and generalization of the proposed query-based framework on the Karel task and the list processing task. Experimental results show that the querybased framework can generate informative input-output examples which achieve and even outperform well-designed input-output examples.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Relational Inference with Node-Specific Information b/data/2022/iclr/Neural Relational Inference with Node-Specific Information
new file mode 100644
index 0000000000..182d69b16a
--- /dev/null
+++ b/data/2022/iclr/Neural Relational Inference with Node-Specific Information	
@@ -0,0 +1 @@
+Inferring interactions among entities is an important problem in studying dynamical systems, which greatly impacts the performance of downstream tasks, such as prediction. In this paper, we tackle the relational inference problem in a setting where each entity can potentially have a set of individualized information that other entities cannot have access to. Specifically, we represent the system using a graph in which the individualized information become node-specific information (NSI). We build our model in the framework of Neural Relation Inference (NRI), where the interaction among entities are uncovered using variational inference. We adopt NRI model to incorporate the individualized information by introducing private nodes in the graph that represent NSI. Such representation enables us to uncover more accurate relations among the agents and therefore leads to better performance on the downstream tasks. Our experiment results over real-world datasets validate the merit of our proposed algorithm.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Solvers for Fast and Accurate Numerical Optimal Control b/data/2022/iclr/Neural Solvers for Fast and Accurate Numerical Optimal Control
new file mode 100644
index 0000000000..d8c729ac87
--- /dev/null
+++ b/data/2022/iclr/Neural Solvers for Fast and Accurate Numerical Optimal Control	
@@ -0,0 +1 @@
+Synthesizing optimal controllers for dynamical systems often involves solving optimization problems with hard real-time constraints. These constraints determine the class of numerical methods that can be applied: computationally expensive but accurate numerical routines are replaced by fast and inaccurate methods, trading inference time for solution accuracy. This paper provides techniques to improve the quality of optimized control policies given a fixed computational budget. We achieve the above via a hypersolvers approach, which hybridizes a differential equation solver and a neural network. The performance is evaluated in direct and receding-horizon optimal control tasks in both low and high dimensions, where the proposed approach shows consistent Pareto improvements in solution accuracy and control performance.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Spectral Marked Point Processes b/data/2022/iclr/Neural Spectral Marked Point Processes
new file mode 100644
index 0000000000..19ce510f4b
--- /dev/null
+++ b/data/2022/iclr/Neural Spectral Marked Point Processes	
@@ -0,0 +1 @@
+Self- and mutually-exciting point processes are popular models in machine learning and statistics for dependent discrete event data. To date, most existing models assume stationary kernels (including the classical Hawkes processes) and simple parametric models. Modern applications with complex event data require more general point process models that can incorporate contextual information of the events, called marks, besides the temporal and location information. Moreover, such applications often require non-stationary models to capture more complex spatio-temporal dependence. To tackle these challenges, a key question is to devise a versatile influence kernel in the point process model. In this paper, we introduce a novel and general neural network-based non-stationary influence kernel with high expressiveness for handling complex discrete events data while providing theoretical performance guarantees. We demonstrate the superior performance of our proposed method compared with the state-of-the-art on synthetic and real data.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Stochastic Dual Dynamic Programming b/data/2022/iclr/Neural Stochastic Dual Dynamic Programming
new file mode 100644
index 0000000000..9334d6f34b
--- /dev/null
+++ b/data/2022/iclr/Neural Stochastic Dual Dynamic Programming	
@@ -0,0 +1 @@
+Stochastic dual dynamic programming (SDDP) is a state-of-the-art method for solving multi-stage stochastic optimization, widely used for modeling real-world process optimization tasks. Unfortunately, SDDP has a worst-case complexity that scales exponentially in the number of decision variables, which severely limits applicability to only low dimensional problems. To overcome this limitation, we extend SDDP by introducing a trainable neural model that learns to map problem instances to a piece-wise linear value function within intrinsic low-dimension space, which is architected specifically to interact with a base SDDP solver, so that can accelerate optimization performance on new instances. The proposed Neural Stochastic Dual Dynamic Programming ($\nu$-SDDP) continually self-improves by solving successive problems. An empirical investigation demonstrates that $\nu$-SDDP can significantly reduce problem solving cost without sacrificing solution quality over competitors such as SDDP and reinforcement learning algorithms, across a range of synthetic and real-world process optimization problems.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Structured Prediction for Inductive Node Classification b/data/2022/iclr/Neural Structured Prediction for Inductive Node Classification
new file mode 100644
index 0000000000..eddf6081b5
--- /dev/null
+++ b/data/2022/iclr/Neural Structured Prediction for Inductive Node Classification	
@@ -0,0 +1 @@
+This paper studies node classification in the inductive setting, i.e., aiming to learn a model on labeled training graphs and generalize it to infer node labels on unlabeled test graphs. This problem has been extensively studied with graph neural networks (GNNs) by learning effective node representations, as well as traditional structured prediction methods for modeling the structured output of node labels, e.g., conditional random fields (CRFs). In this paper, we present a new approach called the Structured Proxy Network (SPN), which combines the advantages of both worlds. SPN defines flexible potential functions of CRFs with GNNs. However, learning such a model is nontrivial as it involves optimizing a maximin game with high-cost inference. Inspired by the underlying connection between joint and marginal distributions defined by Markov networks, we propose to solve an approximate version of the optimization problem as a proxy, which yields a near-optimal solution, making learning more efficient. Extensive experiments on two settings show that our approach outperforms many competitive baselines.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural Variational Dropout Processes b/data/2022/iclr/Neural Variational Dropout Processes
new file mode 100644
index 0000000000..2dcf37e497
--- /dev/null
+++ b/data/2022/iclr/Neural Variational Dropout Processes	
@@ -0,0 +1 @@
+Learning to infer the conditional posterior model is a key step for robust meta-learning. This paper presents a new Bayesian meta-learning approach called Neural Variational Dropout Processes (NVDPs). NVDPs model the conditional posterior distribution based on a task-speciﬁc dropout; a low-rank product of Bernoulli experts meta-model is utilized for a memory-efﬁcient mapping of dropout rates from a few observed contexts. It allows for a quick reconﬁguration of a globally learned and shared neural network for new tasks in multi-task few-shot learning. In addition, NVDPs utilize a novel prior conditioned on the whole task data to optimize the conditional dropout posterior in the amortized variational inference. Surprisingly, this enables the robust approximation of task-speciﬁc dropout rates that can deal with a wide range of functional ambiguities and uncertainties. We compared the proposed method with other meta-learning approaches in the few-shot learning tasks such as 1D stochastic regression, image inpainting, and classiﬁcation. The results show the excellent performance of NVDPs.
\ No newline at end of file
diff --git a/data/2022/iclr/Neural graphical modelling in continuous-time: consistency guarantees and algorithms b/data/2022/iclr/Neural graphical modelling in continuous-time: consistency guarantees and algorithms
new file mode 100644
index 0000000000..9836e85bc6
--- /dev/null
+++ b/data/2022/iclr/Neural graphical modelling in continuous-time: consistency guarantees and algorithms	
@@ -0,0 +1 @@
+The discovery of structure from time series data is a key problem in fields of study working with complex systems. Most identifiability results and learning algorithms assume the underlying dynamics to be discrete in time. Comparatively few, in contrast, explicitly define dependencies in infinitesimal intervals of time, independently of the scale of observation and of the regularity of sampling. In this paper, we consider score-based structure learning for the study of dynamical systems. We prove that for vector fields parameterized in a large class of neural networks, least squares optimization with adaptive regularization schemes consistently recovers directed graphs of local independencies in systems of stochastic differential equations. Using this insight, we propose a score-based learning algorithm based on penalized Neural Ordinary Differential Equations (modelling the mean process) that we show to be applicable to the general setting of irregularly-sampled multivariate time series and to outperform the state of the art across a range of dynamical systems.
\ No newline at end of file
diff --git a/data/2022/iclr/New Insights on Reducing Abrupt Representation Change in Online Continual Learning b/data/2022/iclr/New Insights on Reducing Abrupt Representation Change in Online Continual Learning
new file mode 100644
index 0000000000..6d7b648b48
--- /dev/null
+++ b/data/2022/iclr/New Insights on Reducing Abrupt Representation Change in Online Continual Learning	
@@ -0,0 +1 @@
+In the online continual learning paradigm, agents must learn from a changing distribution while respecting memory and compute constraints. Experience Replay (ER), where a small subset of past data is stored and replayed alongside new data, has emerged as a simple and effective learning strategy. In this work, we focus on the change in representations of observed data that arises when previously unobserved classes appear in the incoming data stream, and new classes must be distinguished from previous ones. We shed new light on this question by showing that applying ER causes the newly added classes' representations to overlap significantly with the previous classes, leading to highly disruptive parameter updates. Based on this empirical analysis, we propose a new method which mitigates this issue by shielding the learned representations from drastic adaptation to accommodate new classes. We show that using an asymmetric update rule pushes new classes to adapt to the older ones (rather than the reverse), which is more effective especially at task boundaries, where much of the forgetting typically occurs. Empirical results show significant gains over strong baselines on standard continual learning benchmarks.
\ No newline at end of file
diff --git a/data/2022/iclr/No One Representation to Rule Them All: Overlapping Features of Training Methods b/data/2022/iclr/No One Representation to Rule Them All: Overlapping Features of Training Methods
new file mode 100644
index 0000000000..97ca5e7508
--- /dev/null
+++ b/data/2022/iclr/No One Representation to Rule Them All: Overlapping Features of Training Methods	
@@ -0,0 +1 @@
+Despite being able to capture a range of features of the data, high accuracy models trained with supervision tend to make similar predictions. This seemingly implies that high-performing models share similar biases regardless of training methodology, which would limit ensembling benefits and render low-accuracy models as having little practical use. Against this backdrop, recent work has developed quite different training techniques, such as large-scale contrastive learning, yielding competitively high accuracy on generalization and robustness benchmarks. This motivates us to revisit the assumption that models necessarily learn similar functions. We conduct a large-scale empirical study of models across hyper-parameters, architectures, frameworks, and datasets. We find that model pairs that diverge more in training methodology display categorically different generalization behavior, producing increasingly uncorrelated errors. We show these models specialize in subdomains of the data, leading to higher ensemble performance: with just 2 models (each with ImageNet accuracy ~76.5%), we can create ensembles with 83.4% (+7% boost). Surprisingly, we find that even significantly low-accuracy models can be used to improve high-accuracy models. Finally, we show diverging training methodology yield representations that capture overlapping (but not supersetting) feature sets which, when combined, lead to increased downstream performance.
\ No newline at end of file
diff --git a/data/2022/iclr/No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models b/data/2022/iclr/No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models
new file mode 100644
index 0000000000..ceddd0a8b5
--- /dev/null
+++ b/data/2022/iclr/No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models	
@@ -0,0 +1 @@
+Recent research has shown the existence of significant redundancy in large Transformer models. One can prune the redundant parameters without significantly sacrificing the generalization performance. However, we question whether the redundant parameters could have contributed more if they were properly trained. To answer this question, we propose a novel training strategy that encourages all parameters to be trained sufficiently. Specifically, we adaptively adjust the learning rate for each parameter according to its sensitivity, a robust gradient-based measure reflecting this parameter's contribution to the model performance. A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate. In contrast, a parameter with high sensitivity is well-trained, and we regularize it by decreasing its learning rate to prevent further overfitting. We conduct extensive experiments on natural language understanding, neural machine translation, and image classification to demonstrate the effectiveness of the proposed schedule. Analysis shows that the proposed schedule indeed reduces the redundancy and improves generalization performance.
\ No newline at end of file
diff --git a/data/2022/iclr/Node Feature Extraction by Self-Supervised Multi-scale Neighborhood Prediction b/data/2022/iclr/Node Feature Extraction by Self-Supervised Multi-scale Neighborhood Prediction
new file mode 100644
index 0000000000..c768019e71
--- /dev/null
+++ b/data/2022/iclr/Node Feature Extraction by Self-Supervised Multi-scale Neighborhood Prediction	
@@ -0,0 +1 @@
+Learning on graphs has attracted significant attention in the learning community due to numerous real-world applications. In particular, graph neural networks (GNNs), which take numerical node features and graph structure as inputs, have been shown to achieve state-of-the-art performance on various graph-related learning tasks. Recent works exploring the correlation between numerical node features and graph structure via self-supervised learning have paved the way for further performance improvements of GNNs. However, methods used for extracting numerical node features from raw data are still graph-agnostic within standard GNN pipelines. This practice is sub-optimal as it prevents one from fully utilizing potential correlations between graph topology and node attributes. To mitigate this issue, we propose a new self-supervised learning framework, Graph Information Aided Node feature exTraction (GIANT). GIANT makes use of the eXtreme Multi-label Classification (XMC) formalism, which is crucial for fine-tuning the language model based on graph information, and scales to large datasets. We also provide a theoretical analysis that justifies the use of XMC over link prediction and motivates integrating XR-Transformers, a powerful method for solving XMC problems, into the GIANT framework. We demonstrate the superior performance of GIANT over the standard GNN pipeline on Open Graph Benchmark datasets: For example, we improve the accuracy of the top-ranked method GAMLP from $68.25\%$ to $69.67\%$, SGC from $63.29\%$ to $66.10\%$ and MLP from $47.24\%$ to $61.10\%$ on the ogbn-papers100M dataset by leveraging GIANT.
\ No newline at end of file
diff --git a/data/2022/iclr/NodePiece: Compositional and Parameter-Efficient Representations of Large Knowledge Graphs b/data/2022/iclr/NodePiece: Compositional and Parameter-Efficient Representations of Large Knowledge Graphs
new file mode 100644
index 0000000000..f88e410d93
--- /dev/null
+++ b/data/2022/iclr/NodePiece: Compositional and Parameter-Efficient Representations of Large Knowledge Graphs	
@@ -0,0 +1 @@
+Conventional representation learning algorithms for knowledge graphs (KG) map each entity to a unique embedding vector. Such a shallow lookup results in a linear growth of memory consumption for storing the embedding matrix and incurs high computational costs when working with real-world KGs. Drawing parallels with subword tokenization commonly used in NLP, we explore the landscape of more parameter-efficient node embedding strategies with possibly sublinear memory requirements. To this end, we propose NodePiece, an anchor-based approach to learn a fixed-size entity vocabulary. In NodePiece, a vocabulary of subword/sub-entity units is constructed from anchor nodes in a graph with known relation types. Given such a fixed-size vocabulary, it is possible to bootstrap an encoding and embedding for any entity, including those unseen during training. Experiments show that NodePiece performs competitively in node classification, link prediction, and relation prediction tasks while retaining less than 10% of explicit nodes in a graph as anchors and often having 10x fewer parameters. To this end, we show that a NodePiece-enabled model outperforms existing shallow models on a large OGB WikiKG 2 graph having 70x fewer parameters.
\ No newline at end of file
diff --git a/data/2022/iclr/Noisy Feature Mixup b/data/2022/iclr/Noisy Feature Mixup
new file mode 100644
index 0000000000..ba2285b8cd
--- /dev/null
+++ b/data/2022/iclr/Noisy Feature Mixup	
@@ -0,0 +1 @@
+We introduce Noisy Feature Mixup (NFM), an inexpensive yet effective method for data augmentation that combines the best of interpolation based training and noise injection schemes. Rather than training with convex combinations of pairs of examples and their labels, we use noise-perturbed convex combinations of pairs of data points in both input and feature space. This method includes mixup and manifold mixup as special cases, but it has additional advantages, including better smoothing of decision boundaries and enabling improved model robustness. We provide theory to understand this as well as the implicit regularization effects of NFM. Our theory is supported by empirical results, demonstrating the advantage of NFM, as compared to mixup and manifold mixup. We show that residual networks and vision transformers trained with NFM have favorable trade-offs between predictive accuracy on clean data and robustness with respect to various types of data perturbation across a range of computer vision benchmark datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/Non-Linear Operator Approximations for Initial Value Problems b/data/2022/iclr/Non-Linear Operator Approximations for Initial Value Problems
new file mode 100644
index 0000000000..e110dca870
--- /dev/null
+++ b/data/2022/iclr/Non-Linear Operator Approximations for Initial Value Problems	
@@ -0,0 +1 @@
+Time-evolution of partial differential equations is fundamental for modeling several complex dynamical processes and events forecasting, but the operators associated with such problems are non-linear. We propose a Pad´e approximation based exponential neural operator scheme for efficiently learning the map between a given initial condition and the activities at a later time. The multiwavelets bases are used for space discretization. By explicitly embedding the exponential operators in the model, we reduce the training parameters and make it more data-efficient which is essential in dealing with scarce and noisy real-world datasets. The Pad´e exponential operator uses a recurrent structure with shared parameters to model the non-linearity compared to recent neural operators that rely on using multiple linear operator layers in succession. We show theoretically that the gradients associated with the recurrent Pad ´ e network are bounded across the recurrent horizon. We perform experiments on non-linear systems such as Korteweg-de Vries (KdV) and Kuramoto–Sivashinsky (KS) equations to show that the proposed approach achieves the best performance and at the same time is data-efficient. We also show that urgent real-world problems like epidemic forecasting (for example, COVID19) can be formulated as a 2D time-varying operator problem. The proposed Pad ´ e exponential operators yield better prediction results ( 53% ( 52% ) better MAE than best neural operator (non-neural operator deep learning model)) compared to state-of-the-art forecasting models.
\ No newline at end of file
diff --git a/data/2022/iclr/Non-Parallel Text Style Transfer with Self-Parallel Supervision b/data/2022/iclr/Non-Parallel Text Style Transfer with Self-Parallel Supervision
new file mode 100644
index 0000000000..a14a5c5300
--- /dev/null
+++ b/data/2022/iclr/Non-Parallel Text Style Transfer with Self-Parallel Supervision	
@@ -0,0 +1 @@
+The performance of existing text style transfer models is severely limited by the non-parallel datasets on which the models are trained. In non-parallel datasets, no direct mapping exists between sentences of the source and target style; the style transfer models thus only receive weak supervision of the target sentences during training, which often leads the model to discard too much style-independent information, or utterly fail to transfer the style. In this work, we propose LaMer, a novel text style transfer framework based on large-scale language models. LaMer first mines the roughly parallel expressions in the non-parallel datasets with scene graphs, and then employs MLE training, followed by imitation learning refinement, to leverage the intrinsic parallelism within the data. On two benchmark tasks (sentiment&formality transfer) and a newly proposed challenging task (political stance transfer), our model achieves qualitative advances in transfer accuracy, content preservation, and fluency. Further empirical and human evaluations demonstrate that our model not only makes training more efficient, but also generates more readable and diverse expressions than previous models.
\ No newline at end of file
diff --git a/data/2022/iclr/Non-Transferable Learning: A New Approach for Model Ownership Verification and Applicability Authorization b/data/2022/iclr/Non-Transferable Learning: A New Approach for Model Ownership Verification and Applicability Authorization
new file mode 100644
index 0000000000..1f142ef0e8
--- /dev/null
+++ b/data/2022/iclr/Non-Transferable Learning: A New Approach for Model Ownership Verification and Applicability Authorization	
@@ -0,0 +1 @@
+As Artificial Intelligence as a Service gains popularity, protecting well-trained models as intellectual property is becoming increasingly important. There are two common types of protection methods: ownership verification and usage authorization. In this paper, we propose Non-Transferable Learning (NTL), a novel approach that captures the exclusive data representation in the learned model and restricts the model generalization ability to certain domains. This approach provides effective solutions to both model verification and authorization. Specifically: 1) For ownership verification, watermarking techniques are commonly used but are often vulnerable to sophisticated watermark removal methods. By comparison, our NTL-based ownership verification provides robust resistance to state-of-the-art watermark removal methods, as shown in extensive experiments with 6 removal approaches over the digits, CIFAR10&STL10, and VisDA datasets. 2) For usage authorization, prior solutions focus on authorizing specific users to access the model, but authorized users can still apply the model to any data without restriction. Our NTL-based authorization approach instead provides data-centric protection, which we call applicability authorization, by significantly degrading the performance of the model on unauthorized data. Its effectiveness is also shown through experiments on the aforementioned datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/Nonlinear ICA Using Volume-Preserving Transformations b/data/2022/iclr/Nonlinear ICA Using Volume-Preserving Transformations
new file mode 100644
index 0000000000..042754f9f5
--- /dev/null
+++ b/data/2022/iclr/Nonlinear ICA Using Volume-Preserving Transformations	
@@ -0,0 +1 @@
+) and u (2) ;
\ No newline at end of file
diff --git a/data/2022/iclr/Normalization of Language Embeddings for Cross-Lingual Alignment b/data/2022/iclr/Normalization of Language Embeddings for Cross-Lingual Alignment
new file mode 100644
index 0000000000..cf0eaeec9c
--- /dev/null
+++ b/data/2022/iclr/Normalization of Language Embeddings for Cross-Lingual Alignment	
@@ -0,0 +1 @@
+Learning a good transfer function to map the word vectors from two languages 1 into a shared cross-lingual word vector space plays a crucial role in cross-lingual 2 NLP. It is useful in translation tasks and important in allowing complex models 3 built on a high-resource language like English to be directly applied on an aligned 4 low resource language. While Procrustes and other techniques can align language 5 models with some success, it has recently been identiﬁed that structural differences 6 (for instance, due to differing word frequency) create different proﬁles for various 7 monolingual embedding. When these proﬁles differ across languages, it corre-8 lates with how well languages can align and their performance on cross-lingual 9 downstream tasks. In this work, we develop a very general language embedding 10 normalization procedure, building and subsuming various previous approaches, 11 which removes these structural proﬁles across languages without destroying their 12 intrinsic meaning. We demonstrate that meaning is retained and alignment is 13 improved on similarity, translation, and cross-language classiﬁcation tasks. Our 14 proposed normalization clearly outperforms all prior approaches like centering and 15 vector normalization on each task and with each alignment approach. 16
\ No newline at end of file
diff --git a/data/2022/iclr/Object Dynamics Distillation for Scene Decomposition and Representation b/data/2022/iclr/Object Dynamics Distillation for Scene Decomposition and Representation
new file mode 100644
index 0000000000..d14e90f38e
--- /dev/null
+++ b/data/2022/iclr/Object Dynamics Distillation for Scene Decomposition and Representation	
@@ -0,0 +1 @@
+in terms of abstract entities
\ No newline at end of file
diff --git a/data/2022/iclr/Object Pursuit: Building a Space of Objects via Discriminative Weight Generation b/data/2022/iclr/Object Pursuit: Building a Space of Objects via Discriminative Weight Generation
new file mode 100644
index 0000000000..e6eb0bcd02
--- /dev/null
+++ b/data/2022/iclr/Object Pursuit: Building a Space of Objects via Discriminative Weight Generation	
@@ -0,0 +1 @@
+We propose a framework to continuously learn object-centric representations for visual learning and understanding. Existing object-centric representations either rely on supervisions that individualize objects in the scene, or perform unsupervised disentanglement that can hardly deal with complex scenes in the real world. To mitigate the annotation burden and relax the constraints on the statistical complexity of the data, our method leverages interactions to effectively sample diverse variations of an object and the corresponding training signals while learning the object-centric representations. Throughout learning, objects are streamed one by one in random order with unknown identities, and are associated with latent codes that can synthesize discriminative weights for each object through a convolutional hypernetwork. Moreover, re-identification of learned objects and forgetting prevention are employed to make the learning process efficient and robust. We perform an extensive study of the key features of the proposed framework and analyze the characteristics of the learned representations. Furthermore, we demonstrate the capability of the proposed framework in learning representations that can improve label efficiency in downstream tasks. Our code and trained models are made publicly available at: https://github.com/pptrick/Object-Pursuit.
\ No newline at end of file
diff --git a/data/2022/iclr/Objects in Semantic Topology b/data/2022/iclr/Objects in Semantic Topology
new file mode 100644
index 0000000000..88ea971577
--- /dev/null
+++ b/data/2022/iclr/Objects in Semantic Topology	
@@ -0,0 +1 @@
+A more realistic object detection paradigm, Open-World Object Detection, has arisen increasing research interests in the community recently. A qualified open-world object detector can not only identify objects of known categories, but also discover unknown objects, and incrementally learn to categorize them when their annotations progressively arrive. Previous works rely on independent modules to recognize unknown categories and perform incremental learning, respectively. In this paper, we provide a unified perspective: Semantic Topology. During the life-long learning of an open-world object detector, all object instances from the same category are assigned to their corresponding pre-defined node in the semantic topology, including the `unknown' category. This constraint builds up discriminative feature representations and consistent relationships among objects, thus enabling the detector to distinguish unknown objects out of the known categories, as well as making learned features of known objects undistorted when learning new categories incrementally. Extensive experiments demonstrate that semantic topology, either randomly-generated or derived from a well-trained language model, could outperform the current state-of-the-art open-world object detectors by a large margin, e.g., the absolute open-set error is reduced from 7832 to 2546, exhibiting the inherent superiority of semantic topology on open-world object detection.
\ No newline at end of file
diff --git a/data/2022/iclr/Offline Neural Contextual Bandits: Pessimism, Optimization and Generalization b/data/2022/iclr/Offline Neural Contextual Bandits: Pessimism, Optimization and Generalization
new file mode 100644
index 0000000000..6de64cdaa8
--- /dev/null
+++ b/data/2022/iclr/Offline Neural Contextual Bandits: Pessimism, Optimization and Generalization	
@@ -0,0 +1 @@
+Offline policy learning (OPL) leverages existing data collected a priori for policy optimization without any active exploration. Despite the prevalence and recent interest in this problem, its theoretical and algorithmic foundations in function approximation settings remain under-developed. In this paper, we consider this problem on the axes of distributional shift, optimization, and generalization in offline contextual bandits with neural networks. In particular, we propose a provably efficient offline contextual bandit with neural network function approximation that does not require any functional assumption on the reward. We show that our method provably generalizes over unseen contexts under a milder condition for distributional shift than the existing OPL works. Notably, unlike any other OPL method, our method learns from the offline data in an online manner using stochastic gradient descent, allowing us to leverage the benefits of online learning into an offline setting. Moreover, we show that our method is more computationally efficient and has a better dependence on the effective dimension of the neural network than an online counterpart. Finally, we demonstrate the empirical effectiveness of our method in a range of synthetic and real-world OPL problems.
\ No newline at end of file
diff --git a/data/2022/iclr/Offline Reinforcement Learning with Implicit Q-Learning b/data/2022/iclr/Offline Reinforcement Learning with Implicit Q-Learning
new file mode 100644
index 0000000000..cb9a628267
--- /dev/null
+++ b/data/2022/iclr/Offline Reinforcement Learning with Implicit Q-Learning	
@@ -0,0 +1 @@
+Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This trade-off is critical, because most current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be in-distribution, or else regularize their values. We propose an offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state. This leverages the generalization capacity of the function approximator to estimate the value of the best available action at a given state without ever directly querying a Q-function with this unseen action. Our algorithm alternates between fitting this upper expectile value function and backing it up into a Q-function. Then, we extract the policy via advantage-weighted behavioral cloning. We dub our method implicit Q-learning (IQL). IQL demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline reinforcement learning. We also demonstrate that IQL achieves strong performance fine-tuning using online interaction after offline initialization.
\ No newline at end of file
diff --git a/data/2022/iclr/Offline Reinforcement Learning with Value-based Episodic Memory b/data/2022/iclr/Offline Reinforcement Learning with Value-based Episodic Memory
new file mode 100644
index 0000000000..2e8a50b3e0
--- /dev/null
+++ b/data/2022/iclr/Offline Reinforcement Learning with Value-based Episodic Memory	
@@ -0,0 +1 @@
+Offline reinforcement learning (RL) shows promise of applying RL to real-world problems by effectively utilizing previously collected data. Most existing offline RL algorithms use regularization or constraints to suppress extrapolation error for actions outside the dataset. In this paper, we adopt a different framework, which learns the V-function instead of the Q-function to naturally keep the learning procedure within the support of an offline dataset. To enable effective generalization while maintaining proper conservatism in offline learning, we propose Expectile V-Learning (EVL), which smoothly interpolates between the optimal value learning and behavior cloning. Further, we introduce implicit planning along offline trajectories to enhance learned V-values and accelerate convergence. Together, we present a new offline method called Value-based Episodic Memory (VEM). We provide theoretical analysis for the convergence properties of our proposed VEM method, and empirical results in the D4RL benchmark show that our method achieves superior performance in most tasks, particularly in sparse-reward tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Omni-Dimensional Dynamic Convolution b/data/2022/iclr/Omni-Dimensional Dynamic Convolution
new file mode 100644
index 0000000000..40154a1655
--- /dev/null
+++ b/data/2022/iclr/Omni-Dimensional Dynamic Convolution	
@@ -0,0 +1 @@
+Learning a single static convolutional kernel in each convolutional layer is the common training paradigm of modern Convolutional Neural Networks (CNNs). Instead, recent research in dynamic convolution shows that learning a linear combination of $n$ convolutional kernels weighted with their input-dependent attentions can significantly improve the accuracy of light-weight CNNs, while maintaining efficient inference. However, we observe that existing works endow convolutional kernels with the dynamic property through one dimension (regarding the convolutional kernel number) of the kernel space, but the other three dimensions (regarding the spatial size, the input channel number and the output channel number for each convolutional kernel) are overlooked. Inspired by this, we present Omni-dimensional Dynamic Convolution (ODConv), a more generalized yet elegant dynamic convolution design, to advance this line of research. ODConv leverages a novel multi-dimensional attention mechanism with a parallel strategy to learn complementary attentions for convolutional kernels along all four dimensions of the kernel space at any convolutional layer. As a drop-in replacement of regular convolutions, ODConv can be plugged into many CNN architectures. Extensive experiments on the ImageNet and MS-COCO datasets show that ODConv brings solid accuracy boosts for various prevailing CNN backbones including both light-weight and large ones, e.g., 3.77%~5.71%|1.86%~3.72% absolute top-1 improvements to MobivleNetV2|ResNet family on the ImageNet dataset. Intriguingly, thanks to its improved feature learning ability, ODConv with even one single kernel can compete with or outperform existing dynamic convolution counterparts with multiple kernels, substantially reducing extra parameters. Furthermore, ODConv is also superior to other attention modules for modulating the output features or the convolutional weights.
\ No newline at end of file
diff --git a/data/2022/iclr/Omni-Scale CNNs: a simple and effective kernel size configuration for time series classification b/data/2022/iclr/Omni-Scale CNNs: a simple and effective kernel size configuration for time series classification
new file mode 100644
index 0000000000..5f983faf4b
--- /dev/null
+++ b/data/2022/iclr/Omni-Scale CNNs: a simple and effective kernel size configuration for time series classification	
@@ -0,0 +1 @@
+The Receptive Field (RF) size has been one of the most important factors for One Dimensional Convolutional Neural Networks (1D-CNNs) on time series classification tasks. Large efforts have been taken to choose the appropriate size because it has a huge influence on the performance and differs significantly for each dataset. In this paper, we propose an Omni-Scale block (OS-block) for 1D-CNNs, where the kernel sizes are decided by a simple and universal rule. Particularly, it is a set of kernel sizes that can efficiently cover the best RF size across different datasets via consisting of multiple prime numbers according to the length of the time series. The experiment result shows that models with the OS-block can achieve a similar performance as models with the searched optimal RF size and due to the strong optimal RF size capture ability, simple 1D-CNN models with OS-block achieves the state-of-the-art performance on four time series benchmarks, including both univariate and multivariate data from multiple domains. Comprehensive analysis and discussions shed light on why the OS-block can capture optimal RF sizes across different datasets. Code available [https://github.com/Wensi-Tang/OS-CNN]
\ No newline at end of file
diff --git a/data/2022/iclr/On Bridging Generic and Personalized Federated Learning for Image Classification b/data/2022/iclr/On Bridging Generic and Personalized Federated Learning for Image Classification
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/On Covariate Shift of Latent Confounders in Imitation and Reinforcement Learning b/data/2022/iclr/On Covariate Shift of Latent Confounders in Imitation and Reinforcement Learning
new file mode 100644
index 0000000000..bdbd85f76e
--- /dev/null
+++ b/data/2022/iclr/On Covariate Shift of Latent Confounders in Imitation and Reinforcement Learning	
@@ -0,0 +1 @@
+We consider the problem of using expert data with unobserved confounders for imitation and reinforcement learning. We begin by defining the problem of learning from confounded expert data in a contextual MDP setup. We analyze the limitations of learning from such data with and without external reward, and propose an adjustment of standard imitation learning algorithms to fit this setup. We then discuss the problem of distribution shift between the expert data and the online environment when the data is only partially observable. We prove possibility and impossibility results for imitation learning under arbitrary distribution shift of the missing covariates. When additional external reward is provided, we propose a sampling procedure that addresses the unknown shift and prove convergence to an optimal solution. Finally, we validate our claims empirically on challenging assistive healthcare and recommender system simulation tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/On Distributed Adaptive Optimization with Gradient Compression b/data/2022/iclr/On Distributed Adaptive Optimization with Gradient Compression
new file mode 100644
index 0000000000..0dd73638f8
--- /dev/null
+++ b/data/2022/iclr/On Distributed Adaptive Optimization with Gradient Compression	
@@ -0,0 +1 @@
+We study COMP-AMS, a distributed optimization framework based on gradient averaging and adaptive AMSGrad algorithm. Gradient compression with error feedback is applied to reduce the communication cost in the gradient transmission process. Our convergence analysis of COMP-AMS shows that such compressed gradient averaging strategy yields same convergence rate as standard AMSGrad, and also exhibits the linear speedup effect w.r.t. the number of local workers. Compared with recently proposed protocols on distributed adaptive methods, COMP-AMS is simple and convenient. Numerical experiments are conducted to justify the theoretical findings, and demonstrate that the proposed method can achieve same test accuracy as the full-gradient AMSGrad with substantial communication savings. With its simplicity and efficiency, COMP-AMS can serve as a useful distributed training framework for adaptive gradient methods.
\ No newline at end of file
diff --git a/data/2022/iclr/On Evaluation Metrics for Graph Generative Models b/data/2022/iclr/On Evaluation Metrics for Graph Generative Models
new file mode 100644
index 0000000000..4ae5a66f46
--- /dev/null
+++ b/data/2022/iclr/On Evaluation Metrics for Graph Generative Models	
@@ -0,0 +1 @@
+In image generation, generative models can be evaluated naturally by visually inspecting model outputs. However, this is not always the case for graph generative models (GGMs), making their evaluation challenging. Currently, the standard process for evaluating GGMs suffers from three critical limitations: i) it does not produce a single score which makes model selection challenging, ii) in many cases it fails to consider underlying edge and node features, and iii) it is prohibitively slow to perform. In this work, we mitigate these issues by searching for scalar, domain-agnostic, and scalable metrics for evaluating and ranking GGMs. To this end, we study existing GGM metrics and neural-network-based metrics emerging from generative models of images that use embeddings extracted from a task-specific network. Motivated by the power of certain Graph Neural Networks (GNNs) to extract meaningful graph representations without any training, we introduce several metrics based on the features extracted by an untrained random GNN. We design experiments to thoroughly test metrics on their ability to measure the diversity and fidelity of generated graphs, as well as their sample and computational efficiency. Depending on the quantity of samples, we recommend one of two random-GNN-based metrics that we show to be more expressive than pre-existing metrics. While we focus on applying these metrics to GGM evaluation, in practice this enables the ability to easily compute the dissimilarity between any two sets of graphs regardless of domain. Our code is released at: https://github.com/uoguelph-mlrg/GGM-metrics.
\ No newline at end of file
diff --git a/data/2022/iclr/On Improving Adversarial Transferability of Vision Transformers b/data/2022/iclr/On Improving Adversarial Transferability of Vision Transformers
new file mode 100644
index 0000000000..c80bcfe6f8
--- /dev/null
+++ b/data/2022/iclr/On Improving Adversarial Transferability of Vision Transformers	
@@ -0,0 +1 @@
+Vision transformers (ViTs) process input images as sequences of patches via self-attention; a radically different architecture than convolutional neural networks (CNNs). This makes it interesting to study the adversarial feature space of ViT models and their transferability. In particular, we observe that adversarial patterns found via conventional adversarial attacks show very \emph{low} black-box transferability even for large ViT models. We show that this phenomenon is only due to the sub-optimal attack procedures that do not leverage the true representation potential of ViTs. A deep ViT is composed of multiple blocks, with a consistent architecture comprising of self-attention and feed-forward layers, where each block is capable of independently producing a class token. Formulating an attack using only the last class token (conventional approach) does not directly leverage the discriminative information stored in the earlier tokens, leading to poor adversarial transferability of ViTs. Using the compositional nature of ViT models, we enhance transferability of existing attacks by introducing two novel strategies specific to the architecture of ViT models. (i) Self-Ensemble: We propose a method to find multiple discriminative pathways by dissecting a single ViT model into an ensemble of networks. This allows explicitly utilizing class-specific information at each ViT block. (ii) Token Refinement: We then propose to refine the tokens to further enhance the discriminative capacity at each block of ViT. Our token refinement systematically combines the class tokens with structural information preserved within the patch tokens.
\ No newline at end of file
diff --git a/data/2022/iclr/On Incorporating Inductive Biases into VAEs b/data/2022/iclr/On Incorporating Inductive Biases into VAEs
new file mode 100644
index 0000000000..963f779aee
--- /dev/null
+++ b/data/2022/iclr/On Incorporating Inductive Biases into VAEs	
@@ -0,0 +1 @@
+We explain why directly changing the prior can be a surprisingly ineffective mechanism for incorporating inductive biases into VAEs, and introduce a simple and effective alternative approach: Intermediary Latent Space VAEs(InteL-VAEs). InteL-VAEs use an intermediary set of latent variables to control the stochasticity of the encoding process, before mapping these in turn to the latent representation using a parametric function that encapsulates our desired inductive bias(es). This allows us to impose properties like sparsity or clustering on learned representations, and incorporate human knowledge into the generative model. Whereas changing the prior only indirectly encourages behavior through regularizing the encoder, InteL-VAEs are able to directly enforce desired characteristics. Moreover, they bypass the computation and encoder design issues caused by non-Gaussian priors, while allowing for additional flexibility through training of the parametric mapping function. We show that these advantages, in turn, lead to both better generative models and better representations being learned.
\ No newline at end of file
diff --git a/data/2022/iclr/On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning b/data/2022/iclr/On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning
new file mode 100644
index 0000000000..4c1dab4005
--- /dev/null
+++ b/data/2022/iclr/On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning	
@@ -0,0 +1 @@
+The lottery ticket hypothesis questions the role of overparameterization in supervised deep learning. But how is the performance of winning lottery tickets affected by the distributional shift inherent to reinforcement learning problems? In this work, we address this question by comparing sparse agents who have to address the non-stationarity of the exploration-exploitation problem with supervised agents trained to imitate an expert. We show that feed-forward networks trained with behavioural cloning compared to reinforcement learning can be pruned to higher levels of sparsity without performance degradation. This suggests that in order to solve the RL-specific distributional shift agents require more degrees of freedom. Using a set of carefully designed baseline conditions, we find that the majority of the lottery ticket effect in both learning paradigms can be attributed to the identified mask rather than the weight initialization. The input layer mask selectively prunes entire input dimensions that turn out to be irrelevant for the task at hand. At a moderate level of sparsity the mask identified by iterative magnitude pruning yields minimal task-relevant representations, i.e., an interpretable inductive bias. Finally, we propose a simple initialization rescaling which promotes the robust identification of sparse task representations in low-dimensional control tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/On Non-Random Missing Labels in Semi-Supervised Learning b/data/2022/iclr/On Non-Random Missing Labels in Semi-Supervised Learning
new file mode 100644
index 0000000000..d883fed8ee
--- /dev/null
+++ b/data/2022/iclr/On Non-Random Missing Labels in Semi-Supervised Learning	
@@ -0,0 +1 @@
+Semi-Supervised Learning (SSL) is fundamentally a missing label problem, in which the label Missing Not At Random (MNAR) problem is more realistic and challenging, compared to the widely-adopted yet naive Missing Completely At Random assumption where both labeled and unlabeled data share the same class distribution. Different from existing SSL solutions that overlook the role of"class"in causing the non-randomness, e.g., users are more likely to label popular classes, we explicitly incorporate"class"into SSL. Our method is three-fold: 1) We propose Class-Aware Propensity (CAP) that exploits the unlabeled data to train an improved classifier using the biased labeled data. 2) To encourage rare class training, whose model is low-recall but high-precision that discards too many pseudo-labeled data, we propose Class-Aware Imputation (CAI) that dynamically decreases (or increases) the pseudo-label assignment threshold for rare (or frequent) classes. 3) Overall, we integrate CAP and CAI into a Class-Aware Doubly Robust (CADR) estimator for training an unbiased SSL model. Under various MNAR settings and ablations, our method not only significantly outperforms existing baselines but also surpasses other label bias removal SSL methods. Please check our code at: https://github.com/JoyHuYY1412/CADR-FixMatch.
\ No newline at end of file
diff --git a/data/2022/iclr/On Predicting Generalization using GANs b/data/2022/iclr/On Predicting Generalization using GANs
new file mode 100644
index 0000000000..a17c4a83df
--- /dev/null
+++ b/data/2022/iclr/On Predicting Generalization using GANs	
@@ -0,0 +1 @@
+Research on generalization bounds for deep networks seeks to give ways to predict test error using just the training dataset and the network parameters. While generalization bounds can give many insights about architecture design, training algorithms, etc., what they do not currently do is yield good predictions for actual test error. A recently introduced Predicting Generalization in Deep Learning competition~\citep{jiang2020neurips} aims to encourage discovery of methods to better predict test error. The current paper investigates a simple idea: can test error be predicted using {\em synthetic data,} produced using a Generative Adversarial Network (GAN) that was trained on the same training dataset? Upon investigating several GAN models and architectures, we find that this turns out to be the case. In fact, using GANs pre-trained on standard datasets, the test error can be predicted without requiring any additional hyper-parameter tuning. This result is surprising because GANs have well-known limitations (e.g. mode collapse) and are known to not learn the data distribution accurately. Yet the generated samples are good enough to substitute for test data. Several additional experiments are presented to explore reasons why GANs do well at this task. In addition to a new approach for predicting generalization, the counter-intuitive phenomena presented in our work may also call for a better understanding of GANs' strengths and limitations.
\ No newline at end of file
diff --git a/data/2022/iclr/On Redundancy and Diversity in Cell-based Neural Architecture Search b/data/2022/iclr/On Redundancy and Diversity in Cell-based Neural Architecture Search
new file mode 100644
index 0000000000..031ee668bf
--- /dev/null
+++ b/data/2022/iclr/On Redundancy and Diversity in Cell-based Neural Architecture Search	
@@ -0,0 +1 @@
+Searching for the architecture cells is a dominant paradigm in NAS. However, little attention has been devoted to the analysis of the cell-based search spaces even though it is highly important for the continual development of NAS. In this work, we conduct an empirical post-hoc analysis of architectures from the popular cell-based search spaces and find that the existing search spaces contain a high degree of redundancy: the architecture performance is minimally sensitive to changes at large parts of the cells, and universally adopted designs, like the explicit search for a reduction cell, significantly increase the complexities but have very limited impact on the performance. Across architectures found by a diverse set of search strategies, we consistently find that the parts of the cells that do matter for architecture performance often follow similar and simple patterns. By explicitly constraining cells to include these patterns, randomly sampled architectures can match or even outperform the state of the art. These findings cast doubts into our ability to discover truly novel architectures in the existing cell-based search spaces, and inspire our suggestions for improvement to guide future NAS research. Code is available at https://github.com/xingchenwan/cell-based-NAS-analysis.
\ No newline at end of file
diff --git a/data/2022/iclr/On Robust Prefix-Tuning for Text Classification b/data/2022/iclr/On Robust Prefix-Tuning for Text Classification
new file mode 100644
index 0000000000..a62249b106
--- /dev/null
+++ b/data/2022/iclr/On Robust Prefix-Tuning for Text Classification	
@@ -0,0 +1 @@
+Recently, prefix-tuning has gained increasing attention as a parameter-efficient finetuning method for large-scale pretrained language models. The method keeps the pretrained models fixed and only updates the prefix token parameters for each downstream task. Despite being lightweight and modular, prefix-tuning still lacks robustness to textual adversarial attacks. However, most currently developed defense techniques necessitate auxiliary model update and storage, which inevitably hamper the modularity and low storage of prefix-tuning. In this work, we propose a robust prefix-tuning framework that preserves the efficiency and modularity of prefix-tuning. The core idea of our framework is leveraging the layerwise activations of the language model by correctly-classified training data as the standard for additional prefix finetuning. During the test phase, an extra batch-level prefix is tuned for each batch and added to the original prefix for robustness enhancement. Extensive experiments on three text classification benchmarks show that our framework substantially improves robustness over several strong baselines against five textual attacks of different types while maintaining comparable accuracy on clean texts. We also interpret our robust prefix-tuning framework from the optimal control perspective and pose several directions for future research.
\ No newline at end of file
diff --git a/data/2022/iclr/On feature learning in neural networks with global convergence guarantees b/data/2022/iclr/On feature learning in neural networks with global convergence guarantees
new file mode 100644
index 0000000000..1169e86e64
--- /dev/null
+++ b/data/2022/iclr/On feature learning in neural networks with global convergence guarantees	
@@ -0,0 +1 @@
+We study the optimization of wide neural networks (NNs) via gradient flow (GF) in setups that allow feature learning while admitting non-asymptotic global convergence guarantees. First, for wide shallow NNs under the mean-field scaling and with a general class of activation functions, we prove that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF. Building upon this analysis, we study a model of wide multi-layer NNs whose second-to-last layer is trained via GF, for which we also prove a linear-rate convergence of the training loss to zero, but regardless of the input dimension. We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
\ No newline at end of file
diff --git a/data/2022/iclr/On the Certified Robustness for Ensemble Models and Beyond b/data/2022/iclr/On the Certified Robustness for Ensemble Models and Beyond
new file mode 100644
index 0000000000..8ccdfd3ac0
--- /dev/null
+++ b/data/2022/iclr/On the Certified Robustness for Ensemble Models and Beyond	
@@ -0,0 +1 @@
+show deep neural networks (DNN) are vulnerable to adversarial which aim to mislead DNNs by adding perturbations with small magnitude. To defend against such attacks, both empirical and theoretical defense approaches have been extensively studied for a single ML model . In this work, we aim to analyze and provide the certiﬁed robustness for ensemble ML models , together with the sufﬁcient and necessary conditions of robustness for different ensemble protocols. Although ensemble models are shown more robust than a single model empirically; surprisingly, we ﬁnd that in terms of the certiﬁed robustness the standard ensemble models only achieve marginal improvement compared to a single model. Thus, to explore the conditions that guarantee to provide certiﬁably robust ensemble ML models, we ﬁrst prove that diversiﬁed gradient and large conﬁdence margin are sufﬁcient and necessary conditions for certiﬁably robust ensemble models under the model-smoothness assumption. We then provide the bounded model-smoothness analysis based on the proposed Ensemble-before-Smoothing strategy. We also prove that an ensemble model can always achieve higher certiﬁed robustness than a single base model under mild conditions. Inspired by the theoretical ﬁndings, we propose the lightweight Diversity Regularized Training (DRT) to train certiﬁably robust ensemble ML models. Extensive experiments show that our DRT enhanced ensembles can consistently achieve higher certiﬁed robustness than existing single and ensemble ML models, demonstrating the state-of-the-art certiﬁed L 2 -robustness on MNIST, CIFAR-10, base justiﬁcation of the regularization-based training approach DRT. Extensive experiments showed that DRT-enhanced ensembles achieve the highest certiﬁed robustness compared with existing baselines.
\ No newline at end of file
diff --git a/data/2022/iclr/On the Connection between Local Attention and Dynamic Depth-wise Convolution b/data/2022/iclr/On the Connection between Local Attention and Dynamic Depth-wise Convolution
new file mode 100644
index 0000000000..bd8544e38b
--- /dev/null
+++ b/data/2022/iclr/On the Connection between Local Attention and Dynamic Depth-wise Convolution	
@@ -0,0 +1 @@
+Vision Transformer (ViT) attains state-of-the-art performance in visual recognition, and the variant, Local Vision Transformer, makes further improvements. The major component in Local Vision Transformer, local attention, performs the attention separately over small local windows. We rephrase local attention as a channel-wise locally-connected layer and analyze it from two network regularization manners, sparse connectivity and weight sharing, as well as weight computation. Sparse connectivity: there is no connection across channels, and each position is connected to the positions within a small local window. Weight sharing: the connection weights for one position are shared across channels or within each group of channels. Dynamic weight: the connection weights are dynamically predicted according to each image instance. We point out that local attention resembles depth-wise convolution and its dynamic version in sparse connectivity. The main difference lies in weight sharing - depth-wise convolution shares connection weights (kernel weights) across spatial positions. We empirically observe that the models based on depth-wise convolution and the dynamic variant with lower computation complexity perform on-par with or sometimes slightly better than Swin Transformer, an instance of Local Vision Transformer, for ImageNet classification, COCO object detection and ADE semantic segmentation. These observations suggest that Local Vision Transformer takes advantage of two regularization forms and dynamic weight to increase the network capacity. Code is available at https://github.com/Atten4Vis/DemystifyLocalViT.
\ No newline at end of file
diff --git a/data/2022/iclr/On the Convergence of Certified Robust Training with Interval Bound Propagation b/data/2022/iclr/On the Convergence of Certified Robust Training with Interval Bound Propagation
new file mode 100644
index 0000000000..144ddd7840
--- /dev/null
+++ b/data/2022/iclr/On the Convergence of Certified Robust Training with Interval Bound Propagation	
@@ -0,0 +1 @@
+Interval Bound Propagation (IBP) is so far the base of state-of-the-art methods for training neural networks with certifiable robustness guarantees when potential adversarial perturbations present, while the convergence of IBP training remains unknown in existing literature. In this paper, we present a theoretical analysis on the convergence of IBP training. With an overparameterized assumption, we analyze the convergence of IBP robust training. We show that when using IBP training to train a randomly initialized two-layer ReLU neural network with logistic loss, gradient descent can linearly converge to zero robust training error with a high probability if we have sufficiently small perturbation radius and large network width.
\ No newline at end of file
diff --git a/data/2022/iclr/On the Convergence of mSGD and AdaGrad for Stochastic Optimization b/data/2022/iclr/On the Convergence of mSGD and AdaGrad for Stochastic Optimization
new file mode 100644
index 0000000000..0c08146aa5
--- /dev/null
+++ b/data/2022/iclr/On the Convergence of mSGD and AdaGrad for Stochastic Optimization	
@@ -0,0 +1 @@
+As one of the most fundamental stochastic optimization algorithms, stochastic gradient descent (SGD) has been intensively developed and extensively applied in machine learning in the past decade. There have been some modified SGD-type algorithms, which outperform the SGD in many competitions and applications in terms of convergence rate and accuracy, such as momentum-based SGD (mSGD) and adaptive gradient algorithm (AdaGrad). Despite these empirical successes, the theoretical properties of these algorithms have not been well established due to technical difficulties. With this motivation, we focus on convergence analysis of mSGD and AdaGrad for any smooth (possibly non-convex) loss functions in stochastic optimization. First, we prove that the iterates of mSGD are asymptotically convergent to a connected set of stationary points with probability one, which is more general than existing works on subsequence convergence or convergence of time averages. Moreover, we prove that the loss function of mSGD decays at a certain rate faster than that of SGD. In addition, we prove the iterates of AdaGrad are asymptotically convergent to a connected set of stationary points with probability one. Also, this result extends the results from the literature on subsequence convergence and the convergence of time averages. Despite the generality of the above convergence results, we have relaxed some assumptions of gradient noises, convexity of loss functions, as well as boundedness of iterates.
\ No newline at end of file
diff --git a/data/2022/iclr/On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning b/data/2022/iclr/On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning
new file mode 100644
index 0000000000..1a1aac04a1
--- /dev/null
+++ b/data/2022/iclr/On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning	
@@ -0,0 +1 @@
+A simple and natural algorithm for reinforcement learning is Monte Carlo Exploring States (MCES), where the Q-function is estimated by averaging the Monte Carlo returns, and the policy is improved by choosing actions that maximize the current estimate of the Q-function. Exploration is performed by "exploring starts", that is, each episode begins with a randomly chosen state and action and then follows the current policy. Establishing convergence for this algorithm has been an open problem for more than 20 years. We make headway with this problem by proving convergence for Optimal Policy Feed-Forward MDPs, which are MDPs whose states are not revisited within any episode for an optimal policy. Such MDPs include all deterministic environments (including Cliff Walking and other gridworld examples) and a large class of stochastic environments (including Blackjack). The convergence results presented here make progress for this long-standing open problem in reinforcement learning.
\ No newline at end of file
diff --git a/data/2022/iclr/On the Existence of Universal Lottery Tickets b/data/2022/iclr/On the Existence of Universal Lottery Tickets
new file mode 100644
index 0000000000..0397e420c4
--- /dev/null
+++ b/data/2022/iclr/On the Existence of Universal Lottery Tickets	
@@ -0,0 +1 @@
+The lottery ticket hypothesis conjectures the existence of sparse subnetworks of large randomly initialized deep neural networks that can be successfully trained in isolation. Recent work has experimentally observed that some of these tickets can be practically reused across a variety of tasks, hinting at some form of universality. We formalize this concept and theoretically prove that not only do such universal tickets exist but they also do not require further training. Our proofs introduce a couple of technical innovations related to pruning for strong lottery tickets, including extensions of subset sum results and a strategy to leverage higher amounts of depth. Our explicit sparse constructions of universal function families might be of independent interest, as they highlight representational benefits induced by univariate convolutional architectures.
\ No newline at end of file
diff --git a/data/2022/iclr/On the Generalization of Models Trained with SGD: Information-Theoretic Bounds and Implications b/data/2022/iclr/On the Generalization of Models Trained with SGD: Information-Theoretic Bounds and Implications
new file mode 100644
index 0000000000..c5a5cc9807
--- /dev/null
+++ b/data/2022/iclr/On the Generalization of Models Trained with SGD: Information-Theoretic Bounds and Implications	
@@ -0,0 +1 @@
+This paper follows up on a recent work of Neu et al. (2021) and presents some new information-theoretic upper bounds for the generalization error of machine learning models, such as neural networks, trained with SGD. We apply these bounds to analyzing the generalization behaviour of linear and two-layer ReLU networks. Experimental study of these bounds provide some insights on the SGD training of neural networks. They also point to a new and simple regularization scheme which we show performs comparably to the current state of the art.
\ No newline at end of file
diff --git a/data/2022/iclr/On the Importance of Difficulty Calibration in Membership Inference Attacks b/data/2022/iclr/On the Importance of Difficulty Calibration in Membership Inference Attacks
new file mode 100644
index 0000000000..615056829f
--- /dev/null
+++ b/data/2022/iclr/On the Importance of Difficulty Calibration in Membership Inference Attacks	
@@ -0,0 +1 @@
+The vulnerability of machine learning models to membership inference attacks has received much attention in recent years. However, existing attacks mostly remain impractical due to having high false positive rates, where non-member samples are often erroneously predicted as members. This type of error makes the predicted membership signal unreliable, especially since most samples are non-members in real world applications. In this work, we argue that membership inference attacks can benefit drastically from \emph{difficulty calibration}, where an attack's predicted membership score is adjusted to the difficulty of correctly classifying the target sample. We show that difficulty calibration can significantly reduce the false positive rate of a variety of existing attacks without a loss in accuracy.
\ No newline at end of file
diff --git a/data/2022/iclr/On the Importance of Firth Bias Reduction in Few-Shot Classification b/data/2022/iclr/On the Importance of Firth Bias Reduction in Few-Shot Classification
new file mode 100644
index 0000000000..2ecc165a43
--- /dev/null
+++ b/data/2022/iclr/On the Importance of Firth Bias Reduction in Few-Shot Classification	
@@ -0,0 +1 @@
+Learning accurate classifiers for novel categories from very few examples, known as few-shot image classification, is a challenging task in statistical machine learning and computer vision. The performance in few-shot classification suffers from the bias in the estimation of classifier parameters; however, an effective underlying bias reduction technique that could alleviate this issue in training few-shot classifiers has been overlooked. In this work, we demonstrate the effectiveness of Firth bias reduction in few-shot classification. Theoretically, Firth bias reduction removes the $O(N^{-1})$ first order term from the small-sample bias of the Maximum Likelihood Estimator. Here we show that the general Firth bias reduction technique simplifies to encouraging uniform class assignment probabilities for multinomial logistic classification, and almost has the same effect in cosine classifiers. We derive an easy-to-implement optimization objective for Firth penalized multinomial logistic and cosine classifiers, which is equivalent to penalizing the cross-entropy loss with a KL-divergence between the uniform label distribution and the predictions. Then, we empirically evaluate that it is consistently effective across the board for few-shot image classification, regardless of (1) the feature representations from different backbones, (2) the number of samples per class, and (3) the number of classes. Finally, we show the robustness of Firth bias reduction, in the case of imbalanced data distribution. Our implementation is available at https://github.com/ehsansaleh/firth_bias_reduction
\ No newline at end of file
diff --git a/data/2022/iclr/On the Learning and Learnability of Quasimetrics b/data/2022/iclr/On the Learning and Learnability of Quasimetrics
new file mode 100644
index 0000000000..85d762a2e7
--- /dev/null
+++ b/data/2022/iclr/On the Learning and Learnability of Quasimetrics	
@@ -0,0 +1 @@
+Our world is full of asymmetries. Gravity and wind can make reaching a place easier than coming back. Social artifacts such as genealogy charts and citation graphs are inherently directed. In reinforcement learning and control, optimal goal-reaching strategies are rarely reversible (symmetrical). Distance functions supported on these asymmetrical structures are called quasimetrics. Despite their common appearance, little research has been done on the learning of quasimetrics. Our theoretical analysis reveals that a common class of learning algorithms, including unconstrained multilayer perceptrons (MLPs), provably fails to learn a quasimetric consistent with training data. In contrast, our proposed Poisson Quasimetric Embedding (PQE) is the first quasimetric learning formulation that both is learnable with gradient-based optimization and enjoys strong performance guarantees. Experiments on random graphs, social graphs, and offline Q-learning demonstrate its effectiveness over many common baselines. Project Page: ssnl.github.io/quasimetric. Code: github.com/SsnL/poisson_quasimetric_embedding.
\ No newline at end of file
diff --git a/data/2022/iclr/On the Limitations of Multimodal VAEs b/data/2022/iclr/On the Limitations of Multimodal VAEs
new file mode 100644
index 0000000000..5b291b1970
--- /dev/null
+++ b/data/2022/iclr/On the Limitations of Multimodal VAEs	
@@ -0,0 +1 @@
+Multimodal variational autoencoders (VAEs) have shown promise as efficient generative models for weakly-supervised data. Yet, despite their advantage of weak supervision, they exhibit a gap in generative quality compared to unimodal VAEs, which are completely unsupervised. In an attempt to explain this gap, we uncover a fundamental limitation that applies to a large family of mixture-based multimodal VAEs. We prove that the sub-sampling of modalities enforces an undesirable upper bound on the multimodal ELBO and thereby limits the generative quality of the respective models. Empirically, we showcase the generative quality gap on both synthetic and real data and present the tradeoffs between different variants of multimodal VAEs. We find that none of the existing approaches fulfills all desired criteria of an effective multimodal generative model when applied on more complex datasets than those used in previous benchmarks. In summary, we identify, formalize, and validate fundamental limitations of VAE-based approaches for modeling weakly-supervised data and discuss implications for real-world applications.
\ No newline at end of file
diff --git a/data/2022/iclr/On the Optimal Memorization Power of ReLU Neural Networks b/data/2022/iclr/On the Optimal Memorization Power of ReLU Neural Networks
new file mode 100644
index 0000000000..40db36002c
--- /dev/null
+++ b/data/2022/iclr/On the Optimal Memorization Power of ReLU Neural Networks	
@@ -0,0 +1 @@
+We study the memorization power of feedforward ReLU neural networks. We show that such networks can memorize any $N$ points that satisfy a mild separability assumption using $\tilde{O}\left(\sqrt{N}\right)$ parameters. Known VC-dimension upper bounds imply that memorizing $N$ samples requires $\Omega(\sqrt{N})$ parameters, and hence our construction is optimal up to logarithmic factors. We also give a generalized construction for networks with depth bounded by $1 \leq L \leq \sqrt{N}$, for memorizing $N$ samples using $\tilde{O}(N/L)$ parameters. This bound is also optimal up to logarithmic factors. Our construction uses weights with large bit complexity. We prove that having such a large bit complexity is both necessary and sufficient for memorization with a sub-linear number of parameters.
\ No newline at end of file
diff --git a/data/2022/iclr/On the Pitfalls of Analyzing Individual Neurons in Language Models b/data/2022/iclr/On the Pitfalls of Analyzing Individual Neurons in Language Models
new file mode 100644
index 0000000000..0b992dc294
--- /dev/null
+++ b/data/2022/iclr/On the Pitfalls of Analyzing Individual Neurons in Language Models	
@@ -0,0 +1 @@
+While many studies have shown that linguistic information is encoded in hidden word representations, few have studied individual neurons, to show how and in which neurons it is encoded. Among these, the common approach is to use an external probe to rank neurons according to their relevance to some linguistic attribute, and to evaluate the obtained ranking using the same probe that produced it. We show two pitfalls in this methodology: 1. It confounds distinct factors: probe quality and ranking quality. We separate them and draw conclusions on each. 2. It focuses on encoded information, rather than information that is used by the model. We show that these are not the same. We compare two recent ranking methods and a simple one we introduce, and evaluate them with regard to both of these aspects.
\ No newline at end of file
diff --git a/data/2022/iclr/On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks b/data/2022/iclr/On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks
new file mode 100644
index 0000000000..35d3bb2452
--- /dev/null
+++ b/data/2022/iclr/On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks	
@@ -0,0 +1 @@
+Capturing aleatoric uncertainty is a critical part of many machine learning systems. In deep learning, a common approach to this end is to train a neural network to estimate the parameters of a heteroscedastic Gaussian distribution by maximizing the logarithm of the likelihood function under the observed data. In this work, we examine this approach and identify potential hazards associated with the use of log-likelihood in conjunction with gradient-based optimizers. First, we present a synthetic example illustrating how this approach can lead to very poor but stable parameter estimates. Second, we identify the culprit to be the log-likelihood loss, along with certain conditions that exacerbate the issue. Third, we present an alternative formulation, termed $\beta$-NLL, in which each data point's contribution to the loss is weighted by the $\beta$-exponentiated variance estimate. We show that using an appropriate $\beta$ largely mitigates the issue in our illustrative example. Fourth, we evaluate this approach on a range of domains and tasks and show that it achieves considerable improvements and performs more robustly concerning hyperparameters, both in predictive RMSE and log-likelihood criteria.
\ No newline at end of file
diff --git a/data/2022/iclr/On the Role of Neural Collapse in Transfer Learning b/data/2022/iclr/On the Role of Neural Collapse in Transfer Learning
new file mode 100644
index 0000000000..65ff2878ac
--- /dev/null
+++ b/data/2022/iclr/On the Role of Neural Collapse in Transfer Learning	
@@ -0,0 +1 @@
+We study the ability of foundation models to learn representations for classification that are transferable to new, unseen classes. Recent results in the literature show that representations learned by a single classifier over many classes are competitive on few-shot learning problems with representations learned by special-purpose algorithms designed for such problems. In this paper we provide an explanation for this behavior based on the recently observed phenomenon that the features learned by overparameterized classification networks show an interesting clustering property, called neural collapse. We demonstrate both theoretically and empirically that neural collapse generalizes to new samples from the training classes, and -- more importantly -- to new classes as well, allowing foundation models to provide feature maps that work well in transfer learning and, specifically, in the few-shot setting.
\ No newline at end of file
diff --git a/data/2022/iclr/On the Uncomputability of Partition Functions in Energy-Based Sequence Models b/data/2022/iclr/On the Uncomputability of Partition Functions in Energy-Based Sequence Models
new file mode 100644
index 0000000000..9f72642480
--- /dev/null
+++ b/data/2022/iclr/On the Uncomputability of Partition Functions in Energy-Based Sequence Models	
@@ -0,0 +1 @@
+In this paper, we argue that energy-based sequence models backed by expressive parametric families can result in uncomputable and inapproximable partition functions. Among other things, this makes model selection — and therefore learning model parameters — not only diﬃcult, but generally undecidable . The reason is that there are no good deterministic or randomized estimators of partition functions. Speciﬁcally, we exhibit a pathological example where under common assumptions, no useful importance sampling estimators of the partition function can guarantee to have variance bounded below a rational number. As alternatives, we consider sequence model families whose partition functions are computable (if they exist), but at the cost of reduced expressiveness. Our theoretical results suggest that statistical procedures with asymptotic guarantees and sheer (but ﬁnite) amounts of compute are not the only things that make sequence modeling work; computability concerns must not be neglected as we consider more expressive model parametrizations.
\ No newline at end of file
diff --git a/data/2022/iclr/On the approximation properties of recurrent encoder-decoder architectures b/data/2022/iclr/On the approximation properties of recurrent encoder-decoder architectures
new file mode 100644
index 0000000000..3c420d0fb1
--- /dev/null
+++ b/data/2022/iclr/On the approximation properties of recurrent encoder-decoder architectures	
@@ -0,0 +1 @@
+Encoder-decoder architectures have recently gained popularity in sequence to sequence modelling, featuring in state-of-the-art models such as trans-formers. However, a mathematical understanding of their working principles still remains limited. In this paper, we study the approximation properties of recurrent encoder-decoder architectures. Prior work established theoretical results for RNNs in the linear setting, where approximation capabilities can be related to smoothness and memory of target temporal relationships. Here, we uncover that the encoder and decoder together form a particular “temporal product structure” which determines the approximation eﬃciency. Moreover, the encoder-decoder architecture generalises RNNs with the capability to learn time-inhomogeneous relationships. Our results provide the theoretical understanding of approximation properties of the recurrent encoder-decoder architecture, which precisely characterises, in the considered setting, the types of temporal relationships that can be eﬃciently learned.
\ No newline at end of file
diff --git a/data/2022/iclr/On the benefits of maximum likelihood estimation for Regression and Forecasting b/data/2022/iclr/On the benefits of maximum likelihood estimation for Regression and Forecasting
new file mode 100644
index 0000000000..d1fca7daf7
--- /dev/null
+++ b/data/2022/iclr/On the benefits of maximum likelihood estimation for Regression and Forecasting	
@@ -0,0 +1 @@
+We advocate for a practical Maximum Likelihood Estimation (MLE) approach towards designing loss functions for regression and forecasting, as an alternative to the typical approach of direct empirical risk minimization on a specific target metric. The MLE approach is better suited to capture inductive biases such as prior domain knowledge in datasets, and can output post-hoc estimators at inference time that can optimize different types of target metrics. We present theoretical results to demonstrate that our approach is competitive with any estimator for the target metric under some general conditions. In two example practical settings, Poisson and Pareto regression, we show that our competitive results can be used to prove that the MLE approach has better excess risk bounds than directly minimizing the target metric. We also demonstrate empirically that our method instantiated with a well-designed general purpose mixture likelihood family can obtain superior performance for a variety of tasks across time-series forecasting and regression datasets with different data distributions.
\ No newline at end of file
diff --git a/data/2022/iclr/On the relation between statistical learning and perceptual distances b/data/2022/iclr/On the relation between statistical learning and perceptual distances
new file mode 100644
index 0000000000..9263bca5ad
--- /dev/null
+++ b/data/2022/iclr/On the relation between statistical learning and perceptual distances	
@@ -0,0 +1 @@
+we to the non-trivial relationships between the probability distribution of the perceptual distances, and unsupervised machine learning. To this end, we show that perceptual sensitivity is correlated with the probability of an image in its close neighborhood. We also explore the relation between distances induced by autoencoders and the probability distribution of the training data, as well as how these induced distances are correlated with human perception. Finally, we perceptual distances do not always lead to noticeable gains in performance over Euclidean distance in common image processing tasks, except when data is scarce and the perceptual distance provides regularization. We propose this may be due to a double-counting effect of the image statistics, once in the perceptual distance and once in the training procedure.
\ No newline at end of file
diff --git a/data/2022/iclr/On the role of population heterogeneity in emergent communication b/data/2022/iclr/On the role of population heterogeneity in emergent communication
new file mode 100644
index 0000000000..4102c8445a
--- /dev/null
+++ b/data/2022/iclr/On the role of population heterogeneity in emergent communication	
@@ -0,0 +1 @@
+Populations have often been perceived as a structuring component for language to emerge and evolve: the larger the population, the more structured the language. While this observation is widespread in the sociolinguistic literature, it has not been consistently reproduced in computer simulations with neural agents. In this paper, we thus aim to clarify this apparent contradiction. We explore emergent language properties by varying agent population size in the speaker-listener Lewis Game. After reproducing the experimental difference, we challenge the simulation assumption that the agent community is homogeneous. We first investigate how speaker-listener asymmetry alters language structure to examine two potential diversity factors: training speed and network capacity. We find out that emergent language properties are only altered by the relative difference of learning speeds between speaker and listener, and not by their absolute values. From then, we leverage this observation to control population heterogeneity without introducing confounding factors. We finally show that introducing such training speed heterogeneities naturally sort out the initial contradiction: larger simulated communities start developing more stable and structured languages.
\ No newline at end of file
diff --git a/data/2022/iclr/On-Policy Model Errors in Reinforcement Learning b/data/2022/iclr/On-Policy Model Errors in Reinforcement Learning
new file mode 100644
index 0000000000..ce0b1b5179
--- /dev/null
+++ b/data/2022/iclr/On-Policy Model Errors in Reinforcement Learning	
@@ -0,0 +1 @@
+Model-free reinforcement learning algorithms can compute policy gradients given sampled environment transitions, but require large amounts of data. In contrast, model-based methods can use the learned model to generate new data, but model errors and bias can render learning unstable or suboptimal. In this paper, we present a novel method that combines real-world data and a learned model in order to get the best of both worlds. The core idea is to exploit the real-world data for on-policy predictions and use the learned model only to generalize to different actions. Specifically, we use the data as time-dependent on-policy correction terms on top of a learned model, to retain the ability to generate data without accumulating errors over long prediction horizons. We motivate this method theoretically and show that it counteracts an error term for model-based policy improvement. Experiments on MuJoCo- and PyBullet-benchmarks show that our method can drastically improve existing model-based approaches without introducing additional tuning parameters.
\ No newline at end of file
diff --git a/data/2022/iclr/One After Another: Learning Incremental Skills for a Changing World b/data/2022/iclr/One After Another: Learning Incremental Skills for a Changing World
new file mode 100644
index 0000000000..8f3b13caad
--- /dev/null
+++ b/data/2022/iclr/One After Another: Learning Incremental Skills for a Changing World	
@@ -0,0 +1 @@
+Reward-free, unsupervised discovery of skills is an attractive alternative to the bottleneck of hand-designing rewards in environments where task supervision is scarce or expensive. However, current skill pre-training methods, like many RL techniques, make a fundamental assumption - stationary environments during training. Traditional methods learn all their skills simultaneously, which makes it difficult for them to both quickly adapt to changes in the environment, and to not forget earlier skills after such adaptation. On the other hand, in an evolving or expanding environment, skill learning must be able to adapt fast to new environment situations while not forgetting previously learned skills. These two conditions make it difficult for classic skill discovery to do well in an evolving environment. In this work, we propose a new framework for skill discovery, where skills are learned one after another in an incremental fashion. This framework allows newly learned skills to adapt to new environment or agent dynamics, while the fixed old skills ensure the agent doesn't forget a learned skill. We demonstrate experimentally that in both evolving and static environments, incremental skills significantly outperform current state-of-the-art skill discovery methods on both skill quality and the ability to solve downstream tasks. Videos for learned skills and code are made public on https://notmahi.github.io/disk
\ No newline at end of file
diff --git a/data/2022/iclr/Online Ad Hoc Teamwork under Partial Observability b/data/2022/iclr/Online Ad Hoc Teamwork under Partial Observability
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Online Adversarial Attacks b/data/2022/iclr/Online Adversarial Attacks
new file mode 100644
index 0000000000..9346c9d979
--- /dev/null
+++ b/data/2022/iclr/Online Adversarial Attacks	
@@ -0,0 +1 @@
+Adversarial attacks expose important vulnerabilities of deep learning models, yet little attention has been paid to settings where data arrives as a stream. In this paper, we formalize the online adversarial attack problem, emphasizing two key elements found in real-world use-cases: attackers must operate under partial knowledge of the target model, and the decisions made by the attacker are irrevocable since they operate on a transient data stream. We first rigorously analyze a deterministic variant of the online threat model by drawing parallels to the well-studied $k$-secretary problem in theoretical computer science and propose Virtual+, a simple yet practical online algorithm. Our main theoretical result shows Virtual+ yields provably the best competitive ratio over all single-threshold algorithms for $k<5$ -- extending the previous analysis of the $k$-secretary problem. We also introduce the \textit{stochastic $k$-secretary} -- effectively reducing online blackbox transfer attacks to a $k$-secretary problem under noise -- and prove theoretical bounds on the performance of Virtual+ adapted to this setting. Finally, we complement our theoretical results by conducting experiments on MNIST, CIFAR-10, and Imagenet classifiers, revealing the necessity of online algorithms in achieving near-optimal performance and also the rich interplay between attack strategies and online attack selection, enabling simple strategies like FGSM to outperform stronger adversaries.
\ No newline at end of file
diff --git a/data/2022/iclr/Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference b/data/2022/iclr/Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference
new file mode 100644
index 0000000000..839e4ffce8
--- /dev/null
+++ b/data/2022/iclr/Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference	
@@ -0,0 +1 @@
+Despite rapid advances in continual learning, a large body of research is devoted to improving performance in the existing setups. While a handful of work do propose new continual learning setups, they still lack practicality in certain aspects. For better practicality, we first propose a novel continual learning setup that is online, task-free, class-incremental, of blurry task boundaries and subject to inference queries at any moment. We additionally propose a new metric to better measure the performance of the continual learning methods subject to inference queries at any moment. To address the challenging setup and evaluation protocol, we propose an effective method that employs a new memory management scheme and novel learning techniques. Our empirical validation demonstrates that the proposed method outperforms prior arts by large margins. Code and data splits are available at https://github.com/naver-ai/i-Blurry.
\ No newline at end of file
diff --git a/data/2022/iclr/Online Coreset Selection for Rehearsal-based Continual Learning b/data/2022/iclr/Online Coreset Selection for Rehearsal-based Continual Learning
new file mode 100644
index 0000000000..d123215ba1
--- /dev/null
+++ b/data/2022/iclr/Online Coreset Selection for Rehearsal-based Continual Learning	
@@ -0,0 +1 @@
+A dataset is a shred of crucial evidence to describe a task. However, each data point in the dataset does not have the same potential, as some of the data points can be more representative or informative than others. This unequal importance among the data points may have a large impact in rehearsal-based continual learning, where we store a subset of the training examples (coreset) to be replayed later to alleviate catastrophic forgetting. In continual learning, the quality of the samples stored in the coreset directly affects the model's effectiveness and efficiency. The coreset selection problem becomes even more important under realistic settings, such as imbalanced continual learning or noisy data scenarios. To tackle this problem, we propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration and trains them in an online manner. Our proposed method maximizes the model's adaptation to a current dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting. We validate the effectiveness of our coreset selection mechanism over various standard, imbalanced, and noisy datasets against strong continual learning baselines, demonstrating that it improves task adaptation and prevents catastrophic forgetting in a sample-efficient manner.
\ No newline at end of file
diff --git a/data/2022/iclr/Online Facility Location with Predictions b/data/2022/iclr/Online Facility Location with Predictions
new file mode 100644
index 0000000000..d57f2f6422
--- /dev/null
+++ b/data/2022/iclr/Online Facility Location with Predictions	
@@ -0,0 +1 @@
+We provide nearly optimal algorithms for online facility location (OFL) with predictions. In OFL, $n$ demand points arrive in order and the algorithm must irrevocably assign each demand point to an open facility upon its arrival. The objective is to minimize the total connection costs from demand points to assigned facilities plus the facility opening cost. We further assume the algorithm is additionally given for each demand point $x_i$ a natural prediction $f_{x_i}^{\mathrm{pred}}$ which is supposed to be the facility $f_{x_i}^{\mathrm{opt}}$ that serves $x_i$ in the offline optimal solution. Our main result is an $O(\min\{\log {\frac{n\eta_\infty}{\mathrm{OPT}}}, \log{n} \})$-competitive algorithm where $\eta_\infty$ is the maximum prediction error (i.e., the distance between $f_{x_i}^{\mathrm{pred}}$ and $f_{x_i}^{\mathrm{opt}}$). Our algorithm overcomes the fundamental $\Omega(\frac{\log n}{\log \log n})$ lower bound of OFL (without predictions) when $\eta_\infty$ is small, and it still maintains $O(\log n)$ ratio even when $\eta_\infty$ is unbounded. Furthermore, our theoretical analysis is supported by empirical evaluations for the tradeoffs between $\eta_\infty$ and the competitive ratio on various real datasets of different types.
\ No newline at end of file
diff --git a/data/2022/iclr/Online Hyperparameter Meta-Learning with Hypergradient Distillation b/data/2022/iclr/Online Hyperparameter Meta-Learning with Hypergradient Distillation
new file mode 100644
index 0000000000..f1680bb4ab
--- /dev/null
+++ b/data/2022/iclr/Online Hyperparameter Meta-Learning with Hypergradient Distillation	
@@ -0,0 +1 @@
+Many gradient-based meta-learning methods assume a set of parameters that do not participate in inner-optimization, which can be considered as hyperparameters. Although such hyperparameters can be optimized using the existing gradient-based hyperparameter optimization (HO) methods, they suffer from the following issues. Unrolled differentiation methods do not scale well to high-dimensional hyperparameters or horizon length, Implicit Function Theorem (IFT) based methods are restrictive for online optimization, and short horizon approximations suffer from short horizon bias. In this work, we propose a novel HO method that can overcome these limitations, by approximating the second-order term with knowledge distillation. Specifically, we parameterize a single Jacobian-vector product (JVP) for each HO step and minimize the distance from the true second-order term. Our method allows online optimization and also is scalable to the hyperparameter dimension and the horizon length. We demonstrate the effectiveness of our method on two different meta-learning methods and three benchmark datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs b/data/2022/iclr/Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs
new file mode 100644
index 0000000000..c7aea1ff44
--- /dev/null
+++ b/data/2022/iclr/Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs	
@@ -0,0 +1 @@
+Q-learning is a popular Reinforcement Learning (RL) algorithm which is widely used in practice with function approximation (Mnih et al., 2015). In contrast, existing theoretical results are pessimistic about Q-learning. For example, (Baird, 1995) shows that Q-learning does not converge even with linear function approximation for linear MDPs. Furthermore, even for tabular MDPs with synchronous updates, Q-learning was shown to have sub-optimal sample complexity (Li et al., 2021;Azar et al., 2013). The goal of this work is to bridge the gap between practical success of Q-learning and the relatively pessimistic theoretical results. The starting point of our work is the observation that in practice, Q-learning is used with two important modifications: (i) training with two networks, called online network and target network simultaneously (online target learning, or OTL) , and (ii) experience replay (ER) (Mnih et al., 2015). While they have been observed to play a significant role in the practical success of Q-learning, a thorough theoretical understanding of how these two modifications improve the convergence behavior of Q-learning has been missing in literature. By carefully combining Q-learning with OTL and reverse experience replay (RER) (a form of experience replay), we present novel methods Q-Rex and Q-RexDaRe (Q-Rex + data reuse). We show that Q-Rex efficiently finds the optimal policy for linear MDPs (or more generally for MDPs with zero inherent Bellman error with linear approximation (ZIBEL)) and provide non-asymptotic bounds on sample complexity -- the first such result for a Q-learning method for this class of MDPs under standard assumptions. Furthermore, we demonstrate that Q-RexDaRe in fact achieves near optimal sample complexity in the tabular setting, improving upon the existing results for vanilla Q-learning.
\ No newline at end of file
diff --git a/data/2022/iclr/OntoProtein: Protein Pretraining With Gene Ontology Embedding b/data/2022/iclr/OntoProtein: Protein Pretraining With Gene Ontology Embedding
new file mode 100644
index 0000000000..f59e99100b
--- /dev/null
+++ b/data/2022/iclr/OntoProtein: Protein Pretraining With Gene Ontology Embedding	
@@ -0,0 +1 @@
+Self-supervised protein language models have proved their effectiveness in learning the proteins representations. With the increasing computational power, current protein language models pre-trained with millions of diverse sequences can advance the parameter scale from million-level to billion-level and achieve remarkable improvement. However, those prevailing approaches rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowledge facts for better protein representations. We argue that informative biology knowledge in KGs can enhance protein representation with external knowledge. In this work, we propose OntoProtein, the first general framework that makes use of structure in GO (Gene Ontology) into protein pre-training models. We construct a novel large-scale knowledge graph that consists of GO and its related proteins, and gene annotation texts or protein sequences describe all nodes in the graph. We propose novel contrastive learning with knowledge-aware negative sampling to jointly optimize the knowledge graph and protein embedding during pre-training. Experimental results show that OntoProtein can surpass state-of-the-art methods with pre-trained protein language models in TAPE benchmark and yield better performance compared with baselines in protein-protein interaction and protein function prediction. Code and datasets are available in https://github.com/zjunlp/OntoProtein.
\ No newline at end of file
diff --git a/data/2022/iclr/Open-Set Recognition: A Good Closed-Set Classifier is All You Need b/data/2022/iclr/Open-Set Recognition: A Good Closed-Set Classifier is All You Need
new file mode 100644
index 0000000000..79ee17a0b5
--- /dev/null
+++ b/data/2022/iclr/Open-Set Recognition: A Good Closed-Set Classifier is All You Need	
@@ -0,0 +1 @@
+The ability to identify whether or not a test sample belongs to one of the semantic classes in a classifier's training set is critical to practical deployment of the model. This task is termed open-set recognition (OSR) and has received significant attention in recent years. In this paper, we first demonstrate that the ability of a classifier to make the 'none-of-above' decision is highly correlated with its accuracy on the closed-set classes. We find that this relationship holds across loss objectives and architectures, and further demonstrate the trend both on the standard OSR benchmarks as well as on a large-scale ImageNet evaluation. Second, we use this correlation to boost the performance of a maximum logit score OSR 'baseline' by improving its closed-set accuracy, and with this strong baseline achieve state-of-the-art on a number of OSR benchmarks. Similarly, we boost the performance of the existing state-of-the-art method by improving its closed-set accuracy, but the resulting discrepancy with the strong baseline is marginal. Our third contribution is to present the 'Semantic Shift Benchmark' (SSB), which better respects the task of detecting semantic novelty, in contrast to other forms of distribution shift also considered in related sub-fields, such as out-of-distribution detection. On this new evaluation, we again demonstrate that there is negligible difference between the strong baseline and the existing state-of-the-art. Project Page: https://www.robots.ox.ac.uk/~vgg/research/osr/
\ No newline at end of file
diff --git a/data/2022/iclr/Open-World Semi-Supervised Learning b/data/2022/iclr/Open-World Semi-Supervised Learning
new file mode 100644
index 0000000000..6546f80129
--- /dev/null
+++ b/data/2022/iclr/Open-World Semi-Supervised Learning	
@@ -0,0 +1 @@
+A fundamental limitation of applying semi-supervised learning in real-world settings is the assumption that unlabeled test data contains only classes previously encountered in the labeled training data. However, this assumption rarely holds for data in-the-wild, where instances belonging to novel classes may appear at testing time. Here, we introduce a novel open-world semi-supervised learning setting that formalizes the notion that novel classes may appear in the unlabeled test data. In this novel setting, the goal is to solve the class distribution mismatch between labeled and unlabeled data, where at the test time every input instance either needs to be classified into one of the existing classes or a new unseen class needs to be initialized. To tackle this challenging problem, we propose ORCA, an end-to-end deep learning approach that introduces uncertainty adaptive margin mechanism to circumvent the bias towards seen classes caused by learning discriminative features for seen classes faster than for the novel classes. In this way, ORCA reduces the gap between intra-class variance of seen with respect to novel classes. Experiments on image classification datasets and a single-cell annotation dataset demonstrate that ORCA consistently outperforms alternative baselines, achieving 25% improvement on seen and 96% improvement on novel classes of the ImageNet dataset.
\ No newline at end of file
diff --git a/data/2022/iclr/Open-vocabulary Object Detection via Vision and Language Knowledge Distillation b/data/2022/iclr/Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
new file mode 100644
index 0000000000..11038f342e
--- /dev/null
+++ b/data/2022/iclr/Open-vocabulary Object Detection via Vision and Language Knowledge Distillation	
@@ -0,0 +1 @@
+We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. ViLD obtains 16.1 mask AP$_r$ with a ResNet-50 backbone, even outperforming the supervised counterpart by 3.8. When trained with a stronger teacher model ALIGN, ViLD achieves 26.3 AP$_r$. The model can directly transfer to other datasets without finetuning, achieving 72.2 AP$_{50}$ on PASCAL VOC, 36.6 AP on COCO and 11.8 AP on Objects365. On COCO, ViLD outperforms the previous state-of-the-art by 4.8 on novel AP and 11.4 on overall AP. Code and demo are open-sourced at https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild.
\ No newline at end of file
diff --git a/data/2022/iclr/Optimal ANN-SNN Conversion for High-accuracy and Ultra-low-latency Spiking Neural Networks b/data/2022/iclr/Optimal ANN-SNN Conversion for High-accuracy and Ultra-low-latency Spiking Neural Networks
new file mode 100644
index 0000000000..01d4a56690
--- /dev/null
+++ b/data/2022/iclr/Optimal ANN-SNN Conversion for High-accuracy and Ultra-low-latency Spiking Neural Networks	
@@ -0,0 +1 @@
+Spiking Neural Networks (SNNs) have gained great attraction due to their distinctive properties of low power consumption and fast inference on neuromorphic hardware. As the most effective method to get deep SNNs, ANN-SNN conversion has achieved comparable performance as ANNs on large-scale datasets. Despite this, it requires long time-steps to match the firing rates of SNNs to the activation of ANNs. As a result, the converted SNN suffers severe performance degradation problems with short time-steps, which hamper the practical application of SNNs. In this paper, we theoretically analyze ANN-SNN conversion error and derive the estimated activation function of SNNs. Then we propose the quantization clip-floor-shift activation function to replace the ReLU activation function in source ANNs, which can better approximate the activation function of SNNs. We prove that the expected conversion error between SNNs and ANNs is zero, enabling us to achieve high-accuracy and ultra-low-latency SNNs. We evaluate our method on CIFAR-10/100 and ImageNet datasets, and show that it outperforms the state-of-the-art ANN-SNN and directly trained SNNs in both accuracy and time-steps. To the best of our knowledge, this is the first time to explore high-performance ANN-SNN conversion with ultra-low latency (4 time-steps). Code is available at https://github.com/putshua/SNN\_conversion\_QCFS
\ No newline at end of file
diff --git a/data/2022/iclr/Optimal Representations for Covariate Shift b/data/2022/iclr/Optimal Representations for Covariate Shift
new file mode 100644
index 0000000000..bc3b036425
--- /dev/null
+++ b/data/2022/iclr/Optimal Representations for Covariate Shift	
@@ -0,0 +1 @@
+Machine learning systems often experience a distribution shift between training and testing. In this paper, we introduce a simple variational objective whose optima are exactly the set of all representations on which risk minimizers are guaranteed to be robust to any distribution shift that preserves the Bayes predictor, e.g., covariate shifts. Our objective has two components. First, a representation must remain discriminative for the task, i.e., some predictor must be able to simultaneously minimize the source and target risk. Second, the representation's marginal support needs to be the same across source and target. We make this practical by designing self-supervised objectives that only use unlabelled data and augmentations to train robust representations. Our objectives give insights into the robustness of CLIP, and further improve CLIP's representations to achieve SOTA results on DomainBed.
\ No newline at end of file
diff --git a/data/2022/iclr/Optimal Transport for Causal Discovery b/data/2022/iclr/Optimal Transport for Causal Discovery
new file mode 100644
index 0000000000..a4c5b2344e
--- /dev/null
+++ b/data/2022/iclr/Optimal Transport for Causal Discovery	
@@ -0,0 +1 @@
+To determine causal relationships between two variables, approaches based on Functional Causal Models (FCMs) have been proposed by properly restricting model classes; however, the performance is sensitive to the model assumptions, which makes it difficult to use. In this paper, we provide a novel dynamical-system view of FCMs and propose a new framework for identifying causal direction in the bivariate case. We first show the connection between FCMs and optimal transport, and then study optimal transport under the constraints of FCMs. Furthermore, by exploiting the dynamical interpretation of optimal transport under the FCM constraints, we determine the corresponding underlying dynamical process of the static cause-effect pair data. It provides a new dimension for describing static causal discovery tasks while enjoying more freedom for modeling the quantitative causal influences. In particular, we show that Additive Noise Models (ANMs) correspond to volume-preserving pressureless flows. Consequently, based on their velocity field divergence, we introduce a criterion for determining causal direction. With this criterion, we propose a novel optimal transport-based algorithm for ANMs which is robust to the choice of models and extend it to post-nonlinear models. Our method demonstrated state-of-the-art results on both synthetic and causal discovery benchmark datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/Optimal Transport for Long-Tailed Recognition with Learnable Cost Matrix b/data/2022/iclr/Optimal Transport for Long-Tailed Recognition with Learnable Cost Matrix
new file mode 100644
index 0000000000..5547ff9bf6
--- /dev/null
+++ b/data/2022/iclr/Optimal Transport for Long-Tailed Recognition with Learnable Cost Matrix	
@@ -0,0 +1 @@
+It is attracting attention to the long-tailed recognition problem, a burning issue that has become very popular recently. Distinctive from conventional recognition is that it posits that the allocation of the training set is supremely distorted. Predictably, it will pose challenges to the generalisation behaviour of the model. Approaches to these challenges revolve into two groups: firstly, training-aware methods, with the aim of enhancing the generalisability of the model by exploiting its potential in the training period; and secondly, post-hoc correction, liberally coupled with trainingaware methods, which is intended to refine the predictions to the extent possible in the post-processing stage, offering the advantages of simplicity and effectiveness. This paper introduces an alternative direction to do the post-hoc correction, which goes beyond the statistical methods. Mathematically, we approach this issue from the perspective of optimal transport (OT), yet, choosing the exact cost matrix when applying OT is challenging and requires expert knowledge of various tasks. To overcome this limitation, we propose to employ linear mapping to learn the cost matrix without necessary configurations adaptively. Testing our methods in practice, along with high efficiency and excellent performance, our method surpasses all previous methods and has the best performance to date.
\ No newline at end of file
diff --git a/data/2022/iclr/Optimization and Adaptive Generalization of Three layer Neural Networks b/data/2022/iclr/Optimization and Adaptive Generalization of Three layer Neural Networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Optimization inspired Multi-Branch Equilibrium Models b/data/2022/iclr/Optimization inspired Multi-Branch Equilibrium Models
new file mode 100644
index 0000000000..bde31e3d53
--- /dev/null
+++ b/data/2022/iclr/Optimization inspired Multi-Branch Equilibrium Models	
@@ -0,0 +1 @@
+Works have shown the strong connections between some implicit models and optimization problems. However, explorations on such relationships are limited. Most works pay attention to some common mathematical properties, such as sparsity. In this work, we propose a new type of implicit model inspired by the designing of the systems’ hidden objective functions, called the Multi-branch Optimization induced Equilibrium networks (MOptEqs). The model architecture is designed based on modelling the hidden objective function for the multi-resolution recognition task. Furthermore, we also pro-pose a new strategy inspired by our understandings of the hidden objective function. In this manner, the proposed model can better utilize the hierarchical patterns for recognition tasks and retain the abilities for interpreting the whole structure as trying to obtain the minima of the problem’s goal. Comparing with the state-of-the-art models, our MOptEqs not only enjoys better explainability but are also superior to MDEQ with less parameter consumption and better performance on practical tasks. Furthermore, we also implement various experiments to demonstrate the eﬀectiveness of our new methods and explore the applicability of the model’s hidden objective function.
\ No newline at end of file
diff --git a/data/2022/iclr/Optimizer Amalgamation b/data/2022/iclr/Optimizer Amalgamation
new file mode 100644
index 0000000000..d849fdac57
--- /dev/null
+++ b/data/2022/iclr/Optimizer Amalgamation	
@@ -0,0 +1 @@
+Selecting an appropriate optimizer for a given problem is of major interest for researchers and practitioners. Many analytical optimizers have been proposed using a variety of theoretical and empirical approaches; however, none can offer a universal advantage over other competitive optimizers. We are thus motivated to study a new problem named Optimizer Amalgamation: how can we best combine a pool of"teacher"optimizers into a single"student"optimizer that can have stronger problem-specific performance? In this paper, we draw inspiration from the field of"learning to optimize"to use a learnable amalgamation target. First, we define three differentiable amalgamation mechanisms to amalgamate a pool of analytical optimizers by gradient descent. Then, in order to reduce variance of the amalgamation process, we also explore methods to stabilize the amalgamation process by perturbing the amalgamation target. Finally, we present experiments showing the superiority of our amalgamated optimizer compared to its amalgamated components and learning to optimize baselines, and the efficacy of our variance reducing perturbations. Our code and pre-trained models are publicly available at http://github.com/VITA-Group/OptimizerAmalgamation.
\ No newline at end of file
diff --git a/data/2022/iclr/Optimizing Neural Networks with Gradient Lexicase Selection b/data/2022/iclr/Optimizing Neural Networks with Gradient Lexicase Selection
new file mode 100644
index 0000000000..7e41e0e37d
--- /dev/null
+++ b/data/2022/iclr/Optimizing Neural Networks with Gradient Lexicase Selection	
@@ -0,0 +1 @@
+One potential drawback of using aggregated performance measurement in machine learning is that models may learn to accept higher errors on some training cases as compromises for lower errors on others, with the lower errors actually being instances of overfitting. This can lead to both stagnation at local optima and poor generalization. Lexicase selection is an uncompromising method developed in evolutionary computation, which selects models on the basis of sequences of individual training case errors instead of using aggregated metrics such as loss and accuracy. In this paper, we investigate how lexicase selection, in its general form, can be integrated into the context of deep learning to enhance generalization. We propose Gradient Lexicase Selection, an optimization framework that combines gradient descent and lexicase selection in an evolutionary fashion. Our experimental results demonstrate that the proposed method improves the generalization performance of various widely-used deep neural network architectures across three image classification benchmarks. Additionally, qualitative analysis suggests that our method assists networks in learning more diverse representations. Our source code is available on GitHub: https://github.com/ld-ing/gradient-lexicase.
\ No newline at end of file
diff --git a/data/2022/iclr/Orchestrated Value Mapping for Reinforcement Learning b/data/2022/iclr/Orchestrated Value Mapping for Reinforcement Learning
new file mode 100644
index 0000000000..f21cc6770c
--- /dev/null
+++ b/data/2022/iclr/Orchestrated Value Mapping for Reinforcement Learning	
@@ -0,0 +1 @@
+We present a general convergent class of reinforcement learning algorithms that is founded on two distinct principles: (1) mapping value estimates to a different space using arbitrary functions from a broad class, and (2) linearly decomposing the reward signal into multiple channels. The first principle enables incorporating specific properties into the value estimator that can enhance learning. The second principle, on the other hand, allows for the value function to be represented as a composition of multiple utility functions. This can be leveraged for various purposes, e.g. dealing with highly varying reward scales, incorporating a priori knowledge about the sources of reward, and ensemble learning. Combining the two principles yields a general blueprint for instantiating convergent algorithms by orchestrating diverse mapping functions over multiple reward channels. This blueprint generalizes and subsumes algorithms such as Q-Learning, Log Q-Learning, and Q-Decomposition. In addition, our convergence proof for this general class relaxes certain required assumptions in some of these algorithms. Based on our theory, we discuss several interesting configurations as special cases. Finally, to illustrate the potential of the design space that our theory opens up, we instantiate a particular algorithm and evaluate its performance on the Atari suite.
\ No newline at end of file
diff --git a/data/2022/iclr/Out-of-distribution Generalization in the Presence of Nuisance-Induced Spurious Correlations b/data/2022/iclr/Out-of-distribution Generalization in the Presence of Nuisance-Induced Spurious Correlations
new file mode 100644
index 0000000000..df557d103e
--- /dev/null
+++ b/data/2022/iclr/Out-of-distribution Generalization in the Presence of Nuisance-Induced Spurious Correlations	
@@ -0,0 +1 @@
+In many prediction problems, spurious correlations are induced by a changing relationship between the label and a nuisance variable that is also correlated with the covariates. For example, in classifying animals in natural images, the background, which is a nuisance, can predict the type of animal. This nuisance-label relationship does not always hold, and the performance of a model trained under one such relationship may be poor on data with a different nuisance-label relationship. To build predictive models that perform well regardless of the nuisance-label relationship, we develop Nuisance-Randomized Distillation (NURD). We introduce the nuisance-randomized distribution, a distribution where the nuisance and the label are independent. Under this distribution, we define the set of representations such that conditioning on any member, the nuisance and the label remain independent. We prove that the representations in this set always perform better than chance, while representations outside of this set may not. NURD finds a representation from this set that is most informative of the label under the nuisance-randomized distribution, and we prove that this representation achieves the highest performance regardless of the nuisance-label relationship. We evaluate NURD on several tasks including chest X-ray classification where, using non-lung patches as the nuisance, NURD produces models that predict pneumonia under strong spurious correlations.
\ No newline at end of file
diff --git a/data/2022/iclr/Overcoming The Spectral Bias of Neural Value Approximation b/data/2022/iclr/Overcoming The Spectral Bias of Neural Value Approximation
new file mode 100644
index 0000000000..ea510ad540
--- /dev/null
+++ b/data/2022/iclr/Overcoming The Spectral Bias of Neural Value Approximation	
@@ -0,0 +1 @@
+Value approximation using deep neural networks is at the heart of off-policy deep reinforcement learning, and is often the primary module that provides learning signals to the rest of the algorithm. While multi-layer perceptron networks are universal function approximators, recent works in neural kernel regression suggest the presence of a spectral bias, where fitting high-frequency components of the value function requires exponentially more gradient update steps than the low-frequency ones. In this work, we re-examine off-policy reinforcement learning through the lens of kernel regression and propose to overcome such bias via a composite neural tangent kernel. With just a single line-change, our approach, the Fourier feature networks (FFN) produce state-of-the-art performance on challenging continuous control domains with only a fraction of the compute. Faster convergence and better off-policy stability also make it possible to remove the target network without suffering catastrophic divergences, which further reduces TD}(0)'s estimation bias on a few tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/P-Adapters: Robustly Extracting Factual Information from Language Models with Diverse Prompts b/data/2022/iclr/P-Adapters: Robustly Extracting Factual Information from Language Models with Diverse Prompts
new file mode 100644
index 0000000000..e43b987748
--- /dev/null
+++ b/data/2022/iclr/P-Adapters: Robustly Extracting Factual Information from Language Models with Diverse Prompts	
@@ -0,0 +1 @@
+Recent work (e.g. LAMA (Petroni et al., 2019)) has found that the quality of the factual information extracted from Large Language Models (LLMs) depends on the prompts used to query them. This inconsistency is problematic because different users will query LLMs for the same information using different wording, but should receive the same, accurate responses regardless. In this work we aim to address this shortcoming by introducing P-Adapters: lightweight models that sit between the embedding layer and first attention layer of LLMs. They take LLM embeddings as input and output continuous prompts that are used to query the LLM. Additionally, we investigate Mixture of Experts (MoE) models that learn a set of continuous prompts ("experts") and select one to query the LLM. They require a separate classifier trained on human-annotated data to map natural language prompts to the continuous ones. P-Adapters perform comparably to the more complex MoE models in extracting factual information from BERT and RoBERTa while eliminating the need for additional annotations. P-Adapters show between 12-26% absolute improvement in precision and 36-50% absolute improvement in consistency over a baseline of only using natural language queries. Finally, we investigate what makes P-Adapters successful and conclude that a significant factor is access to the LLM's embeddings of the original natural language prompt, particularly the subject of the entity pair being queried.
\ No newline at end of file
diff --git a/data/2022/iclr/PAC Prediction Sets Under Covariate Shift b/data/2022/iclr/PAC Prediction Sets Under Covariate Shift
new file mode 100644
index 0000000000..4b7da9078c
--- /dev/null
+++ b/data/2022/iclr/PAC Prediction Sets Under Covariate Shift	
@@ -0,0 +1 @@
+An important challenge facing modern machine learning is how to rigorously quantify the uncertainty of model predictions. Conveying uncertainty is especially important when there are changes to the underlying data distribution that might invalidate the predictive model. Yet, most existing uncertainty quantification algorithms break down in the presence of such shifts. We propose a novel approach that addresses this challenge by constructing \emph{probably approximately correct (PAC)} prediction sets in the presence of covariate shift. Our approach focuses on the setting where there is a covariate shift from the source distribution (where we have labeled training examples) to the target distribution (for which we want to quantify uncertainty). Our algorithm assumes given importance weights that encode how the probabilities of the training examples change under the covariate shift. In practice, importance weights typically need to be estimated; thus, we extend our algorithm to the setting where we are given confidence intervals for the importance weights. We demonstrate the effectiveness of our approach on covariate shifts based on DomainNet and ImageNet. Our algorithm satisfies the PAC constraint, and gives prediction sets with the smallest average normalized size among approaches that always satisfy the PAC constraint.
\ No newline at end of file
diff --git a/data/2022/iclr/PAC-Bayes Information Bottleneck b/data/2022/iclr/PAC-Bayes Information Bottleneck
new file mode 100644
index 0000000000..410cf5ee6a
--- /dev/null
+++ b/data/2022/iclr/PAC-Bayes Information Bottleneck	
@@ -0,0 +1 @@
+Understanding the source of the superior generalization ability of NNs remains one of the most important problems in ML research. There have been a series of theoretical works trying to derive non-vacuous bounds for NNs. Recently, the compression of information stored in weights (IIW) is proved to play a key role in NNs generalization based on the PAC-Bayes theorem. However, no solution of IIW has ever been provided, which builds a barrier for further investigation of the IIW's property and its potential in practical deep learning. In this paper, we propose an algorithm for the efficient approximation of IIW. Then, we build an IIW-based information bottleneck on the trade-off between accuracy and information complexity of NNs, namely PIB. From PIB, we can empirically identify the fitting to compressing phase transition during NNs' training and the concrete connection between the IIW compression and the generalization. Besides, we verify that IIW is able to explain NNs in broad cases, e.g., varying batch sizes, over-parameterization, and noisy labels. Moreover, we propose an MCMC-based algorithm to sample from the optimal weight posterior characterized by PIB, which fulfills the potential of IIW in enhancing NNs in practice.
\ No newline at end of file
diff --git a/data/2022/iclr/PEARL: Data Synthesis via Private Embeddings and Adversarial Reconstruction Learning b/data/2022/iclr/PEARL: Data Synthesis via Private Embeddings and Adversarial Reconstruction Learning
new file mode 100644
index 0000000000..c689f4db45
--- /dev/null
+++ b/data/2022/iclr/PEARL: Data Synthesis via Private Embeddings and Adversarial Reconstruction Learning	
@@ -0,0 +1 @@
+We propose a new framework of synthesizing data using deep generative models in a differentially private manner. Within our framework, sensitive data are sanitized with rigorous privacy guarantees in a one-shot fashion, such that training deep generative models is possible without re-using the original data. Hence, no extra privacy costs or model constraints are incurred, in contrast to popular approaches such as Differentially Private Stochastic Gradient Descent (DP-SGD), which, among other issues, causes degradation in privacy guarantees as the training iteration increases. We demonstrate a realization of our framework by making use of the characteristic function and an adversarial re-weighting objective, which are of independent interest as well. Our proposal has theoretical guarantees of performance, and empirical evaluations on multiple datasets show that our approach outperforms other methods at reasonable levels of privacy.
\ No newline at end of file
diff --git a/data/2022/iclr/PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method b/data/2022/iclr/PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method
new file mode 100644
index 0000000000..bef9eb97ce
--- /dev/null
+++ b/data/2022/iclr/PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method	
@@ -0,0 +1 @@
+Emphatic temporal difference (ETD) learning (Sutton et al., 2016) is a successful method to conduct the off-policy value function evaluation with function approximation. Although ETD has been shown to converge asymptotically to a desirable value function, it is well-known that ETD often encounters a large variance so that its sample complexity can increase exponentially fast with the number of iterations. In this work, we propose a new ETD method, called PER-ETD (i.e., PEriodically Restarted-ETD), which restarts and updates the follow-on trace only for a finite period for each iteration of the evaluation parameter. Further, PER-ETD features a design of the logarithmical increase of the restart period with the number of iterations, which guarantees the best trade-off between the variance and bias and keeps both vanishing sublinearly. We show that PER-ETD converges to the same desirable fixed point as ETD, but improves the exponential sample complexity of ETD to be polynomials. Our experiments validate the superior performance of PER-ETD and its advantage over ETD.
\ No newline at end of file
diff --git a/data/2022/iclr/PF-GNN: Differentiable particle filtering based approximation of universal graph representations b/data/2022/iclr/PF-GNN: Differentiable particle filtering based approximation of universal graph representations
new file mode 100644
index 0000000000..4595215ac1
--- /dev/null
+++ b/data/2022/iclr/PF-GNN: Differentiable particle filtering based approximation of universal graph representations	
@@ -0,0 +1 @@
+Message passing Graph Neural Networks (GNNs) are known to be limited in expressive power by the 1-WL color-refinement test for graph isomorphism. Other more expressive models either are computationally expensive or need preprocessing to extract structural features from the graph. In this work, we propose to make GNNs universal by guiding the learning process with exact isomorphism solver techniques which operate on the paradigm of Individualization and Refinement (IR), a method to artificially introduce asymmetry and further refine the coloring when 1-WL stops. Isomorphism solvers generate a search tree of colorings whose leaves uniquely identify the graph. However, the tree grows exponentially large and needs hand-crafted pruning techniques which are not desirable from a learning perspective. We take a probabilistic view and approximate the search tree of colorings (i.e. embeddings) by sampling multiple paths from root to leaves of the search tree. To learn more discriminative representations, we guide the sampling process with particle filter updates, a principled approach for sequential state estimation. Our algorithm is end-to-end differentiable, can be applied with any GNN as backbone and learns richer graph representations with only linear increase in runtime. Experimental evaluation shows that our approach consistently outperforms leading GNN models on both synthetic benchmarks for isomorphism detection as well as real-world datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/PI3NN: Out-of-distribution-aware Prediction Intervals from Three Neural Networks b/data/2022/iclr/PI3NN: Out-of-distribution-aware Prediction Intervals from Three Neural Networks
new file mode 100644
index 0000000000..b714ee8e9f
--- /dev/null
+++ b/data/2022/iclr/PI3NN: Out-of-distribution-aware Prediction Intervals from Three Neural Networks	
@@ -0,0 +1 @@
+We propose a novel prediction interval (PI) method for uncertainty quantification, which addresses three major issues with the state-of-the-art PI methods. First, existing PI methods require retraining of neural networks (NNs) for every given confidence level and suffer from the crossing issue in calculating multiple PIs. Second, they usually rely on customized loss functions with extra sensitive hyperparameters for which fine tuning is required to achieve a well-calibrated PI. Third, they usually underestimate uncertainties of out-of-distribution (OOD) samples leading to over-confident PIs. Our PI3NN method calculates PIs from linear combinations of three NNs, each of which is independently trained using the standard mean squared error loss. The coefficients of the linear combinations are computed using root-finding algorithms to ensure tight PIs for a given confidence level. We theoretically prove that PI3NN can calculate PIs for a series of confidence levels without retraining NNs and it completely avoids the crossing issue. Additionally, PI3NN does not introduce any unusual hyperparameters resulting in a stable performance. Furthermore, we address OOD identification challenge by introducing an initialization scheme which provides reasonably larger PIs of the OOD samples than those of the in-distribution samples. Benchmark and real-world experiments show that our method outperforms several state-of-the-art approaches with respect to predictive uncertainty quality, robustness, and OOD samples identification.
\ No newline at end of file
diff --git a/data/2022/iclr/POETREE: Interpretable Policy Learning with Adaptive Decision Trees b/data/2022/iclr/POETREE: Interpretable Policy Learning with Adaptive Decision Trees
new file mode 100644
index 0000000000..c117702426
--- /dev/null
+++ b/data/2022/iclr/POETREE: Interpretable Policy Learning with Adaptive Decision Trees	
@@ -0,0 +1 @@
+Building models of human decision-making from observed behaviour is critical to better understand, diagnose and support real-world policies such as clinical care. As established policy learning approaches remain focused on imitation performance, they fall short of explaining the demonstrated decision-making process. Policy Extraction through decision Trees (POETREE) is a novel framework for interpretable policy learning, compatible with fully-offline and partially-observable clinical decision environments -- and builds probabilistic tree policies determining physician actions based on patients' observations and medical history. Fully-differentiable tree architectures are grown incrementally during optimization to adapt their complexity to the modelling task, and learn a representation of patient history through recurrence, resulting in decision tree policies that adapt over time with patient information. This policy learning method outperforms the state-of-the-art on real and synthetic medical datasets, both in terms of understanding, quantifying and evaluating observed behaviour as well as in accurately replicating it -- with potential to improve future decision support systems.
\ No newline at end of file
diff --git a/data/2022/iclr/PSA-GAN: Progressive Self Attention GANs for Synthetic Time Series b/data/2022/iclr/PSA-GAN: Progressive Self Attention GANs for Synthetic Time Series
new file mode 100644
index 0000000000..a524bffba4
--- /dev/null
+++ b/data/2022/iclr/PSA-GAN: Progressive Self Attention GANs for Synthetic Time Series	
@@ -0,0 +1 @@
+Realistic synthetic time series data of sufficient length enables practical applications in time series modeling tasks, such as forecasting, but remains a challenge. In this paper we present PSA-GAN, a generative adversarial network (GAN) that generates long time series samples of high quality using progressive growing of GANs and self-attention. We show that PSA-GAN can be used to reduce the error in two downstream forecasting tasks over baselines that only use real data. We also introduce a Frechet-Inception Distance-like score, Context-FID, assessing the quality of synthetic time series samples. In our downstream tasks, we find that the lowest scoring models correspond to the best-performing ones. Therefore, Context-FID could be a useful tool to develop time series GAN models.
\ No newline at end of file
diff --git a/data/2022/iclr/Parallel Training of GRU Networks with a Multi-Grid Solver for Long Sequences b/data/2022/iclr/Parallel Training of GRU Networks with a Multi-Grid Solver for Long Sequences
new file mode 100644
index 0000000000..ca1142f5f5
--- /dev/null
+++ b/data/2022/iclr/Parallel Training of GRU Networks with a Multi-Grid Solver for Long Sequences	
@@ -0,0 +1 @@
+Parallelizing Gated Recurrent Unit (GRU) networks is a challenging task, as the training procedure of GRU is inherently sequential. Prior efforts to parallelize GRU have largely focused on conventional parallelization strategies such as data-parallel and model-parallel training algorithms. However, when the given sequences are very long, existing approaches are still inevitably performance limited in terms of training time. In this paper, we present a novel parallel training scheme (called parallel-in-time) for GRU based on a multigrid reduction in time (MGRIT) solver. MGRIT partitions a sequence into multiple shorter sub-sequences and trains the sub-sequences on different processors in parallel. The key to achieving speedup is a hierarchical correction of the hidden state to accelerate end-to-end communication in both the forward and backward propagation phases of gradient descent. Experimental results on the HMDB51 dataset, where each video is an image sequence, demonstrate that the new parallel training scheme achieves up to 6.5$\times$ speedup over a serial approach. As efficiency of our new parallelization strategy is associated with the sequence length, our parallel GRU algorithm achieves significant performance improvement as the sequence length increases.
\ No newline at end of file
diff --git a/data/2022/iclr/Pareto Policy Adaptation b/data/2022/iclr/Pareto Policy Adaptation
new file mode 100644
index 0000000000..14f0ec38d8
--- /dev/null
+++ b/data/2022/iclr/Pareto Policy Adaptation	
@@ -0,0 +1 @@
+We present a policy gradient method for Multi-Objective Reinforcement Learning under unknown, linear preferences. By enforcing Pareto stationarity, a first-order condition for Pareto optimality, we are able to design a simple policy gradient al-gorithm that approximates the Pareto front and infers the unknown preferences. Our method relies on a projected gradient descent solver that identifies common ascent directions for all objectives. Leveraging the solution of that solver, we introduce Pareto Policy Adaptation ( PPA ), a loss function that adapts the policy to be optimal with respect to any distribution over preferences. PPA uses implicit differentiation to back-propagate the loss gradient bypassing the operations of the projected gradient descent solver. Our approach is straightforward, easy to implement and can be used with all existing policy gradient and actor-critic methods. We evaluate our method in a series of reinforcement learning tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Pareto Policy Pool for Model-based Offline Reinforcement Learning b/data/2022/iclr/Pareto Policy Pool for Model-based Offline Reinforcement Learning
new file mode 100644
index 0000000000..af3ce7f6e5
--- /dev/null
+++ b/data/2022/iclr/Pareto Policy Pool for Model-based Offline Reinforcement Learning	
@@ -0,0 +1 @@
+Online reinforcement learning (RL) can suffer from poor exploration, sparse reward, insufficient data, and overhead caused by inefficient interactions between an immature policy and a complicated environment. Model-based offline RL instead trains an environment model using a dataset of pre-collected experiences so online RL methods can learn in an offline manner by solely interacting with the model. However, the uncertainty and accuracy of the environment model can drastically vary across different state-action pairs, so the RL agent may achieve a high model return but perform poorly in the true environment. Unlike previous works that need to carefully tune the trade-off between the model return and uncertainty in a single objective, we study a bi-objective formulation for model-based offline RL that aims at producing a pool of diverse policies on the Pareto front performing different levels of trade-offs, which provides the flexibility to select the best policy for each realistic environment from the pool. Our method, “Pareto policy pool (P3)”, does not need to tune the trade-off weight but can produce policies allocated at different regions of the Pareto front. For this purpose, we develop an efficient algorithm that solves multiple bi-objective optimization problems with distinct constraints defined by reference vectors targeting diverse regions of the Pareto front. We theoretically prove that our algorithm can converge to the targeted regions. In order to obtain more Pareto optimal policies without linearly increasing the cost, we leverage the achieved policies as initialization to find more Pareto optimal policies in their neighborhoods. On the D4RL benchmark for offline RL, P3 substantially outperforms several recent baseline methods over multiple tasks, especially when the quality of pre-collected experiences is low.
\ No newline at end of file
diff --git a/data/2022/iclr/Pareto Set Learning for Neural Multi-Objective Combinatorial Optimization b/data/2022/iclr/Pareto Set Learning for Neural Multi-Objective Combinatorial Optimization
new file mode 100644
index 0000000000..b93bcd37cc
--- /dev/null
+++ b/data/2022/iclr/Pareto Set Learning for Neural Multi-Objective Combinatorial Optimization	
@@ -0,0 +1 @@
+Multiobjective combinatorial optimization (MOCO) problems can be found in many real-world applications. However, exactly solving these problems would be very challenging, particularly when they are NP-hard. Many handcrafted heuristic methods have been proposed to tackle different MOCO problems over the past decades. In this work, we generalize the idea of neural combinatorial optimization, and develop a learning-based approach to approximate the whole Pareto set for a given MOCO problem without further search procedure. We propose a single preference-conditioned model to directly generate approximate Pareto solutions for any trade-off preference, and design an efficient multiobjective reinforcement learning algorithm to train this model. Our proposed method can be treated as a learning-based extension for the widely-used decomposition-based multiobjective evolutionary algorithm (MOEA/D). It uses a single model to accommodate all the possible preferences, whereas other methods use a finite number of solutions to approximate the Pareto set. Experimental results show that our proposed method significantly outperforms some other methods on the multiobjective traveling salesman problem, multiobjective vehicle routing problem, and multiobjective knapsack problem in terms of solution quality, speed, and model efficiency.
\ No newline at end of file
diff --git a/data/2022/iclr/Partial Wasserstein Adversarial Network for Non-rigid Point Set Registration b/data/2022/iclr/Partial Wasserstein Adversarial Network for Non-rigid Point Set Registration
new file mode 100644
index 0000000000..29cb12840a
--- /dev/null
+++ b/data/2022/iclr/Partial Wasserstein Adversarial Network for Non-rigid Point Set Registration	
@@ -0,0 +1 @@
+Given two point sets, the problem of registration is to recover a transformation that matches one set to the other. This task is challenging due to the presence of the large number of outliers, the unknown non-rigid deformations and the large sizes of point sets. To obtain strong robustness against outliers, we formulate the registration problem as a partial distribution matching (PDM) problem, where the goal is to partially match the distributions represented by point sets in a metric space. To handle large point sets, we propose a scalable PDM algorithm by utilizing the efficient partial Wasserstein-1 (PW) discrepancy. Specifically, we derive the Kantorovich-Rubinstein duality for the PW discrepancy, and show its gradient can be explicitly computed. Based on these results, we propose a partial Wasserstein adversarial network (PWAN), which is able to approximate the PW discrepancy by a neural network, and minimize it by gradient descent. In addition, it also incorporates an efficient coherence regularizer for non-rigid transformations to avoid unrealistic deformations. We evaluate PWAN on practical point set registration tasks, and show that the proposed PWAN is robust, scalable and performs more favorably than the state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Particle Stochastic Dual Coordinate Ascent: Exponential convergent algorithm for mean field neural network optimization b/data/2022/iclr/Particle Stochastic Dual Coordinate Ascent: Exponential convergent algorithm for mean field neural network optimization
new file mode 100644
index 0000000000..37367d2973
--- /dev/null
+++ b/data/2022/iclr/Particle Stochastic Dual Coordinate Ascent: Exponential convergent algorithm for mean field neural network optimization	
@@ -0,0 +1 @@
+We introduce Particle-SDCA, a gradient-based optimization algorithm for two-layer neural networks in the mean field regime that achieves exponential convergence rate in regularized empirical risk minimization. The proposed algorithm can be regarded as an infinite dimensional extension of Stochastic Dual Coordinate Ascent (SDCA) in the probability space: we exploit the convexity of the dual problem, for which the coordinate-wise proximal gradient method can be applied. Our proposed method inherits advantages of the original SDCA, including (i) exponential convergence (with respect to the outer iteration steps), and (ii) better dependency on the sample size and condition number than the full-batch gradient method. One technical challenge in implementing the SDCA update is the intractable integral over the entire parameter space at every step. To overcome this limitation, we propose a tractable particle method that approximately solves the dual problem, and an importance re-weighting technique to reduce the computational cost. The convergence rate of our method is verified by numerical experiments.
\ No newline at end of file
diff --git a/data/2022/iclr/Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations? b/data/2022/iclr/Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations?
new file mode 100644
index 0000000000..ad55fe8f6a
--- /dev/null
+++ b/data/2022/iclr/Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations?	
@@ -0,0 +1 @@
+Vision transformers (ViTs) have recently set off a new wave in neural architecture design thanks to their record-breaking performance in various vision tasks. In parallel, to fulfill the goal of deploying ViTs into real-world vision applications, their robustness against potential malicious attacks has gained increasing attention. In particular, recent works show that ViTs are more robust against adversarial attacks as compared with convolutional neural networks (CNNs), and conjecture that this is because ViTs focus more on capturing global interactions among different input/feature patches, leading to their improved robustness to local perturbations imposed by adversarial attacks. In this work, we ask an intriguing question:"Under what kinds of perturbations do ViTs become more vulnerable learners compared to CNNs?"Driven by this question, we first conduct a comprehensive experiment regarding the robustness of both ViTs and CNNs under various existing adversarial attacks to understand the underlying reason favoring their robustness. Based on the drawn insights, we then propose a dedicated attack framework, dubbed Patch-Fool, that fools the self-attention mechanism by attacking its basic component (i.e., a single patch) with a series of attention-aware optimization techniques. Interestingly, our Patch-Fool framework shows for the first time that ViTs are not necessarily more robust than CNNs against adversarial perturbations. In particular, we find that ViTs are more vulnerable learners compared with CNNs against our Patch-Fool attack which is consistent across extensive experiments, and the observations from Sparse/Mild Patch-Fool, two variants of Patch-Fool, indicate an intriguing insight that the perturbation density and strength on each patch seem to be the key factors that influence the robustness ranking between ViTs and CNNs.
\ No newline at end of file
diff --git a/data/2022/iclr/Path Auxiliary Proposal for MCMC in Discrete Space b/data/2022/iclr/Path Auxiliary Proposal for MCMC in Discrete Space
new file mode 100644
index 0000000000..2435370647
--- /dev/null
+++ b/data/2022/iclr/Path Auxiliary Proposal for MCMC in Discrete Space	
@@ -0,0 +1 @@
+Energy-based Models (EBMs) offer a powerful approach for modeling discrete structure, but both inference and learning of EBM are hard as it involves sampling from discrete distributions. Recent work shows Markov Chain Monte Carlo (MCMC) with the informed proposal is a powerful tool for such sampling. How-ever, an informed proposal only allows local updates as it requires evaluating all energy changes in the neighborhood. In this work, we present a path auxiliary algorithm that uses a composition of local moves to efﬁciently explore large neigh-borhoods. We also give a fast version of our algorithm that only queries the evaluation of energy function twice for each proposal via linearization of the energy function. Empirically, we show that our path auxiliary algorithms considerably outperform other generic samplers on various discrete models for sampling, inference, and learning. Our method can also be used to train deep EBMs for high dimensional discrete data.
\ No newline at end of file
diff --git a/data/2022/iclr/Path Integral Sampler: A Stochastic Control Approach For Sampling b/data/2022/iclr/Path Integral Sampler: A Stochastic Control Approach For Sampling
new file mode 100644
index 0000000000..64c77462fb
--- /dev/null
+++ b/data/2022/iclr/Path Integral Sampler: A Stochastic Control Approach For Sampling	
@@ -0,0 +1 @@
+We present Path Integral Sampler~(PIS), a novel algorithm to draw samples from unnormalized probability density functions. The PIS is built on the Schr\"odinger bridge problem which aims to recover the most likely evolution of a diffusion process given its initial distribution and terminal distribution. The PIS draws samples from the initial distribution and then propagates the samples through the Schr\"odinger bridge to reach the terminal distribution. Applying the Girsanov theorem, with a simple prior diffusion, we formulate the PIS as a stochastic optimal control problem whose running cost is the control energy and terminal cost is chosen according to the target distribution. By modeling the control as a neural network, we establish a sampling algorithm that can be trained end-to-end. We provide theoretical justification of the sampling quality of PIS in terms of Wasserstein distance when sub-optimal control is used. Moreover, the path integrals theory is used to compute importance weights of the samples to compensate for the bias induced by the sub-optimality of the controller and time-discretization. We experimentally demonstrate the advantages of PIS compared with other start-of-the-art sampling methods on a variety of tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Peek-a-Boo: What (More) is Disguised in a Randomly Weighted Neural Network, and How to Find It Efficiently b/data/2022/iclr/Peek-a-Boo: What (More) is Disguised in a Randomly Weighted Neural Network, and How to Find It Efficiently
new file mode 100644
index 0000000000..ad23a1d6f8
--- /dev/null
+++ b/data/2022/iclr/Peek-a-Boo: What (More) is Disguised in a Randomly Weighted Neural Network, and How to Find It Efficiently	
@@ -0,0 +1 @@
+Sparse neural networks (NNs) are intensively investigated in literature due to their appeal in saving storage, memory, and computational costs. A recent work (Ramanujan et al., 2020) showed that, different from conventional pruningand-finetuning pipeline, there exist hidden subnetworks in randomly initialized NNs that have good performance without training the weights. However, such “hidden subnetworks” have mediocre performances and require an expensive edge-popup algorithm to search for them. In this work, we define an extended class of subnetworks in randomly initialized NNs called disguised subnetworks, which are not only “hidden” in the random networks but also “disguised” – hence can only be “unmasked” with certain transformations on weights. We argue that the unmasking process plays an important role in enlarging the capacity of the subnetworks and thus grants two major benefits: (i) the disguised subnetworks easily outperform the hidden counterparts; (ii) the unmasking process helps to relax the quality requirement on the sparse subnetwork mask so that the expensive edge-popup algorithm can be replaced with more efficient alternatives. On top of this new concept, we propose a novel two-stage algorithm that plays a Peek-a-Boo (PaB) game to identify the disguised subnetworks with a combination of two operations: (1) searching efficiently for a subnetwork at random initialization; (2) unmasking the disguise by learning to transform the resulting subnetwork’s remaining weights. Furthermore, we show that the unmasking process can be efficiently implemented (a) without referring to any latent weights or scores; and (b) by only leveraging approximated gradients, so that the whole training algorithm is computationally light. Extensive experiments with several large models (ResNet-18, ResNet-50, and WideResNet-28) and datasets (CIFAR-10, CIFAR-100 and ImageNet) demonstrate the competency of PaB over edge-popup and other counterparts. Our codes are available at: https://github.com/VITA-Group/Peek-a-Boo.
\ No newline at end of file
diff --git a/data/2022/iclr/Perceiver IO: A General Architecture for Structured Inputs & Outputs b/data/2022/iclr/Perceiver IO: A General Architecture for Structured Inputs & Outputs
new file mode 100644
index 0000000000..4312120f36
--- /dev/null
+++ b/data/2022/iclr/Perceiver IO: A General Architecture for Structured Inputs & Outputs	
@@ -0,0 +1 @@
+A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain&task assumptions or scale poorly to large inputs or outputs. In this work, we propose Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs. Our model augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, doing away with the need for task-specific architecture engineering. The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and StarCraft II. As highlights, Perceiver IO outperforms a Transformer-based BERT baseline on the GLUE language benchmark despite removing input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation with no explicit mechanisms for multiscale correspondence.
\ No newline at end of file
diff --git a/data/2022/iclr/Permutation Compressors for Provably Faster Distributed Nonconvex Optimization b/data/2022/iclr/Permutation Compressors for Provably Faster Distributed Nonconvex Optimization
new file mode 100644
index 0000000000..d4771a6be6
--- /dev/null
+++ b/data/2022/iclr/Permutation Compressors for Provably Faster Distributed Nonconvex Optimization	
@@ -0,0 +1 @@
+We study the MARINA method of Gorbunov et al (2021) -- the current state-of-the-art distributed non-convex optimization method in terms of theoretical communication complexity. Theoretical superiority of this method can be largely attributed to two sources: the use of a carefully engineered biased stochastic gradient estimator, which leads to a reduction in the number of communication rounds, and the reliance on {\em independent} stochastic communication compression operators, which leads to a reduction in the number of transmitted bits within each communication round. In this paper we i) extend the theory of MARINA to support a much wider class of potentially {\em correlated} compressors, extending the reach of the method beyond the classical independent compressors setting, ii) show that a new quantity, for which we coin the name {\em Hessian variance}, allows us to significantly refine the original analysis of MARINA without any additional assumptions, and iii) identify a special class of correlated compressors based on the idea of {\em random permutations}, for which we coin the term Perm$K$, the use of which leads to $O(\sqrt{n})$ (resp. $O(1 + d/\sqrt{n})$) improvement in the theoretical communication complexity of MARINA in the low Hessian variance regime when $d\geq n$ (resp. $d \leq n$), where $n$ is the number of workers and $d$ is the number of parameters describing the model we are learning. We corroborate our theoretical results with carefully engineered synthetic experiments with minimizing the average of nonconvex quadratics, and on autoencoder training with the MNIST dataset.
\ No newline at end of file
diff --git a/data/2022/iclr/Permutation-Based SGD: Is Random Optimal? b/data/2022/iclr/Permutation-Based SGD: Is Random Optimal?
new file mode 100644
index 0000000000..3274d5bc42
--- /dev/null
+++ b/data/2022/iclr/Permutation-Based SGD: Is Random Optimal?	
@@ -0,0 +1 @@
+A recent line of ground-breaking results for permutation-based SGD has corroborated a widely observed phenomenon: random permutations offer faster convergence than with-replacement sampling. However, is random optimal? We show that this depends heavily on what functions we are optimizing, and the convergence gap between optimal and random permutations can vary from exponential to nonexistent. We first show that for 1-dimensional strongly convex functions, with smooth second derivatives, there exist optimal permutations that offer exponentially faster convergence compared to random. However, for general strongly convex functions, random permutations are optimal. Finally, we show that for quadratic, strongly-convex functions, there are easy-to-construct permutations that lead to accelerated convergence compared to random. Our results suggest that a general convergence characterization of optimal permutations cannot capture the nuances of individual function classes, and can mistakenly indicate that one cannot do much better than random.
\ No newline at end of file
diff --git a/data/2022/iclr/Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning b/data/2022/iclr/Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning
new file mode 100644
index 0000000000..8e41525495
--- /dev/null
+++ b/data/2022/iclr/Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning	
@@ -0,0 +1 @@
+Offline Reinforcement Learning (RL) aims to learn policies from previously collected datasets without exploring the environment. Directly applying off-policy algorithms to offline RL usually fails due to the extrapolation error caused by the out-of-distribution (OOD) actions. Previous methods tackle such problem by penalizing the Q-values of OOD actions or constraining the trained policy to be close to the behavior policy. Nevertheless, such methods typically prevent the generalization of value functions beyond the offline data and also lack precise characterization of OOD data. In this paper, we propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints. Specifically, PBRL conducts uncertainty quantification via the disagreement of bootstrapped Q-functions, and performs pessimistic updates by penalizing the value function based on the estimated uncertainty. To tackle the extrapolating error, we further propose a novel OOD sampling method. We show that such OOD sampling and pessimistic bootstrapping yields provable uncertainty quantifier in linear MDPs, thus providing the theoretical underpinning for PBRL. Extensive experiments on D4RL benchmark show that PBRL has better performance compared to the state-of-the-art algorithms.
\ No newline at end of file
diff --git a/data/2022/iclr/Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage b/data/2022/iclr/Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage
new file mode 100644
index 0000000000..08e56e4a46
--- /dev/null
+++ b/data/2022/iclr/Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage	
@@ -0,0 +1 @@
+We study model-based offline Reinforcement Learning with general function approximation without a full coverage assumption on the offline data distribution. We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO)which leverages a general function class and uses a constraint over the model class to encode pessimism. Under the assumption that the ground truth model belongs to our function class (i.e., realizability in the function class), CPPO has a PAC guarantee with offline data only providing partial coverage, i.e., it can learn a policy that competes against any policy that is covered by the offline data. We then demonstrate that this algorithmic framework can be applied to many specialized Markov Decision Processes where additional structural assumptions can further refine the concept of partial coverage. Two notable examples are: (1) low-rank MDP with representation learning where the partial coverage condition is defined using a relative condition number measured by the unknown ground truth feature representation; (2) factored MDP where the partial coverage condition is defined using density ratio based concentrability coefficients associated with individual factors.
\ No newline at end of file
diff --git a/data/2022/iclr/Phase Collapse in Neural Networks b/data/2022/iclr/Phase Collapse in Neural Networks
new file mode 100644
index 0000000000..e2d87be4b4
--- /dev/null
+++ b/data/2022/iclr/Phase Collapse in Neural Networks	
@@ -0,0 +1 @@
+Deep convolutional classifiers linearly separate image classes and improve accuracy as depth increases. They progressively reduce the spatial dimension whereas the number of channels grows with depth. Spatial variability is therefore transformed into variability along channels. A fundamental challenge is to understand the role of non-linearities together with convolutional filters in this transformation. ReLUs with biases are often interpreted as thresholding operators that improve discrimination through sparsity. This paper demonstrates that it is a different mechanism called phase collapse which eliminates spatial variability while linearly separating classes. We show that collapsing the phases of complex wavelet coefficients is sufficient to reach the classification accuracy of ResNets of similar depths. However, replacing the phase collapses with thresholding operators that enforce sparsity considerably degrades the performance. We explain these numerical results by showing that the iteration of phase collapses progressively improves separation of classes, as opposed to thresholding non-linearities.
\ No newline at end of file
diff --git a/data/2022/iclr/Phenomenology of Double Descent in Finite-Width Neural Networks b/data/2022/iclr/Phenomenology of Double Descent in Finite-Width Neural Networks
new file mode 100644
index 0000000000..83c28c468e
--- /dev/null
+++ b/data/2022/iclr/Phenomenology of Double Descent in Finite-Width Neural Networks	
@@ -0,0 +1 @@
+`Double descent' delineates the generalization behaviour of models depending on the regime they belong to: under- or over-parameterized. The current theoretical understanding behind the occurrence of this phenomenon is primarily based on linear and kernel regression models -- with informal parallels to neural networks via the Neural Tangent Kernel. Therefore such analyses do not adequately capture the mechanisms behind double descent in finite-width neural networks, as well as, disregard crucial components -- such as the choice of the loss function. We address these shortcomings by leveraging influence functions in order to derive suitable expressions of the population loss and its lower bound, while imposing minimal assumptions on the form of the parametric model. Our derived bounds bear an intimate connection with the spectrum of the Hessian at the optimum, and importantly, exhibit a double descent behaviour at the interpolation threshold. Building on our analysis, we further investigate how the loss function affects double descent -- and thus uncover interesting properties of neural networks and their Hessian spectra near the interpolation threshold.
\ No newline at end of file
diff --git a/data/2022/iclr/PiCO: Contrastive Label Disambiguation for Partial Label Learning b/data/2022/iclr/PiCO: Contrastive Label Disambiguation for Partial Label Learning
new file mode 100644
index 0000000000..205c679831
--- /dev/null
+++ b/data/2022/iclr/PiCO: Contrastive Label Disambiguation for Partial Label Learning	
@@ -0,0 +1 @@
+Partial label learning (PLL) is an important problem that allows each training example to be labeled with a coarse candidate set, which well suits many real-world data annotation scenarios with label ambiguity. Despite the promise, the performance of PLL often lags behind the supervised counterpart. In this work, we bridge the gap by addressing two key research challenges in PLL—representation learning and label disambiguation—in one coherent framework. Speciﬁcally, our proposed framework PiCO consists of a contrastive learning module along with a novel class prototype-based label disambiguation algorithm. PiCO produces closely aligned representations for examples from the same classes and facilitates label disambiguation. Theoretically, we show that these two components are mutually beneﬁcial, and can be rigorously justiﬁed from an expectation-maximization (EM) algorithm perspective. Extensive experiments demonstrate that PiCO signiﬁcantly outperforms the current state-of-the-art approaches in PLL and even achieves comparable results to fully supervised learning. Code and data available: https://github.com/hbzju/PiCO .
\ No newline at end of file
diff --git a/data/2022/iclr/PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication b/data/2022/iclr/PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication
new file mode 100644
index 0000000000..79929ed422
--- /dev/null
+++ b/data/2022/iclr/PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication	
@@ -0,0 +1 @@
+Graph Convolutional Networks (GCNs) is the state-of-the-art method for learning graph-structured data, and training large-scale GCNs requires distributed training across multiple accelerators such that each accelerator is able to hold a partitioned subgraph. However, distributed GCN training incurs prohibitive overhead of communicating node features and feature gradients among partitions for every GCN layer during each training iteration, limiting the achievable training efficiency and model scalability. To this end, we propose PipeGCN, a simple yet effective scheme that hides the communication overhead by pipelining inter-partition communication with intra-partition computation. It is non-trivial to pipeline for efficient GCN training, as communicated node features/gradients will become stale and thus can harm the convergence, negating the pipeline benefit. Notably, little is known regarding the convergence rate of GCN training with both stale features and stale feature gradients. This work not only provides a theoretical convergence analysis but also finds the convergence rate of PipeGCN to be close to that of the vanilla distributed GCN training without any staleness. Furthermore, we develop a smoothing method to further improve PipeGCN's convergence. Extensive experiments show that PipeGCN can largely boost the training throughput (1.7x~28.5x) while achieving the same accuracy as its vanilla counterpart and existing full-graph training methods. The code is available at https://github.com/RICE-EIC/PipeGCN.
\ No newline at end of file
diff --git a/data/2022/iclr/Pix2seq: A Language Modeling Framework for Object Detection b/data/2022/iclr/Pix2seq: A Language Modeling Framework for Object Detection
new file mode 100644
index 0000000000..17b5aac8e8
--- /dev/null
+++ b/data/2022/iclr/Pix2seq: A Language Modeling Framework for Object Detection	
@@ -0,0 +1 @@
+We present Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we cast object detection as a language modeling task conditioned on the observed pixel inputs. Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and we train a neural network to perceive the image and generate the desired sequence. Our approach is based mainly on the intuition that if a neural network knows about where and what the objects are, we just need to teach it how to read them out. Beyond the use of task-specific data augmentations, our approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.
\ No newline at end of file
diff --git a/data/2022/iclr/Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models b/data/2022/iclr/Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models
new file mode 100644
index 0000000000..b20e825f4a
--- /dev/null
+++ b/data/2022/iclr/Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models	
@@ -0,0 +1 @@
+Overparameterized neural networks generalize well but are expensive to train. Ideally, one would like to reduce their computational cost while retaining their generalization benefits. Sparse model training is a simple and promising approach to achieve this, but there remain challenges as existing methods struggle with accuracy loss, slow training runtime, or difficulty in sparsifying all model components. The core problem is that searching for a sparsity mask over a discrete set of sparse matrices is difficult and expensive. To address this, our main insight is to optimize over a continuous superset of sparse matrices with a fixed structure known as products of butterfly matrices. As butterfly matrices are not hardware efficient, we propose simple variants of butterfly (block and flat) to take advantage of modern hardware. Our method (Pixelated Butterfly) uses a simple fixed sparsity pattern based on flat block butterfly and low-rank matrices to sparsify most network layers (e.g., attention, MLP). We empirically validate that Pixelated Butterfly is 3x faster than butterfly and speeds up training to achieve favorable accuracy--efficiency tradeoffs. On the ImageNet classification and WikiText-103 language modeling tasks, our sparse models train up to 2.5x faster than the dense MLP-Mixer, Vision Transformer, and GPT-2 medium with no drop in accuracy.
\ No newline at end of file
diff --git a/data/2022/iclr/Planning in Stochastic Environments with a Learned Model b/data/2022/iclr/Planning in Stochastic Environments with a Learned Model
new file mode 100644
index 0000000000..c19deb6e63
--- /dev/null
+++ b/data/2022/iclr/Planning in Stochastic Environments with a Learned Model	
@@ -0,0 +1 @@
+Model-based reinforcement learning has proven highly successful. However, learning a model in isolation from its use during planning is problematic in complex environments. To date, the most effective techniques have instead combined value-equivalent model learning with powerful tree-search methods. This approach is exempliﬁed by MuZero , which has achieved state-of-the-art performance in a wide range of domains, from board games to visually rich environments, with discrete and continuous action spaces, in online and ofﬂine settings. However, previous instantiations of this approach were limited to the use of deterministic models. This limits their performance in environments that are inherently stochastic, partially observed, or so large and complex that they appear stochastic to a ﬁnite agent. In this paper we extend this approach to learn and plan with stochastic models. Specifically, we introduce a new algorithm, Stochastic MuZero , that learns a stochastic model incorporating afterstates, and uses this model to perform a stochastic tree search. Stochastic MuZero matched or exceeded the state of the art in a set of canonical single and multi-agent environments, including 2048 and backgammon, while maintaining the superhuman performance of standard MuZero in the game of Go.
\ No newline at end of file
diff --git a/data/2022/iclr/Plant 'n' Seek: Can You Find the Winning Ticket? b/data/2022/iclr/Plant 'n' Seek: Can You Find the Winning Ticket?
new file mode 100644
index 0000000000..329ddc0e4e
--- /dev/null
+++ b/data/2022/iclr/Plant 'n' Seek: Can You Find the Winning Ticket?	
@@ -0,0 +1 @@
+The lottery ticket hypothesis has sparked the rapid development of pruning algorithms that aim to reduce the computational costs associated with deep learning during training and model deployment. Currently, such algorithms are primarily evaluated on imaging data, for which we lack ground truth information and thus the understanding of how sparse lottery tickets could be. To fill this gap, we develop a framework that allows us to plant and hide winning tickets with desirable properties in randomly initialized neural networks. To analyze the ability of state-of-the-art pruning to identify tickets of extreme sparsity, we design and hide such tickets solving four challenging tasks. In extensive experiments, we observe similar trends as in imaging studies, indicating that our framework can provide transferable insights into realistic problems. Additionally, we can now see beyond such relative trends and highlight limitations of current pruning methods. Based on our results, we conclude that the current limitations in ticket sparsity are likely of algorithmic rather than fundamental nature. We anticipate that comparisons to planted tickets will facilitate future developments of efficient pruning algorithms.
\ No newline at end of file
diff --git a/data/2022/iclr/PoNet: Pooling Network for Efficient Token Mixing in Long Sequences b/data/2022/iclr/PoNet: Pooling Network for Efficient Token Mixing in Long Sequences
new file mode 100644
index 0000000000..7c19a47eed
--- /dev/null
+++ b/data/2022/iclr/PoNet: Pooling Network for Efficient Token Mixing in Long Sequences	
@@ -0,0 +1 @@
+Transformer-based models have achieved great success in various NLP, vision, and speech tasks. However, the core of Transformer, the self-attention mechanism, has a quadratic time and memory complexity with respect to the sequence length, which hinders applications of Transformer-based models to long sequences. Many approaches have been proposed to mitigate this problem, such as sparse attention mechanisms, low-rank matrix approximations and scalable kernels, and token mixing alternatives to self-attention. We propose a novel Pooling Network (PoNet) for token mixing in long sequences with linear complexity. We design multi-granularity pooling and pooling fusion to capture different levels of contextual information and combine their interactions with tokens. On the Long Range Arena benchmark, PoNet significantly outperforms Transformer and achieves competitive accuracy, while being only slightly slower than the fastest model, FNet, across all sequence lengths measured on GPUs. We also conduct systematic studies on the transfer learning capability of PoNet and observe that PoNet achieves 95.7% of the accuracy of BERT on the GLUE benchmark, outperforming FNet by 4.5% relative. Comprehensive ablation analysis demonstrates effectiveness of the designed multi-granularity pooling and pooling fusion for token mixing in long sequences and efficacy of the designed pre-training tasks for PoNet to learn transferable contextualized language representations.
\ No newline at end of file
diff --git a/data/2022/iclr/Poisoning and Backdooring Contrastive Learning b/data/2022/iclr/Poisoning and Backdooring Contrastive Learning
new file mode 100644
index 0000000000..40ee52d24b
--- /dev/null
+++ b/data/2022/iclr/Poisoning and Backdooring Contrastive Learning	
@@ -0,0 +1 @@
+Multimodal contrastive learning methods like CLIP train on noisy and uncurated training datasets. This is cheaper than labeling datasets manually, and even improves out-of-distribution robustness. We show that this practice makes backdoor and poisoning attacks a significant threat. By poisoning just 0.01% of a dataset (e.g., just 300 images of the 3 million-example Conceptual Captions dataset), we can cause the model to misclassify test images by overlaying a small patch. Targeted poisoning attacks, whereby the model misclassifies a particular test input with an adversarially-desired label, are even easier requiring control of 0.0001% of the dataset (e.g., just three out of the 3 million images). Our attacks call into question whether training on noisy and uncurated Internet scrapes is desirable.
\ No newline at end of file
diff --git a/data/2022/iclr/Policy Gradients Incorporating the Future b/data/2022/iclr/Policy Gradients Incorporating the Future
new file mode 100644
index 0000000000..3589adeaf3
--- /dev/null
+++ b/data/2022/iclr/Policy Gradients Incorporating the Future	
@@ -0,0 +1 @@
+Reasoning about the future -- understanding how decisions in the present time affect outcomes in the future -- is one of the central challenges for reinforcement learning (RL), especially in highly-stochastic or partially observable environments. While predicting the future directly is hard, in this work we introduce a method that allows an agent to"look into the future"without explicitly predicting it. Namely, we propose to allow an agent, during its training on past experience, to observe what \emph{actually} happened in the future at that time, while enforcing an information bottleneck to avoid the agent overly relying on this privileged information. This gives our agent the opportunity to utilize rich and useful information about the future trajectory dynamics in addition to the present. Our method, Policy Gradients Incorporating the Future (PGIF), is easy to implement and versatile, being applicable to virtually any policy gradient algorithm. We apply our proposed method to a number of off-the-shelf RL algorithms and show that PGIF is able to achieve higher reward faster in a variety of online and offline RL domains, as well as sparse-reward and partially observable environments.
\ No newline at end of file
diff --git a/data/2022/iclr/Policy Smoothing for Provably Robust Reinforcement Learning b/data/2022/iclr/Policy Smoothing for Provably Robust Reinforcement Learning
new file mode 100644
index 0000000000..9b153c105d
--- /dev/null
+++ b/data/2022/iclr/Policy Smoothing for Provably Robust Reinforcement Learning	
@@ -0,0 +1 @@
+The study of provable adversarial robustness for deep neural networks (DNNs) has mainly focused on static supervised learning tasks such as image classiﬁcation. However, DNNs have been used extensively in real-world adaptive tasks such as reinforcement learning (RL), making such systems vulnerable to adversarial attacks as well. Prior works in provable robustness in RL seek to certify the behaviour of the victim policy at every time-step against a non-adaptive adversary using methods developed for the static setting. But in the real world, an RL adversary can infer the defense strategy used by the victim agent by observ-ing the states, actions, etc. from previous time-steps and adapt itself to produce stronger attacks in future steps (e.g., by focusing more on states critical to the agent’s performance). We present an efﬁcient procedure, designed speciﬁcally to defend against an adaptive RL adversary, that can directly certify the total reward without requiring the policy to be robust at each time-step. Focusing on randomized smoothing based defenses, our main theoretical contribution is to prove an adaptive version of the Neyman-Pearson Lemma – a key lemma for smoothing-based certiﬁcates – where the adversarial perturbation at a particular time can be a stochastic function of current and previous observations and states as well as previous actions. Building on this result, we propose policy smoothing where the agent adds a Gaussian noise to its observation at each time-step before passing it through the policy function. Our robustness certiﬁcates guarantee that the ﬁnal total reward obtained by policy smoothing remains above a certain threshold, even though the actions at intermediate time-steps may change under the attack. We show that our certiﬁcates are tight by constructing a worst-case scenario that achieves the bounds derived in our analysis. Our experiments on various environments like Cartpole, Pong, Freeway and Mountain Car show that our method can yield meaningful robustness guarantees in practice.
\ No newline at end of file
diff --git a/data/2022/iclr/Policy improvement by planning with Gumbel b/data/2022/iclr/Policy improvement by planning with Gumbel
new file mode 100644
index 0000000000..41622b4720
--- /dev/null
+++ b/data/2022/iclr/Policy improvement by planning with Gumbel	
@@ -0,0 +1 @@
+,
\ No newline at end of file
diff --git a/data/2022/iclr/PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions b/data/2022/iclr/PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions
new file mode 100644
index 0000000000..d357a3ba00
--- /dev/null
+++ b/data/2022/iclr/PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions	
@@ -0,0 +1 @@
+Cross-entropy loss and focal loss are the most common choices when training deep neural networks for classification problems. Generally speaking, however, a good loss function can take on much more flexible forms, and should be tailored for different tasks and datasets. Motivated by how functions can be approximated via Taylor expansion, we propose a simple framework, named PolyLoss, to view and design loss functions as a linear combination of polynomial functions. Our PolyLoss allows the importance of different polynomial bases to be easily adjusted depending on the targeting tasks and datasets, while naturally subsuming the aforementioned cross-entropy loss and focal loss as special cases. Extensive experimental results show that the optimal choice within the PolyLoss is indeed dependent on the task and dataset. Simply by introducing one extra hyperparameter and adding one line of code, our Poly-1 formulation outperforms the cross-entropy loss and focal loss on 2D image classification, instance segmentation, object detection, and 3D object detection tasks, sometimes by a large margin.
\ No newline at end of file
diff --git a/data/2022/iclr/Possibility Before Utility: Learning And Using Hierarchical Affordances b/data/2022/iclr/Possibility Before Utility: Learning And Using Hierarchical Affordances
new file mode 100644
index 0000000000..e0d814df5f
--- /dev/null
+++ b/data/2022/iclr/Possibility Before Utility: Learning And Using Hierarchical Affordances	
@@ -0,0 +1 @@
+Reinforcement learning algorithms struggle on tasks with complex hierarchical dependency structures. Humans and other intelligent agents do not waste time assessing the utility of every high-level action in existence, but instead only consider ones they deem possible in the first place. By focusing only on what is feasible, or"afforded", at the present moment, an agent can spend more time both evaluating the utility of and acting on what matters. To this end, we present Hierarchical Affordance Learning (HAL), a method that learns a model of hierarchical affordances in order to prune impossible subtasks for more effective learning. Existing works in hierarchical reinforcement learning provide agents with structural representations of subtasks but are not affordance-aware, and by grounding our definition of hierarchical affordances in the present state, our approach is more flexible than the multitude of approaches that ground their subtask dependencies in a symbolic history. While these logic-based methods often require complete knowledge of the subtask hierarchy, our approach is able to utilize incomplete and varying symbolic specifications. Furthermore, we demonstrate that relative to non-affordance-aware methods, HAL agents are better able to efficiently learn complex tasks, navigate environment stochasticity, and acquire diverse skills in the absence of extrinsic supervision -- all of which are hallmarks of human learning.
\ No newline at end of file
diff --git a/data/2022/iclr/Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation b/data/2022/iclr/Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation
new file mode 100644
index 0000000000..a3db520a35
--- /dev/null
+++ b/data/2022/iclr/Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation	
@@ -0,0 +1 @@
+We investigate whether three types of post hoc model explanations--feature attribution, concept activation, and training point ranking--are effective for detecting a model's reliance on spurious signals in the training data. Specifically, we consider the scenario where the spurious signal to be detected is unknown, at test-time, to the user of the explanation method. We design an empirical methodology that uses semi-synthetic datasets along with pre-specified spurious artifacts to obtain models that verifiably rely on these spurious training signals. We then provide a suite of metrics that assess an explanation method's reliability for spurious signal detection under various conditions. We find that the post hoc explanation methods tested are ineffective when the spurious artifact is unknown at test-time especially for non-visible artifacts like a background blur. Further, we find that feature attribution methods are susceptible to erroneously indicating dependence on spurious signals even when the model being explained does not rely on spurious artifacts. This finding casts doubt on the utility of these approaches, in the hands of a practitioner, for detecting a model's reliance on spurious signals.
\ No newline at end of file
diff --git a/data/2022/iclr/Post-Training Detection of Backdoor Attacks for Two-Class and Multi-Attack Scenarios b/data/2022/iclr/Post-Training Detection of Backdoor Attacks for Two-Class and Multi-Attack Scenarios
new file mode 100644
index 0000000000..652704a329
--- /dev/null
+++ b/data/2022/iclr/Post-Training Detection of Backdoor Attacks for Two-Class and Multi-Attack Scenarios	
@@ -0,0 +1 @@
+Backdoor attacks (BAs) are an emerging threat to deep neural network classifiers. A victim classifier will predict to an attacker-desired target class whenever a test sample is embedded with the same backdoor pattern (BP) that was used to poison the classifier's training set. Detecting whether a classifier is backdoor attacked is not easy in practice, especially when the defender is, e.g., a downstream user without access to the classifier's training set. This challenge is addressed here by a reverse-engineering defense (RED), which has been shown to yield state-of-the-art performance in several domains. However, existing REDs are not applicable when there are only {\it two classes} or when {\it multiple attacks} are present. These scenarios are first studied in the current paper, under the practical constraints that the defender neither has access to the classifier's training set nor to supervision from clean reference classifiers trained for the same domain. We propose a detection framework based on BP reverse-engineering and a novel {\it expected transferability} (ET) statistic. We show that our ET statistic is effective {\it using the same detection threshold}, irrespective of the classification domain, the attack configuration, and the BP reverse-engineering algorithm that is used. The excellent performance of our method is demonstrated on six benchmark datasets. Notably, our detection framework is also applicable to multi-class scenarios with multiple attacks. Code is available at https://github.com/zhenxianglance/2ClassBADetection.
\ No newline at end of file
diff --git a/data/2022/iclr/Practical Conditional Neural Process Via Tractable Dependent Predictions b/data/2022/iclr/Practical Conditional Neural Process Via Tractable Dependent Predictions
new file mode 100644
index 0000000000..af45f3eb46
--- /dev/null
+++ b/data/2022/iclr/Practical Conditional Neural Process Via Tractable Dependent Predictions	
@@ -0,0 +1 @@
+Conditional Neural Processes (CNPs; Garnelo et al., 2018a) are meta-learning models which leverage the flexibility of deep learning to produce well-calibrated predictions and naturally handle off-the-grid and missing data. CNPs scale to large datasets and train with ease. Due to these features, CNPs appear well-suited to tasks from environmental sciences or healthcare. Unfortunately, CNPs do not produce correlated predictions, making them fundamentally inappropriate for many estimation and decision making tasks. Predicting heat waves or floods, for example, requires modelling dependencies in temperature or precipitation over time and space. Existing approaches which model output dependencies, such as Neural Processes (NPs; Garnelo et al., 2018b) or the FullConvGNP (Bruinsma et al., 2021), are either complicated to train or prohibitively expensive. What is needed is an approach which provides dependent predictions, but is simple to train and computationally tractable. In this work, we present a new class of Neural Process models that make correlated predictions and support exact maximum likelihood training that is simple and scalable. We extend the proposed models by using invertible output transformations, to capture non-Gaussian output distributions. Our models can be used in downstream estimation tasks which require dependent function samples. By accounting for output dependencies, our models show improved predictive performance on a range of experiments with synthetic and real data.
\ No newline at end of file
diff --git a/data/2022/iclr/Practical Integration via Separable Bijective Networks b/data/2022/iclr/Practical Integration via Separable Bijective Networks
new file mode 100644
index 0000000000..ce315162b5
--- /dev/null
+++ b/data/2022/iclr/Practical Integration via Separable Bijective Networks	
@@ -0,0 +1 @@
+Neural networks have enabled learning over examples that contain thousands of dimensions. However, most of these models are limited to training and evaluating on a finite collection of points and do not consider the hypervolume in which the data resides. Any analysis of the model’s local or global behavior is therefore limited to very expensive or imprecise estimators. We propose to formulate neural networks as a composition of a bijective (flow) network followed by a learnable, separable network. This construction allows for learning (or assessing) over full hypervolumes with precise estimators at tractable computational cost via integration over the input space. We develop the necessary machinery, propose several practical integrals to use during training, and demonstrate their utility.
\ No newline at end of file
diff --git a/data/2022/iclr/Pre-training Molecular Graph Representation with 3D Geometry b/data/2022/iclr/Pre-training Molecular Graph Representation with 3D Geometry
new file mode 100644
index 0000000000..9572e7b806
--- /dev/null
+++ b/data/2022/iclr/Pre-training Molecular Graph Representation with 3D Geometry	
@@ -0,0 +1 @@
+Molecular graph representation learning is a fundamental problem in modern drug and material discovery. Molecular graphs are typically modeled by their 2D topological structures, but it has been recently discovered that 3D geometric information plays a more vital role in predicting molecular functionalities. However, the lack of 3D information in real-world scenarios has significantly impeded the learning of geometric graph representation. To cope with this challenge, we propose the Graph Multi-View Pre-training (GraphMVP) framework where self-supervised learning (SSL) is performed by leveraging the correspondence and consistency between 2D topological structures and 3D geometric views. GraphMVP effectively learns a 2D molecular graph encoder that is enhanced by richer and more discriminative 3D geometry. We further provide theoretical insights to justify the effectiveness of GraphMVP. Finally, comprehensive experiments show that GraphMVP can consistently outperform existing graph SSL methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Predicting Physics in Mesh-reduced Space with Temporal Attention b/data/2022/iclr/Predicting Physics in Mesh-reduced Space with Temporal Attention
new file mode 100644
index 0000000000..41c6756ae3
--- /dev/null
+++ b/data/2022/iclr/Predicting Physics in Mesh-reduced Space with Temporal Attention	
@@ -0,0 +1 @@
+Graph-based next-step prediction models have recently been very successful in modeling complex high-dimensional physical systems on irregular meshes. However, due to their short temporal attention span, these models suffer from error accumulation and drift. In this paper, we propose a new method that captures long-term dependencies through a transformer-style temporal attention model. We introduce an encoder-decoder structure to summarize features and create a compact mesh representation of the system state, to allow the temporal model to operate on a low-dimensional mesh representations in a memory efficient manner. Our method outperforms a competitive GNN baseline on several complex fluid dynamics prediction tasks, from sonic shocks to vascular flow. We demonstrate stable rollouts without the need for training noise and show perfectly phase-stable predictions even for very long sequences. More broadly, we believe our approach paves the way to bringing the benefits of attention-based sequence models to solving high-dimensional complex physics tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Pretrained Language Model in Continual Learning: A Comparative Study b/data/2022/iclr/Pretrained Language Model in Continual Learning: A Comparative Study
new file mode 100644
index 0000000000..144892198f
--- /dev/null
+++ b/data/2022/iclr/Pretrained Language Model in Continual Learning: A Comparative Study	
@@ -0,0 +1 @@
+Continual learning (CL) is a setting in which a model learns from a stream of incoming data while avoiding to forget previously learned knowledge. Pre-trained language models (PLMs) have been successfully employed in continual learning of different natural language problems. With the rapid development of many continual learning methods and PLMs, understanding and disentangling their interactions become essential for continued improvement of continual learning performance. In this paper, we thoroughly compare the continual learning performance over the combination of 5 PLMs and 4 CL approaches on 3 benchmarks in 2 typical incremental settings. Our extensive experimental analyses reveal interesting performance differences across PLMs and across CL methods. Furthermore, our representativeness probing analyses dissect PLMs’ performance characteristics in a layer-wise and task-wise manner, uncovering the extent to which their inner layers suffer from forgetting, and the effect of different CL approaches on each layer. Finally, our observations and analyses open up a number of important research questions that will inform and guide the design of effective continual learning techniques.
\ No newline at end of file
diff --git a/data/2022/iclr/Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators b/data/2022/iclr/Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators
new file mode 100644
index 0000000000..358e978265
--- /dev/null
+++ b/data/2022/iclr/Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators	
@@ -0,0 +1 @@
+We present a new framework AMOS that pretrains text encoders with an Adversarial learning curriculum via a Mixture Of Signals from multiple auxiliary generators. Following ELECTRA-style pretraining, the main encoder is trained as a discriminator to detect replaced tokens generated by auxiliary masked language models (MLMs). Different from ELECTRA which trains one MLM as the generator, we jointly train multiple MLMs of different sizes to provide training signals at various levels of difficulty. To push the discriminator to learn better with challenging replaced tokens, we learn mixture weights over the auxiliary MLMs' outputs to maximize the discriminator loss by backpropagating the gradient from the discriminator via Gumbel-Softmax. For better pretraining efficiency, we propose a way to assemble multiple MLMs into one unified auxiliary model. AMOS outperforms ELECTRA and recent state-of-the-art pretrained models by about 1 point on the GLUE benchmark for BERT base-sized models.
\ No newline at end of file
diff --git a/data/2022/iclr/PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior b/data/2022/iclr/PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior
new file mode 100644
index 0000000000..83a3c87eb9
--- /dev/null
+++ b/data/2022/iclr/PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior	
@@ -0,0 +1 @@
+Denoising diffusion probabilistic models have been recently proposed to generate high-quality samples by estimating the gradient of the data density. The framework defines the prior noise as a standard Gaussian distribution, whereas the corresponding data distribution may be more complicated than the standard Gaussian distribution, which potentially introduces inefficiency in denoising the prior noise into the data sample because of the discrepancy between the data and the prior. In this paper, we propose PriorGrad to improve the efficiency of the conditional diffusion model for speech synthesis (for example, a vocoder using a mel-spectrogram as the condition) by applying an adaptive prior derived from the data statistics based on the conditional information. We formulate the training and sampling procedures of PriorGrad and demonstrate the advantages of an adaptive prior through a theoretical analysis. Focusing on the speech synthesis domain, we consider the recently proposed diffusion-based speech generative models based on both the spectral and time domains and show that PriorGrad achieves faster convergence and inference with superior performance, leading to an improved perceptual quality and robustness to a smaller network capacity, and thereby demonstrating the efficiency of a data-dependent adaptive prior.
\ No newline at end of file
diff --git a/data/2022/iclr/Privacy Implications of Shuffling b/data/2022/iclr/Privacy Implications of Shuffling
new file mode 100644
index 0000000000..9dc6833782
--- /dev/null
+++ b/data/2022/iclr/Privacy Implications of Shuffling	
@@ -0,0 +1 @@
+LDP deployments are vulnerable to inference attacks as an adversary can link the noisy responses to their identity and subsequently, auxiliary information using the order of the data. An alternative model, shuffle DP, prevents this by shuffling the noisy responses uniformly at random. However, this limits the data learnability – only symmetric functions (input order agnostic) can be learned. In this paper, we strike a balance and show that systematic shuffling of the noisy responses can thwart specific inference attacks while retaining some meaningful data learnability. To this end, we propose a novel privacy guarantee, dσ-privacy, that captures the privacy of the order of a data sequence. dσ-privacy allows tuning the granularity at which the ordinal information is maintained, which formalizes the degree the resistance to inference attacks trading it off with data learnability. Additionally, we propose a novel shuffling mechanism that can achieve dσ-privacy and demonstrate the practicality of our mechanism via evaluation on real-world datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/Probabilistic Implicit Scene Completion b/data/2022/iclr/Probabilistic Implicit Scene Completion
new file mode 100644
index 0000000000..76569afe95
--- /dev/null
+++ b/data/2022/iclr/Probabilistic Implicit Scene Completion	
@@ -0,0 +1 @@
+We propose a probabilistic shape completion method extended to the continuous geometry of large-scale 3D scenes. Real-world scans of 3D scenes suffer from a considerable amount of missing data cluttered with unsegmented objects. The problem of shape completion is inherently ill-posed, and high-quality result requires scalable solutions that consider multiple possible outcomes. We employ the Generative Cellular Automata that learns the multi-modal distribution and transform the formulation to process large-scale continuous geometry. The local continuous shape is incrementally generated as a sparse voxel embedding, which contains the latent code for each occupied cell. We formally derive that our training objective for the sparse voxel embedding maximizes the variational lower bound of the complete shape distribution and therefore our progressive generation constitutes a valid generative model. Experiments show that our model successfully generates diverse plausible scenes faithful to the input, especially when the input suffers from a significant amount of missing data. We also demonstrate that our approach outperforms deterministic models even in less ambiguous cases with a small amount of missing data, which infers that probabilistic formulation is crucial for high-quality geometry completion on input scans exhibiting any levels of completeness.
\ No newline at end of file
diff --git a/data/2022/iclr/Procedural generalization by planning with self-supervised world models b/data/2022/iclr/Procedural generalization by planning with self-supervised world models
new file mode 100644
index 0000000000..98548d587d
--- /dev/null
+++ b/data/2022/iclr/Procedural generalization by planning with self-supervised world models	
@@ -0,0 +1 @@
+One of the key promises of model-based reinforcement learning is the ability to generalize using an internal model of the world to make predictions in novel environments and tasks. However, the generalization ability of model-based agents is not well understood because existing work has focused on model-free agents when benchmarking generalization. Here, we explicitly measure the generalization ability of model-based agents in comparison to their model-free counterparts. We focus our analysis on MuZero (Schrittwieser et al., 2020), a powerful model-based agent, and evaluate its performance on both procedural and task generalization. We identify three factors of procedural generalization -- planning, self-supervised representation learning, and procedural data diversity -- and show that by combining these techniques, we achieve state-of-the art generalization performance and data efficiency on Procgen (Cobbe et al., 2019). However, we find that these factors do not always provide the same benefits for the task generalization benchmarks in Meta-World (Yu et al., 2019), indicating that transfer remains a challenge and may require different approaches than procedural generalization. Overall, we suggest that building generalizable agents requires moving beyond the single-task, model-free paradigm and towards self-supervised model-based agents that are trained in rich, procedural, multi-task environments.
\ No newline at end of file
diff --git a/data/2022/iclr/Programmatic Reinforcement Learning without Oracles b/data/2022/iclr/Programmatic Reinforcement Learning without Oracles
new file mode 100644
index 0000000000..70afb7498a
--- /dev/null
+++ b/data/2022/iclr/Programmatic Reinforcement Learning without Oracles	
@@ -0,0 +1 @@
+Deep reinforcement learning (RL) has led to encouraging successes in many challenging control tasks. However, a deep RL model lacks interpretability due to the difficulty of identifying how the model’s control logic relates to its network structure. Programmatic policies structured in more interpretable representations emerge as a promising solution. Yet two shortcomings remain: First, synthesizing programmatic policies requires optimizing over the discrete and non-differentiable search space of program architectures. Previous works are suboptimal because they only enumerate program architectures greedily guided by a pretrained RL oracle. Second, these works do not exploit compositionality, an important programming concept, to reuse and compose primitive functions to form a complex function for new tasks. Our first contribution is a programmatically interpretable RL framework that conducts program architecture search on top of a continuous relaxation of the architecture space defined by programming language grammar rules. Our algorithm allows policy architectures to be learned with policy parameters via bilevel optimization using efficient policy-gradient methods, and thus does not require a pretrained oracle. Our second contribution is improving programmatic policies to support compositionality by integrating primitive functions learned to grasp task-agnostic skills as a composite program to solve novel RL problems. Experiment results demonstrate that our algorithm excels in discovering optimal programmatic policies that are highly interpretable.
\ No newline at end of file
diff --git a/data/2022/iclr/Progressive Distillation for Fast Sampling of Diffusion Models b/data/2022/iclr/Progressive Distillation for Fast Sampling of Diffusion Models
new file mode 100644
index 0000000000..b7d96f398e
--- /dev/null
+++ b/data/2022/iclr/Progressive Distillation for Fast Sampling of Diffusion Models	
@@ -0,0 +1 @@
+This paper introduces Bespoke Non-Stationary (BNS) Solvers, a solver distillation approach to improve sample efficiency of Diffusion and Flow models. BNS solvers are based on a family of non-stationary solvers that provably subsumes existing numerical ODE solvers and consequently demonstrate considerable improvement in sample approximation (PSNR) over these baselines. Compared to model distillation, BNS solvers benefit from a tiny parameter space ($<$200 parameters), fast optimization (two orders of magnitude faster), maintain diversity of samples, and in contrast to previous solver distillation approaches nearly close the gap from standard distillation methods such as Progressive Distillation in the low-medium NFE regime. For example, BNS solver achieves 45 PSNR / 1.76 FID using 16 NFE in class-conditional ImageNet-64. We experimented with BNS solvers for conditional image generation, text-to-image generation, and text-2-audio generation showing significant improvement in sample approximation (PSNR) in all.
\ No newline at end of file
diff --git a/data/2022/iclr/Promoting Saliency From Depth: Deep Unsupervised RGB-D Saliency Detection b/data/2022/iclr/Promoting Saliency From Depth: Deep Unsupervised RGB-D Saliency Detection
new file mode 100644
index 0000000000..ade50077ce
--- /dev/null
+++ b/data/2022/iclr/Promoting Saliency From Depth: Deep Unsupervised RGB-D Saliency Detection	
@@ -0,0 +1 @@
+Growing interests in RGB-D salient object detection (RGB-D SOD) have been witnessed in recent years, owing partly to the popularity of depth sensors and the rapid progress of deep learning techniques. Unfortunately, existing RGB-D SOD methods typically demand large quantity of training images being thoroughly annotated at pixel-level. The laborious and time-consuming manual annotation has become a real bottleneck in various practical scenarios. On the other hand, current unsupervised RGB-D SOD methods still heavily rely on handcrafted feature representations. This inspires us to propose in this paper a deep unsupervised RGB-D saliency detection approach, which requires no manual pixel-level annotation during training. It is realized by two key ingredients in our training pipeline. First, a depth-disentangled saliency update (DSU) framework is designed to automatically produce pseudo-labels with iterative follow-up refinements, which provides more trustworthy supervision signals for training the saliency network. Second, an attentive training strategy is introduced to tackle the issue of noisy pseudo-labels, by properly re-weighting to highlight the more reliable pseudo-labels. Extensive experiments demonstrate the superior efficiency and effectiveness of our approach in tackling the challenging unsupervised RGB-D SOD scenarios. Moreover, our approach can also be adapted to work in fully-supervised situation. Empirical studies show the incorporation of our approach gives rise to notably performance improvement in existing supervised RGB-D SOD models.
\ No newline at end of file
diff --git a/data/2022/iclr/Proof Artifact Co-Training for Theorem Proving with Language Models b/data/2022/iclr/Proof Artifact Co-Training for Theorem Proving with Language Models
new file mode 100644
index 0000000000..79d47586e7
--- /dev/null
+++ b/data/2022/iclr/Proof Artifact Co-Training for Theorem Proving with Language Models	
@@ -0,0 +1 @@
+Labeled data for imitation learning of theorem proving in large libraries of formalized mathematics is scarce as such libraries require years of concentrated effort by human specialists to be built. This is particularly challenging when applying large Transformer language models to tactic prediction, because the scaling of performance with respect to model size is quickly disrupted in the data-scarce, easily-overfitted regime. We propose PACT ({\bf P}roof {\bf A}rtifact {\bf C}o-{\bf T}raining), a general methodology for extracting abundant self-supervised data from kernel-level proof terms for co-training alongside the usual tactic prediction objective. We apply this methodology to Lean, an interactive proof assistant which hosts some of the most sophisticated formalized mathematics to date. We instrument Lean with a neural theorem prover driven by a Transformer language model and show that PACT improves theorem proving success rate on a held-out suite of test theorems from 32\% to 48\%.
\ No newline at end of file
diff --git a/data/2022/iclr/Properties from mechanisms: an equivariance perspective on identifiable representation learning b/data/2022/iclr/Properties from mechanisms: an equivariance perspective on identifiable representation learning
new file mode 100644
index 0000000000..323c95f4bb
--- /dev/null
+++ b/data/2022/iclr/Properties from mechanisms: an equivariance perspective on identifiable representation learning	
@@ -0,0 +1 @@
+A key goal of unsupervised representation learning is"inverting"a data generating process to recover its latent properties. Existing work that provably achieves this goal relies on strong assumptions on relationships between the latent variables (e.g., independence conditional on auxiliary information). In this paper, we take a very different perspective on the problem and ask,"Can we instead identify latent properties by leveraging knowledge of the mechanisms that govern their evolution?"We provide a complete characterization of the sources of non-identifiability as we vary knowledge about a set of possible mechanisms. In particular, we prove that if we know the exact mechanisms under which the latent properties evolve, then identification can be achieved up to any equivariances that are shared by the underlying mechanisms. We generalize this characterization to settings where we only know some hypothesis class over possible mechanisms, as well as settings where the mechanisms are stochastic. We demonstrate the power of this mechanism-based perspective by showing that we can leverage our results to generalize existing identifiable representation learning results. These results suggest that by exploiting inductive biases on mechanisms, it is possible to design a range of new identifiable representation learning approaches.
\ No newline at end of file
diff --git a/data/2022/iclr/Prospect Pruning: Finding Trainable Weights at Initialization using Meta-Gradients b/data/2022/iclr/Prospect Pruning: Finding Trainable Weights at Initialization using Meta-Gradients
new file mode 100644
index 0000000000..223494f8ce
--- /dev/null
+++ b/data/2022/iclr/Prospect Pruning: Finding Trainable Weights at Initialization using Meta-Gradients	
@@ -0,0 +1 @@
+Pruning neural networks at initialization would enable us to find sparse models that retain the accuracy of the original network while consuming fewer computational resources for training and inference. However, current methods are insufficient to enable this optimization and lead to a large degradation in model performance. In this paper, we identify a fundamental limitation in the formulation of current methods, namely that their saliency criteria look at a single step at the start of training without taking into account the trainability of the network. While pruning iteratively and gradually has been shown to improve pruning performance, explicit consideration of the training stage that will immediately follow pruning has so far been absent from the computation of the saliency criterion. To overcome the short-sightedness of existing methods, we propose Prospect Pruning (ProsPr), which uses meta-gradients through the first few steps of optimization to determine which weights to prune. ProsPr combines an estimate of the higherorder effects of pruning on the loss and the optimization trajectory to identify the trainable sub-network. Our method achieves state-of-the-art pruning performance on a variety of vision classification tasks, with less data and in a single shot compared to existing pruning-at-initialization methods. Our code is available online at https://github.com/mil-ad/prospr.
\ No newline at end of file
diff --git a/data/2022/iclr/ProtoRes: Proto-Residual Network for Pose Authoring via Learned Inverse Kinematics b/data/2022/iclr/ProtoRes: Proto-Residual Network for Pose Authoring via Learned Inverse Kinematics
new file mode 100644
index 0000000000..2246f7f5e5
--- /dev/null
+++ b/data/2022/iclr/ProtoRes: Proto-Residual Network for Pose Authoring via Learned Inverse Kinematics	
@@ -0,0 +1 @@
+Our work focuses on the development of a learnable neural representation of human pose for advanced AI assisted animation tooling. Specifically, we tackle the problem of constructing a full static human pose based on sparse and variable user inputs (e.g. locations and/or orientations of a subset of body joints). To solve this problem, we propose a novel neural architecture that combines residual connections with prototype encoding of a partially specified pose to create a new complete pose from the learned latent space. We show that our architecture outperforms a baseline based on Transformer, both in terms of accuracy and computational efficiency. Additionally, we develop a user interface to integrate our neural model in Unity, a real-time 3D development platform. Furthermore, we introduce two new datasets representing the static human pose modeling problem, based on high-quality human motion capture data, which will be released publicly along with model code.
\ No newline at end of file
diff --git a/data/2022/iclr/Prototype memory and attention mechanisms for few shot image generation b/data/2022/iclr/Prototype memory and attention mechanisms for few shot image generation
new file mode 100644
index 0000000000..0b890efa02
--- /dev/null
+++ b/data/2022/iclr/Prototype memory and attention mechanisms for few shot image generation	
@@ -0,0 +1 @@
+Recent discoveries indicate that the neural codes in the superﬁcial layers of the primary visual cortex (V1) of macaque monkeys are complex, diverse and super-sparse. This leads us to ponder the computational advantages and functional role of these “grandmother cells." Here, we propose that such cells can serve as prototype memory priors that bias and shape the distributed feature processing during the image generation process in the brain. These memory prototypes are learned by momentum online clustering and are utilized through a memory-based attention operation. Integrating this mechanism, we propose Memory Concept Attention ( MoCA ) to improve few shot image generation quality. We show that having a prototype memory with attention mechanisms can improve image synthesis quality, learn interpretable visual concept clusters, and improve the robustness of the model. Our results demonstrate the feasibility of the idea that these super-sparse complex feature detectors can serve as prototype memory priors for modulating the image synthesis processes in the visual system.
\ No newline at end of file
diff --git a/data/2022/iclr/Prototypical Contrastive Predictive Coding b/data/2022/iclr/Prototypical Contrastive Predictive Coding
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Provable Adaptation across Multiway Domains via Representation Learning b/data/2022/iclr/Provable Adaptation across Multiway Domains via Representation Learning
new file mode 100644
index 0000000000..16dead6aba
--- /dev/null
+++ b/data/2022/iclr/Provable Adaptation across Multiway Domains via Representation Learning	
@@ -0,0 +1 @@
+This paper studies zero-shot domain adaptation where each domain is indexed on a multi-dimensional array, and we only have data from a small subset of domains. Our goal is to produce predictors that perform well on \emph{unseen} domains. We propose a model which consists of a domain-invariant latent representation layer and a domain-specific linear prediction layer with a low-rank tensor structure. Theoretically, we present explicit sample complexity bounds to characterize the prediction error on unseen domains in terms of the number of domains with training data and the number of data per domain. To our knowledge, this is the first finite-sample guarantee for zero-shot domain adaptation. In addition, we provide experiments on two-way MNIST and four-way fiber sensing datasets to demonstrate the effectiveness of our proposed model.
\ No newline at end of file
diff --git a/data/2022/iclr/Provable Learning-based Algorithm For Sparse Recovery b/data/2022/iclr/Provable Learning-based Algorithm For Sparse Recovery
new file mode 100644
index 0000000000..be9620ceed
--- /dev/null
+++ b/data/2022/iclr/Provable Learning-based Algorithm For Sparse Recovery	
@@ -0,0 +1 @@
+Recovering sparse parameters from observational data is a fundamental problem in machine learning with wide applications. Many classic algorithms can solve this problem with theoretical guarantees, but their performances rely on choosing the correct hyperparameters. Besides, hand-designed algorithms do not fully exploit the particular problem distribution of interest. In this work, we propose a deep learning method for algorithm learning called PLISA ( P rovable L earning-based I terative S parse recovery A lgorithm). PLISA is designed by unrolling a classic path-following algorithm for sparse recovery, with some components being more ﬂexible and learnable. We theoretically show the improved recovery accuracy achievable by PLISA . Furthermore, we analyze the empirical Rademacher complexity of PLISA to characterize its generalization ability to solve new problems outside the training set. This paper contains novel theoretical contributions to the area of learning-based algorithms in the sense that (i) PLISA is generically applicable to a broad class of sparse estimation problems, (ii) generalization analysis has received less attention so far, and (iii) our analysis makes novel connections between the generalization ability and algorithmic properties such as stability and convergence of the unrolled algorithm, which leads to a tighter bound that can explain the empirical observations. The techniques could potentially be applied to analyze other learning-based algorithms in the literature.
\ No newline at end of file
diff --git a/data/2022/iclr/Provably Filtering Exogenous Distractors using Multistep Inverse Dynamics b/data/2022/iclr/Provably Filtering Exogenous Distractors using Multistep Inverse Dynamics
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Provably Robust Adversarial Examples b/data/2022/iclr/Provably Robust Adversarial Examples
new file mode 100644
index 0000000000..bd0f2a997b
--- /dev/null
+++ b/data/2022/iclr/Provably Robust Adversarial Examples	
@@ -0,0 +1 @@
+We introduce the concept of provably robust adversarial examples for deep neural networks - connected input regions constructed from standard adversarial examples which are guaranteed to be robust to a set of real-world perturbations (such as changes in pixel intensity and geometric transformations). We present a novel method called PARADE for generating these regions in a scalable manner which works by iteratively refining the region initially obtained via sampling until a refined region is certified to be adversarial with existing state-of-the-art verifiers. At each step, a novel optimization procedure is applied to maximize the region's volume under the constraint that the convex relaxation of the network behavior with respect to the region implies a chosen bound on the certification objective. Our experimental evaluation shows the effectiveness of PARADE: it successfully finds large provably robust regions including ones containing $\approx 10^{573}$ adversarial examples for pixel intensity and $\approx 10^{599}$ for geometric perturbations. The provability enables our robust examples to be significantly more effective against state-of-the-art defenses based on randomized smoothing than the individual attacks used to construct the regions.
\ No newline at end of file
diff --git a/data/2022/iclr/Provably convergent quasistatic dynamics for mean-field two-player zero-sum games b/data/2022/iclr/Provably convergent quasistatic dynamics for mean-field two-player zero-sum games
new file mode 100644
index 0000000000..4edc5910e9
--- /dev/null
+++ b/data/2022/iclr/Provably convergent quasistatic dynamics for mean-field two-player zero-sum games	
@@ -0,0 +1 @@
+In this paper, we study the problem of finding mixed Nash equilibrium for mean-field two-player zero-sum games. Solving this problem requires optimizing over two probability distributions. We consider a quasistatic Wasserstein gradient flow dynamics in which one probability distribution follows the Wasserstein gradient flow, while the other one is always at the equilibrium. Theoretical analysis are conducted on this dynamics, showing its convergence to the mixed Nash equilibrium under mild conditions. Inspired by the continuous dynamics of probability distributions, we derive a quasistatic Langevin gradient descent method with inner-outer iterations, and test the method on different problems, including training mixture of GANs.
\ No newline at end of file
diff --git a/data/2022/iclr/Proving the Lottery Ticket Hypothesis for Convolutional Neural Networks b/data/2022/iclr/Proving the Lottery Ticket Hypothesis for Convolutional Neural Networks
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Pseudo Numerical Methods for Diffusion Models on Manifolds b/data/2022/iclr/Pseudo Numerical Methods for Diffusion Models on Manifolds
new file mode 100644
index 0000000000..e76ea8c009
--- /dev/null
+++ b/data/2022/iclr/Pseudo Numerical Methods for Diffusion Models on Manifolds	
@@ -0,0 +1 @@
+Denoising Diffusion Probabilistic Models (DDPMs) can generate high-quality samples such as image and audio samples. However, DDPMs require hundreds to thousands of iterations to produce final samples. Several prior works have successfully accelerated DDPMs through adjusting the variance schedule (e.g., Improved Denoising Diffusion Probabilistic Models) or the denoising equation (e.g., Denoising Diffusion Implicit Models (DDIMs)). However, these acceleration methods cannot maintain the quality of samples and even introduce new noise at a high speedup rate, which limit their practicability. To accelerate the inference process while keeping the sample quality, we provide a fresh perspective that DDPMs should be treated as solving differential equations on manifolds. Under such a perspective, we propose pseudo numerical methods for diffusion models (PNDMs). Specifically, we figure out how to solve differential equations on manifolds and show that DDIMs are simple cases of pseudo numerical methods. We change several classical numerical methods to corresponding pseudo numerical methods and find that the pseudo linear multi-step method is the best in most situations. According to our experiments, by directly using pre-trained models on Cifar10, CelebA and LSUN, PNDMs can generate higher quality synthetic images with only 50 steps compared with 1000-step DDIMs (20x speedup), significantly outperform DDIMs with 250 steps (by around 0.4 in FID) and have good generalization on different variance schedules. Our implementation is available at https://github.com/luping-liu/PNDM.
\ No newline at end of file
diff --git a/data/2022/iclr/Pseudo-Labeled Auto-Curriculum Learning for Semi-Supervised Keypoint Localization b/data/2022/iclr/Pseudo-Labeled Auto-Curriculum Learning for Semi-Supervised Keypoint Localization
new file mode 100644
index 0000000000..7959ee4834
--- /dev/null
+++ b/data/2022/iclr/Pseudo-Labeled Auto-Curriculum Learning for Semi-Supervised Keypoint Localization	
@@ -0,0 +1 @@
+Localizing keypoints of an object is a basic visual problem. However, supervised learning of a keypoint localization network often requires a large amount of data, which is expensive and time-consuming to obtain. To remedy this, there is an ever-growing interest in semi-supervised learning (SSL), which leverages a small set of labeled data along with a large set of unlabeled data. Among these SSL approaches, pseudo-labeling (PL) is one of the most popular. PL approaches apply pseudo-labels to unlabeled data, and then train the model with a combination of the labeled and pseudo-labeled data iteratively. The key to the success of PL is the selection of high-quality pseudo-labeled samples. Previous works mostly select training samples by manually setting a single confidence threshold. We propose to automatically select reliable pseudo-labeled samples with a series of dynamic thresholds, which constitutes a learning curriculum. Extensive experiments on six keypoint localization benchmark datasets demonstrate that the proposed approach significantly outperforms the previous state-of-the-art SSL approaches.
\ No newline at end of file
diff --git a/data/2022/iclr/Pyraformer: Low-Complexity Pyramidal Attention for Long-Range Time Series Modeling and Forecasting b/data/2022/iclr/Pyraformer: Low-Complexity Pyramidal Attention for Long-Range Time Series Modeling and Forecasting
new file mode 100644
index 0000000000..d87505b751
--- /dev/null
+++ b/data/2022/iclr/Pyraformer: Low-Complexity Pyramidal Attention for Long-Range Time Series Modeling and Forecasting	
@@ -0,0 +1 @@
+Accurate prediction of the future given the past based on time series data is of paramount importance, since it opens the door for decision making and risk management ahead of time. In practice, the challenge is to build a flexible but parsimonious model that can capture a wide range of temporal dependencies. In this paper, we propose Pyraformer by exploring the multi-resolution representation of the time series. Specifically, we introduce the pyramidal attention module (PAM) in which the inter-scale tree structure summarizes features at different resolutions and the intra-scale neighboring connections model the temporal dependencies of different ranges. Under mild conditions, the maximum length of the signal traversing path in Pyraformer is a constant (i.e., O(1)) with regard to the sequence length L, while its time and space complexity scale linearly with L. Extensive experimental results show that Pyraformer typically achieves the highest prediction accuracy in both single-step and long-range multi-step forecasting tasks with the least amount of time and memory consumption, especially when the sequence is long1.
\ No newline at end of file
diff --git a/data/2022/iclr/QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization b/data/2022/iclr/QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization
new file mode 100644
index 0000000000..1c60e5b67b
--- /dev/null
+++ b/data/2022/iclr/QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization	
@@ -0,0 +1 @@
+Recently, post-training quantization (PTQ) has driven much attention to produce efficient neural networks without long-time retraining. Despite its low cost, current PTQ works tend to fail under the extremely low-bit setting. In this study, we pioneeringly confirm that properly incorporating activation quantization into the PTQ reconstruction benefits the final accuracy. To deeply understand the inherent reason, a theoretical framework is established, indicating that the flatness of the optimized low-bit model on calibration and test data is crucial. Based on the conclusion, a simple yet effective approach dubbed as QDROP is proposed, which randomly drops the quantization of activations during PTQ. Extensive experiments on various tasks including computer vision (image classification, object detection) and natural language processing (text classification and question answering) prove its superiority. With QDROP, the limit of PTQ is pushed to the 2-bit activation for the first time and the accuracy boost can be up to 51.49%. Without bells and whistles, QDROP establishes a new state of the art for PTQ. Our code is available at https://github.com/wimh966/QDrop and has been integrated into MQBench (https://github.com/ModelTC/MQBench)
\ No newline at end of file
diff --git a/data/2022/iclr/Quadtree Attention for Vision Transformers b/data/2022/iclr/Quadtree Attention for Vision Transformers
new file mode 100644
index 0000000000..ae387548b0
--- /dev/null
+++ b/data/2022/iclr/Quadtree Attention for Vision Transformers	
@@ -0,0 +1 @@
+Transformers have been successful in many vision tasks, thanks to their capability of capturing long-range dependency. However, their quadratic computational complexity poses a major obstacle for applying them to vision tasks requiring dense predictions, such as object detection, feature matching, stereo, etc. We introduce QuadTree Attention, which reduces the computational complexity from quadratic to linear. Our quadtree transformer builds token pyramids and computes attention in a coarse-to-fine manner. At each level, the top K patches with the highest attention scores are selected, such that at the next level, attention is only evaluated within the relevant regions corresponding to these top K patches. We demonstrate that quadtree attention achieves state-of-the-art performance in various vision tasks, e.g. with 4.0% improvement in feature matching on ScanNet, about 50% flops reduction in stereo matching, 0.4-1.5% improvement in top-1 accuracy on ImageNet classification, 1.2-1.8% improvement on COCO object detection, and 0.7-2.4% improvement on semantic segmentation over previous state-of-the-art transformers. The codes are available at https://github.com/Tangshitao/QuadtreeAttention.
\ No newline at end of file
diff --git a/data/2022/iclr/Quantitative Performance Assessment of CNN Units via Topological Entropy Calculation b/data/2022/iclr/Quantitative Performance Assessment of CNN Units via Topological Entropy Calculation
new file mode 100644
index 0000000000..e29cab9726
--- /dev/null
+++ b/data/2022/iclr/Quantitative Performance Assessment of CNN Units via Topological Entropy Calculation	
@@ -0,0 +1 @@
+Identifying the status of individual network units is critical for understanding the mechanism of convolutional neural networks (CNNs). However, it is still challenging to reliably give a general indication of unit status, especially for units in different network models. To this end, we propose a novel method for quantitatively clarifying the status of single unit in CNN using algebraic topological tools. Unit status is indicated via the calculation of a defined topological-based entropy, called feature entropy, which measures the degree of chaos of the global spatial pattern hidden in the unit for a category. In this way, feature entropy could provide an accurate indication of status for units in different networks with diverse situations like weight-rescaling operation. Further, we show that feature entropy decreases as the layer goes deeper and shares almost simultaneous trend with loss during training. We show that by investigating the feature entropy of units on only training data, it could give discrimination between networks with different generalization ability from the view of the effectiveness of feature representations.
\ No newline at end of file
diff --git a/data/2022/iclr/Query Efficient Decision Based Sparse Attacks Against Black-Box Deep Learning Models b/data/2022/iclr/Query Efficient Decision Based Sparse Attacks Against Black-Box Deep Learning Models
new file mode 100644
index 0000000000..074dd9a9ff
--- /dev/null
+++ b/data/2022/iclr/Query Efficient Decision Based Sparse Attacks Against Black-Box Deep Learning Models	
@@ -0,0 +1 @@
+Despite our best efforts, deep learning models remain highly vulnerable to even tiny adversarial perturbations applied to the inputs. The ability to extract information from solely the output of a machine learning model to craft adversarial perturbations to black-box models is a practical threat against real-world systems, such as autonomous cars or machine learning models exposed as a service (MLaaS). Of particular interest are sparse attacks. The realization of sparse attacks in black-box models demonstrates that machine learning models are more vulnerable than we believe. Because these attacks aim to minimize the number of perturbed pixels measured by l_0 norm-required to mislead a model by solely observing the decision (the predicted label) returned to a model query; the so-called decision-based attack setting. But, such an attack leads to an NP-hard optimization problem. We develop an evolution-based algorithm-SparseEvo-for the problem and evaluate against both convolutional deep neural networks and vision transformers. Notably, vision transformers are yet to be investigated under a decision-based attack setting. SparseEvo requires significantly fewer model queries than the state-of-the-art sparse attack Pointwise for both untargeted and targeted attacks. The attack algorithm, although conceptually simple, is also competitive with only a limited query budget against the state-of-the-art gradient-based whitebox attacks in standard computer vision tasks such as ImageNet. Importantly, the query efficient SparseEvo, along with decision-based attacks, in general, raise new questions regarding the safety of deployed systems and poses new directions to study and understand the robustness of machine learning models.
\ No newline at end of file
diff --git a/data/2022/iclr/Query Embedding on Hyper-Relational Knowledge Graphs b/data/2022/iclr/Query Embedding on Hyper-Relational Knowledge Graphs
new file mode 100644
index 0000000000..7c003ab6cf
--- /dev/null
+++ b/data/2022/iclr/Query Embedding on Hyper-Relational Knowledge Graphs	
@@ -0,0 +1 @@
+Multi-hop logical reasoning is an established problem in the field of representation learning on knowledge graphs (KGs). It subsumes both one-hop link prediction as well as other more complex types of logical queries. Existing algorithms operate only on classical, triple-based graphs, whereas modern KGs often employ a hyper-relational modeling paradigm. In this paradigm, typed edges may have several key-value pairs known as qualifiers that provide fine-grained context for facts. In queries, this context modifies the meaning of relations, and usually reduces the answer set. Hyper-relational queries are often observed in real-world KG applications, and existing approaches for approximate query answering cannot make use of qualifier pairs. In this work, we bridge this gap and extend the multi-hop reasoning problem to hyper-relational KGs allowing to tackle this new type of complex queries. Building upon recent advancements in Graph Neural Networks and query embedding techniques, we study how to embed and answer hyper-relational conjunctive queries. Besides that, we propose a method to answer such queries and demonstrate in our experiments that qualifiers improve query answering on a diverse set of query patterns.
\ No newline at end of file
diff --git a/data/2022/iclr/R4D: Utilizing Reference Objects for Long-Range Distance Estimation b/data/2022/iclr/R4D: Utilizing Reference Objects for Long-Range Distance Estimation
new file mode 100644
index 0000000000..ee02faf6dd
--- /dev/null
+++ b/data/2022/iclr/R4D: Utilizing Reference Objects for Long-Range Distance Estimation	
@@ -0,0 +1 @@
+Estimating the distance of objects is a safety-critical task for autonomous driving. Focusing on short-range objects, existing methods and datasets neglect the equally important long-range objects. In this paper, we introduce a challenging and under-explored task, which we refer to as Long-Range Distance Estimation, as well as two datasets to validate new methods developed for this task. We then proposeR4D, the first framework to accurately estimate the distance of long-range objects by using references with known distances in the scene. Drawing inspiration from human perception, R4D builds a graph by connecting a target object to all references. An edge in the graph encodes the relative distance information between a pair of target and reference objects. An attention module is then used to weigh the importance of reference objects and combine them into one target object distance prediction. Experiments on the two proposed datasets demonstrate the effectiveness and robustness of R4D by showing significant improvements compared to existing baselines. We are looking to make the proposed dataset, Waymo OpenDataset - Long-Range Labels, available publicly at waymo.com/open/download.
\ No newline at end of file
diff --git a/data/2022/iclr/R5: Rule Discovery with Reinforced and Recurrent Relational Reasoning b/data/2022/iclr/R5: Rule Discovery with Reinforced and Recurrent Relational Reasoning
new file mode 100644
index 0000000000..d5fab876a8
--- /dev/null
+++ b/data/2022/iclr/R5: Rule Discovery with Reinforced and Recurrent Relational Reasoning	
@@ -0,0 +1 @@
+Systematicity, i.e., the ability to recombine known parts and rules to form new sequences while reasoning over relational data, is critical to machine intelligence. A model with strong systematicity is able to train on small-scale tasks and generalize to large-scale tasks. In this paper, we propose R5, a relational reasoning framework based on reinforcement learning that reasons over relational graph data and explicitly mines underlying compositional logical rules from observations. R5 has strong systematicity and being robust to noisy data. It consists of a policy value network equipped with Monte Carlo Tree Search to perform recurrent relational prediction and a backtrack rewriting mechanism for rule mining. By alternately applying the two components, R5 progressively learns a set of explicit rules from data and performs explainable and generalizable relation prediction. We conduct extensive evaluations on multiple datasets. Experimental results show that R5 outperforms various embedding-based and rule induction baselines on relation prediction tasks while achieving a high recall rate in discovering ground truth rules.
\ No newline at end of file
diff --git a/data/2022/iclr/RISP: Rendering-Invariant State Predictor with Differentiable Simulation and Rendering for Cross-Domain Parameter Estimation b/data/2022/iclr/RISP: Rendering-Invariant State Predictor with Differentiable Simulation and Rendering for Cross-Domain Parameter Estimation
new file mode 100644
index 0000000000..cabc55ba50
--- /dev/null
+++ b/data/2022/iclr/RISP: Rendering-Invariant State Predictor with Differentiable Simulation and Rendering for Cross-Domain Parameter Estimation	
@@ -0,0 +1 @@
+This work considers identifying parameters characterizing a physical system's dynamic motion directly from a video whose rendering configurations are inaccessible. Existing solutions require massive training data or lack generalizability to unknown rendering configurations. We propose a novel approach that marries domain randomization and differentiable rendering gradients to address this problem. Our core idea is to train a rendering-invariant state-prediction (RISP) network that transforms image differences into state differences independent of rendering configurations, e.g., lighting, shadows, or material reflectance. To train this predictor, we formulate a new loss on rendering variances using gradients from differentiable rendering. Moreover, we present an efficient, second-order method to compute the gradients of this loss, allowing it to be integrated seamlessly into modern deep learning frameworks. We evaluate our method in rigid-body and deformable-body simulation environments using four tasks: state estimation, system identification, imitation learning, and visuomotor control. We further demonstrate the efficacy of our approach on a real-world example: inferring the state and action sequences of a quadrotor from a video of its motion sequences. Compared with existing methods, our approach achieves significantly lower reconstruction errors and has better generalizability among unknown rendering configurations.
\ No newline at end of file
diff --git a/data/2022/iclr/Random matrices in service of ML footprint: ternary random features with no performance loss b/data/2022/iclr/Random matrices in service of ML footprint: ternary random features with no performance loss
new file mode 100644
index 0000000000..c4d4a08c7d
--- /dev/null
+++ b/data/2022/iclr/Random matrices in service of ML footprint: ternary random features with no performance loss	
@@ -0,0 +1 @@
+In this article, we investigate the spectral behavior of random features kernel matrices of the type K = E w [ σ ( w T x i ) σ ( w T x j )] ni,j =1 , with nonlinear function σ ( · ) , data x 1 , . . . , x n ∈ R p , and random projection vector w ∈ R p having i.i.d. entries. In a high-dimensional setting where the number of data n and their dimension p are both large and comparable, we show, under a Gaussian mixture model for the data, that the eigenspectrum of K is independent of the distribution of the i.i.d. (zero-mean and unit-variance) entries of w and only depends on σ ( · ) via its (generalized) Gaussian moments E z ∼N (0 , 1) [ σ ′ ( z )] and E z ∼N (0 , 1) [ σ ′′ ( z )] . As a result, for any kernel matrix K of the form above, we propose a novel random features technique, called Ternary Random Feature (TRF), that (i) asymptotically yields the same limiting kernel as the original K in a spectral sense and (ii) can be computed and stored much more efﬁciently, by wisely tuning (in a data-dependent counterpart expensive kernels. Our article comes along with (Couillet et al., 2021) as ﬁrst steps in re-designing machine learning algorithms using Random Matrix Theory, in order to be able to perform computations on massive data using desktop computers instead of relying on energy consuming giant servers.
\ No newline at end of file
diff --git a/data/2022/iclr/Real-Time Neural Voice Camouflage b/data/2022/iclr/Real-Time Neural Voice Camouflage
new file mode 100644
index 0000000000..402c057f66
--- /dev/null
+++ b/data/2022/iclr/Real-Time Neural Voice Camouflage	
@@ -0,0 +1 @@
+Automatic speech recognition systems have created exciting possibilities for applications, however they also enable opportunities for systematic eavesdropping. We propose a method to camouflage a person's voice over-the-air from these systems without inconveniencing the conversation between people in the room. Standard adversarial attacks are not effective in real-time streaming situations because the characteristics of the signal will have changed by the time the attack is executed. We introduce predictive attacks, which achieve real-time performance by forecasting the attack that will be the most effective in the future. Under real-time constraints, our method jams the established speech recognition system DeepSpeech 3.9x more than baselines as measured through word error rate, and 6.6x more as measured through character error rate. We furthermore demonstrate our approach is practically effective in realistic environments over physical distances.
\ No newline at end of file
diff --git a/data/2022/iclr/Recursive Disentanglement Network b/data/2022/iclr/Recursive Disentanglement Network
new file mode 100644
index 0000000000..2ec4d2d2bb
--- /dev/null
+++ b/data/2022/iclr/Recursive Disentanglement Network	
@@ -0,0 +1 @@
+Disentangled feature representation is essential for data-efficient learning. The feature space of deep models is inherently compositional. Existing β-VAE-based methods, which only apply disentanglement regularization to the resulting embedding space of deep models, cannot effectively regularize such compositional feature space, resulting in unsatisfactory disentangled results. In this paper, we formulate the compositional disentanglement learning problem from an informationtheoretic perspective and propose a recursive disentanglement network (RecurD) that propagates regulatory inductive bias recursively across the compositional feature space during disentangled representation learning. Experimental studies demonstrate that RecurD outperforms β-VAE and several of its state-of-the-art variants on disentangled representation learning and enables more data-efficient downstream machine learning tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Recycling Model Updates in Federated Learning: Are Gradient Subspaces Low-Rank? b/data/2022/iclr/Recycling Model Updates in Federated Learning: Are Gradient Subspaces Low-Rank?
new file mode 100644
index 0000000000..a3df32b269
--- /dev/null
+++ b/data/2022/iclr/Recycling Model Updates in Federated Learning: Are Gradient Subspaces Low-Rank?	
@@ -0,0 +1 @@
+In this paper, we question the rationale behind propagating large numbers of parameters through a distributed system during federated learning. We start by examining the rank characteristics of the subspace spanned by gradients across epochs (i.e., the gradient-space) in centralized model training, and observe that this gradient-space often consists of a few leading principal components accounting for an overwhelming majority ( 95 − 99% ) of the explained variance. Motivated by this, we propose the "Look-back Gradient Multiplier" ( LBGM ) algorithm, which exploits this low-rank property to enable gradient recycling between model update rounds of federated learning, reducing transmissions of large parameters to single scalars for aggregation. We analytically characterize the convergence behavior of LBGM , revealing the nature of the trade-off between communication savings and model performance. Our subsequent experimental results demonstrate the improvement LBGM obtains in communication overhead compared to conventional federated learning on several datasets and deep learning models. Additionally, we show that LBGM is a general plug-and-play algorithm that can be used standalone or stacked on top of existing sparsification techniques for distributed model training.
\ No newline at end of file
diff --git a/data/2022/iclr/Reducing Excessive Margin to Achieve a Better Accuracy vs. Robustness Trade-off b/data/2022/iclr/Reducing Excessive Margin to Achieve a Better Accuracy vs. Robustness Trade-off
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/RegionViT: Regional-to-Local Attention for Vision Transformers b/data/2022/iclr/RegionViT: Regional-to-Local Attention for Vision Transformers
new file mode 100644
index 0000000000..4a99ff01a5
--- /dev/null
+++ b/data/2022/iclr/RegionViT: Regional-to-Local Attention for Vision Transformers	
@@ -0,0 +1 @@
+Vision transformer (ViT) has recently shown its strong capability in achieving comparable results to convolutional neural networks (CNNs) on image classification. However, vanilla ViT simply inherits the same architecture from the natural language processing directly, which is often not optimized for vision applications. Motivated by this, in this paper, we propose a new architecture that adopts the pyramid structure and employ a novel regional-to-local attention rather than global self-attention in vision transformers. More specifically, our model first generates regional tokens and local tokens from an image with different patch sizes, where each regional token is associated with a set of local tokens based on the spatial location. The regional-to-local attention includes two steps: first, the regional self-attention extract global information among all regional tokens and then the local self-attention exchanges the information among one regional token and the associated local tokens via self-attention. Therefore, even though local self-attention confines the scope in a local region but it can still receive global information. Extensive experiments on four vision tasks, including image classification, object and keypoint detection, semantics segmentation and action recognition, show that our approach outperforms or is on par with state-of-the-art ViT variants including many concurrent works. Our source codes and models are available at https://github.com/ibm/regionvit.
\ No newline at end of file
diff --git a/data/2022/iclr/Regularized Autoencoders for Isometric Representation Learning b/data/2022/iclr/Regularized Autoencoders for Isometric Representation Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Reinforcement Learning in Presence of Discrete Markovian Context Evolution b/data/2022/iclr/Reinforcement Learning in Presence of Discrete Markovian Context Evolution
new file mode 100644
index 0000000000..dfa0b3f66e
--- /dev/null
+++ b/data/2022/iclr/Reinforcement Learning in Presence of Discrete Markovian Context Evolution	
@@ -0,0 +1 @@
+We consider a context-dependent Reinforcement Learning (RL) setting, which is characterized by: a) an unknown finite number of not directly observable contexts; b) abrupt (discontinuous) context changes occurring during an episode; and c) Markovian context evolution. We argue that this challenging case is often met in applications and we tackle it using a Bayesian approach and variational inference. We adapt a sticky Hierarchical Dirichlet Process (HDP) prior for model learning, which is arguably best-suited for Markov process modeling. We then derive a context distillation procedure, which identifies and removes spurious contexts in an unsupervised fashion. We argue that the combination of these two components allows to infer the number of contexts from data thus dealing with the context cardinality assumption. We then find the representation of the optimal policy enabling efficient policy learning using off-the-shelf RL algorithms. Finally, we demonstrate empirically (using gym environments cart-pole swing-up, drone, intersection) that our approach succeeds where state-of-the-art methods of other frameworks fail and elaborate on the reasons for such failures.
\ No newline at end of file
diff --git a/data/2022/iclr/Reinforcement Learning under a Multi-agent Predictive State Representation Model: Method and Theory b/data/2022/iclr/Reinforcement Learning under a Multi-agent Predictive State Representation Model: Method and Theory
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration b/data/2022/iclr/Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration
new file mode 100644
index 0000000000..937aea4be5
--- /dev/null
+++ b/data/2022/iclr/Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration	
@@ -0,0 +1 @@
+A major challenge in real-world reinforcement learning (RL) is the sparsity of reward feedback. Often, what is available is an intuitive but sparse reward function that only indicates whether the task is completed partially or fully. However, the lack of carefully designed, fine grain feedback implies that most existing RL algorithms fail to learn an acceptable policy in a reasonable time frame. This is because of the large number of exploration actions that the policy has to perform before it gets any useful feedback that it can learn from. In this work, we address this challenging problem by developing an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy for faster and efficient online RL in such sparse reward settings. The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data. The key idea is that by obtaining guidance from - not imitating - the offline data, LOGO orients its policy in the manner of the sub-optimal policy, while yet being able to learn beyond and approach optimality. We provide a theoretical analysis of our algorithm, and provide a lower bound on the performance improvement in each learning episode. We also extend our algorithm to the even more challenging incomplete observation setting, where the demonstration data contains only a censored version of the true state observation. We demonstrate the superior performance of our algorithm over state-of-the-art approaches on a number of benchmark environments with sparse rewards and censored state. Further, we demonstrate the value of our approach via implementing LOGO on a mobile robot for trajectory tracking and obstacle avoidance, where it shows excellent performance.
\ No newline at end of file
diff --git a/data/2022/iclr/RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning b/data/2022/iclr/RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning
new file mode 100644
index 0000000000..8997e671ec
--- /dev/null
+++ b/data/2022/iclr/RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning	
@@ -0,0 +1 @@
+Reasoning about visual relationships is central to how humans interpret the visual world. This task remains challenging for current deep learning algorithms since it requires addressing three key technical problems jointly: 1) identifying object entities and their properties, 2) inferring semantic relations between pairs of entities, and 3) generalizing to novel object-relation combinations, i.e., systematic generalization. In this work, we use vision transformers (ViTs) as our base model for visual reasoning and make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs. Specifically, we introduce a novel concept-feature dictionary to allow flexible image feature retrieval at training time with concept keys. This dictionary enables two new concept-guided auxiliary tasks: 1) a global task for promoting relational reasoning, and 2) a local task for facilitating semantic object-centric correspondence learning. To examine the systematic generalization of visual reasoning models, we introduce systematic splits for the standard HICO and GQA benchmarks. We show the resulting model, Concept-guided Vision Transformer (or RelViT for short) significantly outperforms prior approaches on HICO and GQA by 16% and 13% in the original split, and by 43% and 18% in the systematic split. Our ablation analyses also reveal our model's compatibility with multiple ViT variants and robustness to hyper-parameters.
\ No newline at end of file
diff --git a/data/2022/iclr/Relating transformers to models and neural representations of the hippocampal formation b/data/2022/iclr/Relating transformers to models and neural representations of the hippocampal formation
new file mode 100644
index 0000000000..935c48301d
--- /dev/null
+++ b/data/2022/iclr/Relating transformers to models and neural representations of the hippocampal formation	
@@ -0,0 +1 @@
+Many deep neural network architectures loosely based on brain networks have recently been shown to replicate neural firing patterns observed in the brain. One of the most exciting and promising novel architectures, the Transformer neural network, was developed without the brain in mind. In this work, we show that transformers, when equipped with recurrent position encodings, replicate the precisely tuned spatial representations of the hippocampal formation; most notably place and grid cells. Furthermore, we show that this result is no surprise since it is closely related to current hippocampal models from neuroscience. We additionally show the transformer version offers dramatic performance gains over the neuroscience version. This work continues to bind computations of artificial and brain networks, offers a novel understanding of the hippocampal-cortical interaction, and suggests how wider cortical areas may perform complex tasks beyond current neuroscience models such as language comprehension.
\ No newline at end of file
diff --git a/data/2022/iclr/Relational Learning with Variational Bayes b/data/2022/iclr/Relational Learning with Variational Bayes
new file mode 100644
index 0000000000..3b652fc26f
--- /dev/null
+++ b/data/2022/iclr/Relational Learning with Variational Bayes	
@@ -0,0 +1 @@
+When comparing our model and training setups between the MNIST and Yale face experiments, their differences can all be attributed to the need for accommodating different data types, e.g., increasing nework size for larger face images (vs. smaller MNIST images), modifying learning objective to adapt real-valued Yale dataset (vs. binay valued MNIST dataset), and not for accomodating different relationships. Taking this into consideration, our results shows that we have solved both class of problems with the exact same principled method despite each class of problems represents a very different kind of relationships (the relationships in MNIST are geometric whereas the relationahips in Yale are high-level perception, e.g., sentiment, external environmental factors). We believe our results further demonstrates that the proposed VRL method is robust, stable, and generalizable to many different relationships.
\ No newline at end of file
diff --git a/data/2022/iclr/Relational Multi-Task Learning: Modeling Relations between Data and Tasks b/data/2022/iclr/Relational Multi-Task Learning: Modeling Relations between Data and Tasks
new file mode 100644
index 0000000000..a34febc9c4
--- /dev/null
+++ b/data/2022/iclr/Relational Multi-Task Learning: Modeling Relations between Data and Tasks	
@@ -0,0 +1 @@
+A key assumption in multi-task learning is that at the inference time the multi-task model only has access to a given data point but not to the data point's labels from other tasks. This presents an opportunity to extend multi-task learning to utilize data point's labels from other auxiliary tasks, and this way improves performance on the new task. Here we introduce a novel relational multi-task learning setting where we leverage data point labels from auxiliary tasks to make more accurate predictions on the new task. We develop MetaLink, where our key innovation is to build a knowledge graph that connects data points and tasks and thus allows us to leverage labels from auxiliary tasks. The knowledge graph consists of two types of nodes: (1) data nodes, where node features are data embeddings computed by the neural network, and (2) task nodes, with the last layer's weights for each task as node features. The edges in this knowledge graph capture data-task relationships, and the edge label captures the label of a data point on a particular task. Under MetaLink, we reformulate the new task as a link label prediction problem between a data node and a task node. The MetaLink framework provides flexibility to model knowledge transfer from auxiliary task labels to the task of interest. We evaluate MetaLink on 6 benchmark datasets in both biochemical and vision domains. Experiments demonstrate that MetaLink can successfully utilize the relations among different tasks, outperforming the state-of-the-art methods under the proposed relational multi-task learning setting, with up to 27% improvement in ROC AUC.
\ No newline at end of file
diff --git a/data/2022/iclr/Relational Surrogate Loss Learning b/data/2022/iclr/Relational Surrogate Loss Learning
new file mode 100644
index 0000000000..37bfda73e2
--- /dev/null
+++ b/data/2022/iclr/Relational Surrogate Loss Learning	
@@ -0,0 +1 @@
+Evaluation metrics in machine learning are often hardly taken as loss functions, as they could be non-differentiable and non-decomposable, e.g., average precision and F1 score. This paper aims to address this problem by revisiting the surrogate loss learning, where a deep neural network is employed to approximate the evaluation metrics. Instead of pursuing an exact recovery of the evaluation metric through a deep neural network, we are reminded of the purpose of the existence of these evaluation metrics, which is to distinguish whether one model is better or worse than another. In this paper, we show that directly maintaining the relation of models between surrogate losses and metrics suffices, and propose a rank correlation-based optimization method to maximize this relation and learn surrogate losses. Compared to previous works, our method is much easier to optimize and enjoys significant efficiency and performance gains. Extensive experiments show that our method achieves improvements on various tasks including image classification and neural machine translation, and even outperforms state-of-the-art methods on human pose estimation and machine reading comprehension tasks. Code is available at: https://github.com/hunto/ReLoss.
\ No newline at end of file
diff --git a/data/2022/iclr/RelaxLoss: Defending Membership Inference Attacks without Losing Utility b/data/2022/iclr/RelaxLoss: Defending Membership Inference Attacks without Losing Utility
new file mode 100644
index 0000000000..c63ce7f7bd
--- /dev/null
+++ b/data/2022/iclr/RelaxLoss: Defending Membership Inference Attacks without Losing Utility	
@@ -0,0 +1 @@
+As a long-term threat to the privacy of training data, membership inference attacks (MIAs) emerge ubiquitously in machine learning models. Existing works evidence strong connection between the distinguishability of the training and testing loss distributions and the model's vulnerability to MIAs. Motivated by existing results, we propose a novel training framework based on a relaxed loss with a more achievable learning target, which leads to narrowed generalization gap and reduced privacy leakage. RelaxLoss is applicable to any classification model with added benefits of easy implementation and negligible overhead. Through extensive evaluations on five datasets with diverse modalities (images, medical data, transaction records), our approach consistently outperforms state-of-the-art defense mechanisms in terms of resilience against MIAs as well as model utility. Our defense is the first that can withstand a wide range of attacks while preserving (or even improving) the target model's utility. Source code is available at https://github.com/DingfanChen/RelaxLoss
\ No newline at end of file
diff --git a/data/2022/iclr/Reliable Adversarial Distillation with Unreliable Teachers b/data/2022/iclr/Reliable Adversarial Distillation with Unreliable Teachers
new file mode 100644
index 0000000000..d1019066f0
--- /dev/null
+++ b/data/2022/iclr/Reliable Adversarial Distillation with Unreliable Teachers	
@@ -0,0 +1 @@
+In ordinary distillation, student networks are trained with soft labels (SLs) given by pretrained teacher networks, and students are expected to improve upon teachers since SLs are stronger supervision than the original hard labels. However, when considering adversarial robustness, teachers may become unreliable and adversarial distillation may not work: teachers are pretrained on their own adversarial data, and it is too demanding to require that teachers are also good at every adversarial data queried by students. Therefore, in this paper, we propose reliable introspective adversarial distillation (IAD) where students partially instead of fully trust their teachers. Specifically, IAD distinguishes between three cases given a query of a natural data (ND) and the corresponding adversarial data (AD): (a) if a teacher is good at AD, its SL is fully trusted; (b) if a teacher is good at ND but not AD, its SL is partially trusted and the student also takes its own SL into account; (c) otherwise, the student only relies on its own SL. Experiments demonstrate the effectiveness of IAD for improving upon teachers in terms of adversarial robustness.
\ No newline at end of file
diff --git a/data/2022/iclr/Representation Learning for Online and Offline RL in Low-rank MDPs b/data/2022/iclr/Representation Learning for Online and Offline RL in Low-rank MDPs
new file mode 100644
index 0000000000..37fd941558
--- /dev/null
+++ b/data/2022/iclr/Representation Learning for Online and Offline RL in Low-rank MDPs	
@@ -0,0 +1 @@
+This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner. We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix. Unlike prior works that assume the representation is known (e.g., linear MDPs), here we need to learn the representation for the low-rank MDP. We study both the online RL and offline RL settings. For the online setting, operating with the same computational oracles used in FLAMBE (Agarwal et.al), the state-of-art algorithm for learning representations in low-rank MDPs, we propose an algorithm REP-UCB Upper Confidence Bound driven Representation learning for RL), which significantly improves the sample complexity from $\widetilde{O}( A^9 d^7 / (\epsilon^{10} (1-\gamma)^{22}))$ for FLAMBE to $\widetilde{O}( A^2 d^4 / (\epsilon^2 (1-\gamma)^{5}) )$ with $d$ being the rank of the transition matrix (or dimension of the ground truth representation), $A$ being the number of actions, and $\gamma$ being the discounted factor. Notably, REP-UCB is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation, while FLAMBE is an explore-then-commit style approach and has to perform reward-free exploration step-by-step forward in time. For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition: our algorithm is able to compete against any policy as long as it is covered by the offline distribution.
\ No newline at end of file
diff --git a/data/2022/iclr/Representation-Agnostic Shape Fields b/data/2022/iclr/Representation-Agnostic Shape Fields
new file mode 100644
index 0000000000..7a3f079e6b
--- /dev/null
+++ b/data/2022/iclr/Representation-Agnostic Shape Fields	
@@ -0,0 +1 @@
+3D shape analysis has been widely explored in the era of deep learning. Numerous models have been developed for various 3D data representation formats, e.g., MeshCNN for meshes, PointNet for point clouds and VoxNet for voxels. In this study, we present Representation-Agnostic Shape Fields (RASF), a generalizable and computation-efficient shape embedding module for 3D deep learning. RASF is implemented with a learnable 3D grid with multiple channels to store local geometry. Based on RASF, shape embeddings for various 3D shape representations (point clouds, meshes and voxels) are retrieved by coordinate indexing. While there are multiple ways to optimize the learnable parameters of RASF, we provide two effective schemes among all in this paper for RASF pre-training: shape reconstruction and normal estimation. Once trained, RASF becomes a plug-and-play performance booster with negligible cost. Extensive experiments on diverse 3D representation formats, networks and applications, validate the universal effectiveness of the proposed RASF. Code and pre-trained models are publicly available https://github.com/seanywang0408/RASF
\ No newline at end of file
diff --git a/data/2022/iclr/Representational Continuity for Unsupervised Continual Learning b/data/2022/iclr/Representational Continuity for Unsupervised Continual Learning
new file mode 100644
index 0000000000..327fb8fefe
--- /dev/null
+++ b/data/2022/iclr/Representational Continuity for Unsupervised Continual Learning	
@@ -0,0 +1 @@
+Continual learning (CL) aims to learn a sequence of tasks without forgetting the previously acquired knowledge. However, recent CL advances are restricted to supervised continual learning (SCL) scenarios. Consequently, they are not scalable to real-world applications where the data distribution is often biased and unannotated. In this work, we focus on unsupervised continual learning (UCL), where we learn the feature representations on an unlabelled sequence of tasks and show that reliance on annotated data is not necessary for continual learning. We conduct a systematic study analyzing the learned feature representations and show that unsupervised visual representations are surprisingly more robust to catastrophic forgetting, consistently achieve better performance, and generalize better to out-of-distribution tasks than SCL. Furthermore, we find that UCL achieves a smoother loss landscape through qualitative analysis of the learned representations and learns meaningful feature representations. Additionally, we propose Lifelong Unsupervised Mixup (LUMP), a simple yet effective technique that interpolates between the current task and previous tasks' instances to alleviate catastrophic forgetting for unsupervised representations.
\ No newline at end of file
diff --git a/data/2022/iclr/Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings b/data/2022/iclr/Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings
new file mode 100644
index 0000000000..c372d29479
--- /dev/null
+++ b/data/2022/iclr/Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings	
@@ -0,0 +1 @@
+A topic model is often formulated as a generative model that explains how each word of a document is generated given a set of topics and document-specific topic proportions. It is focused on capturing the word co-occurrences in a document and hence often suffers from poor performance in analyzing short documents. In addition, its parameter estimation often relies on approximate posterior inference that is either not scalable or suffers from large approximation error. This paper introduces a new topic-modeling framework where each document is viewed as a set of word embedding vectors and each topic is modeled as an embedding vector in the same embedding space. Embedding the words and topics in the same vector space, we define a method to measure the semantic difference between the embedding vectors of the words of a document and these of the topics, and optimize the topic embeddings to minimize the expected difference over all documents. Experiments on text analysis demonstrate that the proposed method, which is amenable to mini-batch stochastic gradient descent based optimization and hence scalable to big corpora, provides competitive performance in discovering more coherent and diverse topics and extracting better document representations.
\ No newline at end of file
diff --git a/data/2022/iclr/Resolving Training Biases via Influence-based Data Relabeling b/data/2022/iclr/Resolving Training Biases via Influence-based Data Relabeling
new file mode 100644
index 0000000000..f969b9ac11
--- /dev/null
+++ b/data/2022/iclr/Resolving Training Biases via Influence-based Data Relabeling	
@@ -0,0 +1 @@
+The performance of supervised learning methods easily suffers from the training bias issue caused by train-test distribution mismatch or label noise. Influence function is a technique that estimates the impacts of a training sample on the model’s predictions. Recent studies on data resampling have employed influence functions to identify harmful training samples that will degrade model’s test performance. They have shown that discarding or downweighting the identified harmful training samples is an effective way to resolve training biases. In this work, we move one step forward and propose an influence-based relabeling framework named RDIA for reusing harmful training samples toward better model performance. To achieve this, we use influence functions to estimate how relabeling a training sample would affect model’s test performance and further develop a novel relabeling function R. We theoretically prove that applying R to relabel harmful training samples allows the model to achieve lower test loss than simply discarding them for any classification tasks using cross-entropy loss. Extensive experiments on ten real-world datasets demonstrate RDIA outperforms the state-of-the-art data resampling methods and improves model’s robustness against label noise.
\ No newline at end of file
diff --git a/data/2022/iclr/Resonance in Weight Space: Covariate Shift Can Drive Divergence of SGD with Momentum b/data/2022/iclr/Resonance in Weight Space: Covariate Shift Can Drive Divergence of SGD with Momentum
new file mode 100644
index 0000000000..299998d924
--- /dev/null
+++ b/data/2022/iclr/Resonance in Weight Space: Covariate Shift Can Drive Divergence of SGD with Momentum	
@@ -0,0 +1 @@
+Most convergence guarantees for stochastic gradient descent with momentum (SGDm) rely on iid sampling. Yet, SGDm is often used outside this regime, in settings with temporally correlated input samples such as continual learning and reinforcement learning. Existing work has shown that SGDm with a decaying step-size can converge under Markovian temporal correlation. In this work, we show that SGDm under covariate shift with a fixed step-size can be unstable and diverge. In particular, we show SGDm under covariate shift is a parametric oscillator, and so can suffer from a phenomenon known as resonance. We approximate the learning system as a time varying system of ordinary differential equations, and leverage existing theory to characterize the system's divergence/convergence as resonant/nonresonant modes. The theoretical result is limited to the linear setting with periodic covariate shift, so we empirically supplement this result to show that resonance phenomena persist even under non-periodic covariate shift, nonlinear dynamics with neural networks, and optimizers other than SGDm.
\ No newline at end of file
diff --git a/data/2022/iclr/Responsible Disclosure of Generative Models Using Scalable Fingerprinting b/data/2022/iclr/Responsible Disclosure of Generative Models Using Scalable Fingerprinting
new file mode 100644
index 0000000000..62692de60d
--- /dev/null
+++ b/data/2022/iclr/Responsible Disclosure of Generative Models Using Scalable Fingerprinting	
@@ -0,0 +1 @@
+Over the past five years, deep generative models have achieved a qualitative new level of performance. Generated data has become difficult, if not impossible, to be distinguished from real data. While there are plenty of use cases that benefit from this technology, there are also strong concerns on how this new technology can be misused to spoof sensors, generate deep fakes, and enable misinformation at scale. Unfortunately, current deep fake detection methods are not sustainable, as the gap between real and fake continues to close. In contrast, our work enables a responsible disclosure of such state-of-the-art generative models, that allows researchers and companies to fingerprint their models, so that the generated samples containing a fingerprint can be accurately detected and attributed to a source. Our technique achieves this by an efficient and scalable ad-hoc generation of a large population of models with distinct fingerprints. Our recommended operation point uses a 128-bit fingerprint which in principle results in more than $10^{36}$ identifiable models. Experimental results show that our method fulfills key properties of a fingerprinting mechanism and achieves effectiveness in deep fake detection and attribution.
\ No newline at end of file
diff --git a/data/2022/iclr/Rethinking Adversarial Transferability from a Data Distribution Perspective b/data/2022/iclr/Rethinking Adversarial Transferability from a Data Distribution Perspective
new file mode 100644
index 0000000000..ad74201d53
--- /dev/null
+++ b/data/2022/iclr/Rethinking Adversarial Transferability from a Data Distribution Perspective	
@@ -0,0 +1 @@
+Adversarial transferability enables attackers to generate adversarial examples from the source model to attack the target model, which has raised security concerns about the deployment of DNNs in practice. In this paper, we rethink adversarial transferability from a data distribution perspective and further enhance transferability by score matching based optimization. We identify that some samples with injecting small Gaussian noise can fool different target models, and their adversarial examples under different source models have much stronger transferability. We hypothesize that these samples are in the low-density region of the ground truth distribution where models are not well trained. To improve the attack success rate of adversarial examples, we match the adversarial attacks with the directions which effectively decrease the ground truth density. We propose Intrinsic Adversarial Attack (IAA), which smooths the activation function and decreases the impact of the later layers of a given normal model, to increase the alignment of adversarial attack and the gradient of joint data distribution. We conduct comprehensive transferable attacks against multiple DNNs and show that our IAA can boost the transferability of the crafted attacks in all cases and go beyond state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Rethinking Class-Prior Estimation for Positive-Unlabeled Learning b/data/2022/iclr/Rethinking Class-Prior Estimation for Positive-Unlabeled Learning
new file mode 100644
index 0000000000..8a7732edf9
--- /dev/null
+++ b/data/2022/iclr/Rethinking Class-Prior Estimation for Positive-Unlabeled Learning	
@@ -0,0 +1 @@
+Given only positive (P) and unlabeled (U) data, PU learning can train a binary classifier without any negative data. It has two building blocks: PU class-prior estimation (CPE) and PU classification; the latter has been well studied while the former has received less attention. Hitherto, the distributional-assumption-free CPE methods rely on a critical assumption that the support of the positive data distribution cannot be contained in the support of the negative data distribution. If this is violated, those CPE methods will systematically overestimate the class prior; it is even worse that we cannot verify the assumption based on the data. In this paper, we rethink CPE for PU learning-can we remove the assumption to make CPE always valid? We show an affirmative answer by proposing Regrouping CPE (ReCPE) that builds an auxiliary probability distribution such that the support of the positive data distribution is never contained in the support of the negative data distribution. ReCPE can work with any CPE method by treating it as the base method. Theoretically, ReCPE does not affect its base if the assumption already holds for the original probability distribution; otherwise, it reduces the positive bias of its base. Empirically, ReCPE improves all state-of-the-art CPE methods on various datasets, implying that the assumption has indeed been violated here.
\ No newline at end of file
diff --git a/data/2022/iclr/Rethinking Goal-Conditioned Supervised Learning and Its Connection to Offline RL b/data/2022/iclr/Rethinking Goal-Conditioned Supervised Learning and Its Connection to Offline RL
new file mode 100644
index 0000000000..0426b9993d
--- /dev/null
+++ b/data/2022/iclr/Rethinking Goal-Conditioned Supervised Learning and Its Connection to Offline RL	
@@ -0,0 +1 @@
+Solving goal-conditioned tasks with sparse rewards using self-supervised learning is promising because of its simplicity and stability over current reinforcement learning (RL) algorithms. A recent work, called Goal-Conditioned Supervised Learning (GCSL), provides a new learning framework by iteratively relabeling and imitating self-generated experiences. In this paper, we revisit the theoretical property of GCSL -- optimizing a lower bound of the goal reaching objective, and extend GCSL as a novel offline goal-conditioned RL algorithm. The proposed method is named Weighted GCSL (WGCSL), in which we introduce an advanced compound weight consisting of three parts (1) discounted weight for goal relabeling, (2) goal-conditioned exponential advantage weight, and (3) best-advantage weight. Theoretically, WGCSL is proved to optimize an equivalent lower bound of the goal-conditioned RL objective and generates monotonically improved policies via an iterated scheme. The monotonic property holds for any behavior policies, and therefore WGCSL can be applied to both online and offline settings. To evaluate algorithms in the offline goal-conditioned RL setting, we provide a benchmark including a range of point and simulated robot domains. Experiments in the introduced benchmark demonstrate that WGCSL can consistently outperform GCSL and existing state-of-the-art offline methods in the fully offline goal-conditioned setting.
\ No newline at end of file
diff --git a/data/2022/iclr/Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework b/data/2022/iclr/Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework
new file mode 100644
index 0000000000..bbaf7a2ce0
--- /dev/null
+++ b/data/2022/iclr/Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework	
@@ -0,0 +1 @@
+Point cloud analysis is challenging due to irregularity and unordered data structure. To capture the 3D geometries, prior works mainly rely on exploring sophisticated local geometric extractors using convolution, graph, or attention mechanisms. These methods, however, incur unfavorable latency during inference, and the performance saturates over the past few years. In this paper, we present a novel perspective on this task. We notice that detailed local geometrical information probably is not the key to point cloud analysis -- we introduce a pure residual MLP network, called PointMLP, which integrates no sophisticated local geometrical extractors but still performs very competitively. Equipped with a proposed lightweight geometric affine module, PointMLP delivers the new state-of-the-art on multiple datasets. On the real-world ScanObjectNN dataset, our method even surpasses the prior best method by 3.3% accuracy. We emphasize that PointMLP achieves this strong performance without any sophisticated operations, hence leading to a superior inference speed. Compared to most recent CurveNet, PointMLP trains 2x faster, tests 7x faster, and is more accurate on ModelNet40 benchmark. We hope our PointMLP may help the community towards a better understanding of point cloud analysis. The code is available at https://github.com/ma-xu/pointMLP-pytorch.
\ No newline at end of file
diff --git a/data/2022/iclr/Rethinking Supervised Pre-Training for Better Downstream Transferring b/data/2022/iclr/Rethinking Supervised Pre-Training for Better Downstream Transferring
new file mode 100644
index 0000000000..d2e8bac8fe
--- /dev/null
+++ b/data/2022/iclr/Rethinking Supervised Pre-Training for Better Downstream Transferring	
@@ -0,0 +1 @@
+The pretrain-finetune paradigm has shown outstanding performance on many applications of deep learning, where a model is pre-trained on a upstream large dataset (e.g. ImageNet), and is then fine-tuned to different downstream tasks. Though for most cases, the pre-training stage is conducted based on supervised methods, recent works on self-supervised pre-training have shown powerful transferability and even outperform supervised pre-training on multiple downstream tasks. It thus remains an open question how to better generalize supervised pre-training model to downstream tasks. In this paper, we argue that the worse transferability of existing supervised pre-training methods arise from the negligence of valuable intra-class semantic difference. This is because these methods tend to push images from the same class close to each other despite of the large diversity in their visual contents, a problem to which referred as"overfit of upstream tasks". To alleviate this problem, we propose a new supervised pre-training method based on Leave-One-Out K-Nearest-Neighbor, or LOOK for short. It relieves the problem of overfitting upstream tasks by only requiring each image to share its class label with most of its k nearest neighbors, thus allowing each class to exhibit a multi-mode distribution and consequentially preserving part of intra-class difference for better transferring to downstream tasks. We developed efficient implementation of the proposed method that scales well to large datasets. Experimental studies on multiple downstream tasks show that LOOK outperforms other state-of-the-art methods for supervised and self-supervised pre-training.
\ No newline at end of file
diff --git a/data/2022/iclr/Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph b/data/2022/iclr/Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph
new file mode 100644
index 0000000000..d03ddfa825
--- /dev/null
+++ b/data/2022/iclr/Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph	
@@ -0,0 +1 @@
+This paper addresses the unsupervised learning of content-style decomposed representation. We first give a definition of style and then model the content-style representation as a token-level bipartite graph. An unsupervised framework, named Retriever, is proposed to learn such representations. First, a cross-attention module is employed to retrieve permutation invariant (P.I.) information, defined as style, from the input data. Second, a vector quantization (VQ) module is used, together with man-induced constraints, to produce interpretable content tokens. Last, an innovative link attention module serves as the decoder to reconstruct data from the decomposed content and style, with the help of the linking keys. Being modal-agnostic, the proposed Retriever is evaluated in both speech and image domains. The state-of-the-art zero-shot voice conversion performance confirms the disentangling ability of our framework. Top performance is also achieved in the part discovery task for images, verifying the interpretability of our representation. In addition, the vivid part-based style transfer quality demonstrates the potential of Retriever to support various fascinating generative tasks. Project page at https://ydcustc.github.io/retriever-demo/.
\ No newline at end of file
diff --git a/data/2022/iclr/Reverse Engineering of Imperceptible Adversarial Image Perturbations b/data/2022/iclr/Reverse Engineering of Imperceptible Adversarial Image Perturbations
new file mode 100644
index 0000000000..3efc46da0f
--- /dev/null
+++ b/data/2022/iclr/Reverse Engineering of Imperceptible Adversarial Image Perturbations	
@@ -0,0 +1 @@
+It has been well recognized that neural network based image classifiers are easily fooled by images with tiny perturbations crafted by an adversary. There has been a vast volume of research to generate and defend such adversarial attacks. However, the following problem is left unexplored: How to reverse-engineer adversarial perturbations from an adversarial image? This leads to a new adversarial learning paradigm--Reverse Engineering of Deceptions (RED). If successful, RED allows us to estimate adversarial perturbations and recover the original images. However, carefully crafted, tiny adversarial perturbations are difficult to recover by optimizing a unilateral RED objective. For example, the pure image denoising method may overfit to minimizing the reconstruction error but hardly preserve the classification properties of the true adversarial perturbations. To tackle this challenge, we formalize the RED problem and identify a set of principles crucial to the RED approach design. Particularly, we find that prediction alignment and proper data augmentation (in terms of spatial transformations) are two criteria to achieve a generalizable RED approach. By integrating these RED principles with image denoising, we propose a new Class-Discriminative Denoising based RED framework, termed CDD-RED. Extensive experiments demonstrate the effectiveness of CDD-RED under different evaluation metrics (ranging from the pixel-level, prediction-level to the attribution-level alignment) and a variety of attack generation methods (e.g., FGSM, PGD, CW, AutoAttack, and adaptive attacks).
\ No newline at end of file
diff --git a/data/2022/iclr/Reversible Instance Normalization for Accurate Time-Series Forecasting against Distribution Shift b/data/2022/iclr/Reversible Instance Normalization for Accurate Time-Series Forecasting against Distribution Shift
new file mode 100644
index 0000000000..58d858c215
--- /dev/null
+++ b/data/2022/iclr/Reversible Instance Normalization for Accurate Time-Series Forecasting against Distribution Shift	
@@ -0,0 +1 @@
+Statistical properties such as mean and variance often change over time in time series, i.e., time-series data suffer from a distribution shift problem. This change in temporal distribution is one of the main challenges that prevent accurate timeseries forecasting. To address this issue, we propose a simple yet effective normalization method called reversible instance normalization (RevIN), a generallyapplicable normalization-and-denormalization method with learnable affine transformation. The proposed method is symmetrically structured to remove and restore the statistical information of a time-series instance, leading to significant performance improvements in time-series forecasting, as shown in Fig. 1. We demonstrate the effectiveness of RevIN via extensive quantitative and qualitative analyses on various real-world datasets, addressing the distribution shift problem. ∗Both authors contributed equally. The order of the first authors was determined by coin flip.
\ No newline at end of file
diff --git a/data/2022/iclr/Revisit Kernel Pruning with Lottery Regulated Grouped Convolutions b/data/2022/iclr/Revisit Kernel Pruning with Lottery Regulated Grouped Convolutions
new file mode 100644
index 0000000000..e6d3fb3413
--- /dev/null
+++ b/data/2022/iclr/Revisit Kernel Pruning with Lottery Regulated Grouped Convolutions	
@@ -0,0 +1 @@
+Structured pruning methods which are capable of delivering a densely pruned network are among the most popular techniques in the realm of neural network pruning, where most methods prune the original network at a filter or layer level. Although such methods may provide immediate compression and acceleration benefits, we argue that the blanket removal of an entire filter or layer may result in undesired accuracy loss. In this paper, we revisit the idea of kernel pruning (to only prune one or several k × k kernels out of a 3D-filter), a heavily overlooked approach under the context of structured pruning. This is because kernel pruning will naturally introduce sparsity to filters within the same convolutional layer — thus, making the remaining network no longer dense. We address this problem by proposing a versatile grouped pruning framework where we first cluster filters from each convolutional layer into equal-sized groups, prune the grouped kernels we deem unimportant from each filter group, then permute the remaining filters to form a densely grouped convolutional architecture (which also enables the parallel computing capability) for fine-tuning. Specifically, we consult empirical findings from a series of literature regarding Lottery Ticket Hypothesis to determine the optimal clustering scheme per layer, and develop a simple yet cost-efficient greedy approximation algorithm to determine which group kernels to keep within each filter group. Extensive experiments also demonstrate our method often outperforms comparable SOTA methods with lesser data augmentation needed, smaller finetuning budget required, and sometimes even much simpler procedure executed (e.g., one-shot v. iterative). Please refer to our GitHub repository for code.
\ No newline at end of file
diff --git a/data/2022/iclr/Revisiting Design Choices in Offline Model Based Reinforcement Learning b/data/2022/iclr/Revisiting Design Choices in Offline Model Based Reinforcement Learning
new file mode 100644
index 0000000000..41e1f31858
--- /dev/null
+++ b/data/2022/iclr/Revisiting Design Choices in Offline Model Based Reinforcement Learning	
@@ -0,0 +1 @@
+Offline reinforcement learning enables agents to leverage large pre-collected datasets of environment transitions to learn control policies, circumventing the need for potentially expensive or unsafe online data collection. Significant progress has been made recently in offline model-based reinforcement learning, approaches which leverage a learned dynamics model. This typically involves constructing a probabilistic model, and using the model uncertainty to penalize rewards where there is insufficient data, solving for a pessimistic MDP that lower bounds the true MDP. Existing methods, however, exhibit a breakdown between theory and practice, whereby pessimistic return ought to be bounded by the total variation distance of the model from the true dynamics, but is instead implemented through a penalty based on estimated model uncertainty. This has spawned a variety of uncertainty heuristics, with little to no comparison between differing approaches. In this paper, we compare these heuristics, and design novel protocols to investigate their interaction with other hyperparameters, such as the number of models, or imaginary rollout horizon. Using these insights, we show that selecting these key hyperparameters using Bayesian Optimization produces superior configurations that are vastly different to those currently used in existing hand-tuned state-of-the-art methods, and result in drastically stronger performance.
\ No newline at end of file
diff --git a/data/2022/iclr/Revisiting Over-smoothing in BERT from the Perspective of Graph b/data/2022/iclr/Revisiting Over-smoothing in BERT from the Perspective of Graph
new file mode 100644
index 0000000000..9cfa0b6aec
--- /dev/null
+++ b/data/2022/iclr/Revisiting Over-smoothing in BERT from the Perspective of Graph	
@@ -0,0 +1 @@
+Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields. However, no existing work has delved deeper to further investigate the main cause of this phenomenon. In this work, we make the attempt to analyze the over-smoothing problem from the perspective of graph, where such problem was first discovered and explored. Intuitively, the self-attention matrix can be seen as a normalized adjacent matrix of a corresponding graph. Based on the above connection, we provide some theoretical analysis and find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models. Specifically, if the standard deviation of layer normalization is sufficiently large, the output of Transformer stacks will converge to a specific low-rank subspace and result in over-smoothing. To alleviate the over-smoothing problem, we consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse. Extensive experiment results on various data sets illustrate the effect of our fusion method.
\ No newline at end of file
diff --git a/data/2022/iclr/Revisiting flow generative models for Out-of-distribution detection b/data/2022/iclr/Revisiting flow generative models for Out-of-distribution detection
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Reward Uncertainty for Exploration in Preference-based Reinforcement Learning b/data/2022/iclr/Reward Uncertainty for Exploration in Preference-based Reinforcement Learning
new file mode 100644
index 0000000000..4011e07616
--- /dev/null
+++ b/data/2022/iclr/Reward Uncertainty for Exploration in Preference-based Reinforcement Learning	
@@ -0,0 +1 @@
+Conveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preference-based RL methods are able to learn a more flexible reward model based on human preferences by actively incorporating human feedback, i.e. teacher's preferences between two clips of behaviors. However, poor feedback-efficiency still remains a problem in current preference-based RL algorithms, as tailored human feedback is very expensive. To handle this issue, previous methods have mainly focused on improving query selection and policy initialization. At the same time, recent exploration methods have proven to be a recipe for improving sample-efficiency in RL. We present an exploration method specifically for preference-based RL algorithms. Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward. Specifically, we utilize disagreement across ensemble of learned reward models. Our intuition is that disagreement in learned reward model reflects uncertainty in tailored human feedback and could be useful for exploration. Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms on complex robot manipulation tasks from MetaWorld benchmarks, compared with other existing exploration methods that measure the novelty of state visitation.
\ No newline at end of file
diff --git a/data/2022/iclr/Robbing the Fed: Directly Obtaining Private Data in Federated Learning with Modified Models b/data/2022/iclr/Robbing the Fed: Directly Obtaining Private Data in Federated Learning with Modified Models
new file mode 100644
index 0000000000..09d49d6167
--- /dev/null
+++ b/data/2022/iclr/Robbing the Fed: Directly Obtaining Private Data in Federated Learning with Modified Models	
@@ -0,0 +1 @@
+Federated learning has quickly gained popularity with its promises of increased user privacy and efficiency. Previous works have shown that federated gradient updates contain information that can be used to approximately recover user data in some situations. These previous attacks on user privacy have been limited in scope and do not scale to gradient updates aggregated over even a handful of data points, leaving some to conclude that data privacy is still intact for realistic training regimes. In this work, we introduce a new threat model based on minimal but malicious modifications of the shared model architecture which enable the server to directly obtain a verbatim copy of user data from gradient updates without solving difficult inverse problems. Even user data aggregated over large batches -- where previous methods fail to extract meaningful content -- can be reconstructed by these minimally modified models.
\ No newline at end of file
diff --git a/data/2022/iclr/Robust Learning Meets Generative Models: Can Proxy Distributions Improve Adversarial Robustness? b/data/2022/iclr/Robust Learning Meets Generative Models: Can Proxy Distributions Improve Adversarial Robustness?
new file mode 100644
index 0000000000..702d09a6d7
--- /dev/null
+++ b/data/2022/iclr/Robust Learning Meets Generative Models: Can Proxy Distributions Improve Adversarial Robustness?	
@@ -0,0 +1 @@
+While additional training data improves the robustness of deep neural networks against adversarial examples, it presents the challenge of curating a large number of specific real-world samples. We circumvent this challenge by using additional data from proxy distributions learned by advanced generative models. We first seek to formally understand the transfer of robustness from classifiers trained on proxy distributions to the real data distribution. We prove that the difference between the robustness of a classifier on the two distributions is upper bounded by the conditional Wasserstein distance between them. Next we use proxy distributions to significantly improve the performance of adversarial training on five different datasets. For example, we improve robust accuracy by up to 7.5% and 6.7% in $\ell_{\infty}$ and $\ell_2$ threat model over baselines that are not using proxy distributions on the CIFAR-10 dataset. We also improve certified robust accuracy by 7.6% on the CIFAR-10 dataset. We further demonstrate that different generative models bring a disparate improvement in the performance in robust training. We propose a robust discrimination approach to characterize the impact of individual generative models and further provide a deeper understanding of why current state-of-the-art in diffusion-based generative models are a better choice for proxy distribution than generative adversarial networks.
\ No newline at end of file
diff --git a/data/2022/iclr/Robust Unlearnable Examples: Protecting Data Privacy Against Adversarial Learning b/data/2022/iclr/Robust Unlearnable Examples: Protecting Data Privacy Against Adversarial Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Robust and Scalable SDE Learning: A Functional Perspective b/data/2022/iclr/Robust and Scalable SDE Learning: A Functional Perspective
new file mode 100644
index 0000000000..ea15d37bcb
--- /dev/null
+++ b/data/2022/iclr/Robust and Scalable SDE Learning: A Functional Perspective	
@@ -0,0 +1 @@
+Stochastic differential equations provide a rich class of flexible generative models, capable of describing a wide range of spatio-temporal processes. A host of recent work looks to learn data-representing SDEs, using neural networks and other flexible function approximators. Despite these advances, learning remains computationally expensive due to the sequential nature of SDE integrators. In this work, we propose an importance-sampling estimator for probabilities of observations of SDEs for the purposes of learning. Crucially, the approach we suggest does not rely on such integrators. The proposed method produces lower-variance gradient estimates compared to algorithms based on SDE integrators and has the added advantage of being embarrassingly parallelizable. This facilitates the effective use of large-scale parallel hardware for massive decreases in computation time.
\ No newline at end of file
diff --git a/data/2022/iclr/RotoGrad: Gradient Homogenization in Multitask Learning b/data/2022/iclr/RotoGrad: Gradient Homogenization in Multitask Learning
new file mode 100644
index 0000000000..dc12afb776
--- /dev/null
+++ b/data/2022/iclr/RotoGrad: Gradient Homogenization in Multitask Learning	
@@ -0,0 +1 @@
+Multitask learning is being increasingly adopted in applications domains like computer vision and reinforcement learning. However, optimally exploiting its advantages remains a major challenge due to the effect of negative transfer. Previous works have tracked down this issue to the disparities in gradient magnitudes and directions across tasks, when optimizing the shared network parameters. While recent work has acknowledged that negative transfer is a two-fold problem, existing approaches fall short as they only focus on either homogenizing the gradient magnitude across tasks; or greedily change the gradient directions, overlooking future conflicts. In this work, we introduce RotoGrad, an algorithm that tackles negative transfer as a whole: it jointly homogenizes gradient magnitudes and directions, while ensuring training convergence. We show that RotoGrad outperforms competing methods in complex problems, including multi-label classification in CelebA and computer vision tasks in the NYUv2 dataset. A Pytorch implementation can be found in https://github.com/adrianjav/rotograd.
\ No newline at end of file
diff --git a/data/2022/iclr/RvS: What is Essential for Offline RL via Supervised Learning? b/data/2022/iclr/RvS: What is Essential for Offline RL via Supervised Learning?
new file mode 100644
index 0000000000..dc0b8cecbc
--- /dev/null
+++ b/data/2022/iclr/RvS: What is Essential for Offline RL via Supervised Learning?	
@@ -0,0 +1 @@
+Recent work has shown that supervised learning alone, without temporal difference (TD) learning, can be remarkably effective for offline RL. When does this hold true, and which algorithmic components are necessary? Through extensive experiments, we boil supervised learning for offline RL down to its essential elements. In every environment suite we consider, simply maximizing likelihood with a two-layer feedforward MLP is competitive with state-of-the-art results of substantially more complex methods based on TD learning or sequence modeling with Transformers. Carefully choosing model capacity (e.g., via regularization or architecture) and choosing which information to condition on (e.g., goals or rewards) are critical for performance. These insights serve as a field guide for practitioners doing Reinforcement Learning via Supervised Learning (which we coin"RvS learning"). They also probe the limits of existing RvS methods, which are comparatively weak on random data, and suggest a number of open problems.
\ No newline at end of file
diff --git a/data/2022/iclr/SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations b/data/2022/iclr/SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
new file mode 100644
index 0000000000..03a030aa8b
--- /dev/null
+++ b/data/2022/iclr/SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations	
@@ -0,0 +1 @@
+Guided image synthesis enables everyday users to create and edit photo-realistic images with minimum effort. The key challenge is balancing faithfulness to the user input (e.g., hand-drawn colored strokes) and realism of the synthesized image. Existing GAN-based methods attempt to achieve such balance using either conditional GANs or GAN inversions, which are challenging and often require additional training data or loss functions for individual applications. To address these issues, we introduce a new image synthesis and editing method, Stochastic Differential Editing (SDEdit), based on a diffusion model generative prior, which synthesizes realistic images by iteratively denoising through a stochastic differential equation (SDE). Given an input image with user guide of any type, SDEdit first adds noise to the input, then subsequently denoises the resulting image through the SDE prior to increase its realism. SDEdit does not require task-specific training or inversions and can naturally achieve the balance between realism and faithfulness. SDEdit significantly outperforms state-of-the-art GAN-based methods by up to 98.09% on realism and 91.72% on overall satisfaction scores, according to a human perception study, on multiple tasks, including stroke-based image synthesis and editing as well as image compositing.
\ No newline at end of file
diff --git a/data/2022/iclr/SGD Can Converge to Local Maxima b/data/2022/iclr/SGD Can Converge to Local Maxima
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models b/data/2022/iclr/SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models
new file mode 100644
index 0000000000..ca0b731d3a
--- /dev/null
+++ b/data/2022/iclr/SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models	
@@ -0,0 +1 @@
+In recent years, implicit deep learning has emerged as a method to increase the effective depth of deep neural networks. While their training is memory-efficient, they are still significantly slower to train than their explicit counterparts. In Deep Equilibrium Models (DEQs), the training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix. In this paper, we propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer. The main idea is to use the quasi-Newton matrices from the forward pass to efficiently approximate the inverse Jacobian matrix in the direction needed for the gradient computation. We provide a theorem that motivates using our method with the original forward algorithms. In addition, by modifying these forward algorithms, we further provide theoretical guarantees that our method asymptotically estimates the true implicit gradient. We empirically study this approach and the recent Jacobian-Free method in different settings, ranging from hyperparameter optimization to large Multiscale DEQs (MDEQs) applied to CIFAR and ImageNet. Both methods reduce significantly the computational cost of the backward pass. While SHINE has a clear advantage on hyperparameter optimization problems, both methods attain similar computational performances for larger scale problems such as MDEQs at the cost of a limited performance drop compared to the original models.
\ No newline at end of file
diff --git a/data/2022/iclr/SOSP: Efficiently Capturing Global Correlations by Second-Order Structured Pruning b/data/2022/iclr/SOSP: Efficiently Capturing Global Correlations by Second-Order Structured Pruning
new file mode 100644
index 0000000000..c74d7e57b2
--- /dev/null
+++ b/data/2022/iclr/SOSP: Efficiently Capturing Global Correlations by Second-Order Structured Pruning	
@@ -0,0 +1 @@
+Pruning neural networks reduces inference time and memory costs. On standard hardware, these benefits will be especially prominent if coarse-grained structures, like feature maps, are pruned. We devise two novel saliency-based methods for second-order structured pruning (SOSP) which include correlations among all structures and layers. Our main method SOSP-H employs an innovative second-order approximation, which enables saliency evaluations by fast Hessian-vector products. SOSP-H thereby scales like a first-order method despite taking into account the full Hessian. We validate SOSP-H by comparing it to our second method SOSP-I that uses a well-established Hessian approximation, and to numerous state-of-the-art methods. While SOSP-H performs on par or better in terms of accuracy, it has clear advantages in terms of scalability and efficiency. This allowed us to scale SOSP-H to large-scale vision tasks, even though it captures correlations across all layers of the network. To underscore the global nature of our pruning methods, we evaluate their performance not only by removing structures from a pretrained network, but also by detecting architectural bottlenecks. We show that our algorithms allow to systematically reveal architectural bottlenecks, which we then remove to further increase the accuracy of the networks.
\ No newline at end of file
diff --git a/data/2022/iclr/SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training b/data/2022/iclr/SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training
new file mode 100644
index 0000000000..b5be3afb07
--- /dev/null
+++ b/data/2022/iclr/SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training	
@@ -0,0 +1 @@
+We introduce a new approach for speech pre-training named SPIRAL which works by learning denoising representation of perturbed data in a teacher-student framework. Specifically, given a speech utterance, we first feed the utterance to a teacher network to obtain corresponding representation. Then the same utterance is perturbed and fed to a student network. The student network is trained to output representation resembling that of the teacher. At the same time, the teacher network is updated as moving average of student's weights over training steps. In order to prevent representation collapse, we apply an in-utterance contrastive loss as pre-training objective and impose position randomization on the input to the teacher. SPIRAL achieves competitive or better results compared to state-of-the-art speech pre-training method wav2vec 2.0, with significant reduction of training cost (80% for BASE model, 65% for LARGE model). Furthermore, we address the problem of noise-robustness that is critical to real-world speech applications. We propose multi-condition pre-training by perturbing the student's input with various types of additive noise. We demonstrate that multi-condition pre-trained SPIRAL models are more robust to noisy speech (9.0% - 13.3% relative word error rate reduction on real noisy test data), compared to applying multi-condition training solely in the fine-tuning stage. Source code is available at https://github.com/huawei-noah/Speech-Backbones/tree/main/SPIRAL.
\ No newline at end of file
diff --git a/data/2022/iclr/SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation b/data/2022/iclr/SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation
new file mode 100644
index 0000000000..9a1f6b95e6
--- /dev/null
+++ b/data/2022/iclr/SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation	
@@ -0,0 +1 @@
+Quantization of deep neural networks (DNN) has been proven effective for compressing and accelerating DNN models. Data-free quantization (DFQ) is a promising approach without the original datasets under privacy-sensitive and confidential scenarios. However, current DFQ solutions degrade accuracy, need synthetic data to calibrate networks, and are time-consuming and costly. This paper proposes an on-the-fly DFQ framework with sub-second quantization time, called SQuant, which can quantize networks on inference-only devices with low computation and memory requirements. With the theoretical analysis of the second-order information of DNN task loss, we decompose and approximate the Hessian-based optimization objective into three diagonal sub-items, which have different areas corresponding to three dimensions of weight tensor: element-wise, kernel-wise, and output channel-wise. Then, we progressively compose sub-items and propose a novel data-free optimization objective in the discrete domain, minimizing Constrained Absolute Sum of Error (or CASE in short), which surprisingly does not need any dataset and is even not aware of network architecture. We also design an efficient algorithm without back-propagation to further reduce the computation complexity of the objective solver. Finally, without fine-tuning and synthetic datasets, SQuant accelerates the data-free quantization process to a sub-second level with>30% accuracy improvement over the existing data-free post-training quantization works, with the evaluated models under 4-bit quantization. We have open-sourced the SQuant framework at https://github.com/clevercool/SQuant.
\ No newline at end of file
diff --git a/data/2022/iclr/SUMNAS: Supernet with Unbiased Meta-Features for Neural Architecture Search b/data/2022/iclr/SUMNAS: Supernet with Unbiased Meta-Features for Neural Architecture Search
new file mode 100644
index 0000000000..cf8dfddb3e
--- /dev/null
+++ b/data/2022/iclr/SUMNAS: Supernet with Unbiased Meta-Features for Neural Architecture Search	
@@ -0,0 +1 @@
+One-shot Neural Architecture Search (NAS) usually constructs an overparameterized network, which we call a supernet, and typically adopts sharing parameters among the sub-models to improve computational efficiency. Oneshot NAS often repeatedly samples sub-models from the supernet and trains them to optimize the shared parameters. However, this training strategy suffers from multi-model forgetting. Training a sampled sub-model overrides the previous knowledge learned by the other sub-models, resulting in an unfair performance evaluation between the sub-models. We propose Supernet with Unbiased MetaFeatures for Neural Architecture Search (SUMNAS), a supernet learning strategy based on meta-learning to tackle the knowledge forgetting issue. During the training phase, we explicitly address the multi-model forgetting problem and help the supernet learn unbiased meta-features, independent from the sampled submodels. Once training is over, sub-models can be instantly compared to get the overall ranking or the best sub-model. Our evaluation on the NAS-Bench-201 and MobileNet-based search space demonstrate that SUMNAS shows improved ranking ability and finds architectures whose performance is on par with existing state-of-the-art NAS algorithms.
\ No newline at end of file
diff --git a/data/2022/iclr/SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning b/data/2022/iclr/SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning
new file mode 100644
index 0000000000..9bdf73f9f8
--- /dev/null
+++ b/data/2022/iclr/SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning	
@@ -0,0 +1 @@
+Preference-based reinforcement learning (RL) has shown potential for teaching agents to perform the target tasks without a costly, pre-defined reward function by learning the reward with a supervisor's preference between the two agent behaviors. However, preference-based learning often requires a large amount of human feedback, making it difficult to apply this approach to various applications. This data-efficiency problem, on the other hand, has been typically addressed by using unlabeled samples or data augmentation techniques in the context of supervised learning. Motivated by the recent success of these approaches, we present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation. In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor. To further improve the label-efficiency of reward learning, we introduce a new data augmentation that temporally crops consecutive subsequences from the original behaviors. Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the state-of-the-art preference-based method on a variety of locomotion and robotic manipulation tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Safe Neurosymbolic Learning with Differentiable Symbolic Execution b/data/2022/iclr/Safe Neurosymbolic Learning with Differentiable Symbolic Execution
new file mode 100644
index 0000000000..134437cd5f
--- /dev/null
+++ b/data/2022/iclr/Safe Neurosymbolic Learning with Differentiable Symbolic Execution	
@@ -0,0 +1 @@
+We study the problem of learning worst-case-safe parameters for programs that use neural networks as well as symbolic, human-written code. Such neurosymbolic programs arise in many safety-critical domains. However, because they can use nondifferentiable operations, it is hard to learn their parameters using existing gradient-based approaches to safe learning. Our approach to this problem, Differentiable Symbolic Execution (DSE), samples control flow paths in a program, symbolically constructs worst-case"safety losses"along these paths, and backpropagates the gradients of these losses through program operations using a generalization of the REINFORCE estimator. We evaluate the method on a mix of synthetic tasks and real-world benchmarks. Our experiments show that DSE significantly outperforms the state-of-the-art DiffAI method on these tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Salient ImageNet: How to discover spurious features in Deep Learning? b/data/2022/iclr/Salient ImageNet: How to discover spurious features in Deep Learning?
new file mode 100644
index 0000000000..d510523430
--- /dev/null
+++ b/data/2022/iclr/Salient ImageNet: How to discover spurious features in Deep Learning?	
@@ -0,0 +1 @@
+we identify spurious or core neural features (penultimate layer neurons of a robust model) via limited human supervision (e.g., using top 5 activating images per feature). We then show that these neural feature annotations generalize extremely well to many more images without any human supervision. We use the activation maps for these neural features as the soft masks to highlight spurious or core visual features. Using this methodology, we introduce the Salient Imagenet dataset containing core and spurious masks for a large set of samples from Imagenet. Using this dataset, we show that several popular Imagenet models rely heavily on various spurious features in their predictions, in-dicating the standard accuracy alone is not sufﬁcient to fully assess model perfor-1
\ No newline at end of file
diff --git a/data/2022/iclr/Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation b/data/2022/iclr/Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation
new file mode 100644
index 0000000000..d30c1c5d39
--- /dev/null
+++ b/data/2022/iclr/Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation	
@@ -0,0 +1 @@
+In model-free deep reinforcement learning (RL) algorithms, using noisy value estimates to supervise policy evaluation and optimization is detrimental to the sample efficiency. As this noise is heteroscedastic, its effects can be mitigated using uncertainty-based weights in the optimization process. Previous methods rely on sampled ensembles, which do not capture all aspects of uncertainty. We provide a systematic analysis of the sources of uncertainty in the noisy supervision that occurs in RL, and introduce inverse-variance RL, a Bayesian framework which combines probabilistic ensembles and Batch Inverse Variance weighting. We propose a method whereby two complementary uncertainty estimation methods account for both the Q-value and the environment stochasticity to better mitigate the negative impacts of noisy supervision. Our results show significant improvement in terms of sample efficiency on discrete and continuous control tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Sample Efficient Stochastic Policy Extragradient Algorithm for Zero-Sum Markov Game b/data/2022/iclr/Sample Efficient Stochastic Policy Extragradient Algorithm for Zero-Sum Markov Game
new file mode 100644
index 0000000000..61924aa408
--- /dev/null
+++ b/data/2022/iclr/Sample Efficient Stochastic Policy Extragradient Algorithm for Zero-Sum Markov Game	
@@ -0,0 +1 @@
+Two-player zero-sum Markov game is a fundamental problem in reinforcement learning and game theory. Although many algorithms have been proposed for solving zero-sum Markov games in the existing literature, many of them either require a full knowledge of the environment or are not sample-efﬁcient. In this paper, we develop a fully decentralized and sample-efﬁcient stochastic policy extragradient algorithm for solving tabular zero-sum Markov games. In particular, our algorithm utilizes multiple stochastic estimators to accurately estimate the value functions involved in the stochastic updates, and leverages entropy regularization to accelerate the convergence. Speciﬁcally, with a proper entropy-regularization parameter, we prove that the stochastic policy extragradient algorithm has a sample complexity of the order (cid:101) O ( A max µ min (cid:15) 5 . 5 (1 − γ ) 13 . 5 ) for ﬁnding a solution that achieves (cid:15) -Nash equilibrium duality gap, where A max is the maximum number of actions between the players, µ min is the lower bound of state stationary distribution, and γ is the discount factor. Such a sample complexity result substantially improves the state-of-the-art complexity result.
\ No newline at end of file
diff --git a/data/2022/iclr/Sample Selection with Uncertainty of Losses for Learning with Noisy Labels b/data/2022/iclr/Sample Selection with Uncertainty of Losses for Learning with Noisy Labels
new file mode 100644
index 0000000000..c583ac2f17
--- /dev/null
+++ b/data/2022/iclr/Sample Selection with Uncertainty of Losses for Learning with Noisy Labels	
@@ -0,0 +1 @@
+In learning with noisy labels, the sample selection approach is very popular, which regards small-loss data as correctly labeled during training. However, losses are generated on-the-fly based on the model being trained with noisy labels, and thus large-loss data are likely but not certainly to be incorrect. There are actually two possibilities of a large-loss data point: (a) it is mislabeled, and then its loss decreases slower than other data, since deep neural networks"learn patterns first"; (b) it belongs to an underrepresented group of data and has not been selected yet. In this paper, we incorporate the uncertainty of losses by adopting interval estimation instead of point estimation of losses, where lower bounds of the confidence intervals of losses derived from distribution-free concentration inequalities, but not losses themselves, are used for sample selection. In this way, we also give large-loss but less selected data a try; then, we can better distinguish between the cases (a) and (b) by seeing if the losses effectively decrease with the uncertainty after the try. As a result, we can better explore underrepresented data that are correctly labeled but seem to be mislabeled at first glance. Experiments demonstrate that the proposed method is superior to baselines and robust to a broad range of label noise types.
\ No newline at end of file
diff --git a/data/2022/iclr/Sample and Computation Redistribution for Efficient Face Detection b/data/2022/iclr/Sample and Computation Redistribution for Efficient Face Detection
new file mode 100644
index 0000000000..ae9f0a8abb
--- /dev/null
+++ b/data/2022/iclr/Sample and Computation Redistribution for Efficient Face Detection	
@@ -0,0 +1 @@
+Although tremendous strides have been made in uncontrolled face detection, efficient face detection with a low computation cost as well as high precision remains an open challenge. In this paper, we point out that training data sampling and computation distribution strategies are the keys to efficient and accurate face detection. Motivated by these observations, we introduce two simple but effective methods (1) Sample Redistribution (SR), which augments training samples for the most needed stages, based on the statistics of benchmark datasets; and (2) Computation Redistribution (CR), which reallocates the computation between the backbone, neck and head of the model, based on a meticulously defined search methodology. Extensive experiments conducted on WIDER FACE demonstrate the state-of-the-art efficiency-accuracy trade-off for the proposed \scrfd family across a wide range of compute regimes. In particular, \scrfdf{34} outperforms the best competitor, TinaFace, by $3.86\%$ (AP at hard set) while being more than \emph{3$\times$ faster} on GPUs with VGA-resolution images. We also release our code to facilitate future research.
\ No newline at end of file
diff --git a/data/2022/iclr/Sampling with Mirrored Stein Operators b/data/2022/iclr/Sampling with Mirrored Stein Operators
new file mode 100644
index 0000000000..18649b17b4
--- /dev/null
+++ b/data/2022/iclr/Sampling with Mirrored Stein Operators	
@@ -0,0 +1 @@
+We introduce a new family of particle evolution samplers suitable for constrained domains and non-Euclidean geometries. Stein Variational Mirror Descent and Mirrored Stein Variational Gradient Descent minimize the Kullback-Leibler (KL) divergence to constrained target distributions by evolving particles in a dual space defined by a mirror map. Stein Variational Natural Gradient exploits non-Euclidean geometry to more efficiently minimize the KL divergence to unconstrained targets. We derive these samplers from a new class of mirrored Stein operators and adaptive kernels developed in this work. We demonstrate that these new samplers yield accurate approximations to distributions on the simplex, deliver valid confidence intervals in post-selection inference, and converge more rapidly than prior methods in large-scale unconstrained posterior inference. Finally, we establish the convergence of our new procedures under verifiable conditions on the target distribution.
\ No newline at end of file
diff --git a/data/2022/iclr/Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiation b/data/2022/iclr/Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiation
new file mode 100644
index 0000000000..757a631205
--- /dev/null
+++ b/data/2022/iclr/Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiation	
@@ -0,0 +1 @@
+Machine learning training methods depend plentifully and intricately on hyperparameters, motivating automated strategies for their optimisation. Many existing algorithms restart training for each new hyperparameter choice, at considerable computational cost. Some hypergradient-based one-pass methods exist, but these either cannot be applied to arbitrary optimiser hyperparameters (such as learning rates and momenta) or take several times longer to train than their base models. We extend these existing methods to develop an approximate hypergradient-based hyperparameter optimiser which is applicable to any continuous hyperparameter appearing in a differentiable model weight update, yet requires only one training episode, with no restarts. We also provide a motivating argument for convergence to the true hypergradient, and perform tractable gradient-based optimisation of independent learning rates for each model parameter. Our method performs competitively from varied random hyperparameter initialisations on several UCI datasets and Fashion-MNIST (using a one-layer MLP), Penn Treebank (using an LSTM) and CIFAR-10 (using a ResNet-18), in time only 2-3x greater than vanilla training.
\ No newline at end of file
diff --git a/data/2022/iclr/Scalable Sampling for Nonsymmetric Determinantal Point Processes b/data/2022/iclr/Scalable Sampling for Nonsymmetric Determinantal Point Processes
new file mode 100644
index 0000000000..d95b6b40a4
--- /dev/null
+++ b/data/2022/iclr/Scalable Sampling for Nonsymmetric Determinantal Point Processes	
@@ -0,0 +1 @@
+A determinantal point process (DPP) on a collection of $M$ items is a model, parameterized by a symmetric kernel matrix, that assigns a probability to every subset of those items. Recent work shows that removing the kernel symmetry constraint, yielding nonsymmetric DPPs (NDPPs), can lead to significant predictive performance gains for machine learning applications. However, existing work leaves open the question of scalable NDPP sampling. There is only one known DPP sampling algorithm, based on Cholesky decomposition, that can directly apply to NDPPs as well. Unfortunately, its runtime is cubic in $M$, and thus does not scale to large item collections. In this work, we first note that this algorithm can be transformed into a linear-time one for kernels with low-rank structure. Furthermore, we develop a scalable sublinear-time rejection sampling algorithm by constructing a novel proposal distribution. Additionally, we show that imposing certain structural constraints on the NDPP kernel enables us to bound the rejection rate in a way that depends only on the kernel rank. In our experiments we compare the speed of all of these samplers for a variety of real-world tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Scale Efficiently: Insights from Pretraining and Finetuning Transformers b/data/2022/iclr/Scale Efficiently: Insights from Pretraining and Finetuning Transformers
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Scale Mixtures of Neural Network Gaussian Processes b/data/2022/iclr/Scale Mixtures of Neural Network Gaussian Processes
new file mode 100644
index 0000000000..d5b09f5df9
--- /dev/null
+++ b/data/2022/iclr/Scale Mixtures of Neural Network Gaussian Processes	
@@ -0,0 +1 @@
+Recent works have revealed that infinitely-wide feed-forward or recurrent neural networks of any architecture correspond to Gaussian processes referred to as Neural Network Gaussian Processes (NNGPs). While these works have extended the class of neural networks converging to Gaussian processes significantly, however, there has been little focus on broadening the class of stochastic processes that such neural networks converge to. In this work, inspired by the scale mixture of Gaussian random variables, we propose the scale mixture of NNGPs for which we introduce a prior distribution on the scale of the last-layer parameters. We show that simply introducing a scale prior on the last-layer parameters can turn infinitely-wide neural networks of any architecture into a richer class of stochastic processes. With certain scale priors, we obtain heavy-tailed stochastic processes, and in the case of inverse gamma priors, we recover Student's $t$ processes. We further analyze the distributions of the neural networks initialized with our prior setting and trained with gradient descents and obtain similar results as for NNGPs. We present a practical posterior-inference algorithm for the scale mixture of NNGPs and empirically demonstrate its usefulness on regression and classification tasks. In particular, we show that in both tasks, the heavy-tailed stochastic processes obtained from our framework are robust to out-of-distribution data.
\ No newline at end of file
diff --git a/data/2022/iclr/Scaling Laws for Neural Machine Translation b/data/2022/iclr/Scaling Laws for Neural Machine Translation
new file mode 100644
index 0000000000..55cbc93058
--- /dev/null
+++ b/data/2022/iclr/Scaling Laws for Neural Machine Translation	
@@ -0,0 +1 @@
+We present an empirical study of scaling properties of encoder-decoder Transformer models used in neural machine translation (NMT). We show that cross-entropy loss as a function of model size follows a certain scaling law. Specifically (i) We propose a formula which describes the scaling behavior of cross-entropy loss as a bivariate function of encoder and decoder size, and show that it gives accurate predictions under a variety of scaling approaches and languages; we show that the total number of parameters alone is not sufficient for such purposes. (ii) We observe different power law exponents when scaling the decoder vs scaling the encoder, and provide recommendations for optimal allocation of encoder/decoder capacity based on this observation. (iii) We also report that the scaling behavior of the model is acutely influenced by composition bias of the train/test sets, which we define as any deviation from naturally generated text (either via machine generated or human translated text). We observe that natural text on the target side enjoys scaling, which manifests as successful reduction of the cross-entropy loss. (iv) Finally, we investigate the relationship between the cross-entropy loss and the quality of the generated translations. We find two different behaviors, depending on the nature of the test data. For test sets which were originally translated from target language to source language, both loss and BLEU score improve as model size increases. In contrast, for test sets originally translated from source language to target language, the loss improves, but the BLEU score stops improving after a certain threshold. We release generated text from all models used in this study.
\ No newline at end of file
diff --git a/data/2022/iclr/Scarf: Self-Supervised Contrastive Learning using Random Feature Corruption b/data/2022/iclr/Scarf: Self-Supervised Contrastive Learning using Random Feature Corruption
new file mode 100644
index 0000000000..a4520dcf15
--- /dev/null
+++ b/data/2022/iclr/Scarf: Self-Supervised Contrastive Learning using Random Feature Corruption	
@@ -0,0 +1 @@
+Self-supervised contrastive representation learning has proved incredibly successful in the vision and natural language domains, enabling state-of-the-art performance with orders of magnitude less labeled data. However, such methods are domain-specific and little has been done to leverage this technique on real-world tabular datasets. We propose SCARF, a simple, widely-applicable technique for contrastive learning, where views are formed by corrupting a random subset of features. When applied to pre-train deep neural networks on the 69 real-world, tabular classification datasets from the OpenML-CC18 benchmark, SCARF not only improves classification accuracy in the fully-supervised setting but does so also in the presence of label noise and in the semi-supervised setting where only a fraction of the available training data is labeled. We show that SCARF complements existing strategies and outperforms alternatives like autoencoders. We conduct comprehensive ablations, detailing the importance of a range of factors.
\ No newline at end of file
diff --git a/data/2022/iclr/Scattering Networks on the Sphere for Scalable and Rotationally Equivariant Spherical CNNs b/data/2022/iclr/Scattering Networks on the Sphere for Scalable and Rotationally Equivariant Spherical CNNs
new file mode 100644
index 0000000000..14fb0edc20
--- /dev/null
+++ b/data/2022/iclr/Scattering Networks on the Sphere for Scalable and Rotationally Equivariant Spherical CNNs	
@@ -0,0 +1 @@
+Convolutional neural networks (CNNs) constructed natively on the sphere have been developed recently and shown to be highly effective for the analysis of spherical data. While an efficient framework has been formulated, spherical CNNs are nevertheless highly computationally demanding; typically they cannot scale beyond spherical signals of thousands of pixels. We develop scattering networks constructed natively on the sphere that provide a powerful representational space for spherical data. Spherical scattering networks are computationally scalable and exhibit rotational equivariance, while their representational space is invariant to isometries and provides efficient and stable signal representations. By integrating scattering networks as an additional type of layer in the generalized spherical CNN framework, we show how they can be leveraged to scale spherical CNNs to the high-resolution data typical of many practical applications, with spherical signals of many tens of megapixels and beyond.
\ No newline at end of file
diff --git a/data/2022/iclr/Scene Transformer: A unified architecture for predicting future trajectories of multiple agents b/data/2022/iclr/Scene Transformer: A unified architecture for predicting future trajectories of multiple agents
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Score-Based Generative Modeling with Critically-Damped Langevin Diffusion b/data/2022/iclr/Score-Based Generative Modeling with Critically-Damped Langevin Diffusion
new file mode 100644
index 0000000000..4572add9fc
--- /dev/null
+++ b/data/2022/iclr/Score-Based Generative Modeling with Critically-Damped Langevin Diffusion	
@@ -0,0 +1 @@
+Score-based generative models (SGMs) have demonstrated remarkable synthesis quality. SGMs rely on a diffusion process that gradually perturbs the data towards a tractable distribution, while the generative model learns to denoise. The complexity of this denoising task is, apart from the data distribution itself, uniquely determined by the diffusion process. We argue that current SGMs employ overly simplistic diffusions, leading to unnecessarily complex denoising processes, which limit generative modeling performance. Based on connections to statistical mechanics, we propose a novel critically-damped Langevin diffusion (CLD) and show that CLD-based SGMs achieve superior performance. CLD can be interpreted as running a joint diffusion in an extended space, where the auxiliary variables can be considered “velocities” that are coupled to the data variables as in Hamiltonian dynamics. We derive a novel score matching objective for CLD and show that the model only needs to learn the score function of the conditional distribution of the velocity given data, an easier task than learning scores of the data directly. We also derive a new sampling scheme for efﬁcient synthesis from CLD-based diffusion models. We ﬁnd that CLD outperforms previous SGMs in synthesis quality for similar network architectures and sampling compute budgets. We show that our novel sampler for CLD signiﬁcantly outperforms solvers such as Euler–Maruyama. Our framework provides new insights into score-based denoising diffusion models and can be readily used for high-resolution image synthesis. VPSDE. We leave the study of CLD with maximum likelihood training for high-dimensional (image) datasets to future work.
\ No newline at end of file
diff --git a/data/2022/iclr/Selective Ensembles for Consistent Predictions b/data/2022/iclr/Selective Ensembles for Consistent Predictions
new file mode 100644
index 0000000000..1f9f768e38
--- /dev/null
+++ b/data/2022/iclr/Selective Ensembles for Consistent Predictions	
@@ -0,0 +1 @@
+Recent work has shown that models trained to the same objective, and which achieve similar measures of accuracy on consistent test data, may nonetheless behave very differently on individual predictions. This inconsistency is undesirable in high-stakes contexts, such as medical diagnosis and finance. We show that this inconsistent behavior extends beyond predictions to feature attributions, which may likewise have negative implications for the intelligibility of a model, and one's ability to find recourse for subjects. We then introduce selective ensembles to mitigate such inconsistencies by applying hypothesis testing to the predictions of a set of models trained using randomly-selected starting conditions; importantly, selective ensembles can abstain in cases where a consistent outcome cannot be achieved up to a specified confidence level. We prove that that prediction disagreement between selective ensembles is bounded, and empirically demonstrate that selective ensembles achieve consistent predictions and feature attributions while maintaining low abstention rates. On several benchmark datasets, selective ensembles reach zero inconsistently predicted points, with abstention rates as low 1.5%.
\ No newline at end of file
diff --git a/data/2022/iclr/Self-Joint Supervised Learning b/data/2022/iclr/Self-Joint Supervised Learning
new file mode 100644
index 0000000000..26faa224d5
--- /dev/null
+++ b/data/2022/iclr/Self-Joint Supervised Learning	
@@ -0,0 +1 @@
+Supervised learning is a fundamental framework used to train machine learning systems. A supervised learning problem is often formulated using an i.i.d. assumption that restricts model attention to a single relevant signal at a time when predicting. This contrasts with the human ability to actively use related samples as reference when making decisions. We hypothesize that the restriction to a single signal for each prediction in the standard i.i.d. framework contributes to well-known drawbacks of supervised learning: making overconfident predictions and vulnerability to overfitting, adversarial attacks, and out-of-distribution data. To address these limitations, we propose a new supervised learning paradigm called self-joint learning that generalizes the standard approach by modeling the joint conditional distribution of two observed samples, where each sample is an image and its label. Rather than assuming samples are independent, our models explicitly learn the sample-to-sample relation of conditional independence. Our framework can naturally incorporate auxiliary unlabeled data to further improve the performance. Experiments on benchmark image datasets show our method offers significant improvement over standard supervised learning in terms of accuracy, robustness against adversarial attacks, out-of-distribution detection, and overconfidence mitigation. Code: github.com/ndkn/Self-joint-Learning
\ No newline at end of file
diff --git a/data/2022/iclr/Self-Supervised Graph Neural Networks for Improved Electroencephalographic Seizure Analysis b/data/2022/iclr/Self-Supervised Graph Neural Networks for Improved Electroencephalographic Seizure Analysis
new file mode 100644
index 0000000000..3e1be8e753
--- /dev/null
+++ b/data/2022/iclr/Self-Supervised Graph Neural Networks for Improved Electroencephalographic Seizure Analysis	
@@ -0,0 +1 @@
+Automated seizure detection and classification from electroencephalography (EEG) can greatly improve seizure diagnosis and treatment. However, several modeling challenges remain unaddressed in prior automated seizure detection and classification studies: (1) representing non-Euclidean data structure in EEGs, (2) accurately classifying rare seizure types, and (3) lacking a quantitative interpretability approach to measure model ability to localize seizures. In this study, we address these challenges by (1) representing the spatiotemporal dependencies in EEGs using a graph neural network (GNN) and proposing two EEG graph structures that capture the electrode geometry or dynamic brain connectivity, (2) proposing a self-supervised pre-training method that predicts preprocessed signals for the next time period to further improve model performance, particularly on rare seizure types, and (3) proposing a quantitative model interpretability approach to assess a model's ability to localize seizures within EEGs. When evaluating our approach on seizure detection and classification on a large public dataset, we find that our GNN with self-supervised pre-training achieves 0.875 Area Under the Receiver Operating Characteristic Curve on seizure detection and 0.749 weighted F1-score on seizure classification, outperforming previous methods for both seizure detection and classification. Moreover, our self-supervised pre-training strategy significantly improves classification of rare seizure types. Furthermore, quantitative interpretability analysis shows that our GNN with self-supervised pre-training precisely localizes 25.4% focal seizures, a 21.9 point improvement over existing CNNs. Finally, by superimposing the identified seizure locations on both raw EEG signals and EEG graphs, our approach could provide clinicians with an intuitive visualization of localized seizure regions.
\ No newline at end of file
diff --git a/data/2022/iclr/Self-Supervised Inference in State-Space Models b/data/2022/iclr/Self-Supervised Inference in State-Space Models
new file mode 100644
index 0000000000..987fd5fdd1
--- /dev/null
+++ b/data/2022/iclr/Self-Supervised Inference in State-Space Models	
@@ -0,0 +1 @@
+We perform approximate inference in state-space models with nonlinear state transitions. Without parameterizing a generative model, we apply Bayesian update formulas using a local linearity approximation parameterized by neural networks. This comes accompanied by a maximum likelihood objective that requires no supervision via uncorrupt observations or ground truth latent states. The optimization backpropagates through a recursion similar to the classical Kalman filter and smoother. Additionally, using an approximate conditional independence, we can perform smoothing without having to parameterize a separate model. In scientific applications, domain knowledge can give a linear approximation of the latent transition maps, which we can easily incorporate into our model. Usage of such domain knowledge is reflected in excellent results (despite our model's simplicity) on the chaotic Lorenz system compared to fully supervised and variational inference methods. Finally, we show competitive results on an audio denoising experiment.
\ No newline at end of file
diff --git a/data/2022/iclr/Self-Supervision Enhanced Feature Selection with Correlated Gates b/data/2022/iclr/Self-Supervision Enhanced Feature Selection with Correlated Gates
new file mode 100644
index 0000000000..9d6c417e98
--- /dev/null
+++ b/data/2022/iclr/Self-Supervision Enhanced Feature Selection with Correlated Gates	
@@ -0,0 +1 @@
+Discovering relevant input features for predicting a target variable is a key scien-tiﬁc question. However, in many domains, such as medicine and biology, feature selection is confounded by a scarcity of labeled samples coupled with signiﬁcant correlations among features. In this paper, we propose a novel deep learning approach to feature selection that addresses both challenges simultaneously. First, we pre-train the network using unlabeled samples within a self-supervised learning framework by solving pretext tasks that require the network to learn informative representations from partial feature sets. Then, we ﬁne-tune the pre-trained network to discover relevant features using labeled samples. During both training phases, we explicitly account for the correlation structure of the input features by generating correlated gate vectors from a multivariate Bernoulli distribution. Experiments on multiple real-world datasets including clinical and omics demonstrate that our model discovers relevant features that provide superior prediction performance compared to the state-of-the-art benchmarks in practical scenarios where there is often limited labeled data and high correlations among features.
\ No newline at end of file
diff --git a/data/2022/iclr/Self-ensemble Adversarial Training for Improved Robustness b/data/2022/iclr/Self-ensemble Adversarial Training for Improved Robustness
new file mode 100644
index 0000000000..2505ebd909
--- /dev/null
+++ b/data/2022/iclr/Self-ensemble Adversarial Training for Improved Robustness	
@@ -0,0 +1 @@
+Due to numerous breakthroughs in real-world applications brought by machine intelligence, deep neural networks (DNNs) are widely employed in critical applications. However, predictions of DNNs are easily manipulated with imperceptible adversarial perturbations, which impedes the further deployment of DNNs and may result in profound security and privacy implications. By incorporating adversarial samples into the training data pool, adversarial training is the strongest principled strategy against various adversarial attacks among all sorts of defense methods. Recent works mainly focus on developing new loss functions or regularizers, attempting to find the unique optimal point in the weight space. But none of them taps the potentials of classifiers obtained from standard adversarial training, especially states on the searching trajectory of training. In this work, we are dedicated to the weight states of models through the training process and devise a simple but powerful \emph{Self-Ensemble Adversarial Training} (SEAT) method for yielding a robust classifier by averaging weights of history models. This considerably improves the robustness of the target model against several well known adversarial attacks, even merely utilizing the naive cross-entropy loss to supervise. We also discuss the relationship between the ensemble of predictions from different adversarially trained models and the prediction of weight-ensembled models, as well as provide theoretical and empirical evidence that the proposed self-ensemble method provides a smoother loss landscape and better robustness than both individual models and the ensemble of predictions from different classifiers. We further analyze a subtle but fatal issue in the general settings for the self-ensemble model, which causes the deterioration of the weight-ensembled method in the late phases.
\ No newline at end of file
diff --git a/data/2022/iclr/Self-supervised Learning is More Robust to Dataset Imbalance b/data/2022/iclr/Self-supervised Learning is More Robust to Dataset Imbalance
new file mode 100644
index 0000000000..67590ea3ca
--- /dev/null
+++ b/data/2022/iclr/Self-supervised Learning is More Robust to Dataset Imbalance	
@@ -0,0 +1 @@
+Self-supervised learning (SSL) is a scalable way to learn general visual representations since it learns without labels. However, large-scale unlabeled datasets in the wild often have long-tailed label distributions, where we know little about the behavior of SSL. In this work, we systematically investigate self-supervised learning under dataset imbalance. First, we find out via extensive experiments that off-the-shelf self-supervised representations are already more robust to class imbalance than supervised representations. The performance gap between balanced and imbalanced pre-training with SSL is significantly smaller than the gap with supervised learning, across sample sizes, for both in-domain and, especially, out-of-domain evaluation. Second, towards understanding the robustness of SSL, we hypothesize that SSL learns richer features from frequent data: it may learn label-irrelevant-but-transferable features that help classify the rare classes and downstream tasks. In contrast, supervised learning has no incentive to learn features irrelevant to the labels from frequent examples. We validate this hypothesis with semi-synthetic experiments and theoretical analyses on a simplified setting. Third, inspired by the theoretical insights, we devise a re-weighted regularization technique that consistently improves the SSL representation quality on imbalanced datasets with several evaluation criteria, closing the small gap between balanced and imbalanced datasets with the same number of examples.
\ No newline at end of file
diff --git a/data/2022/iclr/Semi-relaxed Gromov-Wasserstein divergence and applications on graphs b/data/2022/iclr/Semi-relaxed Gromov-Wasserstein divergence and applications on graphs
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Sequence Approximation using Feedforward Spiking Neural Network for Spatiotemporal Learning: Theory and Optimization Methods b/data/2022/iclr/Sequence Approximation using Feedforward Spiking Neural Network for Spatiotemporal Learning: Theory and Optimization Methods
new file mode 100644
index 0000000000..06639f4554
--- /dev/null
+++ b/data/2022/iclr/Sequence Approximation using Feedforward Spiking Neural Network for Spatiotemporal Learning: Theory and Optimization Methods	
@@ -0,0 +1 @@
+A dynamical system of spiking neurons with only feedforward connections can classify spatiotemporal patterns without recurrent connections. However, the theoretical construct of a feedforward spiking neural network (SNN) for approximating a temporal sequence remains unclear, making it challenging to optimize SNN architectures for learning complex spatiotemporal patterns. In this work, we establish a theoretical framework to understand and improve sequence approximation using a feedforward SNN. Our framework shows that a feedforward SNN with one neuron per layer and skip-layer connections can approximate the mapping function between any arbitrary pairs of input and output spike train on a compact domain. Moreover, we prove that heterogeneous neurons with varying dynamics and skip-layer connections improve sequence approximation using feedfor-ward SNN. Consequently, we propose SNN architectures incorporating the preceding constructs that are trained using supervised backpropagation-through-time (BPTT) and unsupervised spiking-timing-dependent plasticity (STDP) algorithms for classification of spatiotemporal data. A dual-search-space Bayesian optimization method is developed to optimize architecture and parameters of the proposed SNN with heterogeneous neuron dynamics and skip-layer connections.
\ No newline at end of file
diff --git a/data/2022/iclr/Sequential Reptile: Inter-Task Gradient Alignment for Multilingual Learning b/data/2022/iclr/Sequential Reptile: Inter-Task Gradient Alignment for Multilingual Learning
new file mode 100644
index 0000000000..f9617bd564
--- /dev/null
+++ b/data/2022/iclr/Sequential Reptile: Inter-Task Gradient Alignment for Multilingual Learning	
@@ -0,0 +1 @@
+Multilingual models jointly pretrained on multiple languages have achieved remarkable performance on various multilingual downstream tasks. Moreover, models finetuned on a single monolingual downstream task have shown to generalize to unseen languages. In this paper, we first show that it is crucial for those tasks to align gradients between them in order to maximize knowledge transfer while minimizing negative transfer. Despite its importance, the existing methods for gradient alignment either have a completely different purpose, ignore inter-task alignment, or aim to solve continual learning problems in rather inefficient ways. As a result of the misaligned gradients between tasks, the model suffers from severe negative transfer in the form of catastrophic forgetting of the knowledge acquired from the pretraining. To overcome the limitations, we propose a simple yet effective method that can efficiently align gradients between tasks. Specifically, we perform each inner-optimization by sequentially sampling batches from all the tasks, followed by a Reptile outer update. Thanks to the gradients aligned between tasks by our method, the model becomes less vulnerable to negative transfer and catastrophic forgetting. We extensively validate our method on various multi-task learning and zero-shot cross-lingual transfer tasks, where our method largely outperforms all the relevant baselines we consider.
\ No newline at end of file
diff --git a/data/2022/iclr/Shallow and Deep Networks are Near-Optimal Approximators of Korobov Functions b/data/2022/iclr/Shallow and Deep Networks are Near-Optimal Approximators of Korobov Functions
new file mode 100644
index 0000000000..c8374a8b04
--- /dev/null
+++ b/data/2022/iclr/Shallow and Deep Networks are Near-Optimal Approximators of Korobov Functions	
@@ -0,0 +1 @@
+In this paper, we analyze the number of neurons and training parameters that a neural network needs to approximate multivariate functions of bounded second mixed derivatives — Korobov functions. We prove upper bounds on these quantities for shallow and deep neural networks, drastically lessening the curse of dimensionality. Our bounds hold for general activation functions, including ReLU. We further prove that these bounds nearly match the minimal number of parameters any continuous function approximator needs to approximate Korobov functions, showing that neural networks are near-optimal function approximators
\ No newline at end of file
diff --git a/data/2022/iclr/Should I Run Offline Reinforcement Learning or Behavioral Cloning? b/data/2022/iclr/Should I Run Offline Reinforcement Learning or Behavioral Cloning?
new file mode 100644
index 0000000000..f645fc5dfa
--- /dev/null
+++ b/data/2022/iclr/Should I Run Offline Reinforcement Learning or Behavioral Cloning?	
@@ -0,0 +1 @@
+Offline reinforcement learning (RL) algorithms can acquire effective policies by utilizing
\ No newline at end of file
diff --git a/data/2022/iclr/Should We Be Pre-training? An Argument for End-task Aware Training as an Alternative b/data/2022/iclr/Should We Be Pre-training? An Argument for End-task Aware Training as an Alternative
new file mode 100644
index 0000000000..d10e1fdfe4
--- /dev/null
+++ b/data/2022/iclr/Should We Be Pre-training? An Argument for End-task Aware Training as an Alternative	
@@ -0,0 +1 @@
+In most settings of practical concern, machine learning practitioners know in advance what end-task they wish to boost with auxiliary tasks. However, widely used methods for leveraging auxiliary data like pre-training and its continued-pretraining variant are end-task agnostic: they rarely, if ever, exploit knowledge of the target task. We study replacing end-task agnostic continued training of pre-trained language models with end-task aware training of said models. We argue that for sufficiently important end-tasks, the benefits of leveraging auxiliary data in a task-aware fashion can justify forgoing the traditional approach of obtaining generic, end-task agnostic representations as with (continued) pre-training. On three different low-resource NLP tasks from two domains, we demonstrate that multi-tasking the end-task and auxiliary objectives results in significantly better downstream task performance than the widely-used task-agnostic continued pre-training paradigm of Gururangan et al. (2020). We next introduce an online meta-learning algorithm that learns a set of multi-task weights to better balance among our multiple auxiliary objectives, achieving further improvements on end-task performance and data efficiency.
\ No newline at end of file
diff --git a/data/2022/iclr/Shuffle Private Stochastic Convex Optimization b/data/2022/iclr/Shuffle Private Stochastic Convex Optimization
new file mode 100644
index 0000000000..33f82c7e0b
--- /dev/null
+++ b/data/2022/iclr/Shuffle Private Stochastic Convex Optimization	
@@ -0,0 +1 @@
+In shuffle privacy, each user sends a collection of randomized messages to a trusted shuffler, the shuffler randomly permutes these messages, and the resulting shuffled collection of messages must satisfy differential privacy. Prior work in this model has largely focused on protocols that use a single round of communication to compute algorithmic primitives like means, histograms, and counts. We present interactive shuffle protocols for stochastic convex optimization. Our protocols rely on a new noninteractive protocol for summing vectors of bounded $\ell_2$ norm. By combining this sum subroutine with mini-batch stochastic gradient descent, accelerated gradient descent, and Nesterov's smoothing method, we obtain loss guarantees for a variety of convex loss functions that significantly improve on those of the local model and sometimes match those of the central model.
\ No newline at end of file
diff --git a/data/2022/iclr/Signing the Supermask: Keep, Hide, Invert b/data/2022/iclr/Signing the Supermask: Keep, Hide, Invert
new file mode 100644
index 0000000000..a05f9513dc
--- /dev/null
+++ b/data/2022/iclr/Signing the Supermask: Keep, Hide, Invert	
@@ -0,0 +1 @@
+The exponential growth in numbers of parameters of neural networks over the past years has been accompanied by an increase in performance across several fields. However, due to their sheer size, the networks not only became difficult to interpret but also problematic to train and use in real-world applications, since hardware requirements increased accordingly. Tackling both issues, we present a novel approach that either drops a neural network's initial weights or inverts their respective sign. Put simply, a network is trained by weight selection and inversion without changing their absolute values. Our contribution extends previous work on masking by additionally sign-inverting the initial weights and follows the findings of the Lottery Ticket Hypothesis. Through this extension and adaptations of initialization methods, we achieve a pruning rate of up to 99%, while still matching or exceeding the performance of various baseline and previous models. Our approach has two main advantages. First, and most notable, signed Supermask models drastically simplify a model's structure, while still performing well on given tasks. Second, by reducing the neural network to its very foundation, we gain insights into which weights matter for performance. The code is available on GitHub.
\ No newline at end of file
diff --git a/data/2022/iclr/SimVLM: Simple Visual Language Model Pretraining with Weak Supervision b/data/2022/iclr/SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
new file mode 100644
index 0000000000..3131562f02
--- /dev/null
+++ b/data/2022/iclr/SimVLM: Simple Visual Language Model Pretraining with Weak Supervision	
@@ -0,0 +1 @@
+With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction of multiple dataset-specific objectives. In this work, we relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective. Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score). Furthermore, we demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.
\ No newline at end of file
diff --git a/data/2022/iclr/Simple GNN Regularisation for 3D Molecular Property Prediction and Beyond b/data/2022/iclr/Simple GNN Regularisation for 3D Molecular Property Prediction and Beyond
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/SketchODE: Learning neural sketch representation in continuous time b/data/2022/iclr/SketchODE: Learning neural sketch representation in continuous time
new file mode 100644
index 0000000000..7fc459b358
--- /dev/null
+++ b/data/2022/iclr/SketchODE: Learning neural sketch representation in continuous time	
@@ -0,0 +1 @@
+Learning meaningful representations for chirographic drawing data such as sketches, handwriting, and flowcharts is a gateway for understanding and emulating human creative expression. Despite being inherently continuous-time data, existing works have treated these as discrete-time sequences, disregarding their true nature. In this work, we model such data as continuous-time functions and learn compact representations by virtue of Neural Ordinary Differential Equations. To this end, we introduce the first continuous-time Seq2Seq model and demonstrate some remarkable properties that set it apart from traditional discrete-time analogues. We also provide solutions for some practical challenges for such models, including introducing a family of parameterized ODE dynamics & continuoustime data augmentation particularly suitable for the task. Our models are validated on several datasets including VectorMNIST, DiDi and Quick, Draw!.
\ No newline at end of file
diff --git a/data/2022/iclr/Skill-based Meta-Reinforcement Learning b/data/2022/iclr/Skill-based Meta-Reinforcement Learning
new file mode 100644
index 0000000000..b5bb4f2272
--- /dev/null
+++ b/data/2022/iclr/Skill-based Meta-Reinforcement Learning	
@@ -0,0 +1 @@
+While deep reinforcement learning methods have shown impressive results in robot learning, their sample inefficiency makes the learning of complex, long-horizon behaviors with real robot systems infeasible. To mitigate this issue, meta-reinforcement learning methods aim to enable fast learning on novel tasks by learning how to learn. Yet, the application has been limited to short-horizon tasks with dense rewards. To enable learning long-horizon behaviors, recent works have explored leveraging prior experience in the form of offline datasets without reward or task annotations. While these approaches yield improved sample efficiency, millions of interactions with environments are still required to solve complex tasks. In this work, we devise a method that enables meta-learning on long-horizon, sparse-reward tasks, allowing us to solve unseen target tasks with orders of magnitude fewer environment interactions. Our core idea is to leverage prior experience extracted from offline datasets during meta-learning. Specifically, we propose to (1) extract reusable skills and a skill prior from offline datasets, (2) meta-train a high-level policy that learns to efficiently compose learned skills into long-horizon behaviors, and (3) rapidly adapt the meta-trained policy to solve an unseen target task. Experimental results on continuous control tasks in navigation and manipulation demonstrate that the proposed method can efficiently solve long-horizon novel target tasks by combining the strengths of meta-learning and the usage of offline datasets, while prior approaches in RL, meta-RL, and multi-task RL require substantially more environment interactions to solve the tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Solving Inverse Problems in Medical Imaging with Score-Based Generative Models b/data/2022/iclr/Solving Inverse Problems in Medical Imaging with Score-Based Generative Models
new file mode 100644
index 0000000000..72ee77473e
--- /dev/null
+++ b/data/2022/iclr/Solving Inverse Problems in Medical Imaging with Score-Based Generative Models	
@@ -0,0 +1 @@
+Reconstructing medical images from partial measurements is an important inverse problem in Computed Tomography (CT) and Magnetic Resonance Imaging (MRI). Existing solutions based on machine learning typically train a model to directly map measurements to medical images, leveraging a training dataset of paired images and measurements. These measurements are typically synthesized from images using a fixed physical model of the measurement process, which hinders the generalization capability of models to unknown measurement processes. To address this issue, we propose a fully unsupervised technique for inverse problem solving, leveraging the recently introduced score-based generative models. Specifically, we first train a score-based generative model on medical images to capture their prior distribution. Given measurements and a physical model of the measurement process at test time, we introduce a sampling method to reconstruct an image consistent with both the prior and the observed measurements. Our method does not assume a fixed measurement process during training, and can thus be flexibly adapted to different measurement processes at test time. Empirically, we observe comparable or better performance to supervised learning techniques in several medical imaging tasks in CT and MRI, while demonstrating significantly better generalization to unknown measurement processes.
\ No newline at end of file
diff --git a/data/2022/iclr/Sound Adversarial Audio-Visual Navigation b/data/2022/iclr/Sound Adversarial Audio-Visual Navigation
new file mode 100644
index 0000000000..be5328eca5
--- /dev/null
+++ b/data/2022/iclr/Sound Adversarial Audio-Visual Navigation	
@@ -0,0 +1 @@
+Audio-visual navigation task requires an agent to find a sound source in a realistic, unmapped 3D environment by utilizing egocentric audio-visual observations. Existing audio-visual navigation works assume a clean environment that solely contains the target sound, which, however, would not be suitable in most real-world applications due to the unexpected sound noise or intentional interference. In this work, we design an acoustically complex environment in which, besides the target sound, there exists a sound attacker playing a zero-sum game with the agent. More specifically, the attacker can move and change the volume and category of the sound to make the agent suffer from finding the sounding object while the agent tries to dodge the attack and navigate to the goal under the intervention. Under certain constraints to the attacker, we can improve the robustness of the agent towards unexpected sound attacks in audio-visual navigation. For better convergence, we develop a joint training mechanism by employing the property of a centralized critic with decentralized actors. Experiments on two real-world 3D scan datasets, Replica, and Matterport3D, verify the effectiveness and the robustness of the agent trained under our designed environment when transferred to the clean environment or the one containing sound attackers with random policy. Project: \url{https://yyf17.github.io/SAAVN}.
\ No newline at end of file
diff --git a/data/2022/iclr/Sound and Complete Neural Network Repair with Minimality and Locality Guarantees b/data/2022/iclr/Sound and Complete Neural Network Repair with Minimality and Locality Guarantees
new file mode 100644
index 0000000000..0831117832
--- /dev/null
+++ b/data/2022/iclr/Sound and Complete Neural Network Repair with Minimality and Locality Guarantees	
@@ -0,0 +1 @@
+We present a novel methodology for repairing neural networks that use ReLU activation functions. Unlike existing methods that rely on modifying the weights of a neural network which can induce a global change in the function space, our approach applies only a localized change in the function space while still guaranteeing the removal of the buggy behavior. By leveraging the piecewise linear nature of ReLU networks, our approach can efficiently construct a patch network tailored to the linear region where the buggy input resides, which when combined with the original network, provably corrects the behavior on the buggy input. Our method is both sound and complete -- the repaired network is guaranteed to fix the buggy input, and a patch is guaranteed to be found for any buggy input. Moreover, our approach preserves the continuous piecewise linear nature of ReLU networks, automatically generalizes the repair to all the points including other undetected buggy inputs inside the repair region, is minimal in terms of changes in the function space, and guarantees that outputs on inputs away from the repair region are unaltered. On several benchmarks, we show that our approach significantly outperforms existing methods in terms of locality and limiting negative side effects. Our code is available on GitHub: https://github.com/BU-DEPEND-Lab/REASSURE.
\ No newline at end of file
diff --git a/data/2022/iclr/Source-Free Adaptation to Measurement Shift via Bottom-Up Feature Restoration b/data/2022/iclr/Source-Free Adaptation to Measurement Shift via Bottom-Up Feature Restoration
new file mode 100644
index 0000000000..f45701b794
--- /dev/null
+++ b/data/2022/iclr/Source-Free Adaptation to Measurement Shift via Bottom-Up Feature Restoration	
@@ -0,0 +1 @@
+Source-free domain adaptation (SFDA) aims to adapt a model trained on labelled data in a source domain to unlabelled data in a target domain without access to the source-domain data during adaptation. Existing methods for SFDA leverage entropy-minimization techniques which: (i) apply only to classification; (ii) destroy model calibration; and (iii) rely on the source model achieving a good level of feature-space class-separation in the target domain. We address these issues for a particularly pervasive type of domain shift called measurement shift which can be resolved by restoring the source features rather than extracting new ones. In particular, we propose Feature Restoration (FR) wherein we: (i) store a lightweight and flexible approximation of the feature distribution under the source data; and (ii) adapt the feature-extractor such that the approximate feature distribution under the target data realigns with that saved on the source. We additionally propose a bottom-up training scheme which boosts performance, which we call Bottom-Up Feature Restoration (BUFR). On real and synthetic data, we demonstrate that BUFR outperforms existing SFDA methods in terms of accuracy, calibration, and data efficiency, while being less reliant on the performance of the source model in the target domain.
\ No newline at end of file
diff --git a/data/2022/iclr/Space-Time Graph Neural Networks b/data/2022/iclr/Space-Time Graph Neural Networks
new file mode 100644
index 0000000000..d2f28787b9
--- /dev/null
+++ b/data/2022/iclr/Space-Time Graph Neural Networks	
@@ -0,0 +1 @@
+We introduce space-time graph neural network (ST-GNN), a novel GNN architecture, tailored to jointly process the underlying space-time topology of time-varying network data. The cornerstone of our proposed architecture is the composition of time and graph convolutional filters followed by pointwise nonlinear activation functions. We introduce a generic definition of convolution operators that mimic the diffusion process of signals over its underlying support. On top of this definition, we propose space-time graph convolutions that are built upon a composition of time and graph shift operators. We prove that ST-GNNs with multivariate integral Lipschitz filters are stable to small perturbations in the underlying graphs as well as small perturbations in the time domain caused by time warping. Our analysis shows that small variations in the network topology and time evolution of a system does not significantly affect the performance of ST-GNNs. Numerical experiments with decentralized control systems showcase the effectiveness and stability of the proposed ST-GNNs.
\ No newline at end of file
diff --git a/data/2022/iclr/Spanning Tree-based Graph Generation for Molecules b/data/2022/iclr/Spanning Tree-based Graph Generation for Molecules
new file mode 100644
index 0000000000..c7f13a451e
--- /dev/null
+++ b/data/2022/iclr/Spanning Tree-based Graph Generation for Molecules	
@@ -0,0 +1 @@
+In this paper
\ No newline at end of file
diff --git a/data/2022/iclr/Sparse Attention with Learning to Hash b/data/2022/iclr/Sparse Attention with Learning to Hash
new file mode 100644
index 0000000000..67d7a09f0d
--- /dev/null
+++ b/data/2022/iclr/Sparse Attention with Learning to Hash	
@@ -0,0 +1 @@
+Transformer has become ubiquitous in sequence modeling tasks. As a key component of Transformer, self-attention does not scale to long sequences due to its quadratic time and space complexity with respect to the sequence length. To tackle this problem, recent work developed dynamic attention sparsification techniques based on Approximate Nearest Neighbor (ANN) methods, where similar queries and keys are allocated to the same hash bucket with high probability. However, the effectiveness of those ANN methods relies on the assumption that queries and keys should lie in the same space, which is not well justified. Besides, some of the ANN methods such as Locality-Sensitive Hashing (LSH) are randomized and cannot fully utilize the available real data distributions. To overcome these issues, this paper proposes a new strategy for sparse attention, namely LHA (Learningto-Hash Attention), which directly learns separate parameterized hash functions for queries and keys, respectively. Another advantage of LHA is that it does not impose extra constraints for queries and keys, which makes it applicable to the wide range of pre-trained Transformer models. Our experiments on evaluation of the WikiText-103 dataset for language modeling, the GLUE benchmark for natural language understanding, and the Lang-Range-Arena benchmark for multiple tasks (text/image classification, retrieval, etc.) show the superior performance of LHA over other strong Transformer variants.
\ No newline at end of file
diff --git a/data/2022/iclr/Sparse Communication via Mixed Distributions b/data/2022/iclr/Sparse Communication via Mixed Distributions
new file mode 100644
index 0000000000..df7b2723b5
--- /dev/null
+++ b/data/2022/iclr/Sparse Communication via Mixed Distributions	
@@ -0,0 +1 @@
+Neural networks and other machine learning models compute continuous representations, while humans communicate mostly through discrete symbols. Reconciling these two forms of communication is desirable for generating human-readable interpretations or learning discrete latent variable models, while maintaining end-to-end differentiability. Some existing approaches (such as the Gumbel-Softmax transformation) build continuous relaxations that are discrete approximations in the zero-temperature limit, while others (such as sparsemax transformations and the Hard Concrete distribution) produce discrete/continuous hybrids. In this paper, we build rigorous theoretical foundations for these hybrids, which we call"mixed random variables."Our starting point is a new"direct sum"base measure defined on the face lattice of the probability simplex. From this measure, we introduce new entropy and Kullback-Leibler divergence functions that subsume the discrete and differential cases and have interpretations in terms of code optimality. Our framework suggests two strategies for representing and sampling mixed random variables, an extrinsic ("sample-and-project") and an intrinsic one (based on face stratification). We experiment with both approaches on an emergent communication benchmark and on modeling MNIST and Fashion-MNIST data with variational auto-encoders with mixed latent variables.
\ No newline at end of file
diff --git a/data/2022/iclr/Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity b/data/2022/iclr/Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity
new file mode 100644
index 0000000000..0f0d387618
--- /dev/null
+++ b/data/2022/iclr/Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity	
@@ -0,0 +1 @@
+DETR is the first end-to-end object detector using a transformer encoder-decoder architecture and demonstrates competitive performance but low computational efficiency on high resolution feature maps. The subsequent work, Deformable DETR, enhances the efficiency of DETR by replacing dense attention with deformable attention, which achieves 10x faster convergence and improved performance. Deformable DETR uses the multiscale feature to ameliorate performance, however, the number of encoder tokens increases by 20x compared to DETR, and the computation cost of the encoder attention remains a bottleneck. In our preliminary experiment, we observe that the detection performance hardly deteriorates even if only a part of the encoder token is updated. Inspired by this observation, we propose Sparse DETR that selectively updates only the tokens expected to be referenced by the decoder, thus help the model effectively detect objects. In addition, we show that applying an auxiliary detection loss on the selected tokens in the encoder improves the performance while minimizing computational overhead. We validate that Sparse DETR achieves better performance than Deformable DETR even with only 10% encoder tokens on the COCO dataset. Albeit only the encoder tokens are sparsified, the total computation cost decreases by 38% and the frames per second (FPS) increases by 42% compared to Deformable DETR. Code is available at https://github.com/kakaobrain/sparse-detr
\ No newline at end of file
diff --git a/data/2022/iclr/Sparsity Winning Twice: Better Robust Generalization from More Efficient Training b/data/2022/iclr/Sparsity Winning Twice: Better Robust Generalization from More Efficient Training
new file mode 100644
index 0000000000..7d857eabf7
--- /dev/null
+++ b/data/2022/iclr/Sparsity Winning Twice: Better Robust Generalization from More Efficient Training	
@@ -0,0 +1 @@
+Recent studies demonstrate that deep networks, even robustified by the state-of-the-art adversarial training (AT), still suffer from large robust generalization gaps, in addition to the much more expensive training costs than standard training. In this paper, we investigate this intriguing problem from a new perspective, i.e., injecting appropriate forms of sparsity during adversarial training. We introduce two alternatives for sparse adversarial training: (i) static sparsity, by leveraging recent results from the lottery ticket hypothesis to identify critical sparse subnetworks arising from the early training; (ii) dynamic sparsity, by allowing the sparse subnetwork to adaptively adjust its connectivity pattern (while sticking to the same sparsity ratio) throughout training. We find both static and dynamic sparse methods to yield win-win: substantially shrinking the robust generalization gap and alleviating the robust overfitting, meanwhile significantly saving training and inference FLOPs. Extensive experiments validate our proposals with multiple network architectures on diverse datasets, including CIFAR-10/100 and Tiny-ImageNet. For example, our methods reduce robust generalization gap and overfitting by 34.44% and 4.02%, with comparable robust/standard accuracy boosts and 87.83%/87.82% training/inference FLOPs savings on CIFAR-100 with ResNet-18. Besides, our approaches can be organically combined with existing regularizers, establishing new state-of-the-art results in AT. Codes are available in https://github.com/VITA-Group/Sparsity-Win-Robust-Generalization.
\ No newline at end of file
diff --git a/data/2022/iclr/Spatial Graph Attention and Curiosity-driven Policy for Antiviral Drug Discovery b/data/2022/iclr/Spatial Graph Attention and Curiosity-driven Policy for Antiviral Drug Discovery
new file mode 100644
index 0000000000..bf1f445b3b
--- /dev/null
+++ b/data/2022/iclr/Spatial Graph Attention and Curiosity-driven Policy for Antiviral Drug Discovery	
@@ -0,0 +1 @@
+We developed Distilled Graph Attention Policy Networks (DGAPNs), a curiosity-driven reinforcement learning model to generate novel graph-structured chemical representations that optimize user-defined objectives by efficiently navigating a physically constrained domain. The framework is examined on the task of generating molecules that are designed to bind, noncovalently, to functional sites of SARS-CoV-2 proteins. We present a spatial Graph Attention Network (sGAT) that leverages self-attention over both node and edge attributes as well as encoding spatial structure -- this capability is of considerable interest in areas such as molecular and synthetic biology and drug discovery. An attentional policy network is then introduced to learn decision rules for a dynamic, fragment-based chemical environment, and state-of-the-art policy gradient techniques are employed to train the network with enhanced stability. Exploration is efficiently encouraged by incorporating innovation reward bonuses learned and proposed by random network distillation. In experiments, our framework achieved outstanding results compared to state-of-the-art algorithms, while increasing the diversity of proposed molecules and reducing the complexity of paths to chemical synthesis.
\ No newline at end of file
diff --git a/data/2022/iclr/SphereFace2: Binary Classification is All You Need for Deep Face Recognition b/data/2022/iclr/SphereFace2: Binary Classification is All You Need for Deep Face Recognition
new file mode 100644
index 0000000000..d8fe448af3
--- /dev/null
+++ b/data/2022/iclr/SphereFace2: Binary Classification is All You Need for Deep Face Recognition	
@@ -0,0 +1 @@
+State-of-the-art deep face recognition methods are mostly trained with a softmax-based multi-class classification framework. Despite being popular and effective, these methods still have a few shortcomings that limit empirical performance. In this paper, we start by identifying the discrepancy between training and evaluation in the existing multi-class classification framework and then discuss the potential limitations caused by the"competitive"nature of softmax normalization. Motivated by these limitations, we propose a novel binary classification training framework, termed SphereFace2. In contrast to existing methods, SphereFace2 circumvents the softmax normalization, as well as the corresponding closed-set assumption. This effectively bridges the gap between training and evaluation, enabling the representations to be improved individually by each binary classification task. Besides designing a specific well-performing loss function, we summarize a few general principles for this"one-vs-all"binary classification framework so that it can outperform current competitive methods. Our experiments on popular benchmarks demonstrate that SphereFace2 can consistently outperform state-of-the-art deep face recognition methods. The code has been made publicly available.
\ No newline at end of file
diff --git a/data/2022/iclr/Spherical Message Passing for 3D Molecular Graphs b/data/2022/iclr/Spherical Message Passing for 3D Molecular Graphs
new file mode 100644
index 0000000000..43bd6177f3
--- /dev/null
+++ b/data/2022/iclr/Spherical Message Passing for 3D Molecular Graphs	
@@ -0,0 +1 @@
+We consider representation learning of 3D molecular graphs in which each atom is associated with a spatial position in 3D. This is an under-explored area of research, and a principled message passing framework is currently lacking. In this work, we conduct analyses in the spherical coordinate system (SCS) for the complete identification of 3D graph structures. Based on such observations, we propose the spherical message passing (SMP) as a novel and powerful scheme for 3D molecular learning. SMP dramatically reduces training complexity, enabling it to perform efficiently on large-scale molecules. In addition, SMP is capable of distinguishing almost all molecular structures, and the uncovered cases may not exist in practice. Based on meaningful physically-based representations of 3D information, we further propose the SphereNet for 3D molecular learning. Experimental results demonstrate that the use of meaningful 3D information in SphereNet leads to significant performance improvements in prediction tasks. Our results also demonstrate the advantages of SphereNet in terms of capability, efficiency, and scalability. Our code is publicly available as part of the DIG library (https://github.com/divelab/DIG).
\ No newline at end of file
diff --git a/data/2022/iclr/Spike-inspired rank coding for fast and accurate recurrent neural networks b/data/2022/iclr/Spike-inspired rank coding for fast and accurate recurrent neural networks
new file mode 100644
index 0000000000..2ebf228218
--- /dev/null
+++ b/data/2022/iclr/Spike-inspired rank coding for fast and accurate recurrent neural networks	
@@ -0,0 +1 @@
+Biological spiking neural networks (SNNs) can temporally encode information in their outputs, e.g. in the rank order in which neurons fire, whereas artificial neural networks (ANNs) conventionally do not. As a result, models of SNNs for neuromorphic computing are regarded as potentially more rapid and efficient than ANNs when dealing with temporal input. On the other hand, ANNs are simpler to train, and usually achieve superior performance. Here we show that temporal coding such as rank coding (RC) inspired by SNNs can also be applied to conventional ANNs such as LSTMs, and leads to computational savings and speedups. In our RC for ANNs, we apply backpropagation through time using the standard real-valued activations, but only from a strategically early time step of each sequential input example, decided by a threshold-crossing event. Learning then incorporates naturally also *when* to produce an output, without other changes to the model or the algorithm. Both the forward and the backward training pass can be significantly shortened by skipping the remaining input sequence after that first event. RC-training also significantly reduces time-to-insight during inference, with a minimal decrease in accuracy. The desired speed-accuracy trade-off is tunable by varying the threshold or a regularization parameter that rewards output entropy. We demonstrate these in two toy problems of sequence classification, and in a temporally-encoded MNIST dataset where our RC model achieves 99.19% accuracy after the first input time-step, outperforming the state of the art in temporal coding with SNNs, as well as in spoken-word classification of Google Speech Commands, outperforming non-RC-trained early inference with LSTMs.
\ No newline at end of file
diff --git a/data/2022/iclr/Spread Spurious Attribute: Improving Worst-group Accuracy with Spurious Attribute Estimation b/data/2022/iclr/Spread Spurious Attribute: Improving Worst-group Accuracy with Spurious Attribute Estimation
new file mode 100644
index 0000000000..6b061b826b
--- /dev/null
+++ b/data/2022/iclr/Spread Spurious Attribute: Improving Worst-group Accuracy with Spurious Attribute Estimation	
@@ -0,0 +1 @@
+The paradigm of worst-group loss minimization has shown its promise in avoiding to learn spurious correlations, but requires costly additional supervision on spurious attributes. To resolve this, recent works focus on developing weaker forms of supervision -- e.g., hyperparameters discovered with a small number of validation samples with spurious attribute annotation -- but none of the methods retain comparable performance to methods using full supervision on the spurious attribute. In this paper, instead of searching for weaker supervisions, we ask: Given access to a fixed number of samples with spurious attribute annotations, what is the best achievable worst-group loss if we"fully exploit"them? To this end, we propose a pseudo-attribute-based algorithm, coined Spread Spurious Attribute (SSA), for improving the worst-group accuracy. In particular, we leverage samples both with and without spurious attribute annotations to train a model to predict the spurious attribute, then use the pseudo-attribute predicted by the trained model as supervision on the spurious attribute to train a new robust model having minimal worst-group loss. Our experiments on various benchmark datasets show that our algorithm consistently outperforms the baseline methods using the same number of validation samples with spurious attribute annotations. We also demonstrate that the proposed SSA can achieve comparable performances to methods using full (100%) spurious attribute supervision, by using a much smaller number of annotated samples -- from 0.6% and up to 1.5%, depending on the dataset.
\ No newline at end of file
diff --git a/data/2022/iclr/Sqrt(d) Dimension Dependence of Langevin Monte Carlo b/data/2022/iclr/Sqrt(d) Dimension Dependence of Langevin Monte Carlo
new file mode 100644
index 0000000000..04ed7373a1
--- /dev/null
+++ b/data/2022/iclr/Sqrt(d) Dimension Dependence of Langevin Monte Carlo	
@@ -0,0 +1 @@
+This article considers the popular MCMC method of unadjusted Langevin Monte Carlo (LMC) and provides a non-asymptotic analysis of its sampling error in 2-Wasserstein distance. The proof is based on a refinement of mean-square analysis in Li et al. (2019), and this refined framework automates the analysis of a large class of sampling algorithms based on discretizations of contractive SDEs. Using this framework, we establish an $\tilde{O}(\sqrt{d}/\epsilon)$ mixing time bound for LMC, without warm start, under the common log-smooth and log-strongly-convex conditions, plus a growth condition on the 3rd-order derivative of the potential of target measures. This bound improves the best previously known $\tilde{O}(d/\epsilon)$ result and is optimal (in terms of order) in both dimension $d$ and accuracy tolerance $\epsilon$ for target measures satisfying the aforementioned assumptions. Our theoretical analysis is further validated by numerical experiments.
\ No newline at end of file
diff --git a/data/2022/iclr/Stability Regularization for Discrete Representation Learning b/data/2022/iclr/Stability Regularization for Discrete Representation Learning
new file mode 100644
index 0000000000..e698591d4e
--- /dev/null
+++ b/data/2022/iclr/Stability Regularization for Discrete Representation Learning	
@@ -0,0 +1 @@
+We present a method for training neural network models with discrete stochastic variables. The core of the method is stability regularization, which is a regularization procedure based on the idea of noise stability developed in Gaussian isoperimetric theory in the analysis of Gaussian functions. Stability regularization is a method to make the output of continuous functions of Gaussian random variables close to discrete, that is binary or categorical, without the need for significant manual tuning. The method allows control over the extent to which a Gaussian function’s output is close to discrete, thus allowing for a continued flow of gradient. The method can be used standalone or in combination with existing continuous relaxation methods. We validate the method in a broad range of settings, showing competitive performance against the state-of-the-art.
\ No newline at end of file
diff --git a/data/2022/iclr/Steerable Partial Differential Operators for Equivariant Neural Networks b/data/2022/iclr/Steerable Partial Differential Operators for Equivariant Neural Networks
new file mode 100644
index 0000000000..5730146146
--- /dev/null
+++ b/data/2022/iclr/Steerable Partial Differential Operators for Equivariant Neural Networks	
@@ -0,0 +1 @@
+Recent work in equivariant deep learning bears strong similarities to physics. Fields over a base space are fundamental entities in both subjects, as are equivariant maps between these fields. In deep learning, however, these maps are usually defined by convolutions with a kernel, whereas they are partial differential operators (PDOs) in physics. Developing the theory of equivariant PDOs in the context of deep learning could bring these subjects even closer together and lead to a stronger flow of ideas. In this work, we derive a $G$-steerability constraint that completely characterizes when a PDO between feature vector fields is equivariant, for arbitrary symmetry groups $G$. We then fully solve this constraint for several important groups. We use our solutions as equivariant drop-in replacements for convolutional layers and benchmark them in that role. Finally, we develop a framework for equivariant maps based on Schwartz distributions that unifies classical convolutions and differential operators and gives insight about the relation between the two.
\ No newline at end of file
diff --git a/data/2022/iclr/Stein Latent Optimization for Generative Adversarial Networks b/data/2022/iclr/Stein Latent Optimization for Generative Adversarial Networks
new file mode 100644
index 0000000000..4095e61072
--- /dev/null
+++ b/data/2022/iclr/Stein Latent Optimization for Generative Adversarial Networks	
@@ -0,0 +1 @@
+Generative adversarial networks (GANs) with clustered latent spaces can perform conditional generation in a completely unsupervised manner. In the real world, the salient attributes of unlabeled data can be imbalanced. However, most of existing unsupervised conditional GANs cannot cluster attributes of these data in their latent spaces properly because they assume uniform distributions of the attributes. To address this problem, we theoretically derive Stein latent optimization that provides reparameterizable gradient estimations of the latent distribution parameters assuming a Gaussian mixture prior in a continuous latent space. Structurally, we introduce an encoder network and novel unsupervised conditional contrastive loss to ensure that data generated from a single mixture component represent a single attribute. We confirm that the proposed method, named Stein Latent Optimization for GANs (SLOGAN), successfully learns balanced or imbalanced attributes and achieves state-of-the-art unsupervised conditional generation performance even in the absence of attribute information (e.g., the imbalance ratio). Moreover, we demonstrate that the attributes to be learned can be manipulated using a small amount of probe data.
\ No newline at end of file
diff --git a/data/2022/iclr/Step-unrolled Denoising Autoencoders for Text Generation b/data/2022/iclr/Step-unrolled Denoising Autoencoders for Text Generation
new file mode 100644
index 0000000000..683fadb84a
--- /dev/null
+++ b/data/2022/iclr/Step-unrolled Denoising Autoencoders for Text Generation	
@@ -0,0 +1 @@
+In this paper we propose a new generative model of text, Step-unrolled Denoising Autoencoder (SUNDAE), that does not rely on autoregressive models. Similarly to denoising diffusion techniques, SUNDAE is repeatedly applied on a sequence of tokens, starting from random inputs and improving them each time until convergence. We present a simple new improvement operator that converges in fewer iterations than diffusion methods, while qualitatively producing better samples on natural language datasets. SUNDAE achieves state-of-the-art results (among non-autoregressive methods) on the WMT'14 English-to-German translation task and good qualitative results on unconditional language modeling on the Colossal Cleaned Common Crawl dataset and a dataset of Python code from GitHub. The non-autoregressive nature of SUNDAE opens up possibilities beyond left-to-right prompted generation, by filling in arbitrary blank patterns in a template.
\ No newline at end of file
diff --git a/data/2022/iclr/Stiffness-aware neural network for learning Hamiltonian systems b/data/2022/iclr/Stiffness-aware neural network for learning Hamiltonian systems
new file mode 100644
index 0000000000..4388b4e1c1
--- /dev/null
+++ b/data/2022/iclr/Stiffness-aware neural network for learning Hamiltonian systems	
@@ -0,0 +1 @@
+We propose stiffness-aware neural network (SANN), a new method for learning Hamiltonian dynamical systems from data. SANN identiﬁes and splits the training data into stiff and nonstiff portions based on a stiffness-aware index, a simple, yet effective metric we introduce to quantify the stiffness of the dynamical system. This classiﬁcation along with a resampling technique allows us to apply different time integration strategies such as step size adaptation to better capture the dynamical characteristics of the Hamiltonian vector ﬁelds. We evaluate SANN on complex physical systems including a three-body problem and billiard model. We show that SANN is more stable and can better preserve energy when compared with the state-of-the-art methods, leading to signiﬁcant improvement in accuracy.
\ No newline at end of file
diff --git a/data/2022/iclr/Stochastic Training is Not Necessary for Generalization b/data/2022/iclr/Stochastic Training is Not Necessary for Generalization
new file mode 100644
index 0000000000..dc25415ab0
--- /dev/null
+++ b/data/2022/iclr/Stochastic Training is Not Necessary for Generalization	
@@ -0,0 +1 @@
+It is widely believed that the implicit regularization of SGD is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve comparably strong performance to SGD on CIFAR-10 using modern architectures. To this end, we show that the implicit regularization of SGD can be completely replaced with explicit regularization even when comparing against a strong and well-researched baseline. Our observations indicate that the perceived difficulty of full-batch training may be the result of its optimization properties and the disproportionate time and effort spent by the ML community tuning optimizers and hyperparameters for small-batch training.
\ No newline at end of file
diff --git a/data/2022/iclr/Strength of Minibatch Noise in SGD b/data/2022/iclr/Strength of Minibatch Noise in SGD
new file mode 100644
index 0000000000..201aeef37f
--- /dev/null
+++ b/data/2022/iclr/Strength of Minibatch Noise in SGD	
@@ -0,0 +1 @@
+The noise in stochastic gradient descent (SGD), caused by minibatch sampling, is poorly understood despite its practical importance in deep learning. This work presents the first systematic study of the SGD noise and fluctuations close to a local minimum. We first analyze the SGD noise in linear regression in detail and then derive a general formula for approximating SGD noise in different types of minima. For application, our results (1) provide insight into the stability of training a neural network, (2) suggest that a large learning rate can help generalization by introducing an implicit regularization, (3) explain why the linear learning rate-batchsize scaling law fails at a large learning rate or at a small batchsize and (4) can provide an understanding of how discrete-time nature of SGD affects the recently discovered power-law phenomenon of SGD.
\ No newline at end of file
diff --git a/data/2022/iclr/Structure-Aware Transformer Policy for Inhomogeneous Multi-Task Reinforcement Learning b/data/2022/iclr/Structure-Aware Transformer Policy for Inhomogeneous Multi-Task Reinforcement Learning
new file mode 100644
index 0000000000..f11b186d27
--- /dev/null
+++ b/data/2022/iclr/Structure-Aware Transformer Policy for Inhomogeneous Multi-Task Reinforcement Learning	
@@ -0,0 +1 @@
+Modular Reinforcement Learning, where the agent is assumed to be morphologically structured as a graph, for example composed of limbs and joints, aims to learn a policy that is transferable to a structurally similar but different agent. Compared to traditional Multi-Task Reinforcement Learning, this promising approach allows us to cope with inhomogeneous tasks where the state and action space dimensions differ across tasks. Graph Neural Networks are a natural model for representing the pertinent policies, but a recent work has shown that their multihop message passing mechanism is not ideal for conveying important information to other modules and thus a transformer model without morphological information was proposed. In this work, we argue that the morphological information is still very useful and propose a transformer policy model that effectively encodes such information. Specifically, we encode the morphological information in terms of the traversal-based positional embedding and the graph-based relational embedding. We empirically show that the morphological information is crucial for modular reinforcement learning, substantially outperforming prior state-of-the-art methods on multi-task learning as well as transfer learning settings with different state and action space dimensions.
\ No newline at end of file
diff --git a/data/2022/iclr/StyleAlign: Analysis and Applications of Aligned StyleGAN Models b/data/2022/iclr/StyleAlign: Analysis and Applications of Aligned StyleGAN Models
new file mode 100644
index 0000000000..67ab6ed547
--- /dev/null
+++ b/data/2022/iclr/StyleAlign: Analysis and Applications of Aligned StyleGAN Models	
@@ -0,0 +1 @@
+In this paper, we perform an in-depth study of the properties and applications of aligned generative models. We refer to two models as aligned if they share the same architecture, and one of them (the child) is obtained from the other (the parent) via fine-tuning to another domain, a common practice in transfer learning. Several works already utilize some basic properties of aligned StyleGAN models to perform image-to-image translation. Here, we perform the first detailed exploration of model alignment, also focusing on StyleGAN. First, we empirically analyze aligned models and provide answers to important questions regarding their nature. In particular, we find that the child model's latent spaces are semantically aligned with those of the parent, inheriting incredibly rich semantics, even for distant data domains such as human faces and churches. Second, equipped with this better understanding, we leverage aligned models to solve a diverse set of tasks. In addition to image translation, we demonstrate fully automatic cross-domain image morphing. We further show that zero-shot vision tasks may be performed in the child domain, while relying exclusively on supervision in the parent domain. We demonstrate qualitatively and quantitatively that our approach yields state-of-the-art results, while requiring only simple fine-tuning and inversion.
\ No newline at end of file
diff --git a/data/2022/iclr/StyleNeRF: A Style-based 3D Aware Generator for High-resolution Image Synthesis b/data/2022/iclr/StyleNeRF: A Style-based 3D Aware Generator for High-resolution Image Synthesis
new file mode 100644
index 0000000000..072d4797d2
--- /dev/null
+++ b/data/2022/iclr/StyleNeRF: A Style-based 3D Aware Generator for High-resolution Image Synthesis	
@@ -0,0 +1 @@
+We propose StyleNeRF, a 3D-aware generative model for photo-realistic high-resolution image synthesis with high multi-view consistency, which can be trained on unstructured 2D images. Existing approaches either cannot synthesize high-resolution images with fine details or yield noticeable 3D-inconsistent artifacts. In addition, many of them lack control over style attributes and explicit 3D camera poses. StyleNeRF integrates the neural radiance field (NeRF) into a style-based generator to tackle the aforementioned challenges, i.e., improving rendering efficiency and 3D consistency for high-resolution image generation. We perform volume rendering only to produce a low-resolution feature map and progressively apply upsampling in 2D to address the first issue. To mitigate the inconsistencies caused by 2D upsampling, we propose multiple designs, including a better upsampler and a new regularization loss. With these designs, StyleNeRF can synthesize high-resolution images at interactive rates while preserving 3D consistency at high quality. StyleNeRF also enables control of camera poses and different levels of styles, which can generalize to unseen views. It also supports challenging tasks, including zoom-in and-out, style mixing, inversion, and semantic editing.
\ No newline at end of file
diff --git a/data/2022/iclr/Subspace Regularizers for Few-Shot Class Incremental Learning b/data/2022/iclr/Subspace Regularizers for Few-Shot Class Incremental Learning
new file mode 100644
index 0000000000..170fccac9b
--- /dev/null
+++ b/data/2022/iclr/Subspace Regularizers for Few-Shot Class Incremental Learning	
@@ -0,0 +1 @@
+Few-shot class incremental learning -- the problem of updating a trained classifier to discriminate among an expanded set of classes with limited labeled data -- is a key challenge for machine learning systems deployed in non-stationary environments. Existing approaches to the problem rely on complex model architectures and training procedures that are difficult to tune and re-use. In this paper, we present an extremely simple approach that enables the use of ordinary logistic regression classifiers for few-shot incremental learning. The key to this approach is a new family of subspace regularization schemes that encourage weight vectors for new classes to lie close to the subspace spanned by the weights of existing classes. When combined with pretrained convolutional feature extractors, logistic regression models trained with subspace regularization outperform specialized, state-of-the-art approaches to few-shot incremental image classification by up to 22% on the miniImageNet dataset. Because of its simplicity, subspace regularization can be straightforwardly extended to incorporate additional background information about the new classes (including class names and descriptions specified in natural language); these further improve accuracy by up to 2%. Our results show that simple geometric regularization of class representations offers an effective tool for continual learning.
\ No newline at end of file
diff --git a/data/2022/iclr/Superclass-Conditional Gaussian Mixture Model For Learning Fine-Grained Embeddings b/data/2022/iclr/Superclass-Conditional Gaussian Mixture Model For Learning Fine-Grained Embeddings
new file mode 100644
index 0000000000..1251e1516f
--- /dev/null
+++ b/data/2022/iclr/Superclass-Conditional Gaussian Mixture Model For Learning Fine-Grained Embeddings	
@@ -0,0 +1 @@
+Learning fine-grained embeddings is essential for extending the generalizability of models pre-trained on “coarse” labels (e.g., animals). It is crucial to fields for which fine-grained labeling (e.g., breeds of animals) is expensive, but fine-grained prediction is desirable, such as medicine. The dilemma necessitates adaptation of a “coarsely” pre-trained model to new tasks with a few “finer-grained” training labels. However, coarsely supervised pre-training tends to suppress intra-class variation, which is vital for cross-granularity adaptation. In this paper, we develop a training framework underlain by a novel superclass-conditional Gaussian mixture model (SCGM). SCGM imitates the generative process of samples from hierarchies of classes through latent variable modeling of the fine-grained subclasses. The framework is agnostic to the encoders and only adds a few distribution related parameters, thus is efficient, and flexible to different domains. The model parameters are learned end-to-end by maximum-likelihood estimation via a principled Expectation-Maximization algorithm. Extensive experiments on benchmark datasets and a real-life medical dataset indicate the effectiveness of our method.
\ No newline at end of file
diff --git a/data/2022/iclr/Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm b/data/2022/iclr/Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm
new file mode 100644
index 0000000000..b93a180be5
--- /dev/null
+++ b/data/2022/iclr/Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm	
@@ -0,0 +1 @@
+Recently, large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently. Instead of using the single image-text contrastive supervision, we fully exploit data potential through the use of (1) self-supervision within each modality; (2) multi-view supervision across modalities; (3) nearest-neighbor supervision from other similar pairs. Benefiting from intrinsic supervision, our DeCLIP-ResNet50 can achieve 60.4% zero-shot top1 accuracy on ImageNet, which is 0.8% above the CLIP-ResNet50 while using 7.1 x fewer data. Our DeCLIP-ResNet50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. Moreover, Scaling up the model and computing also works well in our framework.Our code, dataset and models are released at: https://github.com/Sense-GVT/DeCLIP
\ No newline at end of file
diff --git a/data/2022/iclr/Surreal-GAN: Semi-Supervised Representation Learning via GAN for uncovering heterogeneous disease-related imaging patterns b/data/2022/iclr/Surreal-GAN: Semi-Supervised Representation Learning via GAN for uncovering heterogeneous disease-related imaging patterns
new file mode 100644
index 0000000000..71da38b7bc
--- /dev/null
+++ b/data/2022/iclr/Surreal-GAN: Semi-Supervised Representation Learning via GAN for uncovering heterogeneous disease-related imaging patterns	
@@ -0,0 +1 @@
+A plethora of machine learning methods have been applied to imaging data, enabling the construction of clinically relevant imaging signatures of neurological and neuropsychiatric diseases. Oftentimes, such methods don't explicitly model the heterogeneity of disease effects, or approach it via nonlinear models that are not interpretable. Moreover, unsupervised methods may parse heterogeneity that is driven by nuisance confounding factors that affect brain structure or function, rather than heterogeneity relevant to a pathology of interest. On the other hand, semi-supervised clustering methods seek to derive a dichotomous subtype membership, ignoring the truth that disease heterogeneity spatially and temporally extends along a continuum. To address the aforementioned limitations, herein, we propose a novel method, termed Surreal-GAN (Semi-SUpeRvised ReprEsentAtion Learning via GAN). Using cross-sectional imaging data, Surreal-GAN dissects underlying disease-related heterogeneity under the principle of semi-supervised clustering (cluster mappings from normal control to patient), proposes a continuously dimensional representation, and infers the disease severity of patients at individual level along each dimension. The model first learns a transformation function from normal control (CN) domain to the patient (PT) domain with latent variables controlling transformation directions. An inverse mapping function together with regularization on function continuity, pattern orthogonality and monotonicity was also imposed to make sure that the transformation function captures necessarily meaningful imaging patterns with clinical significance. We first validated the model through extensive semi-synthetic experiments, and then demonstrate its potential in capturing biologically plausible imaging patterns in Alzheimer's disease (AD).
\ No newline at end of file
diff --git a/data/2022/iclr/Surrogate Gap Minimization Improves Sharpness-Aware Training b/data/2022/iclr/Surrogate Gap Minimization Improves Sharpness-Aware Training
new file mode 100644
index 0000000000..6abab65df8
--- /dev/null
+++ b/data/2022/iclr/Surrogate Gap Minimization Improves Sharpness-Aware Training	
@@ -0,0 +1 @@
+The recently proposed Sharpness-Aware Minimization (SAM) improves generalization by minimizing a \textit{perturbed loss} defined as the maximum loss within a neighborhood in the parameter space. However, we show that both sharp and flat minima can have a low perturbed loss, implying that SAM does not always prefer flat minima. Instead, we define a \textit{surrogate gap}, a measure equivalent to the dominant eigenvalue of Hessian at a local minimum when the radius of the neighborhood (to derive the perturbed loss) is small. The surrogate gap is easy to compute and feasible for direct minimization during training. Based on the above observations, we propose Surrogate \textbf{G}ap Guided \textbf{S}harpness-\textbf{A}ware \textbf{M}inimization (GSAM), a novel improvement over SAM with negligible computation overhead. Conceptually, GSAM consists of two steps: 1) a gradient descent like SAM to minimize the perturbed loss, and 2) an \textit{ascent} step in the \textit{orthogonal} direction (after gradient decomposition) to minimize the surrogate gap and yet not affect the perturbed loss. GSAM seeks a region with both small loss (by step 1) and low sharpness (by step 2), giving rise to a model with high generalization capabilities. Theoretically, we show the convergence of GSAM and provably better generalization than SAM. Empirically, GSAM consistently improves generalization (e.g., +3.2\% over SAM and +5.4\% over AdamW on ImageNet top-1 accuracy for ViT-B/32). Code is released at \url{ https://sites.google.com/view/gsam-iclr22/home}.
\ No newline at end of file
diff --git a/data/2022/iclr/Surrogate NAS Benchmarks: Going Beyond the Limited Search Spaces of Tabular NAS Benchmarks b/data/2022/iclr/Surrogate NAS Benchmarks: Going Beyond the Limited Search Spaces of Tabular NAS Benchmarks
new file mode 100644
index 0000000000..8343cb7759
--- /dev/null
+++ b/data/2022/iclr/Surrogate NAS Benchmarks: Going Beyond the Limited Search Spaces of Tabular NAS Benchmarks	
@@ -0,0 +1 @@
+The most significant barrier to the advancement of Neural Architecture Search (NAS) is its demand for large computational resources, which hinders scientifically sound empirical evaluations of NAS methods. Tabular NAS benchmarks have alleviated this problem substantially, making it possible to properly evaluate NAS methods in seconds on commodity machines. However, an unintended consequence of tabular NAS benchmarks has been a focus on extremely small architectural search spaces since their construction relies on exhaustive evaluations of the space. This leads to unrealistic results that do not transfer to larger spaces. To overcome this fundamental limitation, we propose a methodology to create cheap NAS surrogate benchmarks for arbitrary search spaces. We exemplify this approach by creating surrogate NAS benchmarks on the existing tabular NAS-Bench-101 and on two widely used NAS search spaces with up to $10^{21}$ architectures ($10^{13}$ times larger than any previous tabular NAS benchmark). We show that surrogate NAS benchmarks can model the true performance of architectures better than tabular benchmarks (at a small fraction of the cost), that they lead to faithful estimates of how well different NAS methods work on the original non-surrogate benchmark, and that they can generate new scientific insight. We open-source all our code and believe that surrogate NAS benchmarks are an indispensable tool to extend scientifically sound work on NAS to large and exciting search spaces.
\ No newline at end of file
diff --git a/data/2022/iclr/Switch to Generalize: Domain-Switch Learning for Cross-Domain Few-Shot Classification b/data/2022/iclr/Switch to Generalize: Domain-Switch Learning for Cross-Domain Few-Shot Classification
new file mode 100644
index 0000000000..67ef5b6e0a
--- /dev/null
+++ b/data/2022/iclr/Switch to Generalize: Domain-Switch Learning for Cross-Domain Few-Shot Classification	
@@ -0,0 +1 @@
+This paper considers few-shot learning under the cross-domain scenario. The cross-domain setting imposes a critical challenge, i.e. , using very few (support) samples to generalize the already-learned model to a novel domain. We hold a hypothesis, i.e. , if a deep model is capable to fast generalize itself to different domains (using very few samples) during training, it will maintain such domain generalization capacity for testing. It motivates us to propose a novel Domain-Switch Learning (DSL) framework. DSL embeds the cross-domain scenario into the training stage in a “fast switching” manner. Speciﬁcally, DSL uses a single domain for a training iteration and switches into another domain for the following iteration. During the switching, DSL enforces two constraints: 1) the deep model should not over-ﬁt the domain in the current iteration and 2) the deep model should not forget the already-learned knowledge of other domains. These two constraints jointly promote fast generalization across different domains. Experimental results conﬁrm that the cross-domain generalization capacity can be inherited from the training stage to the testing stage, validating our key hypothesis. Consequentially, DSL signiﬁcantly improves cross-domain few-shot classiﬁcation and sets up new state of the art.
\ No newline at end of file
diff --git a/data/2022/iclr/Symbolic Learning to Optimize: Towards Interpretability and Scalability b/data/2022/iclr/Symbolic Learning to Optimize: Towards Interpretability and Scalability
new file mode 100644
index 0000000000..ec4daac885
--- /dev/null
+++ b/data/2022/iclr/Symbolic Learning to Optimize: Towards Interpretability and Scalability	
@@ -0,0 +1 @@
+Recent studies on Learning to Optimize (L2O) suggest a promising path to automating and accelerating the optimization procedure for complicated tasks. Existing L2O models parameterize optimization rules by neural networks, and learn those numerical rules via meta-training. However, they face two common pitfalls: (1) scalability: the numerical rules represented by neural networks create extra memory overhead for applying L2O models, and limit their applicability to optimizing larger tasks; (2) interpretability: it is unclear what an L2O model has learned in its black-box optimization rule, nor is it straightforward to compare different L2O models in an explainable way. To avoid both pitfalls, this paper proves the concept that we can"kill two birds by one stone", by introducing the powerful tool of symbolic regression to L2O. In this paper, we establish a holistic symbolic representation and analysis framework for L2O, which yields a series of insights for learnable optimizers. Leveraging our findings, we further propose a lightweight L2O model that can be meta-trained on large-scale problems and outperformed human-designed and tuned optimizers. Our work is set to supply a brand-new perspective to L2O research. Codes are available at: https://github.com/VITA-Group/Symbolic-Learning-To-Optimize.
\ No newline at end of file
diff --git a/data/2022/iclr/Synchromesh: Reliable Code Generation from Pre-trained Language Models b/data/2022/iclr/Synchromesh: Reliable Code Generation from Pre-trained Language Models
new file mode 100644
index 0000000000..9e32feb047
--- /dev/null
+++ b/data/2022/iclr/Synchromesh: Reliable Code Generation from Pre-trained Language Models	
@@ -0,0 +1 @@
+Large pre-trained language models have been used to generate code,providing a flexible interface for synthesizing programs from natural language specifications. However, they often violate syntactic and semantic rules of their output language, limiting their practical usability. In this paper, we propose Synchromesh: a framework for substantially improving the reliability of pre-trained models for code generation. Synchromesh comprises two components. First, it retrieves few-shot examples from a training bank using Target Similarity Tuning (TST), a novel method for semantic example selection. TST learns to recognize utterances that describe similar target programs despite differences in surface natural language features. Then, Synchromesh feeds the examples to a pre-trained language model and samples programs using Constrained Semantic Decoding (CSD): a general framework for constraining the output to a set of valid programs in the target language. CSD leverages constraints on partial outputs to sample complete correct programs, and needs neither re-training nor fine-tuning of the language model. We evaluate our methods by synthesizing code from natural language descriptions using GPT-3 and Codex in three real-world languages: SQL queries, Vega-Lite visualizations and SMCalFlow programs. These domains showcase rich constraints that CSD is able to enforce, including syntax, scope, typing rules, and contextual logic. We observe substantial complementary gains from CSD and TST in prediction accuracy and in effectively preventing run-time errors.
\ No newline at end of file
diff --git a/data/2022/iclr/T-WaveNet: A Tree-Structured Wavelet Neural Network for Time Series Signal Analysis b/data/2022/iclr/T-WaveNet: A Tree-Structured Wavelet Neural Network for Time Series Signal Analysis
new file mode 100644
index 0000000000..5e3c09c583
--- /dev/null
+++ b/data/2022/iclr/T-WaveNet: A Tree-Structured Wavelet Neural Network for Time Series Signal Analysis	
@@ -0,0 +1 @@
+Time series signal analysis plays an essential role in many applications, e.g., activity recognition and healthcare monitoring. Recently, features extracted with deep neural networks (DNNs) have shown to be more effective than conventional hand-crafted ones. However, most existing solutions rely solely on the network to extract information carried in the raw signal, regardless of its inherent physical and statistical properties, leading to sub-optimal performance particularly under a limited amount of training data. In this work, we propose a novel tree-structured wavelet neural network for time series signal analysis, namely T-WaveNet, taking advantage of an inherent property of various types of signals, known as the dominant frequency range. Specifically, with T-WaveNet, we first conduct frequency spectrum energy analysis of the signals to get a set of dominant frequency subbands. Then, we construct a tree-structured network that iteratively decomposes the input signal into various frequency subbands with similar energies. Each node on the tree is built with an invertible neural network (INN) based wavelet transform unit. Such a disentangled representation learning method facilitates a more effective extraction of the discriminative features, as demonstrated with the comprehensive experiments on various real-life time series classification datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/TAMP-S2GCNets: Coupling Time-Aware Multipersistence Knowledge Representation with Spatio-Supra Graph Convolutional Networks for Time-Series Forecasting b/data/2022/iclr/TAMP-S2GCNets: Coupling Time-Aware Multipersistence Knowledge Representation with Spatio-Supra Graph Convolutional Networks for Time-Series Forecasting
new file mode 100644
index 0000000000..3b018eb551
--- /dev/null
+++ b/data/2022/iclr/TAMP-S2GCNets: Coupling Time-Aware Multipersistence Knowledge Representation with Spatio-Supra Graph Convolutional Networks for Time-Series Forecasting	
@@ -0,0 +1 @@
+Graph Neural Networks (GNNs) are proven to be a powerful machinery for learning complex dependencies in multivariate spatio-temporal processes. However, most existing GNNs have inherently static architectures
\ No newline at end of file
diff --git a/data/2022/iclr/TAPEX: Table Pre-training via Learning a Neural SQL Executor b/data/2022/iclr/TAPEX: Table Pre-training via Learning a Neural SQL Executor
new file mode 100644
index 0000000000..a9eec6de8b
--- /dev/null
+++ b/data/2022/iclr/TAPEX: Table Pre-training via Learning a Neural SQL Executor	
@@ -0,0 +1 @@
+Recent progress in language model pre-training has achieved a great success via leveraging large-scale unstructured textual data. However, it is still a challenge to apply pre-training on structured tabular data due to the absence of large-scale high-quality tabular data. In this paper, we propose TAPEX to show that table pre-training can be achieved by learning a neural SQL executor over a synthetic corpus, which is obtained by automatically synthesizing executable SQL queries and their execution outputs. TAPEX addresses the data scarcity challenge via guiding the language model to mimic a SQL executor on the diverse, large-scale and high-quality synthetic corpus. We evaluate TAPEX on four benchmark datasets. Experimental results demonstrate that TAPEX outperforms previous table pre-training approaches by a large margin and achieves new state-of-the-art results on all of them. This includes the improvements on the weakly-supervised WikiSQL denotation accuracy to 89.5% (+2.3%), the WikiTableQuestions denotation accuracy to 57.5% (+4.8%), the SQA denotation accuracy to 74.5% (+3.5%), and the TabFact accuracy to 84.2% (+3.2%). To our knowledge, this is the first work to exploit table pre-training via synthetic executable programs and to achieve new state-of-the-art results on various downstream tasks. Our code can be found at https://github.com/microsoft/Table-Pretraining.
\ No newline at end of file
diff --git a/data/2022/iclr/TAda! Temporally-Adaptive Convolutions for Video Understanding b/data/2022/iclr/TAda! Temporally-Adaptive Convolutions for Video Understanding
new file mode 100644
index 0000000000..5be42a20b4
--- /dev/null
+++ b/data/2022/iclr/TAda! Temporally-Adaptive Convolutions for Video Understanding	
@@ -0,0 +1 @@
+Spatial convolutions are widely used in numerous deep video models. It fundamentally assumes spatio-temporal invariance, i.e., using shared weights for every location in different frames. This work presents Temporally-Adaptive Convolutions (TAdaConv) for video understanding, which shows that adaptive weight calibration along the temporal dimension is an efficient way to facilitate modelling complex temporal dynamics in videos. Specifically, TAdaConv empowers the spatial convolutions with temporal modelling abilities by calibrating the convolution weights for each frame according to its local and global temporal context. Compared to previous temporal modelling operations, TAdaConv is more efficient as it operates over the convolution kernels instead of the features, whose dimension is an order of magnitude smaller than the spatial resolutions. Further, the kernel calibration brings an increased model capacity. We construct TAda2D and TAdaConvNeXt networks by replacing the 2D convolutions in ResNet and ConvNeXt with TAdaConv, which leads to at least on par or better performance compared to state-of-the-art approaches on multiple video action recognition and localization benchmarks. We also demonstrate that as a readily plug-in operation with negligible computation overhead, TAdaConv can effectively improve many existing video models with a convincing margin.
\ No newline at end of file
diff --git a/data/2022/iclr/THOMAS: Trajectory Heatmap Output with learned Multi-Agent Sampling b/data/2022/iclr/THOMAS: Trajectory Heatmap Output with learned Multi-Agent Sampling
new file mode 100644
index 0000000000..dd7eb10c93
--- /dev/null
+++ b/data/2022/iclr/THOMAS: Trajectory Heatmap Output with learned Multi-Agent Sampling	
@@ -0,0 +1 @@
+In this paper, we propose THOMAS, a joint multi-agent trajectory prediction framework allowing for an efficient and consistent prediction of multi-agent multi-modal trajectories. We present a unified model architecture for simultaneous agent future heatmap estimation, in which we leverage hierarchical and sparse image generation for fast and memory-efficient inference. We propose a learnable trajectory recombination model that takes as input a set of predicted trajectories for each agent and outputs its consistent reordered recombination. This recombination module is able to realign the initially independent modalities so that they do no collide and are coherent with each other. We report our results on the Interaction multi-agent prediction challenge and rank $1^{st}$ on the online test leaderboard.
\ No newline at end of file
diff --git a/data/2022/iclr/TPU-GAN: Learning temporal coherence from dynamic point cloud sequences b/data/2022/iclr/TPU-GAN: Learning temporal coherence from dynamic point cloud sequences
new file mode 100644
index 0000000000..02db3494b5
--- /dev/null
+++ b/data/2022/iclr/TPU-GAN: Learning temporal coherence from dynamic point cloud sequences	
@@ -0,0 +1 @@
+Point cloud sequence is an important data representation that provides ﬂexible shape and motion information. Prior work demonstrates that incorporating scene ﬂow information into loss can make model learn temporally coherent feature spaces. However, it is prohibitively expensive to acquire point correspondence information across frames in real-world environments. In this work, we propose a super-resolution generative adversarial network (GAN) for dynamic point cloud sequences without requiring point correspondence annotation. Our model, Temporal Point cloud Upsampling GAN (TPU-GAN), can implicitly learn the underlying temporal coherence from point cloud sequence, which in turn guides the generator to produce temporally coherent output. In addition, we propose a learnable masking module to adapt upsampling ratio according to the point distribution. We conduct extensive experiments on point cloud sequences from two different domains: particles in the ﬂuid dynamical system and human action scanned data. The quantitative and qualitative evaluation demonstrates the effectiveness of our method on upsampling task as well as learning temporal coherence from irregular point cloud sequences.
\ No newline at end of file
diff --git a/data/2022/iclr/TRAIL: Near-Optimal Imitation Learning with Suboptimal Data b/data/2022/iclr/TRAIL: Near-Optimal Imitation Learning with Suboptimal Data
new file mode 100644
index 0000000000..e057ba8de3
--- /dev/null
+++ b/data/2022/iclr/TRAIL: Near-Optimal Imitation Learning with Suboptimal Data	
@@ -0,0 +1 @@
+The aim in imitation learning is to learn effective policies by utilizing near-optimal expert demonstrations. However, high-quality demonstrations from human experts can be expensive to obtain in large numbers. On the other hand, it is often much easier to obtain large quantities of suboptimal or task-agnostic trajectories, which are not useful for direct imitation, but can nevertheless provide insight into the dynamical structure of the environment, showing what could be done in the environment even if not what should be done. We ask the question, is it possible to utilize such suboptimal offline datasets to facilitate provably improved downstream imitation learning? In this work, we answer this question affirmatively and present training objectives that use offline datasets to learn a factored transition model whose structure enables the extraction of a latent action space. Our theoretical analysis shows that the learned latent action space can boost the sample-efficiency of downstream imitation learning, effectively reducing the need for large near-optimal expert datasets through the use of auxiliary non-expert data. To learn the latent action space in practice, we propose TRAIL (Transition-Reparametrized Actions for Imitation Learning), an algorithm that learns an energy-based transition model contrastively, and uses the transition model to reparametrize the action space for sample-efficient imitation learning. We evaluate the practicality of our objective through experiments on a set of navigation and locomotion tasks. Our results verify the benefits suggested by our theory and show that TRAIL is able to improve baseline imitation learning by up to 4x in performance.
\ No newline at end of file
diff --git a/data/2022/iclr/TRGP: Trust Region Gradient Projection for Continual Learning b/data/2022/iclr/TRGP: Trust Region Gradient Projection for Continual Learning
new file mode 100644
index 0000000000..2a5550ce1a
--- /dev/null
+++ b/data/2022/iclr/TRGP: Trust Region Gradient Projection for Continual Learning	
@@ -0,0 +1 @@
+Catastrophic forgetting is one of the major challenges in continual learning. To address this issue, some existing methods put restrictive constraints on the optimization space of the new task for minimizing the interference to old tasks. However, this may lead to unsatisfactory performance for the new task, especially when the new task is strongly correlated with old tasks. To tackle this challenge, we propose Trust Region Gradient Projection (TRGP) for continual learning to facilitate the forward knowledge transfer based on an efficient characterization of task correlation. Particularly, we introduce a notion of `trust region' to select the most related old tasks for the new task in a layer-wise and single-shot manner, using the norm of gradient projection onto the subspace spanned by task inputs. Then, a scaled weight projection is proposed to cleverly reuse the frozen weights of the selected old tasks in the trust region through a layer-wise scaling matrix. By jointly optimizing the scaling matrices and the model, where the model is updated along the directions orthogonal to the subspaces of old tasks, TRGP can effectively prompt knowledge transfer without forgetting. Extensive experiments show that our approach achieves significant improvement over related state-of-the-art methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Tackling the Generative Learning Trilemma with Denoising Diffusion GANs b/data/2022/iclr/Tackling the Generative Learning Trilemma with Denoising Diffusion GANs
new file mode 100644
index 0000000000..bed2e8476e
--- /dev/null
+++ b/data/2022/iclr/Tackling the Generative Learning Trilemma with Denoising Diffusion GANs	
@@ -0,0 +1 @@
+A wide variety of deep generative models has been developed in the past decade. Yet, these models often struggle with simultaneously addressing three key requirements including: high sample quality, mode coverage, and fast sampling. We call the challenge imposed by these requirements the generative learning trilemma, as the existing models often trade some of them for others. Particularly, denoising diffusion models have shown impressive sample quality and diversity, but their expensive sampling does not yet allow them to be applied in many real-world applications. In this paper, we argue that slow sampling in these models is fundamentally attributed to the Gaussian assumption in the denoising step which is justified only for small step sizes. To enable denoising with large steps, and hence, to reduce the total number of denoising steps, we propose to model the denoising distribution using a complex multimodal distribution. We introduce denoising diffusion generative adversarial networks (denoising diffusion GANs) that model each denoising step using a multimodal conditional GAN. Through extensive evaluations, we show that denoising diffusion GANs obtain sample quality and diversity competitive with original diffusion models while being 2000$\times$ faster on the CIFAR-10 dataset. Compared to traditional GANs, our model exhibits better mode coverage and sample diversity. To the best of our knowledge, denoising diffusion GAN is the first model that reduces sampling cost in diffusion models to an extent that allows them to be applied to real-world applications inexpensively. Project page and code can be found at https://nvlabs.github.io/denoising-diffusion-gan
\ No newline at end of file
diff --git a/data/2022/iclr/Taming Sparsely Activated Transformer with Stochastic Experts b/data/2022/iclr/Taming Sparsely Activated Transformer with Stochastic Experts
new file mode 100644
index 0000000000..ecfa7c52cf
--- /dev/null
+++ b/data/2022/iclr/Taming Sparsely Activated Transformer with Stochastic Experts	
@@ -0,0 +1 @@
+Sparsely activated models (SAMs), such as Mixture-of-Experts (MoE), can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. However, SAMs are reported to be parameter inefficient such that larger models do not always lead to better performance. While most on-going research focuses on improving SAMs models by exploring methods of routing inputs to experts, our analysis reveals that such research might not lead to the solution we expect, i.e., the commonly-used routing methods based on gating mechanisms do not work better than randomly routing inputs to experts. In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts). Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference. THOR models are trained using a consistency regularized loss, where experts learn not only from training data but also from other experts as teachers, such that all the experts make consistent predictions. We validate the effectiveness of THOR on machine translation tasks. Results show that THOR models are more parameter efficient in that they significantly outperform the Transformer and MoE models across various settings. For example, in multilingual translation, THOR outperforms the Switch Transformer by 2 BLEU scores, and obtains the same BLEU score as that of a state-of-the-art MoE model that is 18 times larger. Our code is publicly available at: https://github.com/microsoft/Stochastic-Mixture-of-Experts.
\ No newline at end of file
diff --git a/data/2022/iclr/Target-Side Input Augmentation for Sequence to Sequence Generation b/data/2022/iclr/Target-Side Input Augmentation for Sequence to Sequence Generation
new file mode 100644
index 0000000000..ac25099062
--- /dev/null
+++ b/data/2022/iclr/Target-Side Input Augmentation for Sequence to Sequence Generation	
@@ -0,0 +1 @@
+Autoregressive sequence generation, a prevalent task in machine learning and natural language processing, generates every target token conditioned on both a source input and previously generated target tokens. Previous data augmentation methods, which have been shown to be effective for the task, mainly enhance source inputs (e.g., injecting noise into the source sequence by random swapping or masking, back translation, etc.) while overlooking the target-side augmentation. In this work, we propose a target-side augmentation method for sequence generation. In training, we use the decoder output probability distributions as soft indicators, which are multiplied with target token embeddings, to build pseudo tokens. These soft pseudo tokens are then used as target tokens to enhance the training. We conduct comprehensive experiments on various sequence generation tasks, including dialog generation, machine translation, and abstractive summarization. Without using any extra labeled data or introducing additional model parameters, our method signiﬁcantly outperforms strong baselines. The code is available at https://github.com/TARGET-SIDE-DATA-AUG/ TSDASG .
\ No newline at end of file
diff --git a/data/2022/iclr/Task Affinity with Maximum Bipartite Matching in Few-Shot Learning b/data/2022/iclr/Task Affinity with Maximum Bipartite Matching in Few-Shot Learning
new file mode 100644
index 0000000000..f61bc3b3a8
--- /dev/null
+++ b/data/2022/iclr/Task Affinity with Maximum Bipartite Matching in Few-Shot Learning	
@@ -0,0 +1 @@
+We propose an asymmetric affinity score for representing the complexity of utilizing the knowledge of one task for learning another one. Our method is based on the maximum bipartite matching algorithm and utilizes the Fisher Information matrix. We provide theoretical analyses demonstrating that the proposed score is mathematically well-defined, and subsequently use the affinity score to propose a novel algorithm for the few-shot learning problem. In particular, using this score, we find relevant training data labels to the test data and leverage the discovered relevant data for episodically fine-tuning a few-shot model. Results on various few-shot benchmark datasets demonstrate the efficacy of the proposed approach by improving the classification accuracy over the state-of-the-art methods even when using smaller models.
\ No newline at end of file
diff --git a/data/2022/iclr/Task Relatedness-Based Generalization Bounds for Meta Learning b/data/2022/iclr/Task Relatedness-Based Generalization Bounds for Meta Learning
new file mode 100644
index 0000000000..0488528789
--- /dev/null
+++ b/data/2022/iclr/Task Relatedness-Based Generalization Bounds for Meta Learning	
@@ -0,0 +1 @@
+Supposing the n training tasks and the new task are sampled from the same environment, traditional meta learning theory derives an error bound on the expected loss over the new task in terms of the empirical training loss, uniformly over the set of all hypothesis spaces. However, there is still little research on how the relatedness of these tasks can affect the full utilization of all mn training data (with m examples per task). In this paper, we propose to address this problem by defining a new notion of task relatedness according to the existence of the bijective transformation between two tasks. A novel generalization bound of O( 1 √ mn ) for meta learning is thus derived by exploiting the proposed task relatedness. Moreover, when investigating a special branch of meta learning that involves representation learning with deep neural networks, we establish spectrally-normalized bounds for both classification and regression problems. Finally, we demonstrate that the relatedness requirement between two tasks is satisfied when the sample space possesses the completeness and separability properties, validating the rationality and applicability of our proposed task-relatedness measure.
\ No newline at end of file
diff --git a/data/2022/iclr/Task-Induced Representation Learning b/data/2022/iclr/Task-Induced Representation Learning
new file mode 100644
index 0000000000..5c6c3902f6
--- /dev/null
+++ b/data/2022/iclr/Task-Induced Representation Learning	
@@ -0,0 +1 @@
+In this work, we evaluate the effectiveness of representation learning approaches for decision making in visually complex environments. Representation learning is essential for effective reinforcement learning (RL) from high-dimensional inputs. Unsupervised representation learning approaches based on reconstruction, prediction or contrastive learning have shown substantial learning efficiency gains. Yet, they have mostly been evaluated in clean laboratory or simulated settings. In contrast, real environments are visually complex and contain substantial amounts of clutter and distractors. Unsupervised representations will learn to model such distractors, potentially impairing the agent's learning efficiency. In contrast, an alternative class of approaches, which we call task-induced representation learning, leverages task information such as rewards or demonstrations from prior tasks to focus on task-relevant parts of the scene and ignore distractors. We investigate the effectiveness of unsupervised and task-induced representation learning approaches on four visually complex environments, from Distracting DMControl to the CARLA driving simulator. For both, RL and imitation learning, we find that representation learning generally improves sample efficiency on unseen tasks even in visually complex scenes and that task-induced representations can double learning efficiency compared to unsupervised alternatives. Code is available at https://clvrai.com/tarp.
\ No newline at end of file
diff --git a/data/2022/iclr/Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification b/data/2022/iclr/Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification
new file mode 100644
index 0000000000..95345f8abc
--- /dev/null
+++ b/data/2022/iclr/Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification	
@@ -0,0 +1 @@
+Explainable distances for sequence data depend on temporal alignment to tackle sequences with different lengths and local variances. Most sequence alignment methods infer the optimal alignment by solving an optimization problem under pre-deﬁned feasible alignment constraints, which not only is time-consuming, but also makes end-to-end sequence learning intractable. In this paper, we propose a learnable sequence distance called Temporal Alignment Prediction (TAP). TAP employs a lightweight convolutional neural network to directly predict the optimal alignment between two sequences, so that only straightforward calculations are required and no optimization is involved in inference. TAP can be applied in different distance-based machine learning tasks. For supervised sequence representation learning, we show that TAP trained with various metric learning losses achieves completive performances with much faster inference speed. For few-shot action classiﬁcation, we apply TAP as the distance measure in the metric learning-based episode-training paradigm. This simple strategy achieves comparable results with state-of-the-art few-shot action recognition methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Temporal Efficient Training of Spiking Neural Network via Gradient Re-weighting b/data/2022/iclr/Temporal Efficient Training of Spiking Neural Network via Gradient Re-weighting
new file mode 100644
index 0000000000..ef6dbc8eb1
--- /dev/null
+++ b/data/2022/iclr/Temporal Efficient Training of Spiking Neural Network via Gradient Re-weighting	
@@ -0,0 +1 @@
+Recently, brain-inspired spiking neuron networks (SNNs) have attracted widespread research interest because of their event-driven and energy-efficient characteristics. Still, it is difficult to efficiently train deep SNNs due to the non-differentiability of its activation function, which disables the typically used gradient descent approaches for traditional artificial neural networks (ANNs). Although the adoption of surrogate gradient (SG) formally allows for the back-propagation of losses, the discrete spiking mechanism actually differentiates the loss landscape of SNNs from that of ANNs, failing the surrogate gradient methods to achieve comparable accuracy as for ANNs. In this paper, we first analyze why the current direct training approach with surrogate gradient results in SNNs with poor generalizability. Then we introduce the temporal efficient training (TET) approach to compensate for the loss of momentum in the gradient descent with SG so that the training process can converge into flatter minima with better generalizability. Meanwhile, we demonstrate that TET improves the temporal scalability of SNN and induces a temporal inheritable training for acceleration. Our method consistently outperforms the SOTA on all reported mainstream datasets, including CIFAR-10/100 and ImageNet. Remarkably on DVS-CIFAR10, we obtained 83$\%$ top-1 accuracy, over 10$\%$ improvement compared to existing state of the art. Codes are available at \url{https://github.com/Gus-Lab/temporal_efficient_training}.
\ No newline at end of file
diff --git a/data/2022/iclr/The Boltzmann Policy Distribution: Accounting for Systematic Suboptimality in Human Models b/data/2022/iclr/The Boltzmann Policy Distribution: Accounting for Systematic Suboptimality in Human Models
new file mode 100644
index 0000000000..e11b537a29
--- /dev/null
+++ b/data/2022/iclr/The Boltzmann Policy Distribution: Accounting for Systematic Suboptimality in Human Models	
@@ -0,0 +1 @@
+Models of human behavior for prediction and collaboration tend to fall into two categories: ones that learn from large amounts of data via imitation learning, and ones that assume human behavior to be noisily-optimal for some reward function. The former are very useful, but only when it is possible to gather a lot of human data in the target environment and distribution. The advantage of the latter type, which includes Boltzmann rationality, is the ability to make accurate predictions in new environments without extensive data when humans are actually close to optimal. However, these models fail when humans exhibit systematic suboptimality, i.e. when their deviations from optimal behavior are not independent, but instead consistent over time. Our key insight is that systematic suboptimality can be modeled by predicting policies, which couple action choices over time, instead of trajectories. We introduce the Boltzmann policy distribution (BPD), which serves as a prior over human policies and adapts via Bayesian inference to capture systematic deviations by observing human actions during a single episode. The BPD is difficult to compute and represent because policies lie in a high-dimensional continuous space, but we leverage tools from generative and sequence models to enable efficient sampling and inference. We show that the BPD enables prediction of human behavior and human-AI collaboration equally as well as imitation learning-based human models while using far less data.
\ No newline at end of file
diff --git a/data/2022/iclr/The Close Relationship Between Contrastive Learning and Meta-Learning b/data/2022/iclr/The Close Relationship Between Contrastive Learning and Meta-Learning
new file mode 100644
index 0000000000..bb3471992d
--- /dev/null
+++ b/data/2022/iclr/The Close Relationship Between Contrastive Learning and Meta-Learning	
@@ -0,0 +1 @@
+Contrastive learning has recently taken off as a paradigm for learning from unlabeled data. In this paper, we discuss the close relationship between contrastive learning and meta-learning under a certain task distribution. We complement this observation by showing that established meta-learning methods, such as Prototypical Networks, achieve comparable performance to SimCLR when paired with this task distribution. This relationship can be leveraged by taking established techniques from meta-learning, such as task-based data augmentation, and showing that they benefit contrastive learning as well. These tricks also benefit state-of-the-art self-supervised learners without using negative pairs such as BYOL, which achieves 94.6% accuracy on CIFAR-10 using a self-supervised ResNet-18 feature extractor trained with our meta-learning tricks. We conclude that existing advances designed for contrastive learning or metalearning can be exploited to benefit the other, and it is better for contrastive learning researchers to take lessons from the meta-learning literature (and viceversa) than to reinvent the wheel. Our Pytorch implementation can be found on: https://github.com/RenkunNi/MetaContrastive
\ No newline at end of file
diff --git a/data/2022/iclr/The Convex Geometry of Backpropagation: Neural Network Gradient Flows Converge to Extreme Points of the Dual Convex Program b/data/2022/iclr/The Convex Geometry of Backpropagation: Neural Network Gradient Flows Converge to Extreme Points of the Dual Convex Program
new file mode 100644
index 0000000000..f29d5d51d1
--- /dev/null
+++ b/data/2022/iclr/The Convex Geometry of Backpropagation: Neural Network Gradient Flows Converge to Extreme Points of the Dual Convex Program	
@@ -0,0 +1 @@
+We study non-convex subgradient flows for training two-layer ReLU neural networks from a convex geometry and duality perspective. We characterize the implicit bias of unregularized non-convex gradient flow as convex regularization of an equivalent convex model. We then show that the limit points of non-convex subgradient flows can be identified via primal-dual correspondence in this convex optimization problem. Moreover, we derive a sufficient condition on the dual variables which ensures that the stationary points of the non-convex objective are the KKT points of the convex objective, thus proving convergence of non-convex gradient flows to the global optimum. For a class of regular training data distributions such as orthogonal separable data, we show that this sufficient condition holds. Therefore, non-convex gradient flows in fact converge to optimal solutions of a convex optimization problem. We present numerical results verifying the predictions of our theory for non-convex subgradient descent.
\ No newline at end of file
diff --git a/data/2022/iclr/The Effects of Invertibility on the Representational Complexity of Encoders in Variational Autoencoders b/data/2022/iclr/The Effects of Invertibility on the Representational Complexity of Encoders in Variational Autoencoders
new file mode 100644
index 0000000000..ef5bfcfbc0
--- /dev/null
+++ b/data/2022/iclr/The Effects of Invertibility on the Representational Complexity of Encoders in Variational Autoencoders	
@@ -0,0 +1 @@
+Training and using modern neural-network based latent-variable generative models (like Variational Autoencoders) often require simultaneously training a generative direction along with an inferential(encoding) direction, which approximates the posterior distribution over the latent variables. Thus, the question arises: how complex does the inferential model need to be, in order to be able to accurately model the posterior distribution of a given generative model? In this paper, we identify an important property of the generative map impacting the required size of the encoder. We show that if the generative map is"strongly invertible"(in a sense we suitably formalize), the inferential model need not be much more complex. Conversely, we prove that there exist non-invertible generative maps, for which the encoding direction needs to be exponentially larger (under standard assumptions in computational complexity). Importantly, we do not require the generative model to be layerwise invertible, which a lot of the related literature assumes and isn't satisfied by many architectures used in practice (e.g. convolution and pooling based networks). Thus, we provide theoretical support for the empirical wisdom that learning deep generative models is harder when data lies on a low-dimensional manifold.
\ No newline at end of file
diff --git a/data/2022/iclr/The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models b/data/2022/iclr/The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
new file mode 100644
index 0000000000..151b98acca
--- /dev/null
+++ b/data/2022/iclr/The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models	
@@ -0,0 +1 @@
+Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied. To understand how reward hacking arises, we construct four RL environments with misspecified rewards. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time. More capable agents often exploit reward misspecifications, achieving higher proxy reward and lower true reward than less capable agents. Moreover, we find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward. Such phase transitions pose challenges to monitoring the safety of ML systems. To address this, we propose an anomaly detection task for aberrant policies and offer several baseline detectors.
\ No newline at end of file
diff --git a/data/2022/iclr/The Efficiency Misnomer b/data/2022/iclr/The Efficiency Misnomer
new file mode 100644
index 0000000000..876ff840f6
--- /dev/null
+++ b/data/2022/iclr/The Efficiency Misnomer	
@@ -0,0 +1 @@
+Model efficiency is a critical aspect of developing and deploying machine learning models. Inference time and latency directly affect the user experience, and some applications have hard requirements. In addition to inference costs, model training also have direct financial and environmental impacts. Although there are numerous well-established metrics (cost indicators) for measuring model efficiency, researchers and practitioners often assume that these metrics are correlated with each other and report only few of them. In this paper, we thoroughly discuss common cost indicators, their advantages and disadvantages, and how they can contradict each other. We demonstrate how incomplete reporting of cost indicators can lead to partial conclusions and a blurred or incomplete picture of the practical considerations of different models. We further present suggestions to improve reporting of efficiency metrics.
\ No newline at end of file
diff --git a/data/2022/iclr/The Evolution of Uncertainty of Learning in Games b/data/2022/iclr/The Evolution of Uncertainty of Learning in Games
new file mode 100644
index 0000000000..e85ecf5685
--- /dev/null
+++ b/data/2022/iclr/The Evolution of Uncertainty of Learning in Games	
@@ -0,0 +1 @@
+show-case
\ No newline at end of file
diff --git a/data/2022/iclr/The Geometry of Memoryless Stochastic Policy Optimization in Infinite-Horizon POMDPs b/data/2022/iclr/The Geometry of Memoryless Stochastic Policy Optimization in Infinite-Horizon POMDPs
new file mode 100644
index 0000000000..7b42f75574
--- /dev/null
+++ b/data/2022/iclr/The Geometry of Memoryless Stochastic Policy Optimization in Infinite-Horizon POMDPs	
@@ -0,0 +1 @@
+We consider the problem of finding the best memoryless stochastic policy for an infinite-horizon partially observable Markov decision process (POMDP) with finite state and action spaces with respect to either the discounted or mean reward criterion. We show that the (discounted) state-action frequencies and the expected cumulative reward are rational functions of the policy, whereby the degree is determined by the degree of partial observability. We then describe the optimization problem as a linear optimization problem in the space of feasible state-action frequencies subject to polynomial constraints that we characterize explicitly. This allows us to address the combinatorial and geometric complexity of the optimization problem using recent tools from polynomial optimization. In particular, we estimate the number of critical points and use the polynomial programming description of reward maximization to solve a navigation problem in a grid world.
\ No newline at end of file
diff --git a/data/2022/iclr/The Hidden Convex Optimization Landscape of Regularized Two-Layer ReLU Networks: an Exact Characterization of Optimal Solutions b/data/2022/iclr/The Hidden Convex Optimization Landscape of Regularized Two-Layer ReLU Networks: an Exact Characterization of Optimal Solutions
new file mode 100644
index 0000000000..c6f844a5fa
--- /dev/null
+++ b/data/2022/iclr/The Hidden Convex Optimization Landscape of Regularized Two-Layer ReLU Networks: an Exact Characterization of Optimal Solutions	
@@ -0,0 +1 @@
+We prove that finding all globally optimal two-layer ReLU neural networks can be performed by solving a convex optimization program with cone constraints. Our analysis is novel, characterizes all optimal solutions, and does not leverage duality-based analysis which was recently used to lift neural network training into convex spaces. Given the set of solutions of our convex optimization program, we show how to construct exactly the entire set of optimal neural networks. We provide a detailed characterization of this optimal set and its invariant transformations. As additional consequences of our convex perspective, (i) we establish that Clarke stationary points found by stochastic gradient descent correspond to the global optimum of a subsampled convex problem (ii) we provide a polynomial-time algorithm for checking if a neural network is a global minimum of the training loss (iii) we provide an explicit construction of a continuous path between any neural network and the global minimum of its sublevel set and (iv) characterize the minimal size of the hidden layer so that the neural network optimization landscape has no spurious valleys. Overall, we provide a rich framework for studying the landscape of neural network training loss through convexity.
\ No newline at end of file
diff --git a/data/2022/iclr/The Inductive Bias of In-Context Learning: Rethinking Pretraining Example Design b/data/2022/iclr/The Inductive Bias of In-Context Learning: Rethinking Pretraining Example Design
new file mode 100644
index 0000000000..6f839e1448
--- /dev/null
+++ b/data/2022/iclr/The Inductive Bias of In-Context Learning: Rethinking Pretraining Example Design	
@@ -0,0 +1 @@
+Pretraining Neural Language Models (NLMs) over a large corpus involves chunking the text into training examples, which are contiguous text segments of sizes processable by the neural architecture. We highlight a bias introduced by this common practice: we prove that the pretrained NLM can model much stronger dependencies between text segments that appeared in the same training example, than it can between text segments that appeared in different training examples. This intuitive result has a twofold role. First, it formalizes the motivation behind a broad line of recent successful NLM training heuristics, proposed for the pretraining and fine-tuning stages, which do not necessarily appear related at first glance. Second, our result clearly indicates further improvements to be made in NLM pretraining for the benefit of Natural Language Understanding tasks. As an example, we propose"kNN-Pretraining": we show that including semantically related non-neighboring sentences in the same pretraining example yields improved sentence representations and open domain question answering abilities. This theoretically motivated degree of freedom for pretraining example design indicates new training schemes for self-improving representations.
\ No newline at end of file
diff --git a/data/2022/iclr/The Information Geometry of Unsupervised Reinforcement Learning b/data/2022/iclr/The Information Geometry of Unsupervised Reinforcement Learning
new file mode 100644
index 0000000000..aed8be7937
--- /dev/null
+++ b/data/2022/iclr/The Information Geometry of Unsupervised Reinforcement Learning	
@@ -0,0 +1 @@
+How can a reinforcement learning (RL) agent prepare to solve downstream tasks if those tasks are not known a priori? One approach is unsupervised skill discovery, a class of algorithms that learn a set of policies without access to a reward function. Such algorithms bear a close resemblance to representation learning algorithms (e.g., contrastive learning) in supervised learning, in that both are pretraining algorithms that maximize some approximation to a mutual information objective. While prior work has shown that the set of skills learned by such methods can accelerate downstream RL tasks, prior work offers little analysis into whether these skill learning algorithms are optimal, or even what notion of optimality would be appropriate to apply to them. In this work, we show that unsupervised skill discovery algorithms based on mutual information maximization do not learn skills that are optimal for every possible reward function. However, we show that the distribution over skills provides an optimal initialization minimizing regret against adversarially-chosen reward functions, assuming a certain type of adaptation procedure. Our analysis also provides a geometric perspective on these skill learning methods.
\ No newline at end of file
diff --git a/data/2022/iclr/The MultiBERTs: BERT Reproductions for Robustness Analysis b/data/2022/iclr/The MultiBERTs: BERT Reproductions for Robustness Analysis
new file mode 100644
index 0000000000..f9bd2b2a68
--- /dev/null
+++ b/data/2022/iclr/The MultiBERTs: BERT Reproductions for Robustness Analysis	
@@ -0,0 +1 @@
+Experiments with pre-trained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact tested in the experiment (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure which includes the architecture, training data, initialization scheme, and loss function. Recent work has shown that repeating the pre-training process can lead to substantially different performance, suggesting that an alternate strategy is needed to make principled statements about procedures. To enable researchers to draw more robust conclusions, we introduce the MultiBERTs, a set of 25 BERT-Base checkpoints, trained with similar hyper-parameters as the original BERT model but differing in random weight initialization and shuffling of training data. We also define the Multi-Bootstrap, a non-parametric bootstrap method for statistical inference designed for settings where there are multiple pre-trained models and limited test data. To illustrate our approach, we present a case study of gender bias in coreference resolution, in which the Multi-Bootstrap lets us measure effects that may not be detected with a single checkpoint. We release our models and statistical library along with an additional set of 140 intermediate checkpoints captured during pre-training to facilitate research on learning dynamics.
\ No newline at end of file
diff --git a/data/2022/iclr/The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization b/data/2022/iclr/The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization
new file mode 100644
index 0000000000..3d6608de47
--- /dev/null
+++ b/data/2022/iclr/The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization	
@@ -0,0 +1 @@
+Despite progress across a broad range of applications, Transformers have limited success in systematic generalization. The situation is especially frustrating in the case of algorithmic tasks, where they often fail to find intuitive solutions that route relevant information to the right node/operation at the right time in the grid represented by Transformer columns. To facilitate the learning of useful control flow, we propose two modifications to the Transformer architecture, copy gate and geometric attention. Our novel Neural Data Router (NDR) achieves 100% length generalization accuracy on the classic compositional table lookup task, as well as near-perfect accuracy on the simple arithmetic task and a new variant of ListOps testing for generalization across computational depths. NDR's attention and gating patterns tend to be interpretable as an intuitive form of neural routing. Our code is public.
\ No newline at end of file
diff --git a/data/2022/iclr/The Rich Get Richer: Disparate Impact of Semi-Supervised Learning b/data/2022/iclr/The Rich Get Richer: Disparate Impact of Semi-Supervised Learning
new file mode 100644
index 0000000000..c37c06c5f3
--- /dev/null
+++ b/data/2022/iclr/The Rich Get Richer: Disparate Impact of Semi-Supervised Learning	
@@ -0,0 +1 @@
+Semi-supervised learning (SSL) has demonstrated its potential to improve the model accuracy for a variety of learning tasks when the high-quality supervised data is severely limited. Although it is often established that the average accuracy for the entire population of data is improved, it is unclear how SSL fares with different sub-populations. Understanding the above question has substantial fairness implications when different sub-populations are defined by the demographic groups that we aim to treat fairly. In this paper, we reveal the disparate impacts of deploying SSL: the sub-population who has a higher baseline accuracy without using SSL (the"rich"one) tends to benefit more from SSL; while the sub-population who suffers from a low baseline accuracy (the"poor"one) might even observe a performance drop after adding the SSL module. We theoretically and empirically establish the above observation for a broad family of SSL algorithms, which either explicitly or implicitly use an auxiliary"pseudo-label". Experiments on a set of image and text classification tasks confirm our claims. We introduce a new metric, Benefit Ratio, and promote the evaluation of the fairness of SSL (Equalized Benefit Ratio). We further discuss how the disparate impact can be mitigated. We hope our paper will alarm the potential pitfall of using SSL and encourage a multifaceted evaluation of future SSL algorithms.
\ No newline at end of file
diff --git a/data/2022/iclr/The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks b/data/2022/iclr/The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks
new file mode 100644
index 0000000000..c420c66bfe
--- /dev/null
+++ b/data/2022/iclr/The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks	
@@ -0,0 +1 @@
+In this paper, we conjecture that if the permutation invariance of neural networks is taken into account, SGD solutions will likely have no barrier in the linear interpolation between them. Although it is a bold conjecture, we show how extensive empirical attempts fall short of refuting it. We further provide a preliminary theoretical result to support our conjecture. Our conjecture has implications for lottery ticket hypothesis, distributed training, and ensemble methods.
\ No newline at end of file
diff --git a/data/2022/iclr/The Role of Pretrained Representations for the OOD Generalization of RL Agents b/data/2022/iclr/The Role of Pretrained Representations for the OOD Generalization of RL Agents
new file mode 100644
index 0000000000..e139d0d3c8
--- /dev/null
+++ b/data/2022/iclr/The Role of Pretrained Representations for the OOD Generalization of RL Agents	
@@ -0,0 +1 @@
+Building sample-efficient agents that generalize out-of-distribution (OOD) in real-world settings remains a fundamental unsolved problem on the path towards achieving higher-level cognition. One particularly promising approach is to begin with low-dimensional, pretrained representations of our world, which should facilitate efficient downstream learning and generalization. By training 240 representations and over 10,000 reinforcement learning (RL) policies on a simulated robotic setup, we evaluate to what extent different properties of pretrained VAE-based representations affect the OOD generalization of downstream agents. We observe that many agents are surprisingly robust to realistic distribution shifts, including the challenging sim-to-real case. In addition, we find that the generalization performance of a simple downstream proxy task reliably predicts the generalization performance of our RL agents under a wide range of OOD settings. Such proxy tasks can thus be used to select pretrained representations that will lead to agents that generalize.
\ No newline at end of file
diff --git a/data/2022/iclr/The Spectral Bias of Polynomial Neural Networks b/data/2022/iclr/The Spectral Bias of Polynomial Neural Networks
new file mode 100644
index 0000000000..0b5efa4690
--- /dev/null
+++ b/data/2022/iclr/The Spectral Bias of Polynomial Neural Networks	
@@ -0,0 +1 @@
+Polynomial neural networks (PNNs) have been recently shown to be particularly effective at image generation and face recognition, where high-frequency information is critical. Previous studies have revealed that neural networks demonstrate a $\textit{spectral bias}$ towards low-frequency functions, which yields faster learning of low-frequency components during training. Inspired by such studies, we conduct a spectral analysis of the Neural Tangent Kernel (NTK) of PNNs. We find that the $\Pi$-Net family, i.e., a recently proposed parametrization of PNNs, speeds up the learning of the higher frequencies. We verify the theoretical bias through extensive experiments. We expect our analysis to provide novel insights into designing architectures and learning frameworks by incorporating multiplicative interactions via polynomials.
\ No newline at end of file
diff --git a/data/2022/iclr/The Three Stages of Learning Dynamics in High-dimensional Kernel Methods b/data/2022/iclr/The Three Stages of Learning Dynamics in High-dimensional Kernel Methods
new file mode 100644
index 0000000000..fc1abcc386
--- /dev/null
+++ b/data/2022/iclr/The Three Stages of Learning Dynamics in High-dimensional Kernel Methods	
@@ -0,0 +1 @@
+To understand how deep learning works, it is crucial to understand the training dynamics of neural networks. Several interesting hypotheses about these dynamics have been made based on empirically observed phenomena, but there exists a limited theoretical understanding of when and why such phenomena occur. In this paper, we consider the training dynamics of gradient flow on kernel least-squares objectives, which is a limiting dynamics of SGD trained neural networks. Using precise high-dimensional asymptotics, we characterize the dynamics of the fitted model in two"worlds": in the Oracle World the model is trained on the population distribution and in the Empirical World the model is trained on a sampled dataset. We show that under mild conditions on the kernel and $L^2$ target regression function the training dynamics undergo three stages characterized by the behaviors of the models in the two worlds. Our theoretical results also mathematically formalize some interesting deep learning phenomena. Specifically, in our setting we show that SGD progressively learns more complex functions and that there is a"deep bootstrap"phenomenon: during the second stage, the test error of both worlds remain close despite the empirical training error being much smaller. Finally, we give a concrete example comparing the dynamics of two different kernels which shows that faster training is not necessary for better generalization.
\ No newline at end of file
diff --git a/data/2022/iclr/The Uncanny Similarity of Recurrence and Depth b/data/2022/iclr/The Uncanny Similarity of Recurrence and Depth
new file mode 100644
index 0000000000..fc12edd8bb
--- /dev/null
+++ b/data/2022/iclr/The Uncanny Similarity of Recurrence and Depth	
@@ -0,0 +1 @@
+It is widely believed that deep neural networks contain layer specialization, wherein neural networks extract hierarchical features representing edges and patterns in shallow layers and complete objects in deeper layers. Unlike common feed-forward models that have distinct filters at each layer, recurrent networks reuse the same parameters at various depths. In this work, we observe that recurrent models exhibit the same hierarchical behaviors and the same performance benefits with depth as feed-forward networks despite reusing the same filters at every recurrence. By training models of various feed-forward and recurrent architectures on several datasets for image classification as well as maze solving, we show that recurrent networks have the ability to closely emulate the behavior of non-recurrent deep models, often doing so with far fewer parameters.
\ No newline at end of file
diff --git a/data/2022/iclr/The Unreasonable Effectiveness of Random Pruning: Return of the Most Naive Baseline for Sparse Training b/data/2022/iclr/The Unreasonable Effectiveness of Random Pruning: Return of the Most Naive Baseline for Sparse Training
new file mode 100644
index 0000000000..971d3c8726
--- /dev/null
+++ b/data/2022/iclr/The Unreasonable Effectiveness of Random Pruning: Return of the Most Naive Baseline for Sparse Training	
@@ -0,0 +1 @@
+Random pruning is arguably the most naive way to attain sparsity in neural networks, but has been deemed uncompetitive by either post-training pruning or sparse training. In this paper, we focus on sparse training and highlight a perhaps counter-intuitive finding, that random pruning at initialization can be quite powerful for the sparse training of modern neural networks. Without any delicate pruning criteria or carefully pursued sparsity structures, we empirically demonstrate that sparsely training a randomly pruned network from scratch can match the performance of its dense equivalent. There are two key factors that contribute to this revival: (i) the network sizes matter: as the original dense networks grow wider and deeper, the performance of training a randomly pruned sparse network will quickly grow to matching that of its dense equivalent, even at high sparsity ratios; (ii) appropriate layer-wise sparsity ratios can be pre-chosen for sparse training, which shows to be another important performance booster. Simple as it looks, a randomly pruned subnetwork of Wide ResNet-50 can be sparsely trained to outperforming a dense Wide ResNet-50, on ImageNet. We also observed such randomly pruned networks outperform dense counterparts in other favorable aspects, such as out-of-distribution detection, uncertainty estimation, and adversarial robustness. Overall, our results strongly suggest there is larger-than-expected room for sparse training at scale, and the benefits of sparsity might be more universal beyond carefully designed pruning. Our source code can be found at https://github.com/VITA-Group/Random_Pruning.
\ No newline at end of file
diff --git a/data/2022/iclr/Tighter Sparse Approximation Bounds for ReLU Neural Networks b/data/2022/iclr/Tighter Sparse Approximation Bounds for ReLU Neural Networks
new file mode 100644
index 0000000000..ac947ae3f1
--- /dev/null
+++ b/data/2022/iclr/Tighter Sparse Approximation Bounds for ReLU Neural Networks	
@@ -0,0 +1 @@
+A well-known line of work (Barron, 1993; Breiman, 1993; Klusowski&Barron, 2018) provides bounds on the width $n$ of a ReLU two-layer neural network needed to approximate a function $f$ over the ball $\mathcal{B}_R(\mathbb{R}^d)$ up to error $\epsilon$, when the Fourier based quantity $C_f = \frac{1}{(2\pi)^{d/2}} \int_{\mathbb{R}^d} \|\xi\|^2 |\hat{f}(\xi)| \ d\xi$ is finite. More recently Ongie et al. (2019) used the Radon transform as a tool for analysis of infinite-width ReLU two-layer networks. In particular, they introduce the concept of Radon-based $\mathcal{R}$-norms and show that a function defined on $\mathbb{R}^d$ can be represented as an infinite-width two-layer neural network if and only if its $\mathcal{R}$-norm is finite. In this work, we extend the framework of Ongie et al. (2019) and define similar Radon-based semi-norms ($\mathcal{R}, \mathcal{U}$-norms) such that a function admits an infinite-width neural network representation on a bounded open set $\mathcal{U} \subseteq \mathbb{R}^d$ when its $\mathcal{R}, \mathcal{U}$-norm is finite. Building on this, we derive sparse (finite-width) neural network approximation bounds that refine those of Breiman (1993); Klusowski&Barron (2018). Finally, we show that infinite-width neural network representations on bounded open sets are not unique and study their structure, providing a functional view of mode connectivity.
\ No newline at end of file
diff --git a/data/2022/iclr/ToM2C: Target-oriented Multi-agent Communication and Cooperation with Theory of Mind b/data/2022/iclr/ToM2C: Target-oriented Multi-agent Communication and Cooperation with Theory of Mind
new file mode 100644
index 0000000000..a39c0307b7
--- /dev/null
+++ b/data/2022/iclr/ToM2C: Target-oriented Multi-agent Communication and Cooperation with Theory of Mind	
@@ -0,0 +1 @@
+Being able to predict the mental states of others is a key factor to effective social interaction. It is also crucial for distributed multi-agent systems, where agents are required to communicate and cooperate. In this paper, we introduce such an important social-cognitive skill, i.e. Theory of Mind (ToM), to build socially intelligent agents who are able to communicate and cooperate effectively to accomplish challenging tasks. With ToM, each agent is capable of inferring the mental states and intentions of others according to its (local) observation. Based on the inferred states, the agents decide"when"and with"whom"to share their intentions. With the information observed, inferred, and received, the agents decide their sub-goals and reach a consensus among the team. In the end, the low-level executors independently take primitive actions to accomplish the sub-goals. We demonstrate the idea in two typical target-oriented multi-agent tasks: cooperative navigation and multi-sensor target coverage. The experiments show that the proposed model not only outperforms the state-of-the-art methods on reward and communication efficiency, but also shows good generalization across different scales of the environment.
\ No newline at end of file
diff --git a/data/2022/iclr/Top-N: Equivariant Set and Graph Generation without Exchangeability b/data/2022/iclr/Top-N: Equivariant Set and Graph Generation without Exchangeability
new file mode 100644
index 0000000000..7c87f46555
--- /dev/null
+++ b/data/2022/iclr/Top-N: Equivariant Set and Graph Generation without Exchangeability	
@@ -0,0 +1 @@
+This work addresses one-shot set and graph generation, and, more specifically, the parametrization of probabilistic decoders that map a vector-shaped prior to a distribution over sets or graphs. Sets and graphs are most commonly generated by first sampling points i.i.d. from a normal distribution, and then processing these points along with the prior vector using Transformer layers or Graph Neural Networks. This architecture is designed to generate exchangeable distributions, i.e., all permutations of the generated outputs are equally likely. We however show that it only optimizes a proxy to the evidence lower bound, which makes it hard to train. We then study equivariance in generative settings and show that non-exchangeable methods can still achieve permutation equivariance. Using this result, we introduce Top-n creation, a differentiable generation mechanism that uses the latent vector to select the most relevant points from a trainable reference set. Top-n can replace i.i.d. generation in any Variational Autoencoder or Generative Adversarial Network. Experimentally, our method outperforms i.i.d. generation by 15% at SetMNIST reconstruction, by 33% at object detection on CLEVR, generates sets that are 74% closer to the true distribution on a synthetic molecule-like dataset, and generates more valid molecules on QM9.
\ No newline at end of file
diff --git a/data/2022/iclr/Top-label calibration and multiclass-to-binary reductions b/data/2022/iclr/Top-label calibration and multiclass-to-binary reductions
new file mode 100644
index 0000000000..ce267751a1
--- /dev/null
+++ b/data/2022/iclr/Top-label calibration and multiclass-to-binary reductions	
@@ -0,0 +1 @@
+A multiclass classiﬁer is said to be top-label calibrated if the reported probability for the predicted class—the top-label—is calibrated, conditioned on the top-label. This conditioning on the top-label is ab-sent in the closely related and popular notion of conﬁdence calibration, which we argue makes conﬁdence calibration diﬃcult to interpret for decision-making. We propose top-label calibration as a rectiﬁcation of conﬁdence calibration. Further, we outline a multiclass-to-binary (M2B) reduction framework that uniﬁes conﬁdence, top-label, and class-wise calibration, among others. As its name suggests, M2B works by reducing multiclass calibration to numerous binary calibration problems, each of which can be solved using simple binary calibration routines. We instantiate the M2B framework with the well-studied histogram binning (HB) binary calibrator, and prove that the overall procedure is multiclass calibrated without making any assumptions on the underlying data distribution. In an empirical evaluation with four deep net architectures on CIFAR-10 and CIFAR-100, we ﬁnd that the M2B + HB procedure achieves lower top-label and class-wise calibration error than other approaches such as temperature scaling. Code for this work is available at https://github.com/aigen/df-posthoc-calibration .
\ No newline at end of file
diff --git a/data/2022/iclr/Topological Experience Replay b/data/2022/iclr/Topological Experience Replay
new file mode 100644
index 0000000000..55f3baae58
--- /dev/null
+++ b/data/2022/iclr/Topological Experience Replay	
@@ -0,0 +1 @@
+State-of-the-art deep Q-learning methods update Q-values using state transition tuples sampled from the experience replay buffer. This strategy often uniformly and randomly samples or prioritizes data sampling based on measures such as the temporal difference (TD) error. Such sampling strategies can be inefficient at learning Q-function because a state's Q-value depends on the Q-value of successor states. If the data sampling strategy ignores the precision of the Q-value estimate of the next state, it can lead to useless and often incorrect updates to the Q-values. To mitigate this issue, we organize the agent's experience into a graph that explicitly tracks the dependency between Q-values of states. Each edge in the graph represents a transition between two states by executing a single action. We perform value backups via a breadth-first search starting from that expands vertices in the graph starting from the set of terminal states and successively moving backward. We empirically show that our method is substantially more data-efficient than several baselines on a diverse range of goal-reaching tasks. Notably, the proposed method also outperforms baselines that consume more batches of training experience and operates from high-dimensional observational data such as images.
\ No newline at end of file
diff --git a/data/2022/iclr/Topological Graph Neural Networks b/data/2022/iclr/Topological Graph Neural Networks
new file mode 100644
index 0000000000..5701200187
--- /dev/null
+++ b/data/2022/iclr/Topological Graph Neural Networks	
@@ -0,0 +1 @@
+Graph neural networks (GNNs) are a powerful architecture for tackling graph learning tasks, yet have been shown to be oblivious to eminent substructures such as cycles. We present TOGL, a novel layer that incorporates global topological information of a graph using persistent homology. TOGL can be easily integrated into any type of GNN and is strictly more expressive (in terms the Weisfeiler--Lehman graph isomorphism test) than message-passing GNNs. Augmenting GNNs with TOGL leads to improved predictive performance for graph and node classification tasks, both on synthetic data sets, which can be classified by humans using their topology but not by ordinary GNNs, and on real-world data.
\ No newline at end of file
diff --git a/data/2022/iclr/Topologically Regularized Data Embeddings b/data/2022/iclr/Topologically Regularized Data Embeddings
new file mode 100644
index 0000000000..8e25109b3b
--- /dev/null
+++ b/data/2022/iclr/Topologically Regularized Data Embeddings	
@@ -0,0 +1 @@
+Unsupervised feature learning often finds low-dimensional embeddings that capture the structure of complex data. For tasks for which prior expert topological knowledge is available, incorporating this into the learned representation may lead to higher quality embeddings. For example, this may help one to embed the data into a given number of clusters, or to accommodate for noise that prevents one from deriving the distribution of the data over the model directly, which can then be learned more effectively. However, a general tool for integrating different prior topological knowledge into embeddings is lacking. Although differentiable topology layers have been recently developed that can (re)shape embeddings into prespecified topological models, they have two important limitations for representation learning, which we address in this paper. First, the currently suggested topological losses fail to represent simple models such as clusters and flares in a natural manner. Second, these losses neglect all original structural (such as neighborhood) information in the data that is useful for learning. We overcome these limitations by introducing a new set of topological losses, and proposing their usage as a way for topologically regularizing data embeddings to naturally represent a prespecified model. We include thorough experiments on synthetic and real data that highlight the usefulness and versatility of this approach, with applications ranging from modeling high-dimensional single-cell data, to graph embedding.
\ No newline at end of file
diff --git a/data/2022/iclr/Toward Efficient Low-Precision Training: Data Format Optimization and Hysteresis Quantization b/data/2022/iclr/Toward Efficient Low-Precision Training: Data Format Optimization and Hysteresis Quantization
new file mode 100644
index 0000000000..9f4ee00417
--- /dev/null
+++ b/data/2022/iclr/Toward Efficient Low-Precision Training: Data Format Optimization and Hysteresis Quantization	
@@ -0,0 +1 @@
+As the complexity and size of deep neural networks continue to increase
\ No newline at end of file
diff --git a/data/2022/iclr/Toward Faithful Case-based Reasoning through Learning Prototypes in a Nearest Neighbor-friendly Space b/data/2022/iclr/Toward Faithful Case-based Reasoning through Learning Prototypes in a Nearest Neighbor-friendly Space
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Towards Better Understanding and Better Generalization of Low-shot Classification in Histology Images with Contrastive Learning b/data/2022/iclr/Towards Better Understanding and Better Generalization of Low-shot Classification in Histology Images with Contrastive Learning
new file mode 100644
index 0000000000..355966ac9b
--- /dev/null
+++ b/data/2022/iclr/Towards Better Understanding and Better Generalization of Low-shot Classification in Histology Images with Contrastive Learning	
@@ -0,0 +1 @@
+Few-shot learning is an established topic in natural images for years, but few work is attended to histology images, which is of high clinical value since well-labeled datasets and rare abnormal samples are expensive to collect. Here, we facilitate the study of few-shot learning in histology images by setting up three cross-domain tasks that simulate real clinics problems. To enable label-efficient learning and better generalizability, we propose to incorporate contrastive learning (CL) with latent augmentation (LA) to build a few-shot system. CL learns useful representations without manual labels, while LA transfers semantic variations of the base dataset in an unsupervised way. These two components fully exploit unlabeled training data and can scale gracefully to other label-hungry problems. In experiments, we find i) models learned by CL generalize better than supervised learning for histology images in unseen classes, and ii) LA brings consistent gains over baselines. Prior studies of self-supervised learning mainly focus on ImageNet-like images, which only present a dominant object in their centers. Recent attention has been paid to images with multi-objects and multi-textures. Histology images are a natural choice for such a study. We show the superiority of CL over supervised learning in terms of generalization for such data and provide our empirical understanding for this observation. The findings in this work could contribute to understanding how the model generalizes in the context of both representation learning and histological image analysis. Code is available.
\ No newline at end of file
diff --git a/data/2022/iclr/Towards Building A Group-based Unsupervised Representation Disentanglement Framework b/data/2022/iclr/Towards Building A Group-based Unsupervised Representation Disentanglement Framework
new file mode 100644
index 0000000000..2d845a7a97
--- /dev/null
+++ b/data/2022/iclr/Towards Building A Group-based Unsupervised Representation Disentanglement Framework	
@@ -0,0 +1 @@
+Disentangled representation learning is one of the major goals of deep learning, and is a key step for achieving explainable and generalizable models. A well-defined theoretical guarantee still lacks for the VAE-based unsupervised methods, which are a set of popular methods to achieve unsupervised disentanglement. The Group Theory based definition of representation disentanglement mathematically connects the data transformations to the representations using the formalism of group. In this paper, built on the group-based definition and inspired by the n-th dihedral group, we first propose a theoretical framework towards achieving unsupervised representation disentanglement. We then propose a model, based on existing VAE-based methods, to tackle the unsupervised learning problem of the framework. In the theoretical framework, we prove three sufficient conditions on model, group structure, and data respectively in an effort to achieve, in an unsupervised way, disentangled representation per group-based definition. With the first two of the conditions satisfied and a necessary condition derived for the third one, we offer additional constraints, from the perspective of the group-based definition, for the existing VAE-based models. Experimentally, we train 1800 models covering the most prominent VAE-based methods on five datasets to verify the effectiveness of our theoretical framework. Compared to the original VAE-based methods, these Groupified VAEs consistently achieve better mean performance with smaller variances.
\ No newline at end of file
diff --git a/data/2022/iclr/Towards Continual Knowledge Learning of Language Models b/data/2022/iclr/Towards Continual Knowledge Learning of Language Models
new file mode 100644
index 0000000000..757f12ddcf
--- /dev/null
+++ b/data/2022/iclr/Towards Continual Knowledge Learning of Language Models	
@@ -0,0 +1 @@
+Large Language Models (LMs) are known to encode world knowledge in their parameters as they pretrain on a vast amount of web corpus, which is often utilized for performing knowledge-dependent downstream tasks such as question answering, fact-checking, and open dialogue. In real-world scenarios, the world knowledge stored in the LMs can quickly become outdated as the world changes, but it is non-trivial to avoid catastrophic forgetting and reliably acquire new knowledge while preserving invariant knowledge. To push the community towards better maintenance of ever-changing LMs, we formulate a new continual learning (CL) problem called Continual Knowledge Learning (CKL). We construct a new benchmark and metric to quantify the retention of time-invariant world knowledge, the update of outdated knowledge, and the acquisition of new knowledge. We adopt applicable recent methods from literature to create several strong baselines. Through extensive experiments, we find that CKL exhibits unique challenges that are not addressed in previous CL setups, where parameter expansion is necessary to reliably retain and learn knowledge simultaneously. By highlighting the critical causes of knowledge forgetting, we show that CKL is a challenging and important problem that helps us better understand and train ever-changing LMs. The benchmark datasets, evaluation script, and baseline code to reproduce our results are available at https://github.com/joeljang/continual-knowledge-learning.
\ No newline at end of file
diff --git a/data/2022/iclr/Towards Deepening Graph Neural Networks: A GNTK-based Optimization Perspective b/data/2022/iclr/Towards Deepening Graph Neural Networks: A GNTK-based Optimization Perspective
new file mode 100644
index 0000000000..a02aaf13bd
--- /dev/null
+++ b/data/2022/iclr/Towards Deepening Graph Neural Networks: A GNTK-based Optimization Perspective	
@@ -0,0 +1 @@
+Graph convolutional networks (GCNs) and their variants have achieved great success in dealing with graph-structured data. Nevertheless, it is well known that deep GCNs suffer from the over-smoothing problem, where node representations tend to be indistinguishable as more layers are stacked up. The theoretical research to date on deep GCNs has focused primarily on expressive power rather than trainability, an optimization perspective. Compared to expressivity, trainability attempts to address a more fundamental question: Given a sufficiently expressive space of models, can we successfully find a good solution via gradient descent-based optimizers? This work fills this gap by exploiting the Graph Neural Tangent Kernel (GNTK), which governs the optimization trajectory under gradient descent for wide GCNs. We formulate the asymptotic behaviors of GNTK in the large depth, which enables us to reveal the dropping trainability of wide and deep GCNs at an exponential rate in the optimization process. Additionally, we extend our theoretical framework to analyze residual connection-based techniques, which are found to be merely able to mitigate the exponential decay of trainability mildly. Inspired by our theoretical insights on trainability, we propose Critical DropEdge, a connectivity-aware and graph-adaptive sampling method, to alleviate the exponential decay problem more fundamentally. Experimental evaluation consistently confirms using our proposed method can achieve better results compared to relevant counterparts with both infinite-width and finite-width.
\ No newline at end of file
diff --git a/data/2022/iclr/Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality b/data/2022/iclr/Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality
new file mode 100644
index 0000000000..2f76e8251f
--- /dev/null
+++ b/data/2022/iclr/Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality	
@@ -0,0 +1 @@
+Deployment efficiency is an important criterion for many real-world applications of reinforcement learning (RL). Despite the community's increasing interest, there lacks a formal theoretical formulation for the problem. In this paper, we propose such a formulation for deployment-efficient RL (DE-RL) from an"optimization with constraints"perspective: we are interested in exploring an MDP and obtaining a near-optimal policy within minimal \emph{deployment complexity}, whereas in each deployment the policy can sample a large batch of data. Using finite-horizon linear MDPs as a concrete structural model, we reveal the fundamental limit in achieving deployment efficiency by establishing information-theoretic lower bounds, and provide algorithms that achieve the optimal deployment efficiency. Moreover, our formulation for DE-RL is flexible and can serve as a building block for other practically relevant settings; we give"Safe DE-RL"and"Sample-Efficient DE-RL"as two examples, which may be worth future investigation.
\ No newline at end of file
diff --git a/data/2022/iclr/Towards Empirical Sandwich Bounds on the Rate-Distortion Function b/data/2022/iclr/Towards Empirical Sandwich Bounds on the Rate-Distortion Function
new file mode 100644
index 0000000000..4f3d965e4a
--- /dev/null
+++ b/data/2022/iclr/Towards Empirical Sandwich Bounds on the Rate-Distortion Function	
@@ -0,0 +1 @@
+Rate-distortion (R-D) function, a key quantity in information theory, characterizes the fundamental limit of how much a data source can be compressed subject to a fidelity criterion, by any compression algorithm. As researchers push for ever-improving compression performance, establishing the R-D function of a given data source is not only of scientific interest, but also sheds light on the possible room for improving compression algorithms. Previous work on this problem relied on distributional assumptions on the data source (Gibson, 2017) or only applied to discrete data (Blahut, 1972; Arimoto, 1972). By contrast, this paper makes the first attempt at an algorithm for sandwiching the R-D function of a general (not necessarily discrete) source requiring only i.i.d. data samples. We estimate R-D sandwich bounds for a variety of artificial and real-world data sources, in settings far beyond the feasibility of any known method, and shed light on the optimality of neural data compression (Ball\'e et al., 2021; Yang et al., 2022). Our R-D upper bound on natural images indicates theoretical room for improving state-of-the-art image compression methods by at least one dB in PSNR at various bitrates. Our data and code can be found at https://github.com/mandt-lab/empirical-RD-sandwich.
\ No newline at end of file
diff --git a/data/2022/iclr/Towards Evaluating the Robustness of Neural Networks Learned by Transduction b/data/2022/iclr/Towards Evaluating the Robustness of Neural Networks Learned by Transduction
new file mode 100644
index 0000000000..a7c4cfa881
--- /dev/null
+++ b/data/2022/iclr/Towards Evaluating the Robustness of Neural Networks Learned by Transduction	
@@ -0,0 +1 @@
+There has been emerging interest in using transductive learning for adversarial robustness (Goldwasser et al., NeurIPS 2020; Wu et al., ICML 2020; Wang et al., ArXiv 2021). Compared to traditional defenses, these defense mechanisms"dynamically learn"the model based on test-time input; and theoretically, attacking these defenses reduces to solving a bilevel optimization problem, which poses difficulty in crafting adaptive attacks. In this paper, we examine these defense mechanisms from a principled threat analysis perspective. We formulate and analyze threat models for transductive-learning based defenses, and point out important subtleties. We propose the principle of attacking model space for solving bilevel attack objectives, and present Greedy Model Space Attack (GMSA), an attack framework that can serve as a new baseline for evaluating transductive-learning based defenses. Through systematic evaluation, we show that GMSA, even with weak instantiations, can break previous transductive-learning based defenses, which were resilient to previous attacks, such as AutoAttack. On the positive side, we report a somewhat surprising empirical result of"transductive adversarial training": Adversarially retraining the model using fresh randomness at the test time gives a significant increase in robustness against attacks we consider.
\ No newline at end of file
diff --git a/data/2022/iclr/Towards General Function Approximation in Zero-Sum Markov Games b/data/2022/iclr/Towards General Function Approximation in Zero-Sum Markov Games
new file mode 100644
index 0000000000..1377a45f2e
--- /dev/null
+++ b/data/2022/iclr/Towards General Function Approximation in Zero-Sum Markov Games	
@@ -0,0 +1 @@
+This paper considers two-player zero-sum finite-horizon Markov games with simultaneous moves. The study focuses on the challenging settings where the value function or the model is parameterized by general function classes. Provably efficient algorithms for both decoupled and {coordinated} settings are developed. In the {decoupled} setting where the agent controls a single player and plays against an arbitrary opponent, we propose a new model-free algorithm. The sample complexity is governed by the Minimax Eluder dimension -- a new dimension of the function class in Markov games. As a special case, this method improves the state-of-the-art algorithm by a $\sqrt{d}$ factor in the regret when the reward function and transition kernel are parameterized with $d$-dimensional linear features. In the {coordinated} setting where both players are controlled by the agent, we propose a model-based algorithm and a model-free algorithm. In the model-based algorithm, we prove that sample complexity can be bounded by a generalization of Witness rank to Markov games. The model-free algorithm enjoys a $\sqrt{K}$-regret upper bound where $K$ is the number of episodes.
\ No newline at end of file
diff --git a/data/2022/iclr/Towards Model Agnostic Federated Learning Using Knowledge Distillation b/data/2022/iclr/Towards Model Agnostic Federated Learning Using Knowledge Distillation
new file mode 100644
index 0000000000..cafe14d316
--- /dev/null
+++ b/data/2022/iclr/Towards Model Agnostic Federated Learning Using Knowledge Distillation	
@@ -0,0 +1 @@
+Is it possible to design an universal API for federated learning using which an ad-hoc group of data-holders (agents) collaborate with each other and perform federated learning? Such an API would necessarily need to be model-agnostic i.e. make no assumption about the model architecture being used by the agents, and also cannot rely on having representative public data at hand. Knowledge distillation (KD) is the obvious tool of choice to design such protocols. However, surprisingly, we show that most natural KD-based federated learning protocols have poor performance. To investigate this, we propose a new theoretical framework, Federated Kernel ridge regression, which can capture both model heterogeneity as well as data heterogeneity. Our analysis shows that the degradation is largely due to a fundamental limitation of knowledge distillation under data heterogeneity. We further validate our framework by analyzing and designing new protocols based on KD. Their performance on real world experiments using neural networks, though still unsatisfactory, closely matches our theoretical predictions.
\ No newline at end of file
diff --git a/data/2022/iclr/Towards Training Billion Parameter Graph Neural Networks for Atomic Simulations b/data/2022/iclr/Towards Training Billion Parameter Graph Neural Networks for Atomic Simulations
new file mode 100644
index 0000000000..0af3194a24
--- /dev/null
+++ b/data/2022/iclr/Towards Training Billion Parameter Graph Neural Networks for Atomic Simulations	
@@ -0,0 +1 @@
+Recent progress in Graph Neural Networks (GNNs) for modeling atomic simulations has the potential to revolutionize catalyst discovery, which is a key step in making progress towards the energy breakthroughs needed to combat climate change. However, the GNNs that have proven most effective for this task are memory intensive as they model higher-order interactions in the graphs such as those between triplets or quadruplets of atoms, making it challenging to scale these models. In this paper, we introduce Graph Parallelism, a method to distribute input graphs across multiple GPUs, enabling us to train very large GNNs with hundreds of millions or billions of parameters. We empirically evaluate our method by scaling up the number of parameters of the recently proposed DimeNet++ and GemNet models by over an order of magnitude. On the large-scale Open Catalyst 2020 (OC20) dataset, these graph-parallelized models lead to relative improvements of 1) 15% on the force MAE metric for the S2EF task and 2) 21% on the AFbT metric for the IS2RS task, establishing new state-of-the-art results.
\ No newline at end of file
diff --git a/data/2022/iclr/Towards Understanding Generalization via Decomposing Excess Risk Dynamics b/data/2022/iclr/Towards Understanding Generalization via Decomposing Excess Risk Dynamics
new file mode 100644
index 0000000000..5aba0da431
--- /dev/null
+++ b/data/2022/iclr/Towards Understanding Generalization via Decomposing Excess Risk Dynamics	
@@ -0,0 +1 @@
+Generalization is one of the fundamental issues in machine learning. However, traditional techniques like uniform convergence may be unable to explain generalization under overparameterization. As alternative approaches, techniques based on stability analyze the training dynamics and derive algorithm-dependent generalization bounds. Unfortunately, the stability-based bounds are still far from explaining the surprising generalization in deep learning since neural networks usually suffer from unsatisfactory stability. This paper proposes a novel decomposition framework to improve the stability-based bounds via a more fine-grained analysis of the signal and noise, inspired by the observation that neural networks converge relatively slowly when fitting noise (which indicates better stability). Concretely, we decompose the excess risk dynamics and apply the stability-based bound only on the noise component. The decomposition framework performs well in both linear regimes (overparameterized linear regression) and non-linear regimes (diagonal matrix recovery). Experiments on neural networks verify the utility of the decomposition framework.
\ No newline at end of file
diff --git a/data/2022/iclr/Towards Understanding the Data Dependency of Mixup-style Training b/data/2022/iclr/Towards Understanding the Data Dependency of Mixup-style Training
new file mode 100644
index 0000000000..14eda99588
--- /dev/null
+++ b/data/2022/iclr/Towards Understanding the Data Dependency of Mixup-style Training	
@@ -0,0 +1 @@
+In the Mixup training paradigm, a model is trained using convex combinations of data points and their associated labels. Despite seeing very few true data points during training, models trained using Mixup seem to still minimize the original empirical risk and exhibit better generalization and robustness on various tasks when compared to standard training. In this paper, we investigate how these benefits of Mixup training rely on properties of the data in the context of classification. For minimizing the original empirical risk, we compute a closed form for the Mixup-optimal classification, which allows us to construct a simple dataset on which minimizing the Mixup loss can provably lead to learning a classifier that does not minimize the empirical loss on the data. On the other hand, we also give sufficient conditions for Mixup training to also minimize the original empirical risk. For generalization, we characterize the margin of a Mixup classifier, and use this to understand why the decision boundary of a Mixup classifier can adapt better to the full structure of the training data when compared to standard training. In contrast, we also show that, for a large class of linear models and linearly separable datasets, Mixup training leads to learning the same classifier as standard training.
\ No newline at end of file
diff --git a/data/2022/iclr/Towards Understanding the Robustness Against Evasion Attack on Categorical Data b/data/2022/iclr/Towards Understanding the Robustness Against Evasion Attack on Categorical Data
new file mode 100644
index 0000000000..3a2c164435
--- /dev/null
+++ b/data/2022/iclr/Towards Understanding the Robustness Against Evasion Attack on Categorical Data	
@@ -0,0 +1 @@
+𝑥,𝑦
\ No newline at end of file
diff --git a/data/2022/iclr/Towards a Unified View of Parameter-Efficient Transfer Learning b/data/2022/iclr/Towards a Unified View of Parameter-Efficient Transfer Learning
new file mode 100644
index 0000000000..6eb3df8a57
--- /dev/null
+++ b/data/2022/iclr/Towards a Unified View of Parameter-Efficient Transfer Learning	
@@ -0,0 +1 @@
+Fine-tuning large pre-trained language models on downstream tasks has become the de-facto learning paradigm in NLP. However, conventional approaches fine-tune all the parameters of the pre-trained model, which becomes prohibitive as the model size and the number of tasks grow. Recent work has proposed a variety of parameter-efficient transfer learning methods that only fine-tune a small number of (extra) parameters to attain strong performance. While effective, the critical ingredients for success and the connections among the various methods are poorly understood. In this paper, we break down the design of state-of-the-art parameter-efficient transfer learning methods and present a unified framework that establishes connections between them. Specifically, we re-frame them as modifications to specific hidden states in pre-trained models, and define a set of design dimensions along which different methods vary, such as the function to compute the modification and the position to apply the modification. Through comprehensive empirical studies across machine translation, text summarization, language understanding, and text classification benchmarks, we utilize the unified view to identify important design choices in previous methods. Furthermore, our unified framework enables the transfer of design elements across different approaches, and as a result we are able to instantiate new parameter-efficient fine-tuning methods that tune less parameters than previous methods while being more effective, achieving comparable results to fine-tuning all parameters on all four tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Tracking the risk of a deployed model and detecting harmful distribution shifts b/data/2022/iclr/Tracking the risk of a deployed model and detecting harmful distribution shifts
new file mode 100644
index 0000000000..5e2ad6d532
--- /dev/null
+++ b/data/2022/iclr/Tracking the risk of a deployed model and detecting harmful distribution shifts	
@@ -0,0 +1 @@
+When deployed in the real world, machine learning models inevitably encounter changes in the data distribution, and certain—but not all—distribution shifts could result in signiﬁcant performance degradation. In practice, it may make sense to ignore benign shifts, under which the performance of a deployed model does not degrade substantially, making interventions by a human expert (or model retraining) unnecessary. While several works have developed tests for distribution shifts, these typically either use non-sequential methods, or detect arbitrary shifts (benign or harmful), or both. We argue that a sensible method for ﬁring oﬀ a warning has to both (a) detect harmful shifts while ignoring benign ones, and (b) allow continuous monitoring of model performance without increasing the false alarm rate. In this work, we design simple sequential tools for testing if the diﬀerence between source (training) and target (test) distributions leads to a signiﬁcant increase in a risk function of interest, like accuracy or calibration. Recent advances in constructing time-uniform conﬁdence sequences allow eﬃcient aggregation of statistical evidence accumulated during the tracking process. The designed framework is applicable in settings where (some) true labels are revealed after the prediction is performed, or when batches of labels become available in a delayed fashion. We demonstrate the eﬃcacy of the proposed framework through an extensive empirical study on a collection of simulated and real datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation b/data/2022/iclr/Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
new file mode 100644
index 0000000000..bd63205d1c
--- /dev/null
+++ b/data/2022/iclr/Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation	
@@ -0,0 +1 @@
+Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though we find that current methods do not allow for efficient extrapolation. We therefore introduce a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance. We show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory. ALiBi's inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.
\ No newline at end of file
diff --git a/data/2022/iclr/Training Data Generating Networks: Shape Reconstruction via Bi-level Optimization b/data/2022/iclr/Training Data Generating Networks: Shape Reconstruction via Bi-level Optimization
new file mode 100644
index 0000000000..e6a993ede4
--- /dev/null
+++ b/data/2022/iclr/Training Data Generating Networks: Shape Reconstruction via Bi-level Optimization	
@@ -0,0 +1 @@
+We propose a novel 3d shape representation for 3d shape reconstruction from a single image. Rather than predicting a shape directly, we train a network to generate a training set which will be fed into another learning algorithm to define the shape. The nested optimization problem can be modeled by bi-level optimization. Specifically, the algorithms for bi-level optimization are also being used in meta learning approaches for few-shot learning. Our framework establishes a link between 3D shape analysis and few-shot learning. We combine training data generating networks with bi-level optimization algorithms to obtain a complete framework for which all components can be jointly trained. We improve upon recent work on standard benchmarks for 3d shape reconstruction.
\ No newline at end of file
diff --git a/data/2022/iclr/Training Structured Neural Networks Through Manifold Identification and Variance Reduction b/data/2022/iclr/Training Structured Neural Networks Through Manifold Identification and Variance Reduction
new file mode 100644
index 0000000000..16e22c4384
--- /dev/null
+++ b/data/2022/iclr/Training Structured Neural Networks Through Manifold Identification and Variance Reduction	
@@ -0,0 +1 @@
+This paper proposes an algorithm (RMDA) for training neural networks (NNs) with a regularization term for promoting desired structures. RMDA does not incur computation additional to proximal SGD with momentum, and achieves variance reduction without requiring the objective function to be of the finite-sum form. Through the tool of manifold identification from nonlinear optimization, we prove that after a finite number of iterations, all iterates of RMDA possess a desired structure identical to that induced by the regularizer at the stationary point of asymptotic convergence, even in the presence of engineering tricks like data augmentation and dropout that complicate the training process. Experiments on training NNs with structured sparsity confirm that variance reduction is necessary for such an identification, and show that RMDA thus significantly outperforms existing methods for this task. For unstructured sparsity, RMDA also outperforms a state-of-the-art pruning method, validating the benefits of training structured NNs through regularization.
\ No newline at end of file
diff --git a/data/2022/iclr/Training Transition Policies via Distribution Matching for Complex Tasks b/data/2022/iclr/Training Transition Policies via Distribution Matching for Complex Tasks
new file mode 100644
index 0000000000..d3749421aa
--- /dev/null
+++ b/data/2022/iclr/Training Transition Policies via Distribution Matching for Complex Tasks	
@@ -0,0 +1 @@
+Humans decompose novel complex tasks into simpler ones to exploit previously learned skills. Analogously, hierarchical reinforcement learning seeks to leverage lower-level policies for simple tasks to solve complex ones. However, because each lower-level policy induces a different distribution of states, transitioning from one lower-level policy to another may fail due to an unexpected starting state. We introduce transition policies that smoothly connect lower-level policies by producing a distribution of states and actions that matches what is expected by the next policy. Training transition policies is challenging because the natural reward signal -- whether the next policy can execute its subtask successfully -- is sparse. By training transition policies via adversarial inverse reinforcement learning to match the distribution of expected states and actions, we avoid relying on task-based reward. To further improve performance, we use deep Q-learning with a binary action space to determine when to switch from a transition policy to the next pre-trained policy, using the success or failure of the next subtask as the reward. Although the reward is still sparse, the problem is less severe due to the simple binary action space. We demonstrate our method on continuous bipedal locomotion and arm manipulation tasks that require diverse skills. We show that it smoothly connects the lower-level policies, achieving higher success rates than previous methods that search for successful trajectories based on a reward function, but do not match the state distribution.
\ No newline at end of file
diff --git a/data/2022/iclr/Training invariances and the low-rank phenomenon: beyond linear networks b/data/2022/iclr/Training invariances and the low-rank phenomenon: beyond linear networks
new file mode 100644
index 0000000000..4898367d34
--- /dev/null
+++ b/data/2022/iclr/Training invariances and the low-rank phenomenon: beyond linear networks	
@@ -0,0 +1 @@
+The implicit bias induced by the training of neural networks has become a topic of rigorous study. In the limit of gradient ﬂow and gradient descent with appropriate step size, it has been shown that when one trains a deep linear network with logistic or exponential loss on linearly separable data, the weights converge to rank- 1 matrices. In this paper, we extend this theoretical result to the last few linear layers of the much wider class of nonlinear ReLU-activated feedforward networks containing fully-connected layers and skip connections. Similar to the linear case, the proof relies on speciﬁc local training invariances, sometimes referred to as alignment, which we show to hold for submatrices where neurons are stably-activated in all training examples, and it reﬂects empirical results in the literature. We also show this is not true in general for the full matrix of ReLU fully-connected layers. Our proof relies on a speciﬁc decomposition of the network into a multilinear function and another ReLU network whose weights are constant under a certain parameter directional convergence.
\ No newline at end of file
diff --git a/data/2022/iclr/Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations b/data/2022/iclr/Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations
new file mode 100644
index 0000000000..a5430dd025
--- /dev/null
+++ b/data/2022/iclr/Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations	
@@ -0,0 +1 @@
+In NLP, a large volume of tasks involve pairwise comparison between two sequences (e.g. sentence similarity and paraphrase identification). Predominantly, two formulations are used for sentence-pair tasks: bi-encoders and cross-encoders. Bi-encoders produce fixed-dimensional sentence representations and are computationally efficient, however, they usually underperform cross-encoders. Cross-encoders can leverage their attention heads to exploit inter-sentence interactions for better performance but they require task fine-tuning and are computationally more expensive. In this paper, we present a completely unsupervised sentence representation model termed as Trans-Encoder that combines the two learning paradigms into an iterative joint framework to simultaneously learn enhanced bi- and cross-encoders. Specifically, on top of a pre-trained Language Model (PLM), we start with converting it to an unsupervised bi-encoder, and then alternate between the bi- and cross-encoder task formulations. In each alternation, one task formulation will produce pseudo-labels which are used as learning signals for the other task formulation. We then propose an extension to conduct such self-distillation approach on multiple PLMs in parallel and use the average of their pseudo-labels for mutual-distillation. Trans-Encoder creates, to the best of our knowledge, the first completely unsupervised cross-encoder and also a state-of-the-art unsupervised bi-encoder for sentence similarity. Both the bi-encoder and cross-encoder formulations of Trans-Encoder outperform recently proposed state-of-the-art unsupervised sentence encoders such as Mirror-BERT and SimCSE by up to 5% on the sentence similarity benchmarks.
\ No newline at end of file
diff --git a/data/2022/iclr/Transfer RL across Observation Feature Spaces via Model-Based Regularization b/data/2022/iclr/Transfer RL across Observation Feature Spaces via Model-Based Regularization
new file mode 100644
index 0000000000..e11ec26f88
--- /dev/null
+++ b/data/2022/iclr/Transfer RL across Observation Feature Spaces via Model-Based Regularization	
@@ -0,0 +1 @@
+In many reinforcement learning (RL) applications, the observation space is specified by human developers and restricted by physical realizations, and may thus be subject to dramatic changes over time (e.g. increased number of observable features). However, when the observation space changes, the previous policy will likely fail due to the mismatch of input features, and another policy must be trained from scratch, which is inefficient in terms of computation and sample complexity. Following theoretical insights, we propose a novel algorithm which extracts the latent-space dynamics in the source task, and transfers the dynamics model to the target task to use as a model-based regularizer. Our algorithm works for drastic changes of observation space (e.g. from vector-based observation to image-based observation), without any inter-task mapping or any prior knowledge of the target task. Empirical results show that our algorithm significantly improves the efficiency and stability of learning in the target task.
\ No newline at end of file
diff --git a/data/2022/iclr/Transferable Adversarial Attack based on Integrated Gradients b/data/2022/iclr/Transferable Adversarial Attack based on Integrated Gradients
new file mode 100644
index 0000000000..2c29c75a8f
--- /dev/null
+++ b/data/2022/iclr/Transferable Adversarial Attack based on Integrated Gradients	
@@ -0,0 +1 @@
+The vulnerability of deep neural networks to adversarial examples has drawn tremendous attention from the community. Three approaches, optimizing standard objective functions, exploiting attention maps, and smoothing decision surfaces, are commonly used to craft adversarial examples. By tightly integrating the three approaches, we propose a new and simple algorithm named Transferable Attack based on Integrated Gradients (TAIG) in this paper, which can find highly transferable adversarial examples for black-box attacks. Unlike previous methods using multiple computational terms or combining with other methods, TAIG integrates the three approaches into one single term. Two versions of TAIG that compute their integrated gradients on a straight-line path and a random piecewise linear path are studied. Both versions offer strong transferability and can seamlessly work together with the previous methods. Experimental results demonstrate that TAIG outperforms the state-of-the-art methods. The code will available at https://github.com/yihuang2016/TAIG
\ No newline at end of file
diff --git a/data/2022/iclr/Transform2Act: Learning a Transform-and-Control Policy for Efficient Agent Design b/data/2022/iclr/Transform2Act: Learning a Transform-and-Control Policy for Efficient Agent Design
new file mode 100644
index 0000000000..9038f51978
--- /dev/null
+++ b/data/2022/iclr/Transform2Act: Learning a Transform-and-Control Policy for Efficient Agent Design	
@@ -0,0 +1 @@
+An agent's functionality is largely determined by its design, i.e., skeletal structure and joint attributes (e.g., length, size, strength). However, finding the optimal agent design for a given function is extremely challenging since the problem is inherently combinatorial and the design space is prohibitively large. Additionally, it can be costly to evaluate each candidate design which requires solving for its optimal controller. To tackle these problems, our key idea is to incorporate the design procedure of an agent into its decision-making process. Specifically, we learn a conditional policy that, in an episode, first applies a sequence of transform actions to modify an agent's skeletal structure and joint attributes, and then applies control actions under the new design. To handle a variable number of joints across designs, we use a graph-based policy where each graph node represents a joint and uses message passing with its neighbors to output joint-specific actions. Using policy gradient methods, our approach enables joint optimization of agent design and control as well as experience sharing across different designs, which improves sample efficiency substantially. Experiments show that our approach, Transform2Act, outperforms prior methods significantly in terms of convergence speed and final performance. Notably, Transform2Act can automatically discover plausible designs similar to giraffes, squids, and spiders. Code and videos are available at https://sites.google.com/view/transform2act.
\ No newline at end of file
diff --git a/data/2022/iclr/Transformer Embeddings of Irregularly Spaced Events and Their Participants b/data/2022/iclr/Transformer Embeddings of Irregularly Spaced Events and Their Participants
new file mode 100644
index 0000000000..85a52779d8
--- /dev/null
+++ b/data/2022/iclr/Transformer Embeddings of Irregularly Spaced Events and Their Participants	
@@ -0,0 +1 @@
+The neural Hawkes process (Mei&Eisner, 2017) is a generative model of irregularly spaced sequences of discrete events. To handle complex domains with many event types, Mei et al. (2020a) further consider a setting in which each event in the sequence updates a deductive database of facts (via domain-specific pattern-matching rules); future events are then conditioned on the database contents. They show how to convert such a symbolic system into a neuro-symbolic continuous-time generative model, in which each database fact and the possible event has a time-varying embedding that is derived from its symbolic provenance. In this paper, we modify both models, replacing their recurrent LSTM-based architectures with flatter attention-based architectures (Vaswani et al., 2017), which are simpler and more parallelizable. This does not appear to hurt our accuracy, which is comparable to or better than that of the original models as well as (where applicable) previous attention-based methods (Zuo et al., 2020; Zhang et al., 2020a).
\ No newline at end of file
diff --git a/data/2022/iclr/Transformer-based Transform Coding b/data/2022/iclr/Transformer-based Transform Coding
new file mode 100644
index 0000000000..eef1fa8631
--- /dev/null
+++ b/data/2022/iclr/Transformer-based Transform Coding	
@@ -0,0 +1 @@
+Neural data compression based on nonlinear transform coding has made great progress over the last few years, mainly due to improvements in prior models, quantization methods and nonlinear transforms. A general trend in many recent works pushing the limit of rate-distortion performance is to use ever more expensive prior models that can lead to prohibitively slow decoding. Instead, we focus on more expressive transforms that result in a better rate-distortioncomputation trade-off. Specifically, we show that nonlinear transforms built on Swin-transformers can achieve better compression efficiency than transforms built on convolutional neural networks (ConvNets), while requiring fewer parameters and shorter decoding time. Paired with a compute-efficient Channel-wise AutoRegressive Model prior, our SwinT-ChARM model outperforms VTM-12.1 by 3.68% in BD-rate on Kodak with comparable decoding speed. In P-frame video compression setting, we are able to outperform the popular ConvNet-based scalespace-flow model by 12.35% in BD-rate on UVG. We provide model scaling studies to verify the computational efficiency of the proposed solutions and conduct several analyses to reveal the source of coding gain of transformers over ConvNets, including better spatial decorrelation, flexible effective receptive field, and more localized response of latent pixels during progressive decoding.
\ No newline at end of file
diff --git a/data/2022/iclr/Transformers Can Do Bayesian Inference b/data/2022/iclr/Transformers Can Do Bayesian Inference
new file mode 100644
index 0000000000..50e343d2f3
--- /dev/null
+++ b/data/2022/iclr/Transformers Can Do Bayesian Inference	
@@ -0,0 +1 @@
+Currently, it is hard to reap the benefits of deep learning for Bayesian methods, which allow the explicit specification of prior knowledge and accurately capture model uncertainty. We present Prior-Data Fitted Networks (PFNs). PFNs leverage large-scale machine learning techniques to approximate a large set of posteriors. The only requirement for PFNs to work is the ability to sample from a prior distribution over supervised learning tasks (or functions). Our method restates the objective of posterior approximation as a supervised classification problem with a set-valued input: it repeatedly draws a task (or function) from the prior, draws a set of data points and their labels from it, masks one of the labels and learns to make probabilistic predictions for it based on the set-valued input of the rest of the data points. Presented with a set of samples from a new supervised learning task as input, PFNs make probabilistic predictions for arbitrary other data points in a single forward propagation, having learned to approximate Bayesian inference. We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems, with over 200-fold speedups in multiple setups compared to current methods. We obtain strong results in very diverse areas such as Gaussian process regression, Bayesian neural networks, classification for small tabular data sets, and few-shot image classification, demonstrating the generality of PFNs. Code and trained PFNs are released at https://github.com/automl/TransformersCanDoBayesianInference.
\ No newline at end of file
diff --git a/data/2022/iclr/Transition to Linearity of Wide Neural Networks is an Emerging Property of Assembling Weak Models b/data/2022/iclr/Transition to Linearity of Wide Neural Networks is an Emerging Property of Assembling Weak Models
new file mode 100644
index 0000000000..97a3607b91
--- /dev/null
+++ b/data/2022/iclr/Transition to Linearity of Wide Neural Networks is an Emerging Property of Assembling Weak Models	
@@ -0,0 +1 @@
+Wide neural networks with linear output layer have been shown to be near-linear, and to have near-constant neural tangent kernel (NTK), in a region containing the optimization path of gradient descent. These findings seem counter-intuitive since in general neural networks are highly complex models. Why does a linear structure emerge when the networks become wide? In this work, we provide a new perspective on this"transition to linearity"by considering a neural network as an assembly model recursively built from a set of sub-models corresponding to individual neurons. In this view, we show that the linearity of wide neural networks is, in fact, an emerging property of assembling a large number of diverse"weak"sub-models, none of which dominate the assembly.
\ No newline at end of file
diff --git a/data/2022/iclr/Triangle and Four Cycle Counting with Predictions in Graph Streams b/data/2022/iclr/Triangle and Four Cycle Counting with Predictions in Graph Streams
new file mode 100644
index 0000000000..e866135e7d
--- /dev/null
+++ b/data/2022/iclr/Triangle and Four Cycle Counting with Predictions in Graph Streams	
@@ -0,0 +1 @@
+We propose data-driven one-pass streaming algorithms for estimating the number of triangles and four cycles, two fundamental problems in graph analytics that are widely studied in the graph data stream literature. Recently, (Hsu 2018) and (Jiang 2020) applied machine learning techniques in other data stream problems, using a trained oracle that can predict certain properties of the stream elements to improve on prior"classical"algorithms that did not use oracles. In this paper, we explore the power of a"heavy edge"oracle in multiple graph edge streaming models. In the adjacency list model, we present a one-pass triangle counting algorithm improving upon the previous space upper bounds without such an oracle. In the arbitrary order model, we present algorithms for both triangle and four cycle estimation with fewer passes and the same space complexity as in previous algorithms, and we show several of these bounds are optimal. We analyze our algorithms under several noise models, showing that the algorithms perform well even when the oracle errs. Our methodology expands upon prior work on"classical"streaming algorithms, as previous multi-pass and random order streaming algorithms can be seen as special cases of our algorithms, where the first pass or random order was used to implement the heavy edge oracle. Lastly, our experiments demonstrate advantages of the proposed method compared to state-of-the-art streaming algorithms.
\ No newline at end of file
diff --git a/data/2022/iclr/Trigger Hunting with a Topological Prior for Trojan Detection b/data/2022/iclr/Trigger Hunting with a Topological Prior for Trojan Detection
new file mode 100644
index 0000000000..9e3755e0b2
--- /dev/null
+++ b/data/2022/iclr/Trigger Hunting with a Topological Prior for Trojan Detection	
@@ -0,0 +1 @@
+Despite their success and popularity, deep neural networks (DNNs) are vulnerable when facing backdoor attacks. This impedes their wider adoption, especially in mission critical applications. This paper tackles the problem of Trojan detection, namely, identifying Trojaned models -- models trained with poisoned data. One popular approach is reverse engineering, i.e., recovering the triggers on a clean image by manipulating the model's prediction. One major challenge of reverse engineering approach is the enormous search space of triggers. To this end, we propose innovative priors such as diversity and topological simplicity to not only increase the chances of finding the appropriate triggers but also improve the quality of the found triggers. Moreover, by encouraging a diverse set of trigger candidates, our method can perform effectively in cases with unknown target labels. We demonstrate that these priors can significantly improve the quality of the recovered triggers, resulting in substantially improved Trojan detection accuracy as validated on both synthetic and publicly available TrojAI benchmarks.
\ No newline at end of file
diff --git a/data/2022/iclr/Trivial or Impossible --- dichotomous data difficulty masks model differences (on ImageNet and beyond) b/data/2022/iclr/Trivial or Impossible --- dichotomous data difficulty masks model differences (on ImageNet and beyond)
new file mode 100644
index 0000000000..c76beb9222
--- /dev/null
+++ b/data/2022/iclr/Trivial or Impossible --- dichotomous data difficulty masks model differences (on ImageNet and beyond)	
@@ -0,0 +1 @@
+"The power of a generalization system follows directly from its biases"(Mitchell 1980). Today, CNNs are incredibly powerful generalisation systems -- but to what degree have we understood how their inductive bias influences model decisions? We here attempt to disentangle the various aspects that determine how a model decides. In particular, we ask: what makes one model decide differently from another? In a meticulously controlled setting, we find that (1.) irrespective of the network architecture or objective (e.g. self-supervised, semi-supervised, vision transformers, recurrent models) all models end up with a similar decision boundary. (2.) To understand these findings, we analysed model decisions on the ImageNet validation set from epoch to epoch and image by image. We find that the ImageNet validation set, among others, suffers from dichotomous data difficulty (DDD): For the range of investigated models and their accuracies, it is dominated by 46.0%"trivial"and 11.5%"impossible"images (beyond label errors). Only 42.5% of the images could possibly be responsible for the differences between two models' decision boundaries. (3.) Only removing the"impossible"and"trivial"images allows us to see pronounced differences between models. (4.) Humans are highly accurate at predicting which images are"trivial"and"impossible"for CNNs (81.4%). This implies that in future comparisons of brains, machines and behaviour, much may be gained from investigating the decisive role of images and the distribution of their difficulties.
\ No newline at end of file
diff --git a/data/2022/iclr/Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning b/data/2022/iclr/Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning
new file mode 100644
index 0000000000..8edfdcb232
--- /dev/null
+++ b/data/2022/iclr/Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning	
@@ -0,0 +1 @@
+Trust region methods rigorously enabled reinforcement learning (RL) agents to learn monotonically improving policies, leading to superior performance on a variety of tasks. Unfortunately, when it comes to multi-agent reinforcement learning (MARL), the property of monotonic improvement may not simply apply; this is because agents, even in cooperative games, could have conflicting directions of policy updates. As a result, achieving a guaranteed improvement on the joint policy where each agent acts individually remains an open challenge. In this paper, we extend the theory of trust region learning to MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme. Based on these, we develop Heterogeneous-Agent Trust Region Policy Optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) algorithms. Unlike many existing MARL algorithms, HATRPO/HAPPO do not need agents to share parameters, nor do they need any restrictive assumptions on decomposibility of the joint value function. Most importantly, we justify in theory the monotonic improvement property of HATRPO/HAPPO. We evaluate the proposed methods on a series of Multi-Agent MuJoCo and StarCraftII tasks. Results show that HATRPO and HAPPO significantly outperform strong baselines such as IPPO, MAPPO and MADDPG on all tested tasks, therefore establishing a new state of the art.
\ No newline at end of file
diff --git a/data/2022/iclr/Tuformer: Data-driven Design of Transformers for Improved Generalization or Efficiency b/data/2022/iclr/Tuformer: Data-driven Design of Transformers for Improved Generalization or Efficiency
new file mode 100644
index 0000000000..47e8823b6b
--- /dev/null
+++ b/data/2022/iclr/Tuformer: Data-driven Design of Transformers for Improved Generalization or Efficiency	
@@ -0,0 +1 @@
+CIFAR-10 datasets, the lower the better. Results show that Tuformers can be extended to image generation tasks and improve the performance of other efﬁcient designs with linear computational and memory complexities.
\ No newline at end of file
diff --git a/data/2022/iclr/Uncertainty Modeling for Out-of-Distribution Generalization b/data/2022/iclr/Uncertainty Modeling for Out-of-Distribution Generalization
new file mode 100644
index 0000000000..9148185627
--- /dev/null
+++ b/data/2022/iclr/Uncertainty Modeling for Out-of-Distribution Generalization	
@@ -0,0 +1 @@
+Though remarkable progress has been achieved in various vision tasks, deep neural networks still suffer obvious performance degradation when tested in out-of-distribution scenarios. We argue that the feature statistics (mean and standard deviation), which carry the domain characteristics of the training data, can be properly manipulated to improve the generalization ability of deep learning models. Common methods often consider the feature statistics as deterministic values measured from the learned features and do not explicitly consider the uncertain statistics discrepancy caused by potential domain shifts during testing. In this paper, we improve the network generalization ability by modeling the uncertainty of domain shifts with synthesized feature statistics during training. Specifically, we hypothesize that the feature statistic, after considering the potential uncertainties, follows a multivariate Gaussian distribution. Hence, each feature statistic is no longer a deterministic value, but a probabilistic point with diverse distribution possibilities. With the uncertain feature statistics, the models can be trained to alleviate the domain perturbations and achieve better robustness against potential domain shifts. Our method can be readily integrated into networks without additional parameters. Extensive experiments demonstrate that our proposed method consistently improves the network generalization ability on multiple vision tasks, including image classification, semantic segmentation, and instance retrieval. The code can be available at https://github.com/lixiaotong97/DSU.
\ No newline at end of file
diff --git a/data/2022/iclr/Understanding Dimensional Collapse in Contrastive Self-supervised Learning b/data/2022/iclr/Understanding Dimensional Collapse in Contrastive Self-supervised Learning
new file mode 100644
index 0000000000..98591c2d51
--- /dev/null
+++ b/data/2022/iclr/Understanding Dimensional Collapse in Contrastive Self-supervised Learning	
@@ -0,0 +1 @@
+Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these methods, contrastive learning prevents collapse via negative sample pairs. It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space. Here, we show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on an explicit trainable projector. Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet.
\ No newline at end of file
diff --git a/data/2022/iclr/Understanding Domain Randomization for Sim-to-real Transfer b/data/2022/iclr/Understanding Domain Randomization for Sim-to-real Transfer
new file mode 100644
index 0000000000..3eda6098b3
--- /dev/null
+++ b/data/2022/iclr/Understanding Domain Randomization for Sim-to-real Transfer	
@@ -0,0 +1 @@
+Reinforcement learning encounters many challenges when applied directly in the real world. Sim-to-real transfer is widely used to transfer the knowledge learned from simulation to the real world. Domain randomization -- one of the most popular algorithms for sim-to-real transfer -- has been demonstrated to be effective in various tasks in robotics and autonomous driving. Despite its empirical successes, theoretical understanding on why this simple algorithm works is limited. In this paper, we propose a theoretical framework for sim-to-real transfers, in which the simulator is modeled as a set of MDPs with tunable parameters (corresponding to unknown physical parameters such as friction). We provide sharp bounds on the sim-to-real gap -- the difference between the value of policy returned by domain randomization and the value of an optimal policy for the real world. We prove that sim-to-real transfer can succeed under mild conditions without any real-world training samples. Our theory also highlights the importance of using memory (i.e., history-dependent policies) in domain randomization. Our proof is based on novel techniques that reduce the problem of bounding the sim-to-real gap to the problem of designing efficient learning algorithms for infinite-horizon MDPs, which we believe are of independent interest.
\ No newline at end of file
diff --git a/data/2022/iclr/Understanding Intrinsic Robustness Using Label Uncertainty b/data/2022/iclr/Understanding Intrinsic Robustness Using Label Uncertainty
new file mode 100644
index 0000000000..8e33958b00
--- /dev/null
+++ b/data/2022/iclr/Understanding Intrinsic Robustness Using Label Uncertainty	
@@ -0,0 +1 @@
+A fundamental question in adversarial machine learning is whether a robust classifier exists for a given task. A line of research has made some progress towards this goal by studying the concentration of measure, but we argue standard concentration fails to fully characterize the intrinsic robustness of a classification problem since it ignores data labels which are essential to any classification task. Building on a novel definition of label uncertainty, we empirically demonstrate that error regions induced by state-of-the-art models tend to have much higher label uncertainty than randomly-selected subsets. This observation motivates us to adapt a concentration estimation algorithm to account for label uncertainty, resulting in more accurate intrinsic robustness measures for benchmark image classification problems.
\ No newline at end of file
diff --git a/data/2022/iclr/Understanding Latent Correlation-Based Multiview Learning and Self-Supervision: An Identifiability Perspective b/data/2022/iclr/Understanding Latent Correlation-Based Multiview Learning and Self-Supervision: An Identifiability Perspective
new file mode 100644
index 0000000000..5ea5ccb2ed
--- /dev/null
+++ b/data/2022/iclr/Understanding Latent Correlation-Based Multiview Learning and Self-Supervision: An Identifiability Perspective	
@@ -0,0 +1 @@
+Multiple views of data, both naturally acquired (e.g., image and audio) and artificially produced (e.g., via adding different noise to data samples), have proven useful in enhancing representation learning. Natural views are often handled by multiview analysis tools, e.g., (deep) canonical correlation analysis [(D)CCA], while the artificial ones are frequently used in self-supervised learning (SSL) paradigms, e.g., BYOL and Barlow Twins. Both types of approaches often involve learning neural feature extractors such that the embeddings of data exhibit high cross-view correlations. Although intuitive, the effectiveness of correlation-based neural embedding is mostly empirically validated. This work aims to understand latent correlation maximization-based deep multiview learning from a latent component identification viewpoint. An intuitive generative model of multiview data is adopted, where the views are different nonlinear mixtures of shared and private components. Since the shared components are view/distortion-invariant, representing the data using such components is believed to reveal the identity of the samples effectively and robustly. Under this model, latent correlation maximization is shown to guarantee the extraction of the shared components across views (up to certain ambiguities). In addition, it is further shown that the private information in each view can be provably disentangled from the shared using proper regularization design. A finite sample analysis, which has been rare in nonlinear mixture identifiability study, is also presented. The theoretical results and newly designed regularization are tested on a series of tasks.
\ No newline at end of file
diff --git a/data/2022/iclr/Understanding and Improving Graph Injection Attack by Promoting Unnoticeability b/data/2022/iclr/Understanding and Improving Graph Injection Attack by Promoting Unnoticeability
new file mode 100644
index 0000000000..f49ce1e98e
--- /dev/null
+++ b/data/2022/iclr/Understanding and Improving Graph Injection Attack by Promoting Unnoticeability	
@@ -0,0 +1 @@
+Recently Graph Injection Attack (GIA) emerges as a practical attack scenario on Graph Neural Networks (GNNs), where the adversary can merely inject few malicious nodes instead of modifying existing nodes or edges, i.e., Graph Modification Attack (GMA). Although GIA has achieved promising results, little is known about why it is successful and whether there is any pitfall behind the success. To understand the power of GIA, we compare it with GMA and find that GIA can be provably more harmful than GMA due to its relatively high flexibility. However, the high flexibility will also lead to great damage to the homophily distribution of the original graph, i.e., similarity among neighbors. Consequently, the threats of GIA can be easily alleviated or even prevented by homophily-based defenses designed to recover the original homophily. To mitigate the issue, we introduce a novel constraint -- homophily unnoticeability that enforces GIA to preserve the homophily, and propose Harmonious Adversarial Objective (HAO) to instantiate it. Extensive experiments verify that GIA with HAO can break homophily-based defenses and outperform previous GIA attacks by a significant margin. We believe our methods can serve for a more reliable evaluation of the robustness of GNNs.
\ No newline at end of file
diff --git a/data/2022/iclr/Understanding and Leveraging Overparameterization in Recursive Value Estimation b/data/2022/iclr/Understanding and Leveraging Overparameterization in Recursive Value Estimation
new file mode 100644
index 0000000000..5bd1776b26
--- /dev/null
+++ b/data/2022/iclr/Understanding and Leveraging Overparameterization in Recursive Value Estimation	
@@ -0,0 +1 @@
+The theory of function approximation in reinforcement learning (RL) typically considers low capacity representations that incur a tradeoff between approximation error, stability and generalization. Current deep architectures, however, operate in an overparameterized regime where approximation error is not necessarily a bottle-neck. To better understand the utility of deep models in RL we present an analysis of recursive value estimation using overparameterized linear representations that provides useful, transferable ﬁndings. First, we show that classical updates such as temporal difference (TD) learning or ﬁtted-value-iteration (FVI) converge to different ﬁxed points than residual minimization (RM) in the overparameterized linear case. We then develop a uniﬁed interpretation of overparameterized linear value estimation as minimizing the Euclidean norm of the weights subject to alternative constraints. A practical consequence is that RM can be modiﬁed by a simple alteration of the backup targets to obtain the same ﬁxed points as FVI and TD (when they converge), while universally ensuring stability. Further, we provide an analysis of the generalization error of these methods, demonstrating per iterate bounds on the value prediction error of FVI, and ﬁxed point bounds for TD and RM. Given this understanding, we then develop new algorithmic tools for improving recursive value estimation with deep models. In particular, we extract two regularizers that penalize out-of-span top-layer weights and co-linearity in top-layer features respectively. Empirically we ﬁnd that these regularizers dramatically improve the stability of TD and FVI, while allowing RM to match and even sometimes surpass their generalization performance with assured stability.
\ No newline at end of file
diff --git a/data/2022/iclr/Understanding and Preventing Capacity Loss in Reinforcement Learning b/data/2022/iclr/Understanding and Preventing Capacity Loss in Reinforcement Learning
new file mode 100644
index 0000000000..3e0c025174
--- /dev/null
+++ b/data/2022/iclr/Understanding and Preventing Capacity Loss in Reinforcement Learning	
@@ -0,0 +1 @@
+The reinforcement learning (RL) problem is rife with sources of non-stationarity, making it a notoriously difficult problem domain for the application of neural networks. We identify a mechanism by which non-stationary prediction targets can prevent learning progress in deep RL agents: \textit{capacity loss}, whereby networks trained on a sequence of target values lose their ability to quickly update their predictions over time. We demonstrate that capacity loss occurs in a range of RL agents and environments, and is particularly damaging to performance in sparse-reward tasks. We then present a simple regularizer, Initial Feature Regularization (InFeR), that mitigates this phenomenon by regressing a subspace of features towards its value at initialization, leading to significant performance improvements in sparse-reward environments such as Montezuma's Revenge. We conclude that preventing capacity loss is crucial to enable agents to maximally benefit from the learning signals they obtain throughout the entire training trajectory.
\ No newline at end of file
diff --git a/data/2022/iclr/Understanding approximate and unrolled dictionary learning for pattern recovery b/data/2022/iclr/Understanding approximate and unrolled dictionary learning for pattern recovery
new file mode 100644
index 0000000000..9a064dd1f0
--- /dev/null
+++ b/data/2022/iclr/Understanding approximate and unrolled dictionary learning for pattern recovery	
@@ -0,0 +1 @@
+Dictionary learning consists of finding a sparse representation from noisy data and is a common way to encode data-driven prior knowledge on signals. Alternating minimization (AM) is standard for the underlying optimization, where gradient descent steps alternate with sparse coding procedures. The major drawback of this method is its prohibitive computational cost, making it unpractical on large real-world data sets. This work studies an approximate formulation of dictionary learning based on unrolling and compares it to alternating minimization to find the best trade-off between speed and precision. We analyze the asymptotic behavior and convergence rate of gradients estimates in both methods. We show that unrolling performs better on the support of the inner problem solution and during the first iterations. Finally, we apply unrolling on pattern learning in magnetoencephalography (MEG) with the help of a stochastic algorithm and compare the performance to a state-of-the-art method.
\ No newline at end of file
diff --git a/data/2022/iclr/Understanding over-squashing and bottlenecks on graphs via curvature b/data/2022/iclr/Understanding over-squashing and bottlenecks on graphs via curvature
new file mode 100644
index 0000000000..aaffef9a61
--- /dev/null
+++ b/data/2022/iclr/Understanding over-squashing and bottlenecks on graphs via curvature	
@@ -0,0 +1 @@
+Most graph neural networks (GNNs) use the message passing paradigm, in which node features are propagated on the input graph. Recent works pointed to the distortion of information flowing from distant nodes as a factor limiting the efficiency of message passing for tasks relying on long-distance interactions. This phenomenon, referred to as 'over-squashing', has been heuristically attributed to graph bottlenecks where the number of $k$-hop neighbors grows rapidly with $k$. We provide a precise description of the over-squashing phenomenon in GNNs and analyze how it arises from bottlenecks in the graph. For this purpose, we introduce a new edge-based combinatorial curvature and prove that negatively curved edges are responsible for the over-squashing issue. We also propose and experimentally test a curvature-based graph rewiring method to alleviate the over-squashing.
\ No newline at end of file
diff --git a/data/2022/iclr/Understanding the Role of Self Attention for Efficient Speech Recognition b/data/2022/iclr/Understanding the Role of Self Attention for Efficient Speech Recognition
new file mode 100644
index 0000000000..7cd8c80413
--- /dev/null
+++ b/data/2022/iclr/Understanding the Role of Self Attention for Efficient Speech Recognition	
@@ -0,0 +1 @@
+Self-attention (SA) is a critical component of Transformer neural networks that have succeeded in automatic speech recognition (ASR). In this paper, we analyze the role of SA in Transformer-based ASR models for not only understanding the mechanism of improved recognition accuracy but also lowering the computational complexity. We reveal that SA performs two distinct roles: phonetic and linguistic localization. Especially, we show by experiments that phonetic localization in the lower layers extracts phonologically meaningful features from speech and reduces the phonetic variance in the utterance for proper linguistic localization in the upper layers. From this understanding, we discover that attention maps can be reused as long as their localization capability is preserved. To evaluate this idea, we implement the layer-wise attention map reuse on real GPU platforms and achieve up to 1.96 times speedup in inference and 33% savings in training time with noticeably improved ASR performance for the challenging benchmark on LibriSpeech dev/test-other dataset.
\ No newline at end of file
diff --git a/data/2022/iclr/Understanding the Variance Collapse of SVGD in High Dimensions b/data/2022/iclr/Understanding the Variance Collapse of SVGD in High Dimensions
new file mode 100644
index 0000000000..79285032a1
--- /dev/null
+++ b/data/2022/iclr/Understanding the Variance Collapse of SVGD in High Dimensions	
@@ -0,0 +1 @@
+Stein variational gradient descent (SVGD) is a deterministic inference algorithm that evolves a set of particles to fit a target distribution. Despite its computational efficiency, SVGD often underestimates the variance of the target distribution in high dimensions. In this work we attempt to explain the variance collapse in SVGD. On the qualitative side, we compare the SVGD update with gradient descent on the maximum mean discrepancy (MMD) objective; we find that the variance collapse phenomenon relates to the bias from deterministic updates present in the “driving force” of SVGD, and empirically verify that removal of such bias leads to more accurate variance estimation. On the quantitative side, we demonstrate that the variance collapse of SVGD can be accurately predicted in the proportional asymptotic limit, i.e., when the number of particles n and dimensions d diverge at the same rate. In particular, for learning high-dimensional isotropic Gaussians, we derive the exact equilibrium variance for both SVGD and MMD-descent, under certain empirically verified near-orthogonality condition on the converged particles, and confirm that SVGD suffers from the “curse of dimensionality”.
\ No newline at end of file
diff --git a/data/2022/iclr/UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning b/data/2022/iclr/UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning
new file mode 100644
index 0000000000..a6a0247d88
--- /dev/null
+++ b/data/2022/iclr/UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning	
@@ -0,0 +1 @@
+Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term video dependency with self-attention. Unfortunately, they exhibit limitations in tackling local video redundancy, due to the blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format. However, this model has to require a tiresome and complicated image-pretraining phrase, before being finetuned on videos. This blocks its wide usage in practice. On the contrary, open-sourced ViTs are readily available and well-pretrained with rich image supervision. Based on these observations, we propose a generic paradigm to build a powerful family of video networks, by arming the pretrained ViTs with efficient UniFormer designs. We call this family UniFormerV2, since it inherits the concise style of the UniFormer block. But it contains brand-new local and global relation aggregators, which allow for preferable accuracy-computation balance by seamlessly integrating advantages from both ViTs and UniFormer. Without any bells and whistles, our UniFormerV2 gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400, to our best knowledge. Code will be available at https://github.com/OpenGVLab/UniFormerV2.
\ No newline at end of file
diff --git a/data/2022/iclr/Unified Visual Transformer Compression b/data/2022/iclr/Unified Visual Transformer Compression
new file mode 100644
index 0000000000..9bb65384af
--- /dev/null
+++ b/data/2022/iclr/Unified Visual Transformer Compression	
@@ -0,0 +1 @@
+Vision transformers (ViTs) have gained popularity recently. Even without customized image operators such as convolutions, ViTs can yield competitive performance when properly trained on massive data. However, the computational overhead of ViTs remains prohibitive, due to stacking multi-head self-attention modules and else. Compared to the vast literature and prevailing success in compressing convolutional neural networks, the study of Vision Transformer compression has also just emerged, and existing works focused on one or two aspects of compression. This paper proposes a unified ViT compression framework that seamlessly assembles three effective techniques: pruning, layer skipping, and knowledge distillation. We formulate a budget-constrained, end-to-end optimization framework, targeting jointly learning model weights, layer-wise pruning ratios/masks, and skip configurations, under a distillation loss. The optimization problem is then solved using the primal-dual algorithm. Experiments are conducted with several ViT variants, e.g. DeiT and T2T-ViT backbones on the ImageNet dataset, and our approach consistently outperforms recent competitors. For example, DeiT-Tiny can be trimmed down to 50\% of the original FLOPs almost without losing accuracy. Codes are available online:~\url{https://github.com/VITA-Group/UVC}.
\ No newline at end of file
diff --git a/data/2022/iclr/Unifying Likelihood-free Inference with Black-box Optimization and Beyond b/data/2022/iclr/Unifying Likelihood-free Inference with Black-box Optimization and Beyond
new file mode 100644
index 0000000000..e261a6fc26
--- /dev/null
+++ b/data/2022/iclr/Unifying Likelihood-free Inference with Black-box Optimization and Beyond	
@@ -0,0 +1 @@
+Black-box optimization formulations for biological sequence design have drawn recent attention due to their promising potential impact on the pharmaceutical industry. In this work, we propose to unify two seemingly distinct worlds: likelihood-free inference and black-box optimization, under one probabilistic framework. In tandem, we provide a recipe for constructing various sequence design methods based on this framework. We show how previous optimization approaches can be"reinvented"in our framework, and further propose new probabilistic black-box optimization algorithms. Extensive experiments on sequence design application illustrate the benefits of the proposed methodology.
\ No newline at end of file
diff --git a/data/2022/iclr/Universal Approximation Under Constraints is Possible with Transformers b/data/2022/iclr/Universal Approximation Under Constraints is Possible with Transformers
new file mode 100644
index 0000000000..a0462a4aaa
--- /dev/null
+++ b/data/2022/iclr/Universal Approximation Under Constraints is Possible with Transformers	
@@ -0,0 +1 @@
+Many practical problems need the output of a machine learning model to satisfy a set of constraints, $K$. Nevertheless, there is no known guarantee that classical neural network architectures can exactly encode constraints while simultaneously achieving universality. We provide a quantitative constrained universal approximation theorem which guarantees that for any non-convex compact set $K$ and any continuous function $f:\mathbb{R}^n\rightarrow K$, there is a probabilistic transformer $\hat{F}$ whose randomized outputs all lie in $K$ and whose expected output uniformly approximates $f$. Our second main result is a"deep neural version"of Berge's Maximum Theorem (1963). The result guarantees that given an objective function $L$, a constraint set $K$, and a family of soft constraint sets, there is a probabilistic transformer $\hat{F}$ that approximately minimizes $L$ and whose outputs belong to $K$; moreover, $\hat{F}$ approximately satisfies the soft constraints. Our results imply the first universal approximation theorem for classical transformers with exact convex constraint satisfaction. They also yield that a chart-free universal approximation theorem for Riemannian manifold-valued functions subject to suitable geodesically convex constraints.
\ No newline at end of file
diff --git a/data/2022/iclr/Universalizing Weak Supervision b/data/2022/iclr/Universalizing Weak Supervision
new file mode 100644
index 0000000000..b34357cbe1
--- /dev/null
+++ b/data/2022/iclr/Universalizing Weak Supervision	
@@ -0,0 +1 @@
+Weak supervision (WS) frameworks are a popular way to bypass hand-labeling large datasets for training data-hungry models. These approaches synthesize multiple noisy but cheaply-acquired estimates of labels into a set of high-quality pseudolabels for downstream training. However, the synthesis technique is specific to a particular kind of label, such as binary labels or sequences, and each new label type requires manually designing a new synthesis algorithm. Instead, we propose a universal technique that enables weak supervision over any label type while still offering desirable properties, including practical flexibility, computational efficiency, and theoretical guarantees. We apply this technique to important problems previously not tackled by WS frameworks including learning to rank, regression, and learning in hyperbolic space. Theoretically, our synthesis approach produces a consistent estimators for learning some challenging but important generalizations of the exponential family model. Experimentally, we validate our framework and show improvement over baselines in diverse settings including real-world learning-to-rank and regression problems along with learning on hyperbolic manifolds.
\ No newline at end of file
diff --git a/data/2022/iclr/Unraveling Model-Agnostic Meta-Learning via The Adaptation Learning Rate b/data/2022/iclr/Unraveling Model-Agnostic Meta-Learning via The Adaptation Learning Rate
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Unrolling PALM for Sparse Semi-Blind Source Separation b/data/2022/iclr/Unrolling PALM for Sparse Semi-Blind Source Separation
new file mode 100644
index 0000000000..ab608a970f
--- /dev/null
+++ b/data/2022/iclr/Unrolling PALM for Sparse Semi-Blind Source Separation	
@@ -0,0 +1 @@
+Sparse Blind Source Separation (BSS) has become a well established tool for a wide range of applications - for instance, in astrophysics and remote sensing. Classical sparse BSS methods, such as the Proximal Alternating Linearized Minimization (PALM) algorithm, nevertheless often suffer from a difficult hyperparameter choice, which undermines their results. To bypass this pitfall, we propose in this work to build on the thriving field of algorithm unfolding/unrolling. Unrolling PALM enables to leverage the data-driven knowledge stemming from realistic simulations or ground-truth data by learning both PALM hyperparameters and variables. In contrast to most existing unrolled algorithms, which assume a fixed known dictionary during the training and testing phases, this article further emphasizes on the ability to deal with variable mixing matrices (a.k.a. dictionaries). The proposed Learned PALM (LPALM) algorithm thus enables to perform semi-blind source separation, which is key to increase the generalization of the learnt model in real-world applications. We illustrate the relevance of LPALM in astrophysical multispectral imaging: the algorithm not only needs up to $10^4-10^5$ times fewer iterations than PALM, but also improves the separation quality, while avoiding the cumbersome hyperparameter and initialization choice of PALM. We further show that LPALM outperforms other unrolled source separation methods in the semi-blind setting.
\ No newline at end of file
diff --git a/data/2022/iclr/Unsupervised Discovery of Object Radiance Fields b/data/2022/iclr/Unsupervised Discovery of Object Radiance Fields
new file mode 100644
index 0000000000..f996568976
--- /dev/null
+++ b/data/2022/iclr/Unsupervised Discovery of Object Radiance Fields	
@@ -0,0 +1 @@
+We study the problem of inferring an object-centric scene representation from a single image, aiming to derive a representation that explains the image formation process, captures the scene's 3D nature, and is learned without supervision. Most existing methods on scene decomposition lack one or more of these characteristics, due to the fundamental challenge in integrating the complex 3D-to-2D image formation process into powerful inference schemes like deep networks. In this paper, we propose unsupervised discovery of Object Radiance Fields (uORF), integrating recent progresses in neural 3D scene representations and rendering with deep inference networks for unsupervised 3D scene decomposition. Trained on multi-view RGB images without annotations, uORF learns to decompose complex scenes with diverse, textured background from a single image. We show that uORF enables novel tasks, such as scene segmentation and editing in 3D, and it performs well on these tasks and on novel view synthesis on three datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/Unsupervised Disentanglement with Tensor Product Representations on the Torus b/data/2022/iclr/Unsupervised Disentanglement with Tensor Product Representations on the Torus
new file mode 100644
index 0000000000..c05f0350c9
--- /dev/null
+++ b/data/2022/iclr/Unsupervised Disentanglement with Tensor Product Representations on the Torus	
@@ -0,0 +1 @@
+The current methods for learning representations with auto-encoders almost exclusively employ vectors as the latent representations. In this work, we propose to employ a tensor product structure for this purpose. This way, the obtained representations are naturally disentangled. In contrast to the conventional variations methods, which are targeted toward normally distributed features, the latent space in our representation is distributed uniformly over a set of unit circles. We argue that the torus structure of the latent space captures the generative factors effectively. We employ recent tools for measuring unsupervised disentanglement, and in an extensive set of experiments demonstrate the advantage of our method in terms of disentanglement, completeness, and informativeness. The code for our proposed method is available at https://github.com/rotmanmi/Unsupervised-Disentanglement-Torus.
\ No newline at end of file
diff --git a/data/2022/iclr/Unsupervised Learning of Full-Waveform Inversion: Connecting CNN and Partial Differential Equation in a Loop b/data/2022/iclr/Unsupervised Learning of Full-Waveform Inversion: Connecting CNN and Partial Differential Equation in a Loop
new file mode 100644
index 0000000000..fe500d1f7f
--- /dev/null
+++ b/data/2022/iclr/Unsupervised Learning of Full-Waveform Inversion: Connecting CNN and Partial Differential Equation in a Loop	
@@ -0,0 +1 @@
+This paper investigates unsupervised learning of Full-Waveform Inversion (FWI), which has been widely used in geophysics to estimate subsurface velocity maps from seismic data. This problem is mathematically formulated by a second order partial differential equation (PDE), but is hard to solve. Moreover, acquiring velocity map is extremely expensive, making it impractical to scale up a supervised approach to train the mapping from seismic data to velocity maps with convolutional neural networks (CNN). We address these difficulties by integrating PDE and CNN in a loop, thus shifting the paradigm to unsupervised learning that only requires seismic data. In particular, we use finite difference to approximate the forward modeling of PDE as a differentiable operator (from velocity map to seismic data) and model its inversion by CNN (from seismic data to velocity map). Hence, we transform the supervised inversion task into an unsupervised seismic data reconstruction task. We also introduce a new large-scale dataset OpenFWI, to establish a more challenging benchmark for the community. Experiment results show that our model (using seismic data alone) yields comparable accuracy to the supervised counterpart (using both seismic data and velocity map). Furthermore, it outperforms the supervised model when involving more seismic data.
\ No newline at end of file
diff --git a/data/2022/iclr/Unsupervised Semantic Segmentation by Distilling Feature Correspondences b/data/2022/iclr/Unsupervised Semantic Segmentation by Distilling Feature Correspondences
new file mode 100644
index 0000000000..ac103cfa87
--- /dev/null
+++ b/data/2022/iclr/Unsupervised Semantic Segmentation by Distilling Feature Correspondences	
@@ -0,0 +1 @@
+Unsupervised semantic segmentation aims to discover and localize semantically meaningful categories within image corpora without any form of annotation. To solve this task, algorithms must produce features for every pixel that are both semantically meaningful and compact enough to form distinct clusters. Unlike previous works which achieve this with a single end-to-end framework, we propose to separate feature learning from cluster compactification. Empirically, we show that current unsupervised feature learning frameworks already generate dense features whose correlations are semantically consistent. This observation motivates us to design STEGO ($\textbf{S}$elf-supervised $\textbf{T}$ransformer with $\textbf{E}$nergy-based $\textbf{G}$raph $\textbf{O}$ptimization), a novel framework that distills unsupervised features into high-quality discrete semantic labels. At the core of STEGO is a novel contrastive loss function that encourages features to form compact clusters while preserving their relationships across the corpora. STEGO yields a significant improvement over the prior state of the art, on both the CocoStuff ($\textbf{+14 mIoU}$) and Cityscapes ($\textbf{+9 mIoU}$) semantic segmentation challenges.
\ No newline at end of file
diff --git a/data/2022/iclr/Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling b/data/2022/iclr/Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling
new file mode 100644
index 0000000000..53aec2dfbf
--- /dev/null
+++ b/data/2022/iclr/Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling	
@@ -0,0 +1 @@
+corpora than PCFG-based models, which reveals its robustness across corpora.
\ No newline at end of file
diff --git a/data/2022/iclr/Using Graph Representation Learning with Schema Encoders to Measure the Severity of Depressive Symptoms b/data/2022/iclr/Using Graph Representation Learning with Schema Encoders to Measure the Severity of Depressive Symptoms
new file mode 100644
index 0000000000..cd75056acc
--- /dev/null
+++ b/data/2022/iclr/Using Graph Representation Learning with Schema Encoders to Measure the Severity of Depressive Symptoms	
@@ -0,0 +1 @@
+Graph neural networks (GNNs) are widely used in regression and classiﬁcation problems applied to text, in areas such as sentiment analysis and medical decision-making processes. We propose a novel form for node attributes within a GNN based model that captures node-speciﬁc embeddings for every word in the vocabulary. This provides a global representation at each node, coupled with node-level updates according to associations among words in a transcript. We demonstrate the efﬁcacy of the approach by augmenting the accuracy of measuring major depressive disorder (MDD). Prior research has sought to make a diagnostic prediction of depression levels from patient data using several modalities, including audio, video, and text. On the DAIC-WOZ benchmark, our method outperforms state-of-art methods by a substantial margin, including those using multiple modalities. Moreover, we also evaluate the performance of our novel model on a Twitter sentiment dataset. We show that our model outperforms a general GNN model by leveraging our novel 2-D node attributes. These results demonstrate the generality of the proposed method.
\ No newline at end of file
diff --git a/data/2022/iclr/VAE Approximation Error: ELBO and Exponential Families b/data/2022/iclr/VAE Approximation Error: ELBO and Exponential Families
new file mode 100644
index 0000000000..d39d5325ca
--- /dev/null
+++ b/data/2022/iclr/VAE Approximation Error: ELBO and Exponential Families	
@@ -0,0 +1 @@
+The importance of Variational Autoencoders reaches far beyond standalone generative models -- the approach is also used for learning latent representations and can be generalized to semi-supervised learning. This requires a thorough analysis of their commonly known shortcomings: posterior collapse and approximation errors. This paper analyzes VAE approximation errors caused by the combination of the ELBO objective and encoder models from conditional exponential families, including, but not limited to, commonly used conditionally independent discrete and continuous models. We characterize subclasses of generative models consistent with these encoder families. We show that the ELBO optimizer is pulled away from the likelihood optimizer towards the consistent subset and study this effect experimentally. Importantly, this subset can not be enlarged, and the respective error cannot be decreased, by considering deeper encoder/decoder networks.
\ No newline at end of file
diff --git a/data/2022/iclr/VAT-Mart: Learning Visual Action Trajectory Proposals for Manipulating 3D ARTiculated Objects b/data/2022/iclr/VAT-Mart: Learning Visual Action Trajectory Proposals for Manipulating 3D ARTiculated Objects
new file mode 100644
index 0000000000..abf4c09e66
--- /dev/null
+++ b/data/2022/iclr/VAT-Mart: Learning Visual Action Trajectory Proposals for Manipulating 3D ARTiculated Objects	
@@ -0,0 +1 @@
+Perceiving and manipulating 3D articulated objects (e.g., cabinets, doors) in human environments is an important yet challenging task for future home-assistant robots. The space of 3D articulated objects is exceptionally rich in their myriad semantic categories, diverse shape geometry, and complicated part functionality. Previous works mostly abstract kinematic structure with estimated joint parameters and part poses as the visual representations for manipulating 3D articulated objects. In this paper, we propose object-centric actionable visual priors as a novel perception-interaction handshaking point that the perception system outputs more actionable guidance than kinematic structure estimation, by predicting dense geometry-aware, interaction-aware, and task-aware visual action affordance and trajectory proposals. We design an interaction-for-perception framework VAT-Mart to learn such actionable visual representations by simultaneously training a curiosity-driven reinforcement learning policy exploring diverse interaction trajectories and a perception module summarizing and generalizing the explored knowledge for pointwise predictions among diverse shapes. Experiments prove the effectiveness of the proposed approach using the large-scale PartNet-Mobility dataset in SAPIEN environment and show promising generalization capabilities to novel test shapes, unseen object categories, and real-world data. Project page: https://hyperplane-lab.github.io/vat-mart
\ No newline at end of file
diff --git a/data/2022/iclr/VC dimension of partially quantized neural networks in the overparametrized regime b/data/2022/iclr/VC dimension of partially quantized neural networks in the overparametrized regime
new file mode 100644
index 0000000000..61da25283c
--- /dev/null
+++ b/data/2022/iclr/VC dimension of partially quantized neural networks in the overparametrized regime	
@@ -0,0 +1 @@
+Vapnik-Chervonenkis (VC) theory has so far been unable to explain the small generalization error of overparametrized neural networks. Indeed, existing applications of VC theory to large networks obtain upper bounds on VC dimension that are proportional to the number of weights, and for a large class of networks, these upper bound are known to be tight. In this work, we focus on a class of partially quantized networks that we refer to as hyperplane arrangement neural networks (HANNs). Using a sample compression analysis, we show that HANNs can have VC dimension significantly smaller than the number of weights, while being highly expressive. In particular, empirical risk minimization over HANNs in the overparametrized regime achieves the minimax rate for classification with Lipschitz posterior class probability. We further demonstrate the expressivity of HANNs empirically. On a panel of 121 UCI datasets, overparametrized HANNs match the performance of state-of-the-art full-precision models.
\ No newline at end of file
diff --git a/data/2022/iclr/VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning b/data/2022/iclr/VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
new file mode 100644
index 0000000000..bd17657a94
--- /dev/null
+++ b/data/2022/iclr/VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning	
@@ -0,0 +1 @@
+Recent self-supervised methods for image representation learning are based on maximizing the agreement between embedding vectors from different views of the same image. A trivial solution is obtained when the encoder outputs constant vectors. This collapse problem is often avoided through implicit biases in the learning architecture, that often lack a clear justification or interpretation. In this paper, we introduce VICReg (Variance-Invariance-Covariance Regularization), a method that explicitly avoids the collapse problem with a simple regularization term on the variance of the embeddings along each dimension individually. VICReg combines the variance term with a decorrelation mechanism based on redundancy reduction and covariance regularization, and achieves results on par with the state of the art on several downstream tasks. In addition, we show that incorporating our new variance term into other methods helps stabilize the training and leads to performance improvements.
\ No newline at end of file
diff --git a/data/2022/iclr/VOS: Learning What You Don't Know by Virtual Outlier Synthesis b/data/2022/iclr/VOS: Learning What You Don't Know by Virtual Outlier Synthesis
new file mode 100644
index 0000000000..37e4d9d851
--- /dev/null
+++ b/data/2022/iclr/VOS: Learning What You Don't Know by Virtual Outlier Synthesis	
@@ -0,0 +1 @@
+Out-of-distribution (OOD) detection has received much attention lately due to its importance in the safe deployment of neural networks. One of the key challenges is that models lack supervision signals from unknown data, and as a result, can produce overconfident predictions on OOD data. Previous approaches rely on real outlier datasets for model regularization, which can be costly and sometimes infeasible to obtain in practice. In this paper, we present VOS, a novel framework for OOD detection by adaptively synthesizing virtual outliers that can meaningfully regularize the model's decision boundary during training. Specifically, VOS samples virtual outliers from the low-likelihood region of the class-conditional distribution estimated in the feature space. Alongside, we introduce a novel unknown-aware training objective, which contrastively shapes the uncertainty space between the ID data and synthesized outlier data. VOS achieves competitive performance on both object detection and image classification models, reducing the FPR95 by up to 9.36% compared to the previous best method on object detectors. Code is available at https://github.com/deeplearning-wisc/vos.
\ No newline at end of file
diff --git a/data/2022/iclr/Value Function Spaces: Skill-Centric State Abstractions for Long-Horizon Reasoning b/data/2022/iclr/Value Function Spaces: Skill-Centric State Abstractions for Long-Horizon Reasoning
new file mode 100644
index 0000000000..3ad7b8d985
--- /dev/null
+++ b/data/2022/iclr/Value Function Spaces: Skill-Centric State Abstractions for Long-Horizon Reasoning	
@@ -0,0 +1 @@
+Reinforcement learning can train policies that effectively perform complex tasks. However for long-horizon tasks, the performance of these methods degrades with horizon, often necessitating reasoning over and chaining lower-level skills. Hierarchical reinforcement learning aims to enable this by providing a bank of low-level skills as action abstractions. Hierarchies can further improve on this by abstracting the space states as well. We posit that a suitable state abstraction should depend on the capabilities of the available lower-level policies. We propose Value Function Spaces: a simple approach that produces such a representation by using the value functions corresponding to each lower-level skill. These value functions capture the affordances of the scene, thus forming a representation that compactly abstracts task relevant information and robustly ignores distractors. Empirical evaluations for maze-solving and robotic manipulation tasks demonstrate that our approach improves long-horizon performance and enables better zero-shot generalization than alternative model-free and model-based methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Value Gradient weighted Model-Based Reinforcement Learning b/data/2022/iclr/Value Gradient weighted Model-Based Reinforcement Learning
new file mode 100644
index 0000000000..1e18b95418
--- /dev/null
+++ b/data/2022/iclr/Value Gradient weighted Model-Based Reinforcement Learning	
@@ -0,0 +1 @@
+Model-based reinforcement learning (MBRL) is a sample efficient technique to obtain control policies, yet unavoidable modeling errors often lead performance deterioration. The model in MBRL is often solely fitted to reconstruct dynamics, state observations in particular, while the impact of model error on the policy is not captured by the training objective. This leads to a mismatch between the intended goal of MBRL, enabling good policy and value learning, and the target of the loss function employed in practice, future state prediction. Naive intuition would suggest that value-aware model learning would fix this problem and, indeed, several solutions to this objective mismatch problem have been proposed based on theoretical analysis. However, they tend to be inferior in practice to commonly used maximum likelihood (MLE) based approaches. In this paper we propose the Value-gradient weighted Model Learning (VaGraM), a novel method for value-aware model learning which improves the performance of MBRL in challenging settings, such as small model capacity and the presence of distracting state dimensions. We analyze both MLE and value-aware approaches and demonstrate how they fail to account for exploration and the behavior of function approximation when learning value-aware models and highlight the additional goals that must be met to stabilize optimization in the deep learning setting. We verify our analysis by showing that our loss function is able to achieve high returns on the Mujoco benchmark suite while being more robust than maximum likelihood based approaches.
\ No newline at end of file
diff --git a/data/2022/iclr/Variational Inference for Discriminative Learning with Generative Modeling of Feature Incompletion b/data/2022/iclr/Variational Inference for Discriminative Learning with Generative Modeling of Feature Incompletion
new file mode 100644
index 0000000000..ec024ec9aa
--- /dev/null
+++ b/data/2022/iclr/Variational Inference for Discriminative Learning with Generative Modeling of Feature Incompletion	
@@ -0,0 +1 @@
+sequentially,
\ No newline at end of file
diff --git a/data/2022/iclr/Variational Neural Cellular Automata b/data/2022/iclr/Variational Neural Cellular Automata
new file mode 100644
index 0000000000..541f7df088
--- /dev/null
+++ b/data/2022/iclr/Variational Neural Cellular Automata	
@@ -0,0 +1 @@
+In nature, the process of cellular growth and differentiation has lead to an amazing diversity of organisms -- algae, starfish, giant sequoia, tardigrades, and orcas are all created by the same generative process. Inspired by the incredible diversity of this biological generative process, we propose a generative model, the Variational Neural Cellular Automata (VNCA), which is loosely inspired by the biological processes of cellular growth and differentiation. Unlike previous related works, the VNCA is a proper probabilistic generative model, and we evaluate it according to best practices. We find that the VNCA learns to reconstruct samples well and that despite its relatively few parameters and simple local-only communication, the VNCA can learn to generate a large variety of output from information encoded in a common vector format. While there is a significant gap to the current state-of-the-art in terms of generative modeling performance, we show that the VNCA can learn a purely self-organizing generative process of data. Additionally, we show that the VNCA can learn a distribution of stable attractors that can recover from significant damage.
\ No newline at end of file
diff --git a/data/2022/iclr/Variational Predictive Routing with Nested Subjective Timescales b/data/2022/iclr/Variational Predictive Routing with Nested Subjective Timescales
new file mode 100644
index 0000000000..fa4ed53e57
--- /dev/null
+++ b/data/2022/iclr/Variational Predictive Routing with Nested Subjective Timescales	
@@ -0,0 +1 @@
+Discovery and learning of an underlying spatiotemporal hierarchy in sequential data is an important topic for machine learning. Despite this, little work has been done to explore hierarchical generative models that can flexibly adapt their layerwise representations in response to datasets with different temporal dynamics. Here, we present Variational Predictive Routing (VPR) - a neural probabilistic inference system that organizes latent representations of video features in a temporal hierarchy, based on their rates of change, thus modeling continuous data as a hierarchical renewal process. By employing an event detection mechanism that relies solely on the system's latent representations (without the need of a separate model), VPR is able to dynamically adjust its internal state following changes in the observed features, promoting an optimal organisation of representations across the levels of the model's latent hierarchy. Using several video datasets, we show that VPR is able to detect event boundaries, disentangle spatiotemporal features across its hierarchy, adapt to the dynamics of the data, and produce accurate time-agnostic rollouts of the future. Our approach integrates insights from neuroscience and introduces a framework with high potential for applications in model-based reinforcement learning, where flexible and informative state-space rollouts are of particular interest.
\ No newline at end of file
diff --git a/data/2022/iclr/Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias b/data/2022/iclr/Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias
new file mode 100644
index 0000000000..4ee928c8d6
--- /dev/null
+++ b/data/2022/iclr/Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias	
@@ -0,0 +1 @@
+Variational Autoencoders are one of the most commonly used generative models, particularly for image data. A prominent difficulty in training VAEs is data that is supported on a lower-dimensional manifold. Recent work by Dai and Wipf (2020) proposes a two-stage training algorithm for VAEs, based on a conjecture that in standard VAE training the generator will converge to a solution with 0 variance which is correctly supported on the ground truth manifold. They gave partial support for that conjecture by showing that some optima of the VAE loss do satisfy this property, but did not analyze the training dynamics. In this paper, we show that for linear encoders/decoders, the conjecture is true-that is the VAE training does recover a generator with support equal to the ground truth manifold-and does so due to an implicit bias of gradient descent rather than merely the VAE loss itself. In the nonlinear case, we show that VAE training frequently learns a higher-dimensional manifold which is a superset of the ground truth manifold.
\ No newline at end of file
diff --git a/data/2022/iclr/Variational methods for simulation-based inference b/data/2022/iclr/Variational methods for simulation-based inference
new file mode 100644
index 0000000000..3d08fd51a5
--- /dev/null
+++ b/data/2022/iclr/Variational methods for simulation-based inference	
@@ -0,0 +1 @@
+We present Sequential Neural Variational Inference (SNVI), an approach to perform Bayesian inference in models with intractable likelihoods. SNVI combines likelihood-estimation (or likelihood-ratio-estimation) with variational inference to achieve a scalable simulation-based inference approach. SNVI maintains the flexibility of likelihood(-ratio) estimation to allow arbitrary proposals for simulations, while simultaneously providing a functional estimate of the posterior distribution without requiring MCMC sampling. We present several variants of SNVI and demonstrate that they are substantially more computationally efficient than previous algorithms, without loss of accuracy on benchmark tasks. We apply SNVI to a neuroscience model of the pyloric network in the crab and demonstrate that it can infer the posterior distribution with one order of magnitude fewer simulations than previously reported. SNVI vastly reduces the computational cost of simulation-based inference while maintaining accuracy and flexibility, making it possible to tackle problems that were previously inaccessible.
\ No newline at end of file
diff --git a/data/2022/iclr/Variational oracle guiding for reinforcement learning b/data/2022/iclr/Variational oracle guiding for reinforcement learning
new file mode 100644
index 0000000000..0fcc5ad7cf
--- /dev/null
+++ b/data/2022/iclr/Variational oracle guiding for reinforcement learning	
@@ -0,0 +1 @@
+How to make intelligent decisions is a central problem in machine learning and artificial intelligence. Despite recent successes of deep reinforcement learning (RL) in various decision making problems, an important but under-explored aspect is how to leverage oracle observation (the information that is invisible during online decision making, but is available during offline training) to facilitate learning. For example, human experts will look at the replay after a Poker game, in which they can check the opponents’ hands to improve their estimation of the opponents’ hands from the visible information during playing. In this work, we study such problems based on Bayesian theory and derive an objective to leverage oracle observation in RL using variational methods. Our key contribution is to propose a general learning framework referred to as variational latent oracle guiding (VLOG) for DRL. VLOG is featured with preferable properties such as its robust and promising performance and its versatility to incorporate with any value-based DRL algorithm. We empirically demonstrate the effectiveness of VLOG in online and offline RL domains with tasks ranging from video games to a challenging tilebased game Mahjong. Furthermore, we publish the Mahjong environment and an offline RL dataset as a benchmark to facilitate future research on oracle guiding1.
\ No newline at end of file
diff --git a/data/2022/iclr/Vector-quantized Image Modeling with Improved VQGAN b/data/2022/iclr/Vector-quantized Image Modeling with Improved VQGAN
new file mode 100644
index 0000000000..5377e631c0
--- /dev/null
+++ b/data/2022/iclr/Vector-quantized Image Modeling with Improved VQGAN	
@@ -0,0 +1 @@
+Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at \(256\times256\) resolution, we achieve Inception Score (IS) of 175.1 and Fr'echet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively. Based on ViT-VQGAN and unsupervised pretraining, we further evaluate the pretrained Transformer by averaging intermediate features, similar to Image GPT (iGPT). This ImageNet-pretrained VIM-L significantly beats iGPT-L on linear-probe accuracy from 60.3% to 73.2% for a similar model size. VIM-L also outperforms iGPT-XL which is trained with extra web image data and larger model size.
\ No newline at end of file
diff --git a/data/2022/iclr/ViDT: An Efficient and Effective Fully Transformer-based Object Detector b/data/2022/iclr/ViDT: An Efficient and Effective Fully Transformer-based Object Detector
new file mode 100644
index 0000000000..1fac7e46ab
--- /dev/null
+++ b/data/2022/iclr/ViDT: An Efficient and Effective Fully Transformer-based Object Detector	
@@ -0,0 +1 @@
+Transformers are transforming the landscape of computer vision, especially for recognition tasks. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector, followed by a computationally efficient transformer decoder that exploits multi-scale features and auxiliary techniques essential to boost the detection performance without much increase in computational load. Extensive evaluation results on the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP and latency trade-off among existing fully transformer-based object detectors, and achieves 49.2AP owing to its high scalability for large models. We will release the code and trained models at https://github.com/naver-ai/vidt
\ No newline at end of file
diff --git a/data/2022/iclr/ViTGAN: Training GANs with Vision Transformers b/data/2022/iclr/ViTGAN: Training GANs with Vision Transformers
new file mode 100644
index 0000000000..db4ba90155
--- /dev/null
+++ b/data/2022/iclr/ViTGAN: Training GANs with Vision Transformers	
@@ -0,0 +1 @@
+Recently, Vision Transformers (ViTs) have shown competitive performance on image recognition while requiring less vision-specific inductive biases. In this paper, we investigate if such performance can be extended to image generation. To this end, we integrate the ViT architecture into generative adversarial networks (GANs). For ViT discriminators, we observe that existing regularization methods for GANs interact poorly with self-attention, causing serious instability during training. To resolve this issue, we introduce several novel regularization techniques for training GANs with ViTs. For ViT generators, we examine architectural choices for latent and pixel mapping layers to facilitate convergence. Empirically, our approach, named ViTGAN, achieves comparable performance to the leading CNN-based GAN models on three datasets: CIFAR-10, CelebA, and LSUN bedroom.
\ No newline at end of file
diff --git a/data/2022/iclr/Vision-Based Manipulators Need to Also See from Their Hands b/data/2022/iclr/Vision-Based Manipulators Need to Also See from Their Hands
new file mode 100644
index 0000000000..0c1df9dfbf
--- /dev/null
+++ b/data/2022/iclr/Vision-Based Manipulators Need to Also See from Their Hands	
@@ -0,0 +1 @@
+We study how the choice of visual perspective affects learning and generalization in the context of physical manipulation from raw sensor observations. Compared with the more commonly used global third-person perspective, a hand-centric (eye-in-hand) perspective affords reduced observability, but we find that it consistently improves training efficiency and out-of-distribution generalization. These benefits hold across a variety of learning algorithms, experimental settings, and distribution shifts, and for both simulated and real robot apparatuses. However, this is only the case when hand-centric observability is sufficient; otherwise, including a third-person perspective is necessary for learning, but also harms out-of-distribution generalization. To mitigate this, we propose to regularize the third-person information stream via a variational information bottleneck. On six representative manipulation tasks with varying hand-centric observability adapted from the Meta-World benchmark, this results in a state-of-the-art reinforcement learning agent operating from both perspectives improving its out-of-distribution generalization on every task. While some practitioners have long put cameras in the hands of robots, our work systematically analyzes the benefits of doing so and provides simple and broadly applicable insights for improving end-to-end learned vision-based robotic manipulation.
\ No newline at end of file
diff --git a/data/2022/iclr/Visual Correspondence Hallucination b/data/2022/iclr/Visual Correspondence Hallucination
new file mode 100644
index 0000000000..860e17c81d
--- /dev/null
+++ b/data/2022/iclr/Visual Correspondence Hallucination	
@@ -0,0 +1 @@
+Given a pair of partially overlapping source and target images and a keypoint in the source image, the keypoint's correspondent in the target image can be either visible, occluded or outside the field of view. Local feature matching methods are only able to identify the correspondent's location when it is visible, while humans can also hallucinate its location when it is occluded or outside the field of view through geometric reasoning. In this paper, we bridge this gap by training a network to output a peaked probability distribution over the correspondent's location, regardless of this correspondent being visible, occluded, or outside the field of view. We experimentally demonstrate that this network is indeed able to hallucinate correspondences on pairs of images captured in scenes that were not seen at training-time. We also apply this network to an absolute camera pose estimation problem and find it is significantly more robust than state-of-the-art local feature matching-based competitors.
\ No newline at end of file
diff --git a/data/2022/iclr/Visual Representation Learning Does Not Generalize Strongly Within the Same Domain b/data/2022/iclr/Visual Representation Learning Does Not Generalize Strongly Within the Same Domain
new file mode 100644
index 0000000000..aaab781b58
--- /dev/null
+++ b/data/2022/iclr/Visual Representation Learning Does Not Generalize Strongly Within the Same Domain	
@@ -0,0 +1 @@
+An important component for generalization in machine learning is to uncover underlying latent factors of variation as well as the mechanism through which each factor acts in the world. In this paper, we test whether 17 unsupervised, weakly supervised, and fully supervised representation learning approaches correctly infer the generative factors of variation in simple datasets (dSprites, Shapes3D, MPI3D). In contrast to prior robustness work that introduces novel factors of variation during test time, such as blur or other (un)structured noise, we here recompose, interpolate, or extrapolate only existing factors of variation from the training data set (e.g., small and medium-sized objects during training and large objects during testing). Models that learn the correct mechanism should be able to generalize to this benchmark. In total, we train and test 2000+ models and observe that all of them struggle to learn the underlying mechanism regardless of supervision signal and architectural bias. Moreover, the generalization capabilities of all tested models drop significantly as we move from artificial datasets towards more realistic real-world datasets. Despite their inability to identify the correct mechanism, the models are quite modular as their ability to infer other in-distribution factors remains fairly stable, providing only a single factor is out-of-distribution. These results point to an important yet understudied problem of learning mechanistic models of observations that can facilitate generalization.
\ No newline at end of file
diff --git a/data/2022/iclr/Visual Representation Learning over Latent Domains b/data/2022/iclr/Visual Representation Learning over Latent Domains
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Visual hyperacuity with moving sensor and recurrent neural computations b/data/2022/iclr/Visual hyperacuity with moving sensor and recurrent neural computations
new file mode 100644
index 0000000000..1530ab5975
--- /dev/null
+++ b/data/2022/iclr/Visual hyperacuity with moving sensor and recurrent neural computations	
@@ -0,0 +1 @@
+Dynamical phenomena, such as recurrent neuronal activity and perpetual motion of the eye, are typically overlooked in models of bottom-up visual perception. Recent experiments suggest that a tiny inter-saccadic eye motion (“ﬁxational drift”) enhances visual acuity beyond the limit imposed by the density of retinal photoreceptors. Here we hypothesize that such an enhancement is enabled by recurrent neuronal computations in early visual areas. Speciﬁcally, we explore a setting involving a low-resolution dynamical sensor that moves with respect to a static scene, with drift-like tiny steps. This setting mimics a dynamical eye, viewing objects in perceptually-challenging conditions. The dynamical sensory input is classiﬁed by a convolutional neural network with recurrent connectivity added to its lower layers, in analogy to recurrent connectivity in early visual areas. Applying our system to CiFAR-10 and CiFAR-100 datasets down-sampled via 8x8 sensor, we found that (i) classiﬁcation accuracy, which is drastically reduced by this down-sampling, is mostly restored to its 32x32 baseline level when using a moving sensor and recurrent connectivity, (ii) in this setting, neurons in the early layers exhibit a wide repertoire of selectivity patterns, spanning the spatio-temporal selectivity space, with neurons preferring different combinations of spatial and temporal patterning, and (iii) curved sensor’s trajectories improve visual acuity compared to straight trajectories, echoing recent experimental ﬁndings involving eye-tracking in challenging conditions. Our work sheds light on the possible role of recurrent connectivity in early vision
\ No newline at end of file
diff --git a/data/2022/iclr/Vitruvion: A Generative Model of Parametric CAD Sketches b/data/2022/iclr/Vitruvion: A Generative Model of Parametric CAD Sketches
new file mode 100644
index 0000000000..0d0e0ad944
--- /dev/null
+++ b/data/2022/iclr/Vitruvion: A Generative Model of Parametric CAD Sketches	
@@ -0,0 +1 @@
+Parametric computer-aided design (CAD) tools are the predominant way that engineers specify physical structures, from bicycle pedals to airplanes to printed circuit boards. The key characteristic of parametric CAD is that design intent is encoded not only via geometric primitives, but also by parameterized constraints between the elements. This relational specification can be viewed as the construction of a constraint program, allowing edits to coherently propagate to other parts of the design. Machine learning offers the intriguing possibility of accelerating the design process via generative modeling of these structures, enabling new tools such as autocompletion, constraint inference, and conditional synthesis. In this work, we present such an approach to generative modeling of parametric CAD sketches, which constitute the basic computational building blocks of modern mechanical design. Our model, trained on real-world designs from the SketchGraphs dataset, autoregressively synthesizes sketches as sequences of primitives, with initial coordinates, and constraints that reference back to the sampled primitives. As samples from the model match the constraint graph representation used in standard CAD software, they may be directly imported, solved, and edited according to downstream design tasks. In addition, we condition the model on various contexts, including partial sketches (primers) and images of hand-drawn sketches. Evaluation of the proposed approach demonstrates its ability to synthesize realistic CAD sketches and its potential to aid the mechanical design workflow.
\ No newline at end of file
diff --git a/data/2022/iclr/W-CTC: a Connectionist Temporal Classification Loss with Wild Cards b/data/2022/iclr/W-CTC: a Connectionist Temporal Classification Loss with Wild Cards
new file mode 100644
index 0000000000..044caa726a
--- /dev/null
+++ b/data/2022/iclr/W-CTC: a Connectionist Temporal Classification Loss with Wild Cards	
@@ -0,0 +1 @@
+Connectionist Temporal Classification (CTC) loss is commonly used in sequence learning applications. For example, in Automatic Speech Recognition (ASR) task, the training data consists of pairs of audio (input sequence) and text (output label), without temporal alignment information. Standard CTC computes a loss by aggregating over all possible alignment paths, that map the entire sequence to the entire label (full alignment). However, in practice, there are often cases where the label is incomplete. Specifically, we solve the partial alignment problem where the label only matches a middle part of the sequence. This paper proposes the wild-card CTC (W-CTC) to address this issue, by padding wild-cards at both ends of the labels. Consequently, the proposed W-CTC improves the standard CTC via aggregating over even more alignment paths. Evaluations on a number of tasks in speech and vision domains, show that the proposed W-CTC consistently outperforms the standard CTC by a large margin when label is incomplete. The effectiveness of the proposed method is further confirmed in an ablation study.
\ No newline at end of file
diff --git a/data/2022/iclr/WeakM3D: Towards Weakly Supervised Monocular 3D Object Detection b/data/2022/iclr/WeakM3D: Towards Weakly Supervised Monocular 3D Object Detection
new file mode 100644
index 0000000000..078f6b1013
--- /dev/null
+++ b/data/2022/iclr/WeakM3D: Towards Weakly Supervised Monocular 3D Object Detection	
@@ -0,0 +1 @@
+Monocular 3D object detection is one of the most challenging tasks in 3D scene understanding. Due to the ill-posed nature of monocular imagery, existing monocular 3D detection methods highly rely on training with the manually annotated 3D box labels on the LiDAR point clouds. This annotation process is very laborious and expensive. To dispense with the reliance on 3D box labels, in this paper we explore the weakly supervised monocular 3D detection. Specifically, we first detect 2D boxes on the image. Then, we adopt the generated 2D boxes to select corresponding RoI LiDAR points as the weak supervision. Eventually, we adopt a network to predict 3D boxes which can tightly align with associated RoI LiDAR points. This network is learned by minimizing our newly-proposed 3D alignment loss between the 3D box estimates and the corresponding RoI LiDAR points. We will illustrate the potential challenges of the above learning problem and resolve these challenges by introducing several effective designs into our method. Codes will be available at https://github.com/SPengLiang/WeakM3D.
\ No newline at end of file
diff --git a/data/2022/iclr/What Do We Mean by Generalization in Federated Learning? b/data/2022/iclr/What Do We Mean by Generalization in Federated Learning?
new file mode 100644
index 0000000000..8e5c9e2182
--- /dev/null
+++ b/data/2022/iclr/What Do We Mean by Generalization in Federated Learning?	
@@ -0,0 +1 @@
+Federated learning data is drawn from a distribution of distributions: clients are drawn from a meta-distribution, and their data are drawn from local data distributions. Thus generalization studies in federated learning should separate performance gaps from unseen client data (out-of-sample gap) from performance gaps from unseen client distributions (participation gap). In this work, we propose a framework for disentangling these performance gaps. Using this framework, we observe and explain differences in behavior across natural and synthetic federated datasets, indicating that dataset synthesis strategy can be important for realistic simulations of generalization in federated learning. We propose a semantic synthesis strategy that enables realistic simulation without naturally-partitioned data. Informed by our findings, we call out community suggestions for future federated learning works.
\ No newline at end of file
diff --git a/data/2022/iclr/What Happens after SGD Reaches Zero Loss? --A Mathematical Framework b/data/2022/iclr/What Happens after SGD Reaches Zero Loss? --A Mathematical Framework
new file mode 100644
index 0000000000..4b561e34da
--- /dev/null
+++ b/data/2022/iclr/What Happens after SGD Reaches Zero Loss? --A Mathematical Framework	
@@ -0,0 +1 @@
+Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold. Intuitively, with a sufficiently small learning rate $\eta$, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such a regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, $\mathrm{tr}[\nabla^2 L]$. The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold -- i.e., the"implicit bias"-- using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for $\eta^{-2}$ steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for $\eta^{-1.6}$ steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires $O(\kappa\ln d)$ samples for learning an $\kappa$-sparse overparametrized linear model in $\mathbb{R}^d$ (Woodworth et al., 2020), while GD initialized in the kernel regime requires $\Omega(d)$ samples. This upper bound is minimax optimal and improves the previous $\tilde{O}(\kappa^2)$ upper bound (HaoChen et al., 2020).
\ No newline at end of file
diff --git a/data/2022/iclr/What Makes Better Augmentation Strategies? Augment Difficult but Not too Different b/data/2022/iclr/What Makes Better Augmentation Strategies? Augment Difficult but Not too Different
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/What's Wrong with Deep Learning in Tree Search for Combinatorial Optimization b/data/2022/iclr/What's Wrong with Deep Learning in Tree Search for Combinatorial Optimization
new file mode 100644
index 0000000000..eea56ce967
--- /dev/null
+++ b/data/2022/iclr/What's Wrong with Deep Learning in Tree Search for Combinatorial Optimization	
@@ -0,0 +1 @@
+Combinatorial optimization lies at the core of many real-world problems. Especially since the rise of graph neural networks (GNNs), the deep learning community has been developing solvers that derive solutions to NP-hard problems by learning the problem-specific solution structure. However, reproducing the results of these publications proves to be difficult. We make three contributions. First, we present an open-source benchmark suite for the NP-hard Maximum Independent Set problem, in both its weighted and unweighted variants. The suite offers a unified interface to various state-of-the-art traditional and machine learning-based solvers. Second, using our benchmark suite, we conduct an in-depth analysis of the popular guided tree search algorithm by Li et al. [NeurIPS 2018], testing various configurations on small and large synthetic and real-world graphs. By re-implementing their algorithm with a focus on code quality and extensibility, we show that the graph convolution network used in the tree search does not learn a meaningful representation of the solution structure, and can in fact be replaced by random values. Instead, the tree search relies on algorithmic techniques like graph kernelization to find good solutions. Thus, the results from the original publication are not reproducible. Third, we extend the analysis to compare the tree search implementations to other solvers, showing that the classical algorithmic solvers often are faster, while providing solutions of similar quality. Additionally, we analyze a recent solver based on reinforcement learning and observe that for this solver, the GNN is responsible for the competitive solution quality.
\ No newline at end of file
diff --git a/data/2022/iclr/When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently? b/data/2022/iclr/When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?
new file mode 100644
index 0000000000..7c944cd9e6
--- /dev/null
+++ b/data/2022/iclr/When Can We Learn General-Sum Markov Games with a Large Number of Players Sample-Efficiently?	
@@ -0,0 +1 @@
+Multi-agent reinforcement learning has made substantial empirical progresses in solving games with a large number of players. However, theoretically, the best known sample complexity for finding a Nash equilibrium in general-sum games scales exponentially in the number of players due to the size of the joint action space, and there is a matching exponential lower bound. This paper investigates what learning goals admit better sample complexities in the setting of $m$-player general-sum Markov games with $H$ steps, $S$ states, and $A_i$ actions per player. First, we design algorithms for learning an $\epsilon$-Coarse Correlated Equilibrium (CCE) in $\widetilde{\mathcal{O}}(H^5S\max_{i\le m} A_i / \epsilon^2)$ episodes, and an $\epsilon$-Correlated Equilibrium (CE) in $\widetilde{\mathcal{O}}(H^6S\max_{i\le m} A_i^2 / \epsilon^2)$ episodes. This is the first line of results for learning CCE and CE with sample complexities polynomial in $\max_{i\le m} A_i$. Our algorithm for learning CE integrates an adversarial bandit subroutine which minimizes a weighted swap regret, along with several novel designs in the outer loop. Second, we consider the important special case of Markov Potential Games, and design an algorithm that learns an $\epsilon$-approximate Nash equilibrium within $\widetilde{\mathcal{O}}(S\sum_{i\le m} A_i / \epsilon^3)$ episodes (when only highlighting the dependence on $S$, $A_i$, and $\epsilon$), which only depends linearly in $\sum_{i\le m} A_i$ and significantly improves over existing efficient algorithm in the $\epsilon$ dependence. Overall, our results shed light on what equilibria or structural assumptions on the game may enable sample-efficient learning with many players.
\ No newline at end of file
diff --git a/data/2022/iclr/When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations b/data/2022/iclr/When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
new file mode 100644
index 0000000000..a92a97689b
--- /dev/null
+++ b/data/2022/iclr/When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations	
@@ -0,0 +1 @@
+Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pre-training and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rates). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3\% and +11.0\% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pre-training or strong data augmentations. Model checkpoints are available at \url{https://github.com/google-research/vision_transformer}.
\ No newline at end of file
diff --git a/data/2022/iclr/When should agents explore? b/data/2022/iclr/When should agents explore?
new file mode 100644
index 0000000000..d80adc951d
--- /dev/null
+++ b/data/2022/iclr/When should agents explore?	
@@ -0,0 +1 @@
+Exploration remains a central challenge for reinforcement learning (RL). Virtually all existing methods share the feature of a monolithic behaviour policy that changes only gradually (at best). In contrast, the exploratory behaviours of animals and humans exhibit a rich diversity, namely including forms of switching between modes. This paper presents an initial study of mode-switching, non-monolithic exploration for RL. We investigate different modes to switch between, at what timescales it makes sense to switch, and what signals make for good switching triggers. We also propose practical algorithmic components that make the switching mechanism adaptive and robust, which enables flexibility without an accompanying hyper-parameter-tuning burden. Finally, we report a promising and detailed analysis on Atari, using two-mode exploration and switching at sub-episodic time-scales.
\ No newline at end of file
diff --git a/data/2022/iclr/When, Why, and Which Pretrained GANs Are Useful? b/data/2022/iclr/When, Why, and Which Pretrained GANs Are Useful?
new file mode 100644
index 0000000000..bb01a37b00
--- /dev/null
+++ b/data/2022/iclr/When, Why, and Which Pretrained GANs Are Useful?	
@@ -0,0 +1 @@
+The literature has proposed several methods to finetune pretrained GANs on new datasets, which typically results in higher performance compared to training from scratch, especially in the limited-data regime. However, despite the apparent empirical benefits of GAN pretraining, its inner mechanisms were not analyzed in-depth, and understanding of its role is not entirely clear. Moreover, the essential practical details, e.g., selecting a proper pretrained GAN checkpoint, currently do not have rigorous grounding and are typically determined by trial and error. This work aims to dissect the process of GAN finetuning. First, we show that initializing the GAN training process by a pretrained checkpoint primarily affects the model's coverage rather than the fidelity of individual samples. Second, we explicitly describe how pretrained generators and discriminators contribute to the finetuning process and explain the previous evidence on the importance of pretraining both of them. Finally, as an immediate practical benefit of our analysis, we describe a simple recipe to choose an appropriate GAN checkpoint that is the most suitable for finetuning to a particular target task. Importantly, for most of the target tasks, Imagenet-pretrained GAN, despite having poor visual quality, appears to be an excellent starting point for finetuning, resembling the typical pretraining scenario of discriminative computer vision models.
\ No newline at end of file
diff --git a/data/2022/iclr/Which Shortcut Cues Will DNNs Choose? A Study from the Parameter-Space Perspective b/data/2022/iclr/Which Shortcut Cues Will DNNs Choose? A Study from the Parameter-Space Perspective
new file mode 100644
index 0000000000..a62f347d12
--- /dev/null
+++ b/data/2022/iclr/Which Shortcut Cues Will DNNs Choose? A Study from the Parameter-Space Perspective	
@@ -0,0 +1 @@
+Deep neural networks (DNNs) often rely on easy-to-learn discriminatory features, or cues, that are not necessarily essential to the problem at hand. For example, ducks in an image may be recognized based on their typical background scenery, such as lakes or streams. This phenomenon, also known as shortcut learning, is emerging as a key limitation of the current generation of machine learning models. In this work, we introduce a set of experiments to deepen our understanding of shortcut learning and its implications. We design a training setup with several shortcut cues, named WCST-ML, where each cue is equally conducive to the visual recognition problem at hand. Even under equal opportunities, we observe that (1) certain cues are preferred to others, (2) solutions biased to the easy-to-learn cues tend to converge to relatively flat minima on the loss surface, and (3) the solutions focusing on those preferred cues are far more abundant in the parameter space. We explain the abundance of certain cues via their Kolmogorov (descriptional) complexity: solutions corresponding to Kolmogorov-simple cues are abundant in the parameter space and are thus preferred by DNNs. Our studies are based on the synthetic dataset DSprites and the face dataset UTKFace. In our WCST-ML, we observe that the inborn bias of models leans toward simple cues, such as color and ethnicity. Our findings emphasize the importance of active human intervention to remove the inborn model biases that may cause negative societal impacts.
\ No newline at end of file
diff --git a/data/2022/iclr/Who Is Your Right Mixup Partner in Positive and Unlabeled Learning b/data/2022/iclr/Who Is Your Right Mixup Partner in Positive and Unlabeled Learning
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2022/iclr/Who Is the Strongest Enemy? Towards Optimal and Efficient Evasion Attacks in Deep RL b/data/2022/iclr/Who Is the Strongest Enemy? Towards Optimal and Efficient Evasion Attacks in Deep RL
new file mode 100644
index 0000000000..77b09314f0
--- /dev/null
+++ b/data/2022/iclr/Who Is the Strongest Enemy? Towards Optimal and Efficient Evasion Attacks in Deep RL	
@@ -0,0 +1 @@
+Evaluating the worst-case performance of a reinforcement learning (RL) agent under the strongest/optimal adversarial perturbations on state observations (within some constraints) is crucial for understanding the robustness of RL agents. However, finding the optimal adversary is challenging, in terms of both whether we can find the optimal attack and how efficiently we can find it. Existing works on adversarial RL either use heuristics-based methods that may not find the strongest adversary, or directly train an RL-based adversary by treating the agent as a part of the environment, which can find the optimal adversary but may become intractable in a large state space. In this paper, we propose a novel attacking algorithm which has an RL-based "director" searching for the optimal policy perturbation, and an "actor" crafting state perturbations following the directions from the director (i.e. the actor executes targeted attacks). Our proposed algorithm, PA-AD, is theoretically optimal against an RL agent and significantly improves the efficiency compared with prior RL-based works in environments with large or pixel state spaces. Empirical results show that our proposed PA-AD universally outperforms state-of-the-art attacking methods in a wide range of environments. Our method can be easily applied to any RL algorithms to evaluate and improve their robustness.
\ No newline at end of file
diff --git a/data/2022/iclr/Why Propagate Alone? Parallel Use of Labels and Features on Graphs b/data/2022/iclr/Why Propagate Alone? Parallel Use of Labels and Features on Graphs
new file mode 100644
index 0000000000..64f11fc438
--- /dev/null
+++ b/data/2022/iclr/Why Propagate Alone? Parallel Use of Labels and Features on Graphs	
@@ -0,0 +1 @@
+Graph neural networks (GNNs) and label propagation represent two interrelated modeling strategies designed to exploit graph structure in tasks such as node property prediction. The former is typically based on stacked message-passing layers that share neighborhood information to transform node features into predictive embeddings. In contrast, the latter involves spreading label information to unlabeled nodes via a parameter-free diffusion process, but operates independently of the node features. Given then that the material difference is merely whether features or labels are smoothed across the graph, it is natural to consider combinations of the two for improving performance. In this regard, it has recently been proposed to use a randomly-selected portion of the training labels as GNN inputs, concatenated with the original node features for making predictions on the remaining labels. This so-called label trick accommodates the parallel use of features and labels, and is foundational to many of the top-ranking submissions on the Open Graph Benchmark (OGB) leaderboard. And yet despite its wide-spread adoption, thus far there has been little attempt to carefully unpack exactly what statistical properties the label trick introduces into the training pipeline, intended or otherwise. To this end, we prove that under certain simplifying assumptions, the stochastic label trick can be reduced to an interpretable, deterministic training objective composed of two factors. The first is a data-fitting term that naturally resolves potential label leakage issues, while the second serves as a regularization factor conditioned on graph structure that adapts to graph size and connectivity. Later, we leverage this perspective to motivate a broader range of label trick use cases, and provide experiments to verify the efficacy of these extensions.
\ No newline at end of file
diff --git a/data/2022/iclr/Wiring Up Vision: Minimizing Supervised Synaptic Updates Needed to Produce a Primate Ventral Stream b/data/2022/iclr/Wiring Up Vision: Minimizing Supervised Synaptic Updates Needed to Produce a Primate Ventral Stream
new file mode 100644
index 0000000000..892b97c029
--- /dev/null
+++ b/data/2022/iclr/Wiring Up Vision: Minimizing Supervised Synaptic Updates Needed to Produce a Primate Ventral Stream	
@@ -0,0 +1 @@
+After training on large datasets, certain deep neural networks are surprisingly good models of the neural mechanisms of adult primate visual object recognition. Nevertheless, these models are poor models of the development of the visual system because they posit millions of sequential, precisely coordinated synaptic updates, each based on a labeled image. While ongoing research is pursuing the use of unsupervised proxies for labels, we here explore a complementary strategy of reducing the required number of supervised synaptic updates to produce an adult-like ventral visual stream (as judged by the match to V1, V2, V4, IT, and behavior). Such models might require less precise machinery and energy expenditure to coordinate these updates and would thus move us closer to viable neuroscientific hypotheses about how the visual system wires itself up. Relative to the current leading model of the adult ventral stream, we here demonstrate that the total number of supervised weight updates can be substantially reduced using three complementary strategies: First, we find that only 2% of supervised updates (epochs and images) are needed to achieve ~80% of the match to adult ventral stream. Second, by improving the random distribution of synaptic connectivity, we find that 54% of the brain match can already be achieved “at birth” (i.e. no training at all). Third, we find that, by training only ~5% of model synapses, we can still achieve nearly 80% of the match to the ventral stream. When these three strategies are applied in combination, we find that these new models achieve ~80% of a fully trained model’s match to the brain, while using two orders of magnitude fewer supervised synaptic updates. These results reflect first steps in modeling not just primate adult visual processing during inference, but also how the ventral visual stream might be “wired up” by evolution (a model’s “birth” state) and by developmental learning (a model’s updates based on visual experience).
\ No newline at end of file
diff --git a/data/2022/iclr/Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models b/data/2022/iclr/Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models
new file mode 100644
index 0000000000..a42255d675
--- /dev/null
+++ b/data/2022/iclr/Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models	
@@ -0,0 +1 @@
+Committee-based models (ensembles or cascades) construct models by combining existing pre-trained ones. While ensembles and cascades are well-known techniques that were proposed before deep learning, they are not considered a core building block of deep model architectures and are rarely compared to in recent literature on developing efficient models. In this work, we go back to basics and conduct a comprehensive analysis of the efficiency of committee-based models. We find that even the most simplistic method for building committees from existing, independently pre-trained models can match or exceed the accuracy of state-of-the-art models while being drastically more efficient. These simple committee-based models also outperform sophisticated neural architecture search methods (e.g., BigNAS). These findings hold true for several tasks, including image classification, video classification, and semantic segmentation, and various architecture families, such as ViT, EfficientNet, ResNet, MobileNetV2, and X3D. Our results show that an EfficientNet cascade can achieve a 5.4x speedup over B7 and a ViT cascade can achieve a 2.3x speedup over ViT-L-384 while being equally accurate.
\ No newline at end of file
diff --git a/data/2022/iclr/Wish you were here: Hindsight Goal Selection for long-horizon dexterous manipulation b/data/2022/iclr/Wish you were here: Hindsight Goal Selection for long-horizon dexterous manipulation
new file mode 100644
index 0000000000..c3fdc06e3d
--- /dev/null
+++ b/data/2022/iclr/Wish you were here: Hindsight Goal Selection for long-horizon dexterous manipulation	
@@ -0,0 +1 @@
+Complex sequential tasks in continuous-control settings often require agents to successfully traverse a set of"narrow passages"in their state space. Solving such tasks with a sparse reward in a sample-efficient manner poses a challenge to modern reinforcement learning (RL) due to the associated long-horizon nature of the problem and the lack of sufficient positive signal during learning. Various tools have been applied to address this challenge. When available, large sets of demonstrations can guide agent exploration. Hindsight relabelling on the other hand does not require additional sources of information. However, existing strategies explore based on task-agnostic goal distributions, which can render the solution of long-horizon tasks impractical. In this work, we extend hindsight relabelling mechanisms to guide exploration along task-specific distributions implied by a small set of successful demonstrations. We evaluate the approach on four complex, single and dual arm, robotics manipulation tasks against strong suitable baselines. The method requires far fewer demonstrations to solve all tasks and achieves a significantly higher overall performance as task complexity increases. Finally, we investigate the robustness of the proposed solution with respect to the quality of input representations and the number of demonstrations.
\ No newline at end of file
diff --git a/data/2022/iclr/X-model: Improving Data Efficiency in Deep Learning with A Minimax Model b/data/2022/iclr/X-model: Improving Data Efficiency in Deep Learning with A Minimax Model
new file mode 100644
index 0000000000..0e05b018dd
--- /dev/null
+++ b/data/2022/iclr/X-model: Improving Data Efficiency in Deep Learning with A Minimax Model	
@@ -0,0 +1 @@
+To mitigate the burden of data labeling, we aim at improving data efficiency for both classification and regression setups in deep learning. However, the current focus is on classification problems while rare attention has been paid to deep regression, which usually requires more human effort to labeling. Further, due to the intrinsic difference between categorical and continuous label space, the common intuitions for classification, e.g., cluster assumptions or pseudo labeling strategies, cannot be naturally adapted into deep regression. To this end, we first delved into the existing data-efficient methods in deep learning and found that they either encourage invariance to data stochasticity (e.g., consistency regularization under different augmentations) or model stochasticity (e.g., difference penalty for predictions of models with different dropout). To take the power of both worlds, we propose a novel X-model by simultaneously encouraging the invariance to {data stochasticity} and {model stochasticity}. Further, the X-model plays a minimax game between the feature extractor and task-specific heads to further enhance the invariance to model stochasticity. Extensive experiments verify the superiority of the X-model among various tasks, from a single-value prediction task of age estimation to a dense-value prediction task of keypoint localization, a 2D synthetic, and a 3D realistic dataset, as well as a multi-category object recognition task.
\ No newline at end of file
diff --git a/data/2022/iclr/You Mostly Walk Alone: Analyzing Feature Attribution in Trajectory Prediction b/data/2022/iclr/You Mostly Walk Alone: Analyzing Feature Attribution in Trajectory Prediction
new file mode 100644
index 0000000000..3bf352a3e9
--- /dev/null
+++ b/data/2022/iclr/You Mostly Walk Alone: Analyzing Feature Attribution in Trajectory Prediction	
@@ -0,0 +1 @@
+Predicting the future trajectory of a moving agent can be easy when the past trajectory continues smoothly but is challenging when complex interactions with other agents are involved. Recent deep learning approaches for trajectory prediction show promising performance and partially attribute this to successful reasoning about agent-agent interactions. However, it remains unclear which features such black-box models actually learn to use for making predictions. This paper proposes a procedure that quantifies the contributions of different cues to model performance based on a variant of Shapley values. Applying this procedure to state-of-the-art trajectory prediction methods on standard benchmark datasets shows that they are, in fact, unable to reason about interactions. Instead, the past trajectory of the target is the only feature used for predicting its future. For a task with richer social interaction patterns, on the other hand, the tested models do pick up such interactions to a certain extent, as quantified by our feature attribution method. We discuss the limits of the proposed method and its links to causality
\ No newline at end of file
diff --git a/data/2022/iclr/You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks b/data/2022/iclr/You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks
new file mode 100644
index 0000000000..2672406cf8
--- /dev/null
+++ b/data/2022/iclr/You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks	
@@ -0,0 +1 @@
+Hypergraphs are used to model higher-order interactions amongst agents and there exist many practically relevant instances of hypergraph datasets. To enable efficient processing of hypergraph-structured data, several hypergraph neural network platforms have been proposed for learning hypergraph properties and structure, with a special focus on node classification. However, almost all existing methods use heuristic propagation rules and offer suboptimal performance on many datasets. We propose AllSet, a new hypergraph neural network paradigm that represents a highly general framework for (hyper)graph neural networks and for the first time implements hypergraph neural network layers as compositions of two multiset functions that can be efficiently learned for each task and each dataset. Furthermore, AllSet draws on new connections between hypergraph neural networks and recent advances in deep learning of multiset functions. In particular, the proposed architecture utilizes Deep Sets and Set Transformer architectures that allow for significant modeling flexibility and offer high expressive power. To evaluate the performance of AllSet, we conduct the most extensive experiments to date involving ten known benchmarking datasets and three newly curated datasets that represent significant challenges for hypergraph node classification. The results demonstrate that AllSet has the unique ability to consistently either match or outperform all other hypergraph neural networks across the tested datasets.
\ No newline at end of file
diff --git a/data/2022/iclr/Zero Pixel Directional Boundary by Vector Transform b/data/2022/iclr/Zero Pixel Directional Boundary by Vector Transform
new file mode 100644
index 0000000000..f5810d6c69
--- /dev/null
+++ b/data/2022/iclr/Zero Pixel Directional Boundary by Vector Transform	
@@ -0,0 +1 @@
+Boundaries are among the primary visual cues used by human and computer vision systems. One of the key problems in boundary detection is the label representation, which typically leads to class imbalance and, as a consequence, to thick boundaries that require non-differential post-processing steps to be thinned. In this paper, we re-interpret boundaries as 1-D surfaces and formulate a one-to-one vector transform function that allows for training of boundary prediction completely avoiding the class imbalance issue. Specifically, we define the boundary representation at any point as the unit vector pointing to the closest boundary surface. Our problem formulation leads to the estimation of direction as well as richer contextual information of the boundary, and, if desired, the availability of zero-pixel thin boundaries also at training time. Our method uses no hyper-parameter in the training loss and a fixed stable hyper-parameter at inference. We provide theoretical justification/discussions of the vector transform representation. We evaluate the proposed loss method using a standard architecture and show the excellent performance over other losses and representations on several datasets. Code is available at https://github.com/edomel/BoundaryVT.
\ No newline at end of file
diff --git a/data/2022/iclr/Zero-CL: Instance and Feature decorrelation for negative-free symmetric contrastive learning b/data/2022/iclr/Zero-CL: Instance and Feature decorrelation for negative-free symmetric contrastive learning
new file mode 100644
index 0000000000..8c3098b338
--- /dev/null
+++ b/data/2022/iclr/Zero-CL: Instance and Feature decorrelation for negative-free symmetric contrastive learning	
@@ -0,0 +1 @@
+For self-supervised contrastive learning, models can easily collapse and generate trivial constant solutions. The issue has been mitigated by recent improvement on objective design, which however often requires square complexity either for the size of instances (O(N)) or feature dimensions (O(d)). To prevent such collapse, we develop two novel methods by decorrelating on different dimensions on the instance embedding stacking matrix, i.e., Instance-wise (ICL) and Featurewise (FCL) Contrastive Learning. The proposed two methods (FCL, ICL) can be combined synthetically, called Zero-CL, where “Zero” means negative samples are zero relevant, which allows Zero-CL to completely discard negative pairs i.e., with zero negative samples. Compared with previous methods, Zero-CL mainly enjoys three advantages: 1) Negative free in symmetric architecture. 2) By whitening transformation, the correlation of the different features is equal to zero, alleviating information redundancy. 3) Zero-CL remains original information to a great extent after transformation, which improves the accuracy against other whitening transformation techniques. Extensive experimental results on CIFAR-10/100 and ImageNet show that Zero-CL outperforms or is on par with state-of-the-art symmetric contrastive learning methods.
\ No newline at end of file
diff --git a/data/2022/iclr/Zero-Shot Self-Supervised Learning for MRI Reconstruction b/data/2022/iclr/Zero-Shot Self-Supervised Learning for MRI Reconstruction
new file mode 100644
index 0000000000..6b7d62fc95
--- /dev/null
+++ b/data/2022/iclr/Zero-Shot Self-Supervised Learning for MRI Reconstruction	
@@ -0,0 +1 @@
+Deep learning (DL) has emerged as a powerful tool for accelerated MRI reconstruction, but often necessitates a database of fully-sampled measurements for training. Recent self-supervised and unsupervised learning approaches enable training without fully-sampled data. However, a database of undersampled measurements may not be available in many scenarios, especially for scans involving contrast or translational acquisitions in development. Moreover, recent studies show that database-trained models may not generalize well when the unseen measurements differ in terms of sampling pattern, acceleration rate, SNR, image contrast, and anatomy. Such challenges necessitate a new methodology to enable subject-specific DL MRI reconstruction without external training datasets, since it is clinically imperative to provide high-quality reconstructions that can be used to identify lesions/disease for \emph{every individual}. In this work, we propose a zero-shot self-supervised learning approach to perform subject-specific accelerated DL MRI reconstruction to tackle these issues. The proposed approach partitions the available measurements from a single scan into three disjoint sets. Two of these sets are used to enforce data consistency and define loss during training for self-supervision, while the last set serves to self-validate, establishing an early stopping criterion. In the presence of models pre-trained on a database with different image characteristics, we show that the proposed approach can be combined with transfer learning for faster convergence time and reduced computational complexity. The code is available at \url{https://github.com/byaman14/ZS-SSL}.
\ No newline at end of file
diff --git a/data/2022/iclr/ZeroFL: Efficient On-Device Training for Federated Learning with Local Sparsity b/data/2022/iclr/ZeroFL: Efficient On-Device Training for Federated Learning with Local Sparsity
new file mode 100644
index 0000000000..d07be16dcb
--- /dev/null
+++ b/data/2022/iclr/ZeroFL: Efficient On-Device Training for Federated Learning with Local Sparsity	
@@ -0,0 +1 @@
+When the available hardware cannot meet the memory and compute requirements to efficiently train high performing machine learning models, a compromise in either the training quality or the model complexity is needed. In Federated Learning (FL), nodes are orders of magnitude more constrained than traditional server-grade hardware and are often battery powered, severely limiting the sophistication of models that can be trained under this paradigm. While most research has focused on designing better aggregation strategies to improve convergence rates and in alleviating the communication costs of FL, fewer efforts have been devoted to accelerating on-device training. Such stage, which repeats hundreds of times (i.e. every round) and can involve thousands of devices, accounts for the majority of the time required to train federated models and, the totality of the energy consumption at the client side. In this work, we present the first study on the unique aspects that arise when introducing sparsity at training time in FL workloads. We then propose ZeroFL, a framework that relies on highly sparse operations to accelerate on-device training. Models trained with ZeroFL and 95% sparsity achieve up to 2.3% higher accuracy compared to competitive baselines obtained from adapting a state-of-the-art sparse training framework to the FL setting.
\ No newline at end of file
diff --git a/data/2022/iclr/cosFormer: Rethinking Softmax In Attention b/data/2022/iclr/cosFormer: Rethinking Softmax In Attention
new file mode 100644
index 0000000000..d0bba2692e
--- /dev/null
+++ b/data/2022/iclr/cosFormer: Rethinking Softmax In Attention	
@@ -0,0 +1 @@
+Transformer has shown great successes in natural language processing, computer vision, and audio processing. As one of its core components, the softmax attention helps to capture long-range dependencies yet prohibits its scale-up due to the quadratic space and time complexity to the sequence length. Kernel methods are often adopted to reduce the complexity by approximating the softmax operator. Nevertheless, due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops when compared with the vanilla softmax attention. In this paper, we propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer in both casual and cross attentions. cosFormer is based on two key properties of softmax attention: i). non-negativeness of the attention matrix; ii). a non-linear re-weighting scheme that can concentrate the distribution of the attention matrix. As its linear substitute, cosFormer fulfills these properties with a linear operator and a cosine-based distance re-weighting mechanism. Extensive experiments on language modeling and text understanding tasks demonstrate the effectiveness of our method. We further examine our method on long sequences and achieve state-of-the-art performance on the Long-Range Arena benchmark. The source code is available at https://github.com/OpenNLPLab/cosFormer.
\ No newline at end of file
diff --git a/data/2022/iclr/iFlood: A Stable and Effective Regularizer b/data/2022/iclr/iFlood: A Stable and Effective Regularizer
new file mode 100644
index 0000000000..2d480adef5
--- /dev/null
+++ b/data/2022/iclr/iFlood: A Stable and Effective Regularizer	
@@ -0,0 +1 @@
+Various regularization methods have been designed to prevent overfitting of machine learning models. Among them, a surprisingly simple yet effective one, called Flooding, is proposed recently, which directly constrains the training loss on average to stay at a given level. However, our further studies uncover that the design of the loss function of Flooding can lead to a discrepancy between its objective and implementation, and cause the instability issue. To resolve these issues, in this paper, we propose a new regularizer, called individual Flood (denoted as iFlood). With instance-level constraints on training loss, iFlood encourages the trained models to better fit the under-fitted instances while suppressing the confidence on over-fitted ones. We theoretically show that the design of iFlood can be intrinsically connected with removing the noise or bias in training data, which makes it suitable for a variety of applications to improve the generalization performances of learned models. We also theoretically link iFlood to some other regularizers by comparing the inductive biases they introduce. Our experimental results on both image classification and language understanding tasks confirm that models learned with iFlood can stably converge to solutions with better generalization ability, and behave consistently at instance-level.
\ No newline at end of file
diff --git a/data/2022/iclr/iLQR-VAE : control-based learning of input-driven dynamics with applications to neural data b/data/2022/iclr/iLQR-VAE : control-based learning of input-driven dynamics with applications to neural data
new file mode 100644
index 0000000000..3cfd699692
--- /dev/null
+++ b/data/2022/iclr/iLQR-VAE : control-based learning of input-driven dynamics with applications to neural data	
@@ -0,0 +1 @@
+Understanding how neural dynamics give rise to behaviour is one of the most fundamental questions in systems neuroscience. To achieve this, a common approach is to record neural populations in behaving animals, and model these data as emanating from a latent dynamical system whose state trajectories can then be related back to behavioural observations via some form of decoding. As recordings are typically performed in localized circuits that form only a part of the wider implicated network, it is important to simultaneously learn the local dynamics and infer any unobserved external input that might drive them. Here, we introduce iLQR-VAE, a control-based approach to variational inference in nonlinear dynamical systems, capable of learning both latent dynamics, initial conditions, and ongoing external inputs. As in recent deep learning approaches, our method is based on an input-driven sequential variational autoencoder (VAE). The main novelty lies in the use of the powerful iterative linear quadratic regulator algorithm (iLQR) in the recognition model. Optimization of the standard evidence lower-bound requires differentiating through iLQR solutions, which is made possible by recent advances in differentiable control. Importantly, the recognition model is naturally tied to the generative model, greatly reducing the number of free parameters and ensuring high-quality inference throughout the course of learning. Moreover, iLQR can be used to perform inference flexibly on heterogeneous trials of varying lengths. This allows for instance to evaluate the model on a single long trial after training on smaller chunks. We demonstrate the effectiveness of iLQR-VAE on a range of synthetic systems, with autonomous as well as input-driven dynamics. We further apply it to neural and behavioural recordings in non-human primates performing two different reaching tasks, and show that iLQR-VAE yields high-quality kinematic reconstructions from the neural data.
\ No newline at end of file
diff --git a/data/2022/iclr/miniF2F: a cross-system benchmark for formal Olympiad-level mathematics b/data/2022/iclr/miniF2F: a cross-system benchmark for formal Olympiad-level mathematics
new file mode 100644
index 0000000000..cadf91311c
--- /dev/null
+++ b/data/2022/iclr/miniF2F: a cross-system benchmark for formal Olympiad-level mathematics	
@@ -0,0 +1 @@
+We present miniF2F, a dataset of formal Olympiad-level mathematics problems statements intended to provide a unified cross-system benchmark for neural theorem proving. The miniF2F benchmark currently targets Metamath, Lean, Isabelle (partially) and HOL Light (partially) and consists of 488 problem statements drawn from the AIME, AMC, and the International Mathematical Olympiad (IMO), as well as material from high-school and undergraduate mathematics courses. We report baseline results using GPT-f, a neural theorem prover based on GPT-3 and provide an analysis of its performance. We intend for miniF2F to be a community-driven effort and hope that our benchmark will help spur advances in neural theorem proving.
\ No newline at end of file
diff --git a/data/2022/iclr/switch-GLAT: Multilingual Parallel Machine Translation Via Code-Switch Decoder b/data/2022/iclr/switch-GLAT: Multilingual Parallel Machine Translation Via Code-Switch Decoder
new file mode 100644
index 0000000000..3488f908b6
--- /dev/null
+++ b/data/2022/iclr/switch-GLAT: Multilingual Parallel Machine Translation Via Code-Switch Decoder	
@@ -0,0 +1 @@
+Multilingual machine translation aims to develop a single model for multiple language directions. However, existing multilingual models based on Transformer are limited in terms of both translation performance and inference speed. In this paper, we propose switch-GLAT, a non-autoregressive multilingual machine translation model with a code-switch decoder. It can generate contextual codeswitched translations for a given source sentence, and perform code-switch backtranslation, greatly boosting multilingual translation performance. In addition, its inference is highly efficient thanks to its parallel decoder. Experiments show that our proposed switch-GLAT outperform the multilingual Transformer with as much as 0.74 BLEU improvement and 6.2x faster decoding speed in inference.
\ No newline at end of file
diff --git a/data/2023/iclr/A Multi-Grained Self-Interpretable Symbolic-Neural Model For Single Multi-Labeled Text Classification b/data/2023/iclr/A Multi-Grained Self-Interpretable Symbolic-Neural Model For Single Multi-Labeled Text Classification
new file mode 100644
index 0000000000..a655a672c3
--- /dev/null
+++ b/data/2023/iclr/A Multi-Grained Self-Interpretable Symbolic-Neural Model For Single Multi-Labeled Text Classification	
@@ -0,0 +1 @@
+Deep neural networks based on layer-stacking architectures have historically suffered from poor inherent interpretability. Meanwhile, symbolic probabilistic models function with clear interpretability, but how to combine them with neural networks to enhance their performance remains to be explored. In this paper, we try to marry these two systems for text classification via a structured language model. We propose a Symbolic-Neural model that can learn to explicitly predict class labels of text spans from a constituency tree without requiring any access to span-level gold labels. As the structured language model learns to predict constituency trees in a self-supervised manner, only raw texts and sentence-level labels are required as training data, which makes it essentially a general constituent-level self-interpretable classification model. Our experiments demonstrate that our approach could achieve good prediction accuracy in downstream tasks. Meanwhile, the predicted span labels are consistent with human rationales to a certain degree.
\ No newline at end of file
diff --git a/data/2023/iclr/A Unified Framework for Soft Threshold Pruning b/data/2023/iclr/A Unified Framework for Soft Threshold Pruning
new file mode 100644
index 0000000000..eaacc19620
--- /dev/null
+++ b/data/2023/iclr/A Unified Framework for Soft Threshold Pruning	
@@ -0,0 +1 @@
+Soft threshold pruning is among the cutting-edge pruning methods with state-of-the-art performance. However, previous methods either perform aimless searching on the threshold scheduler or simply set the threshold trainable, lacking theoretical explanation from a unified perspective. In this work, we reformulate soft threshold pruning as an implicit optimization problem solved using the Iterative Shrinkage-Thresholding Algorithm (ISTA), a classic method from the fields of sparse recovery and compressed sensing. Under this theoretical framework, all threshold tuning strategies proposed in previous studies of soft threshold pruning are concluded as different styles of tuning $L_1$-regularization term. We further derive an optimal threshold scheduler through an in-depth study of threshold scheduling based on our framework. This scheduler keeps $L_1$-regularization coefficient stable, implying a time-invariant objective function from the perspective of optimization. In principle, the derived pruning algorithm could sparsify any mathematical model trained via SGD. We conduct extensive experiments and verify its state-of-the-art performance on both Artificial Neural Networks (ResNet-50 and MobileNet-V1) and Spiking Neural Networks (SEW ResNet-18) on ImageNet datasets. On the basis of this framework, we derive a family of pruning methods, including sparsify-during-training, early pruning, and pruning at initialization. The code is available at https://github.com/Yanqi-Chen/LATS.
\ No newline at end of file
diff --git a/data/2023/iclr/Achieve the Minimum Width of Neural Networks for Universal Approximation b/data/2023/iclr/Achieve the Minimum Width of Neural Networks for Universal Approximation
new file mode 100644
index 0000000000..9f975962e1
--- /dev/null
+++ b/data/2023/iclr/Achieve the Minimum Width of Neural Networks for Universal Approximation	
@@ -0,0 +1 @@
+The universal approximation property (UAP) of neural networks is fundamental for deep learning, and it is well known that wide neural networks are universal approximators of continuous functions within both the $L^p$ norm and the continuous/uniform norm. However, the exact minimum width, $w_{\min}$, for the UAP has not been studied thoroughly. Recently, using a decoder-memorizer-encoder scheme, \citet{Park2021Minimum} found that $w_{\min} = \max(d_x+1,d_y)$ for both the $L^p$-UAP of ReLU networks and the $C$-UAP of ReLU+STEP networks, where $d_x,d_y$ are the input and output dimensions, respectively. In this paper, we consider neural networks with an arbitrary set of activation functions. We prove that both $C$-UAP and $L^p$-UAP for functions on compact domains share a universal lower bound of the minimal width; that is, $w^*_{\min} = \max(d_x,d_y)$. In particular, the critical width, $w^*_{\min}$, for $L^p$-UAP can be achieved by leaky-ReLU networks, provided that the input or output dimension is larger than one. Our construction is based on the approximation power of neural ordinary differential equations and the ability to approximate flow maps by neural networks. The nonmonotone or discontinuous activation functions case and the one-dimensional case are also discussed.
\ No newline at end of file
diff --git a/data/2023/iclr/BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object Detection b/data/2023/iclr/BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object Detection
new file mode 100644
index 0000000000..93ec0cbf60
--- /dev/null
+++ b/data/2023/iclr/BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object Detection	
@@ -0,0 +1 @@
+3D object detection from multiple image views is a fundamental and challenging task for visual scene understanding. Owing to its low cost and high efficiency, multi-view 3D object detection has demonstrated promising application prospects. However, accurately detecting objects through perspective views is extremely difficult due to the lack of depth information. Current approaches tend to adopt heavy backbones for image encoders, making them inapplicable for real-world deployment. Different from the images, LiDAR points are superior in providing spatial cues, resulting in highly precise localization. In this paper, we explore the incorporation of LiDAR-based detectors for multi-view 3D object detection. Instead of directly training a depth prediction network, we unify the image and LiDAR features in the Bird-Eye-View (BEV) space and adaptively transfer knowledge across non-homogenous representations in a teacher-student paradigm. To this end, we propose \textbf{BEVDistill}, a cross-modal BEV knowledge distillation (KD) framework for multi-view 3D object detection. Extensive experiments demonstrate that the proposed method outperforms current KD approaches on a highly-competitive baseline, BEVFormer, without introducing any extra cost in the inference phase. Notably, our best model achieves 59.4 NDS on the nuScenes test leaderboard, achieving new state-of-the-art in comparison with various image-based detectors. Code will be available at https://github.com/zehuichen123/BEVDistill.
\ No newline at end of file
diff --git a/data/2023/iclr/Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining b/data/2023/iclr/Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining
new file mode 100644
index 0000000000..16f271b1a5
--- /dev/null
+++ b/data/2023/iclr/Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining	
@@ -0,0 +1 @@
+We propose a new contextual masking image modeling (MIM) approach called contrasting-aided contextual MIM (ccMIM), under the MIM paradigm for visual pretraining. Specifically, we adopt importance sampling to select the masked patches with richer semantic information for reconstruction, instead of random sampling as done in previous MIM works. As such, the resulting patch reconstruction task from the remaining less semantic patches could be more difficult and helps to learn. To speed up the possibly slowed convergence due to our more difficult reconstruction task, we further propose a new contrastive loss that aligns the tokens of the vision transformer extracted from the selected masked patches and the remaining ones, respectively. The hope is that it serves as a regularizer for patch feature learning such that the image-level global information could be captured in both masked and unmasked patches, and notably such a single-view contrasting avoids the tedious image augmentation step required in recent efforts of introducing contrastive learning to MIM (to speedup convergence and discriminative ability). Meanwhile, the attention score from the contrastive global feature can also carry effective semantic clues to in turn guide our above masking patch selection scheme. In consequence, our contextual MIM and contrastive learning are synergetically performed in a loop (semantic patch selection-token alignment contrasting) to boost the best of the two worlds: fast convergence and strong performance on downstream tasks without ad-hoc augmentations, which are verified by empirical results on ImageNet-1K for both classification and dense vision tasks.
\ No newline at end of file
diff --git a/data/2023/iclr/Continuous-Discrete Convolution for Geometry-Sequence Modeling in Proteins b/data/2023/iclr/Continuous-Discrete Convolution for Geometry-Sequence Modeling in Proteins
new file mode 100644
index 0000000000..46b8ac2ce4
--- /dev/null
+++ b/data/2023/iclr/Continuous-Discrete Convolution for Geometry-Sequence Modeling in Proteins	
@@ -0,0 +1 @@
+The structure of proteins involves 3D geometry of amino acid coordinates and 1D sequence of peptide chains. The 3D structure exhibits irregularity because amino acids are distributed unevenly in Euclidean space and their coordinates are continuous variables. In contrast, the 1D structure is regular because amino acids are arranged uniformly in the chains and their sequential positions (orders) are discrete variables. Moreover, geometric coordinates and sequential orders are in two types of spaces and their units of length are incompatible. These inconsistencies make it challenging to capture the 3D and 1D structures while avoiding the impact of sequence and geometry modeling on each other. This paper proposes a Continuous-Discrete Convolution (CDConv) that uses irregular and regular approaches to model the geometry and sequence structures, respectively. Specifically, CDConv employs independent learnable weights for different regular sequential displacements but directly encodes geometric displacements due to their irregularity. In this way, CDConv significantly improves protein modeling by reducing the impact of geometric irregularity on sequence modeling. Extensive experiments on a range of tasks, including protein fold classification, enzyme reaction classification, gene ontology term prediction and enzyme commission number prediction, demonstrate the effectiveness of the proposed CDConv.
\ No newline at end of file
diff --git a/data/2023/iclr/DAG Matters! GFlowNets Enhanced Explainer for Graph Neural Networks b/data/2023/iclr/DAG Matters! GFlowNets Enhanced Explainer for Graph Neural Networks
new file mode 100644
index 0000000000..d5a9ff1aa7
--- /dev/null
+++ b/data/2023/iclr/DAG Matters! GFlowNets Enhanced Explainer for Graph Neural Networks	
@@ -0,0 +1 @@
+Uncovering rationales behind predictions of graph neural networks (GNNs) has received increasing attention over the years. Existing literature mainly focus on selecting a subgraph, through combinatorial optimization, to provide faithful explanations. However, the exponential size of candidate subgraphs limits the applicability of state-of-the-art methods to large-scale GNNs. We enhance on this through a different approach: by proposing a generative structure -- GFlowNets-based GNN Explainer (GFlowExplainer), we turn the optimization problem into a step-by-step generative problem. Our GFlowExplainer aims to learn a policy that generates a distribution of subgraphs for which the probability of a subgraph is proportional to its' reward. The proposed approach eliminates the influence of node sequence and thus does not need any pre-training strategies. We also propose a new cut vertex matrix to efficiently explore parent states for GFlowNets structure, thus making our approach applicable in a large-scale setting. We conduct extensive experiments on both synthetic and real datasets, and both qualitative and quantitative results show the superiority of our GFlowExplainer.
\ No newline at end of file
diff --git a/data/2023/iclr/Delving into Semantic Scale Imbalance b/data/2023/iclr/Delving into Semantic Scale Imbalance
new file mode 100644
index 0000000000..67edf14f91
--- /dev/null
+++ b/data/2023/iclr/Delving into Semantic Scale Imbalance	
@@ -0,0 +1 @@
+Model bias triggered by long-tailed data has been widely studied. However, measure based on the number of samples cannot explicate three phenomena simultaneously: (1) Given enough data, the classification performance gain is marginal with additional samples. (2) Classification performance decays precipitously as the number of training samples decreases when there is insufficient data. (3) Model trained on sample-balanced datasets still has different biases for different classes. In this work, we define and quantify the semantic scale of classes, which is used to measure the feature diversity of classes. It is exciting to find experimentally that there is a marginal effect of semantic scale, which perfectly describes the first two phenomena. Further, the quantitative measurement of semantic scale imbalance is proposed, which can accurately reflect model bias on multiple datasets, even on sample-balanced data, revealing a novel perspective for the study of class imbalance. Due to the prevalence of semantic scale imbalance, we propose semantic-scale-balanced learning, including a general loss improvement scheme and a dynamic re-weighting training framework that overcomes the challenge of calculating semantic scales in real-time during iterations. Comprehensive experiments show that dynamic semantic-scale-balanced learning consistently enables the model to perform superiorly on large-scale long-tailed and non-long-tailed natural and medical datasets, which is a good starting point for mitigating the prevalent but unnoticed model bias.
\ No newline at end of file
diff --git a/data/2023/iclr/Diagnosing and Rectifying Vision Models using Language b/data/2023/iclr/Diagnosing and Rectifying Vision Models using Language
new file mode 100644
index 0000000000..41ed87e05c
--- /dev/null
+++ b/data/2023/iclr/Diagnosing and Rectifying Vision Models using Language	
@@ -0,0 +1 @@
+Recent multi-modal contrastive learning models have demonstrated the ability to learn an embedding space suitable for building strong vision classifiers, by leveraging the rich information in large-scale image-caption datasets. Our work highlights a distinct advantage of this multi-modal embedding space: the ability to diagnose vision classifiers through natural language. The traditional process of diagnosing model behaviors in deployment settings involves labor-intensive data acquisition and annotation. Our proposed method can discover high-error data slices, identify influential attributes and further rectify undesirable model behaviors, without requiring any visual data. Through a combination of theoretical explanation and empirical verification, we present conditions under which classifiers trained on embeddings from one modality can be equivalently applied to embeddings from another modality. On a range of image datasets with known error slices, we demonstrate that our method can effectively identify the error slices and influential attributes, and can further use language to rectify failure modes of the classifier.
\ No newline at end of file
diff --git a/data/2023/iclr/Diversify and Disambiguate: Out-of-Distribution Robustness via Disagreement b/data/2023/iclr/Diversify and Disambiguate: Out-of-Distribution Robustness via Disagreement
new file mode 100644
index 0000000000..bdf9247b7a
--- /dev/null
+++ b/data/2023/iclr/Diversify and Disambiguate: Out-of-Distribution Robustness via Disagreement	
@@ -0,0 +1 @@
+Real-world machine learning problems often exhibit shifts between the source and target distributions, in which source data does not fully convey the desired behavior on target inputs. Different functions that achieve near-perfect source accuracy can make differing predictions on test inputs, and such ambiguity makes robustness to distribution shifts challenging. We propose DivDis, a simple two-stage framework for identifying and resolving ambiguity in data. DivDis first learns a diverse set of hypotheses that achieve low source loss but make differing predictions on target inputs. We then disambiguate by selecting one of the discovered functions using additional information, for example, a small number of target labels. Our experimental evaluation shows improved performance in subpopulation shift and domain generalization settings, demonstrating that DivDis can scalably adapt to distribution shifts in image and text classification benchmarks.
\ No newline at end of file
diff --git a/data/2023/iclr/DropIT: Dropping Intermediate Tensors for Memory-Efficient DNN Training b/data/2023/iclr/DropIT: Dropping Intermediate Tensors for Memory-Efficient DNN Training
new file mode 100644
index 0000000000..30f0f074c7
--- /dev/null
+++ b/data/2023/iclr/DropIT: Dropping Intermediate Tensors for Memory-Efficient DNN Training	
@@ -0,0 +1 @@
+A standard hardware bottleneck when training deep neural networks is GPU memory. The bulk of memory is occupied by caching intermediate tensors for gradient computation in the backward pass. We propose a novel method to reduce this footprint - Dropping Intermediate Tensors (DropIT). DropIT drops min-k elements of the intermediate tensors and approximates gradients from the sparsified tensors in the backward pass. Theoretically, DropIT reduces noise on estimated gradients and therefore has a higher rate of convergence than vanilla-SGD. Experiments show that we can drop up to 90\% of the intermediate tensor elements in fully-connected and convolutional layers while achieving higher testing accuracy for Visual Transformers and Convolutional Neural Networks on various tasks (e.g., classification, object detection, instance segmentation). Our code and models are available at https://github.com/chenjoya/dropit.
\ No newline at end of file
diff --git a/data/2023/iclr/DualAfford: Learning Collaborative Visual Affordance for Dual-gripper Manipulation b/data/2023/iclr/DualAfford: Learning Collaborative Visual Affordance for Dual-gripper Manipulation
new file mode 100644
index 0000000000..8ac7bd27ce
--- /dev/null
+++ b/data/2023/iclr/DualAfford: Learning Collaborative Visual Affordance for Dual-gripper Manipulation	
@@ -0,0 +1 @@
+It is essential yet challenging for future home-assistant robots to understand and manipulate diverse 3D objects in daily human environments. Towards building scalable systems that can perform diverse manipulation tasks over various 3D shapes, recent works have advocated and demonstrated promising results learning visual actionable affordance, which labels every point over the input 3D geometry with an action likelihood of accomplishing the downstream task (e.g., pushing or picking-up). However, these works only studied single-gripper manipulation tasks, yet many real-world tasks require two hands to achieve collaboratively. In this work, we propose a novel learning framework, DualAfford, to learn collaborative affordance for dual-gripper manipulation tasks. The core design of the approach is to reduce the quadratic problem for two grippers into two disentangled yet interconnected subtasks for efficient learning. Using the large-scale PartNet-Mobility and ShapeNet datasets, we set up four benchmark tasks for dual-gripper manipulation. Experiments prove the effectiveness and superiority of our method over three baselines.
\ No newline at end of file
diff --git a/data/2023/iclr/Guiding Safe Exploration with Weakest Preconditions b/data/2023/iclr/Guiding Safe Exploration with Weakest Preconditions
new file mode 100644
index 0000000000..c28806e85f
--- /dev/null
+++ b/data/2023/iclr/Guiding Safe Exploration with Weakest Preconditions	
@@ -0,0 +1 @@
+In reinforcement learning for safety-critical settings, it is often desirable for the agent to obey safety constraints at all points in time, including during training. We present a novel neurosymbolic approach called SPICE to solve this safe exploration problem. SPICE uses an online shielding layer based on symbolic weakest preconditions to achieve a more precise safety analysis than existing tools without unduly impacting the training process. We evaluate the approach on a suite of continuous control benchmarks and show that it can achieve comparable performance to existing safe learning techniques while incurring fewer safety violations. Additionally, we present theoretical results showing that SPICE converges to the optimal safe policy under reasonable assumptions.
\ No newline at end of file
diff --git a/data/2023/iclr/H2RBox: Horizontal Box Annotation is All You Need for Oriented Object Detection b/data/2023/iclr/H2RBox: Horizontal Box Annotation is All You Need for Oriented Object Detection
new file mode 100644
index 0000000000..7e64d1a350
--- /dev/null
+++ b/data/2023/iclr/H2RBox: Horizontal Box Annotation is All You Need for Oriented Object Detection	
@@ -0,0 +1 @@
+Oriented object detection emerges in many applications from aerial images to autonomous driving, while many existing detection benchmarks are annotated with horizontal bounding box only which is also less costive than fine-grained rotated box, leading to a gap between the readily available training corpus and the rising demand for oriented object detection. This paper proposes a simple yet effective oriented object detection approach called H2RBox merely using horizontal box annotation for weakly-supervised training, which closes the above gap and shows competitive performance even against those trained with rotated boxes. The cores of our method are weakly- and self-supervised learning, which predicts the angle of the object by learning the consistency of two different views. To our best knowledge, H2RBox is the first horizontal box annotation-based oriented object detector. Compared to an alternative i.e. horizontal box-supervised instance segmentation with our post adaption to oriented object detection, our approach is not susceptible to the prediction quality of mask and can perform more robustly in complex scenes containing a large number of dense objects and outliers. Experimental results show that H2RBox has significant performance and speed advantages over horizontal box-supervised instance segmentation methods, as well as lower memory requirements. While compared to rotated box-supervised oriented object detectors, our method shows very close performance and speed. The source code is available at PyTorch-based \href{https://github.com/yangxue0827/h2rbox-mmrotate}{MMRotate} and Jittor-based \href{https://github.com/yangxue0827/h2rbox-jittor}{JDet}.
\ No newline at end of file
diff --git a/data/2023/iclr/Harnessing Out-Of-Distribution Examples via Augmenting Content and Style b/data/2023/iclr/Harnessing Out-Of-Distribution Examples via Augmenting Content and Style
new file mode 100644
index 0000000000..c01e062664
--- /dev/null
+++ b/data/2023/iclr/Harnessing Out-Of-Distribution Examples via Augmenting Content and Style	
@@ -0,0 +1 @@
+Machine learning models are vulnerable to Out-Of-Distribution (OOD) examples, and such a problem has drawn much attention. However, current methods lack a full understanding of different types of OOD data: there are benign OOD data that can be properly adapted to enhance the learning performance, while other malign OOD data would severely degenerate the classification result. To Harness OOD data, this paper proposes a HOOD method that can leverage the content and style from each image instance to identify benign and malign OOD data. Particularly, we design a variational inference framework to causally disentangle content and style features by constructing a structural causal model. Subsequently, we augment the content and style through an intervention process to produce malign and benign OOD data, respectively. The benign OOD data contain novel styles but hold our interested contents, and they can be leveraged to help train a style-invariant model. In contrast, the malign OOD data inherit unknown contents but carry familiar styles, by detecting them can improve model robustness against deceiving anomalies. Thanks to the proposed novel disentanglement and data augmentation techniques, HOOD can effectively deal with OOD examples in unknown and open environments, whose effectiveness is empirically validated in three typical OOD applications including OOD detection, open-set semi-supervised learning, and open-set domain adaptation.
\ No newline at end of file
diff --git a/data/2023/iclr/IDEAL: Query-Efficient Data-Free Learning from Black-Box Models b/data/2023/iclr/IDEAL: Query-Efficient Data-Free Learning from Black-Box Models
new file mode 100644
index 0000000000..13e943e3d3
--- /dev/null
+++ b/data/2023/iclr/IDEAL: Query-Efficient Data-Free Learning from Black-Box Models	
@@ -0,0 +1 @@
+Knowledge Distillation (KD) is a typical method for training a lightweight student model with the help of a well-trained teacher model. However, most KD methods require access to either the teacher's training data or model parameters, which is unrealistic. To tackle this problem, recent works study KD under data-free and black-box settings. Nevertheless, these works require a large number of queries to the teacher model, which incurs significant monetary and computational costs. To address these problems, we propose a novel method called \emph{query-effIcient Data-free lEarning from blAck-box modeLs} (IDEAL), which aims to query-efficiently learn from black-box model APIs to train a good student without any real data. In detail, IDEAL trains the student model in two stages: data generation and model distillation. Note that IDEAL does not require any query in the data generation stage and queries the teacher only once for each sample in the distillation stage. Extensive experiments on various real-world datasets show the effectiveness of the proposed IDEAL. For instance, IDEAL can improve the performance of the best baseline method DFME by 5.83% on CIFAR10 dataset with only 0.02x the query budget of DFME.
\ No newline at end of file
diff --git a/data/2023/iclr/Learning Domain-Agnostic Representation for Disease Diagnosis b/data/2023/iclr/Learning Domain-Agnostic Representation for Disease Diagnosis
new file mode 100644
index 0000000000..b03dddf092
--- /dev/null
+++ b/data/2023/iclr/Learning Domain-Agnostic Representation for Disease Diagnosis	
@@ -0,0 +1 @@
+In clinical environments, image-based diagnosis is desired to achieve robustness on multi-center samples. Toward this goal, a natural way is to capture only clinically disease-related features. However, such disease-related features are often entangled with center-effect, disabling robust transferring to unseen centers/domains. To disentangle disease-related features, we first leverage structural causal modeling to explicitly model disease-related and center-effects that are provable to be disentangled from each other. Guided by this, we propose a novel Domain Agnostic Representation Model (DarMo) based on variational Auto-Encoder. To facilitate disentanglement, we design domain-agnostic and domain-aware encoders to respectively capture disease-related features and varied center effects by incorporating a domain-aware batch normalization layer. Besides, we constrain the disease-related features to well predict the disease label as well as clinical attributes, by leveraging Graph Convolutional Network (GCN) into our decoder. The effectiveness and utility of our method are demonstrated by the superior performance over others on both public datasets and in-house datasets.
\ No newline at end of file
diff --git a/data/2023/iclr/Logical Entity Representation in Knowledge-Graphs for Differentiable Rule Learning b/data/2023/iclr/Logical Entity Representation in Knowledge-Graphs for Differentiable Rule Learning
new file mode 100644
index 0000000000..d2c0e17078
--- /dev/null
+++ b/data/2023/iclr/Logical Entity Representation in Knowledge-Graphs for Differentiable Rule Learning	
@@ -0,0 +1 @@
+Probabilistic logical rule learning has shown great strength in logical rule mining and knowledge graph completion. It learns logical rules to predict missing edges by reasoning on existing edges in the knowledge graph. However, previous efforts have largely been limited to only modeling chain-like Horn clauses such as $R_1(x,z)\land R_2(z,y)\Rightarrow H(x,y)$. This formulation overlooks additional contextual information from neighboring sub-graphs of entity variables $x$, $y$ and $z$. Intuitively, there is a large gap here, as local sub-graphs have been found to provide important information for knowledge graph completion. Inspired by these observations, we propose Logical Entity RePresentation (LERP) to encode contextual information of entities in the knowledge graph. A LERP is designed as a vector of probabilistic logical functions on the entity's neighboring sub-graph. It is an interpretable representation while allowing for differentiable optimization. We can then incorporate LERP into probabilistic logical rule learning to learn more expressive rules. Empirical results demonstrate that with LERP, our model outperforms other rule learning methods in knowledge graph completion and is comparable or even superior to state-of-the-art black-box methods. Moreover, we find that our model can discover a more expressive family of logical rules. LERP can also be further combined with embedding learning methods like TransE to make it more interpretable.
\ No newline at end of file
diff --git a/data/2023/iclr/Molecular Geometry Pretraining with SE(3)-Invariant Denoising Distance Matching b/data/2023/iclr/Molecular Geometry Pretraining with SE(3)-Invariant Denoising Distance Matching
new file mode 100644
index 0000000000..60ee68ce1f
--- /dev/null
+++ b/data/2023/iclr/Molecular Geometry Pretraining with SE(3)-Invariant Denoising Distance Matching	
@@ -0,0 +1 @@
+Molecular representation pretraining is critical in various applications for drug and material discovery due to the limited number of labeled molecules, and most existing work focuses on pretraining on 2D molecular graphs. However, the power of pretraining on 3D geometric structures has been less explored. This is owing to the difficulty of finding a sufficient proxy task that can empower the pretraining to effectively extract essential features from the geometric structures. Motivated by the dynamic nature of 3D molecules, where the continuous motion of a molecule in the 3D Euclidean space forms a smooth potential energy surface, we propose GeoSSL, a 3D coordinate denoising pretraining framework to model such an energy landscape. Further by leveraging an SE(3)-invariant score matching method, we propose GeoSSL-DDM in which the coordinate denoising proxy task is effectively boiled down to denoising the pairwise atomic distances in a molecule. Our comprehensive experiments confirm the effectiveness and robustness of our proposed method.
\ No newline at end of file
diff --git a/data/2023/iclr/On amortizing convex conjugates for optimal transport b/data/2023/iclr/On amortizing convex conjugates for optimal transport
new file mode 100644
index 0000000000..15458111f3
--- /dev/null
+++ b/data/2023/iclr/On amortizing convex conjugates for optimal transport	
@@ -0,0 +1 @@
+This paper focuses on computing the convex conjugate operation that arises when solving Euclidean Wasserstein-2 optimal transport problems. This conjugation, which is also referred to as the Legendre-Fenchel conjugate or c-transform,is considered difficult to compute and in practice,Wasserstein-2 methods are limited by not being able to exactly conjugate the dual potentials in continuous space. To overcome this, the computation of the conjugate can be approximated with amortized optimization, which learns a model to predict the conjugate. I show that combining amortized approximations to the conjugate with a solver for fine-tuning significantly improves the quality of transport maps learned for the Wasserstein-2 benchmark by Korotin et al. (2021a) and is able to model many 2-dimensional couplings and flows considered in the literature. All of the baselines, methods, and solvers in this paper are available at http://github.com/facebookresearch/w2ot.
\ No newline at end of file
diff --git a/data/2023/iclr/Patch-Level Contrasting without Patch Correspondence for Accurate and Dense Contrastive Representation Learning b/data/2023/iclr/Patch-Level Contrasting without Patch Correspondence for Accurate and Dense Contrastive Representation Learning
new file mode 100644
index 0000000000..5453b0e45d
--- /dev/null
+++ b/data/2023/iclr/Patch-Level Contrasting without Patch Correspondence for Accurate and Dense Contrastive Representation Learning	
@@ -0,0 +1 @@
+We propose ADCLR: A ccurate and D ense Contrastive Representation Learning, a novel self-supervised learning framework for learning accurate and dense vision representation. To extract spatial-sensitive information, ADCLR introduces query patches for contrasting in addition with global contrasting. Compared with previous dense contrasting methods, ADCLR mainly enjoys three merits: i) achieving both global-discriminative and spatial-sensitive representation, ii) model-efficient (no extra parameters in addition to the global contrasting baseline), and iii) correspondence-free and thus simpler to implement. Our approach achieves new state-of-the-art performance for contrastive methods. On classification tasks, for ViT-S, ADCLR achieves 77.5% top-1 accuracy on ImageNet with linear probing, outperforming our baseline (DINO) without our devised techniques as plug-in, by 0.5%. For ViT-B, ADCLR achieves 79.8%, 84.0% accuracy on ImageNet by linear probing and finetune, outperforming iBOT by 0.3%, 0.2% accuracy. For dense tasks, on MS-COCO, ADCLR achieves significant improvements of 44.3% AP on object detection, 39.7% AP on instance segmentation, outperforming previous SOTA method SelfPatch by 2.2% and 1.2%, respectively. On ADE20K, ADCLR outperforms SelfPatch by 1.0% mIoU, 1.2% mAcc on the segme
\ No newline at end of file
diff --git a/data/2023/iclr/Pushing the Limits of Fewshot Anomaly Detection in Industry Vision: Graphcore b/data/2023/iclr/Pushing the Limits of Fewshot Anomaly Detection in Industry Vision: Graphcore
new file mode 100644
index 0000000000..4a1fa29e2c
--- /dev/null
+++ b/data/2023/iclr/Pushing the Limits of Fewshot Anomaly Detection in Industry Vision: Graphcore	
@@ -0,0 +1 @@
+In the area of fewshot anomaly detection (FSAD), efficient visual feature plays an essential role in memory bank M-based methods. However, these methods do not account for the relationship between the visual feature and its rotated visual feature, drastically limiting the anomaly detection performance. To push the limits, we reveal that rotation-invariant feature property has a significant impact in industrial-based FSAD. Specifically, we utilize graph representation in FSAD and provide a novel visual isometric invariant feature (VIIF) as anomaly measurement feature. As a result, VIIF can robustly improve the anomaly discriminating ability and can further reduce the size of redundant features stored in M by a large amount. Besides, we provide a novel model GraphCore via VIIFs that can fast implement unsupervised FSAD training and can improve the performance of anomaly detection. A comprehensive evaluation is provided for comparing GraphCore and other SOTA anomaly detection models under our proposed fewshot anomaly detection setting, which shows GraphCore can increase average AUC by 5.8%, 4.1%, 3.4%, and 1.6% on MVTec AD and by 25.5%, 22.0%, 16.9%, and 14.1% on MPDD for 1, 2, 4, and 8-shot cases, respectively.
\ No newline at end of file
diff --git a/data/2023/iclr/Representation Learning for Low-rank General-sum Markov Games b/data/2023/iclr/Representation Learning for Low-rank General-sum Markov Games
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/data/2023/iclr/SIMPLE: Specialized Model-Sample Matching for Domain Generalization b/data/2023/iclr/SIMPLE: Specialized Model-Sample Matching for Domain Generalization
new file mode 100644
index 0000000000..d31303c755
--- /dev/null
+++ b/data/2023/iclr/SIMPLE: Specialized Model-Sample Matching for Domain Generalization	
@@ -0,0 +1 @@
+In domain generalization (DG), most existing methods aspire to fine-tune a specific pretrained model through novel DG algorithms. In this paper, we propose an alternative direction, i.e., to efficiently leverage a pool of pretrained models without fine-tuning. Through extensive empirical and theoretical evidence, we demonstrate that (1) pretrained models have possessed generalization to some extent while there is no single best pretrained model across all distribution shifts, and (2) out-of-distribution (OOD) generalization error depends on the fitness between the pretrained model and unseen test distributions. This analysis motivates us to incorporate diverse pretrained models and to dispatch the best matched models for each OOD sample by means of recommendation techniques. To this end, we propose SIMPLE, a specialized model-sample matching method for domain generalization. First, the predictions of pretrained models are adapted to the target domain by a linear label space transformation. A matching network aware of model specialty is then proposed to dynamically recommend proper pretrained models to predict each test sample. The experiments on DomainBed show that our method achieves significant performance improvements (up to 12.2% for individual dataset and 3.9% on average) compared to state-of-the-art (SOTA) methods and further achieves 6.1% gain via enlarging the pretrained model pool. Moreover, our method is highly efficient and achieves more than 1000× training speedup compared to the conventional DG methods with fine-tuning a pretrained model. Code and supplemental materials are available at https://seqml.github.io/simple.
\ No newline at end of file
diff --git a/data/2023/iclr/Suppressing the Heterogeneity: A Strong Feature Extractor for Few-shot Segmentation b/data/2023/iclr/Suppressing the Heterogeneity: A Strong Feature Extractor for Few-shot Segmentation
new file mode 100644
index 0000000000..0766bf0331
--- /dev/null
+++ b/data/2023/iclr/Suppressing the Heterogeneity: A Strong Feature Extractor for Few-shot Segmentation	
@@ -0,0 +1 @@
+This paper tackles the Few-shot Semantic Segmentation (FSS) task with focus on learning the feature extractor. Somehow the feature extractor has been overlooked by recent state-of-the-art methods, which directly use a deep model pretrained on ImageNet for feature extraction (without further fine-tuning). Under this background, we think the FSS feature extractor deserves exploration and observe the heterogeneity ( i.e. , the intra-class diversity in the raw images) as a critical challenge hindering the intra-class feature compactness. The heterogeneity has three levels from coarse to fine: 1) Sample-level: the inevitable distribution gap between the support and query images makes them heterogeneous from each other. 2) Region-level: the background in FSS actually contains multiple regions with different semantics. 3) Patch-level: some neighboring patches belonging to a same class may appear quite different from each other. Motivated by these observations, we propose a feature extractor with Multi-level Heterogeneity Suppressing (MuHS). MuHS leverages the attention mechanism in transformer backbone to effectively suppress all these three-level heterogeneity. Concretely, MuHS rein-forces the attention / interaction between different samples (query and support), different regions and neighboring patches by constructing cross-sample attention, cross-region interaction and a novel masked image segmentation (inspired by the recent masked image modeling), respectively. We empirically show that 1) MuHS brings consistent improvement for various FSS heads and 2) using a simple linear classification head,
\ No newline at end of file
diff --git a/data/2023/iclr/Surgical Fine-Tuning Improves Adaptation to Distribution Shifts b/data/2023/iclr/Surgical Fine-Tuning Improves Adaptation to Distribution Shifts
new file mode 100644
index 0000000000..37f45e201a
--- /dev/null
+++ b/data/2023/iclr/Surgical Fine-Tuning Improves Adaptation to Distribution Shifts	
@@ -0,0 +1 @@
+A common approach to transfer learning under distribution shift is to fine-tune the last few layers of a pre-trained model, preserving learned features while also adapting to the new task. This paper shows that in such settings, selectively fine-tuning a subset of layers (which we term surgical fine-tuning) matches or outperforms commonly used fine-tuning approaches. Moreover, the type of distribution shift influences which subset is more effective to tune: for example, for image corruptions, fine-tuning only the first few layers works best. We validate our findings systematically across seven real-world data tasks spanning three types of distribution shifts. Theoretically, we prove that for two-layer neural networks in an idealized setting, first-layer tuning can outperform fine-tuning all layers. Intuitively, fine-tuning more parameters on a small target dataset can cause information learned during pre-training to be forgotten, and the relevant information depends on the type of shift.
\ No newline at end of file
diff --git a/data/2023/iclr/TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding b/data/2023/iclr/TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding
new file mode 100644
index 0000000000..8709680160
--- /dev/null
+++ b/data/2023/iclr/TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding	
@@ -0,0 +1 @@
+Learning effective representations simultaneously from multiple tasks in a unified network framework is a fundamental paradigm for multi-task dense visual scene understanding. This requires joint modeling (i) task-generic and (ii) task-specific representations, and (iii) cross-task representation interactions. Existing works typically model these three perspectives with separately designed structures, using shared network modules for task-generic learning, different modules for task-specific learning, and establishing connections among these components for cross-task interactions. It is barely explored in the literature to model these three perspectives in each network layer in an end-to-end manner, which can not only minimize the effort of carefully designing empirical structures for the three multi-task representation learning objectives, but also greatly improve the representation learning capability of the multi-task network since all the model capacity will be used to optimize the three objectives together. In this paper, we propose TaskPrompter, a novel spatial-channel multi-task prompting transformer framework to achieve this target. Specifically, we design a set of spatial-channel task prompts and learn their spatial-and channel interactions with the shared image tokens in each transformer layer with attention mechanism, as aggregating spatial and channel information is critical for dense prediction tasks. Each task prompt learns task-specific representation for one task, while all the prompts can jointly contribute to the learning of the shared image token representations, and the interactions between different task prompts model the cross-task relationship. To decode dense predictions for multiple tasks with the learned spatial-channel task prompts from transformer, we accordingly design a dense task prompt decoding mechanism, which queries the shared image tokens using task prompts to obtain spatial-and channel-wise task-specific representations.
\ No newline at end of file
diff --git a/data/2023/iclr/The Augmented Image Prior: Distilling 1000 Classes by Extrapolating from a Single Image b/data/2023/iclr/The Augmented Image Prior: Distilling 1000 Classes by Extrapolating from a Single Image
new file mode 100644
index 0000000000..1fb217b00f
--- /dev/null
+++ b/data/2023/iclr/The Augmented Image Prior: Distilling 1000 Classes by Extrapolating from a Single Image	
@@ -0,0 +1 @@
+What can neural networks learn about the visual world when provided with only a single image as input? While any image obviously cannot contain the multitudes of all existing objects, scenes and lighting conditions - within the space of all 256^(3x224x224) possible 224-sized square images, it might still provide a strong prior for natural images. To analyze this `augmented image prior' hypothesis, we develop a simple framework for training neural networks from scratch using a single image and augmentations using knowledge distillation from a supervised pretrained teacher. With this, we find the answer to the above question to be: `surprisingly, a lot'. In quantitative terms, we find accuracies of 94%/74% on CIFAR-10/100, 69% on ImageNet, and by extending this method to video and audio, 51% on Kinetics-400 and 84% on SpeechCommands. In extensive analyses spanning 13 datasets, we disentangle the effect of augmentations, choice of data and network architectures and also provide qualitative evaluations that include lucid `panda neurons' in networks that have never even seen one.
\ No newline at end of file
diff --git a/data/2023/iclr/Trainability Preserving Neural Pruning b/data/2023/iclr/Trainability Preserving Neural Pruning
new file mode 100644
index 0000000000..1108796751
--- /dev/null
+++ b/data/2023/iclr/Trainability Preserving Neural Pruning	
@@ -0,0 +1 @@
+Many recent works have shown trainability plays a central role in neural network pruning -- unattended broken trainability can lead to severe under-performance and unintentionally amplify the effect of retraining learning rate, resulting in biased (or even misinterpreted) benchmark results. This paper introduces trainability preserving pruning (TPP), a scalable method to preserve network trainability against pruning, aiming for improved pruning performance and being more robust to retraining hyper-parameters (e.g., learning rate). Specifically, we propose to penalize the gram matrix of convolutional filters to decorrelate the pruned filters from the retained filters. In addition to the convolutional layers, per the spirit of preserving the trainability of the whole network, we also propose to regularize the batch normalization parameters (scale and bias). Empirical studies on linear MLP networks show that TPP can perform on par with the oracle trainability recovery scheme. On nonlinear ConvNets (ResNet56/VGG19) on CIFAR10/100, TPP outperforms the other counterpart approaches by an obvious margin. Moreover, results on ImageNet-1K with ResNets suggest that TPP consistently performs more favorably against other top-performing structured pruning approaches. Code: https://github.com/MingSun-Tse/TPP.
\ No newline at end of file